What Would You Stop Doing When UI Tests Are Flaky?

This question about an interview question was recently posted in a QA forum, and the discussion it generated is more interesting than the question itself:

"What would you stop doing when UI tests are flaky?"

The phrasing trips people up. Most interview questions ask what you would do, essentially what's your process, how do you handle it, what tools do you reach for. This one inverts it. It's asking about habits to eliminate, which implies the interviewer already assumes you have them. It's also, perhaps intentionally, phrased awkwardly.

I've spent over 20 years in software testing across fintech, SaaS HCM, and insurtech and currently serve as the Director of Quality Engineering at my current employer. I haven't been asked this question in exactly this phrasing, but I've used similar ones from the other side of the table. I know what this type of question is designed to surface.

Before we get to the answer, let's look at what the QA community said. See if you can guess or click reveal to see all the survey responses.

Survey Says — What the QA Community Answered This Interview Question

What would you stop doing when UI tests are flaky?

The most popular community answers were technical and relatable — stop using sleep(), fix timing and waits — the instinctive responses from anyone who has spent time debugging intermittent failures. Investigate root cause first ranked lower by sheer volume but drew the most endorsement from people who paused to think about what was actually being asked.

I also ran a LinkedIn poll with the same question. It had 357 impressions and only 5 votes — low participation — but those 5 voters unanimously chose investigate root cause first. The gap between the free-comment community vote pattern and the forced-choice poll result is itself telling: when people had to commit to one answer, they chose the diagnostic approach. When free-commenting, they led with the most relatable war story.

Why Most Candidates Answer the Wrong Question

Here's what's worth pausing on: many of the most popular community answers — including quarantine tests from CI, add retry logic, report flakiness to the dev team — are valid responses to "what would you do about flaky tests." They are not answers to "what would you stop doing."

Quarantining is an action you add to your process. Retries are something you implement. Reporting is something you start doing. None of these are things you stop.

The community's own discussion demonstrated the exact failure mode the question is designed to surface: answering a different question than the one being asked.

This is worth a conscious moment when you're in an interview seat. Before diving in, restate the question: "So you're asking what habits I'd stop — not what I'd add to my process?" That one sentence signals precision under pressure, and precision matters.

When I'm conducting an interview, if a candidate is giving an answer that feels off, I'll ask them to repeat back their understanding of the question. Sometimes they're just wrong, but more often they didn't fully process it in the moment due to nerves, language barrier, or, in the case of remote interviews, dropped audio packets. The candidates who handle interviews best are the ones who preemptively restate their understanding before answering. It reads as both confident and careful (good qualities for testers and quality engineers).

What This Flaky Test Interview Question Is Actually Testing

This question tests at least four things at once:

Technical knowledge — Do you know the common anti-patterns that cause flaky UI tests?
Diagnostic thinking — Can you reason about root causes rather than recite a fix list?
Listening comprehension — Did you actually process what was asked?
Confidence to challenge ambiguity — Will the candidate accept the awkwardly worded question or point that out and ask for clarification?

A junior answer names tactics: stop using sleep, fix your waits, add retries. Not wrong, but symptom-level.

An experienced answer narrates a thought process — how you'd identify what's causing the flakiness before deciding what to change. The "stop doing" framing is a clue. It's asking which habits you've already had to unlearn, implying you've operated at enough scale to have learned them the hard way.

What to Stop Doing When UI Tests Are Flaky: The Full Answer

If asked this question in an interview, I'd clarify the framing first: "Are you asking about common anti-patterns that lead to flakiness, or more about how I'd approach the investigation?" That distinction matters, and asking it signals diagnostic thinking before the answer even starts.

If they want the approach angle, this is how I'd answer.

Stop Adding Tests to an Unstable Suite

This would be my first answer, and I'd lead with it.

Adding tests to a flaky suite compounds the problem. Every new test inherits the instability of the environment it runs in. Before expanding coverage, you need to stop the bleeding and understand whether the flakiness lives in the test code, the application behavior, or the infrastructure. That distinction determines the shape of your fix.

Stop Using sleep() and pause Statements

This is the answer that generates the most community agreement, and for good reason — it's the most widespread bad habit in UI test automation.

sleep() and pause are blunt instruments. They wait a fixed amount of time regardless of whether the condition they're waiting for became true a second in or never became true at all. They're slow, brittle, and mask the real problem: the test doesn't know what it's waiting for.

This is so well understood that Playwright formally marks page.waitForTimeout() as Discouraged in their own API docs:

"Never wait for timeout in production. Tests that wait for time are inherently flaky. Use Locator actions and web assertions that wait automatically."

Playwright docs — page.waitForTimeout()

I've mandated the removal of pause statements from test suites I've managed and replaced them with explicit wait patterns — waitForElementPresent, custom polling waits — anything that returns as soon as the condition is true rather than waiting out a fixed interval. I've added lint rules to prevent .pause commands from being checked in at all. On one large serial suite, removing sleep and pause statements alone saved over an hour off the total test run time.

One practical detail: when setting a max wait timeout, I set it to roughly twice what I'd expect the worst case to be. CI environments consistently run slower than local development in ways that aren't always predictable. A wait that looks generous locally can time out under CI load.

Stop Assuming the Problem Is in the Test Code

Some flakiness isn't in the test at all.

I had a test that failed intermittently depending on what time of day the build kicked off. After investigation, the root cause was a timezone mismatch between the server under test and the system running the tests. A validation rule in the application behaved differently at a specific hour because of this offset — the test was faithfully catching real behavior, but it looked like random flakiness until you looked closely enough. The initial investigation was tricky because it would pass during normal business hours when we tried to reproduce the failure in the first place!

The fix was a conditional branch in the test to account for the business rule at that magic hour. I generally avoid conditional branched logic in tests — it adds complexity and makes tests harder to reason about. But we couldn't time-travel or alter system clocks, and the conditional was the honest solution.

The point: before assuming the test is broken, determine whether you're dealing with test code, an application bug, or an infrastructure mismatch. The investigation approach is different for each.

It's also worth noting that some intermittent failures aren't flakiness at all — they're the test catching a real intermittent bug in the application. A test that fails once and passes on the next re-run looks identical to a flaky test on the surface. One is noise; the other is a signal you're about to dismiss. This is why every failure deserves investigation before it gets written off.

The goal is a suite trustworthy enough that the team's first instinct when a test fails is "it found something" — not "ugh, it's flaky, just re-run it." The moment re-running becomes the default response, it becomes an annoying car alarm at 3 AM instead of a useful tool.

Stop Running Tests in Parallel Without Isolating Shared State

Parallelism is worth pursuing — the time savings on a large suite are significant, and it's one of the highest-leverage improvements you can make to CI feedback time. The problem isn't parallelism itself; it's running tests in parallel that were never designed for it.

Tests that share data, database state, or external resources become order-dependent and environment-dependent the moment you parallelize them. A suite that runs cleanly in serial can look deeply flaky in parallel for no obvious reason — because the flakiness is in the interaction between tests, not in any individual test.

The practical solution is to stop treating your suite as a single homogeneous run and start thinking in terms of what can safely run concurrently:

Read-only tests — tests that only query state without mutating it — are natural candidates for parallel execution. They can't interfere with each other.
Write operations, state-dependent flows, and anything touching shared fixtures are better kept in a serial suite until you've isolated their data properly (unique test data per run, dedicated test accounts, isolated environments).

A combined approach — a parallel suite for safe tests and a serial suite for the rest — gets you most of the speed benefit while keeping the flakiness surface small. Once the serial tests are properly isolated with their own data, you can graduate them into the parallel suite over time.

Stop Treating Flakiness as Normal

The most damaging thing a team can do with a flaky test is shrug and accept it.

Flakiness trains everyone to ignore failures. Once the build becomes a noise generator instead of a signal, real regressions slip through unchallenged. A test suite that cries wolf is functionally worse than no test suite, because it creates false confidence.

I've used flakiness scoring in both BitBucket and BrowserStack Test Analytics to identify and mute the worst offenders. Muting is not the same as deleting: the test still runs, it just doesn't fail the build while it's under investigation. That distinction matters — it preserves your ability to track whether improvements helped without letting the instability contaminate every build in the meantime.

How to Answer Flaky UI Test Interview Questions

A few framing notes regardless of how you structure your answer:

Restate first. Before diving in, confirm you understood the question. "So you're asking what habits I'd stop, not what I'd add to my process?" One sentence of confirmation demonstrates careful listening — which is arguably what the question is testing most.

Narrate, don't list. A list of tactics sounds like you memorized a checklist. A thought process — "I'd start by determining whether this is test code, application behavior, or environment, because the fix is different for each" — sounds like someone who has actually dealt with this at scale.

Distinguish the problem type. Not all flakiness has the same root cause. Timing issues, shared state, environment inconsistency, and automating an unstable UI are four different problems with four different fixes. Showing you can distinguish them is what separates a good answer from a more experienced one.

Own a specific example. The most memorable interview answers are concrete. If you've refactored a suite full of sleep statements, or tracked down a timezone mismatch that looked like random flakiness for weeks, say so. Specific experience is more credible than correct-sounding generalizations.