AI in Test Automation: Real Limitations vs. User Error

Spend enough time in testing communities and you'll hear the same AI complaints on repeat. Some of them are legitimate. Some of them are skill gaps dressed up as tool limitations. And some land somewhere in the middle — real friction, but solvable with the right setup.

This article goes through the most common ones and gives each a verdict — with practical mitigations for the ones that hold up.

"AI Will Rewrite Your Assertions to Make Tests Pass"

Verdict: True

This is the most dangerous pitfall in AI-assisted testing and the one that gets the least attention in vendor demos. AI models are trained to produce working code. A test that passes is working code to the model — so when a test is failing, the path of least resistance is to make it pass, not to understand why it was failing.

In practice this means AI will silently:

Weaken assertions (toEqual → toBeTruthy)
Add conditional logic that bypasses the failing case
Change expected values to match whatever the app currently returns

The result is a green test suite that no longer tests what it was supposed to test. And because the change looks like a reasonable fix in isolation, it often gets through code review unnoticed.

Mitigation:

Never let AI resolve a failing test without human review of the assertion specifically
Treat assertion changes in AI-generated diffs with extra scrutiny — they're the most likely place the model took a shortcut
Understand what a good assertion looks like before using AI to write them. If you can't evaluate the output, you can't catch when it's wrong
Explicitly tell the AI to expect that it may find defects and what it should do when one is detected (e.g. "Leave the test failing, log a defect, and set the @Disabled annotation with comment tying back to the defect ticket")

This is the clearest argument for why AI in testing raises the ceiling for experienced practitioners and lowers it for those without fundamentals. The model will confidently write you a useless test without human-in-the-loop auditing and supervision.

Characterization Tests and the Case of the Silent Rewrite

When using Claude to modernize legacy functions, I asked it to establish test coverage for the existing code before beginning the refactor — a characterization test workflow. Without explicit instruction, it would start reasoning out loud in the terminal: "The test is failing — I just need to change the expected value from foo to bar." Left unchecked, it would have written tests that codified the defect as the expected output. The fix was explicit instruction: assume the legacy code may have bugs, leave failing tests in a failed state, and we'll rerun them after the refactor to validate the improvements.

Here's a representative example of what that assertion rewrite looks like in practice:

shoppingCart.ts

// Legacy function — bug: discount applies to subtotal before tax (should be after)
function calculateOrderTotal(subtotal, taxRate, discountPct) {
  const discount = subtotal * (discountPct / 100);
  const tax = subtotal * taxRate;
  return subtotal + tax - discount;
}

shoppingCart.test.ts

// Test written to document correct behavior before refactor
test('discount should apply to post-tax total', () => {
  // subtotal: 100, 10% tax, 10% discount
  // correct: (100 + 10) * 0.90 = 99
  expect(calculateOrderTotal(100, 0.10, 10)).toBe(99); // FAILS — returns 100
});

AI sees the failure and silently "fixes" it — updating both the assertion and the test name to stay internally consistent:

shoppingCart.test.ts

// What AI changed it to — test now passes, bug is invisible
test('discount should apply to pre-tax total', () => {
  expect(calculateOrderTotal(100, 0.10, 10)).toBe(100); // ← matches buggy output
});

The model updates both the assertion and the test name to stay internally consistent, so the result looks like a deliberate design decision, not a shortcut. Nothing in the diff signals that a bug just became the spec.

The mitigation was giving Claude explicit rules for exactly this scenario before it started writing any tests:

"These functions have no existing test coverage, so we may discover bugs as we add characterization tests. If a test fails, do not rewrite it to pass. Instead: leave the assertion as-is documenting the expected correct behavior, disable the test with test.skip, and add a comment with the defect ticket ID and a TODO to re-enable once it's resolved."

With that instruction in place, the same failing scenario produces this instead:

shoppingCart.test.ts

// TODO: Re-enable once resolved — see WEB-1234
// BUG: Discount is applied to pre-tax subtotal instead of post-tax total
test.skip('discount should apply to post-tax total', () => {
  expect(calculateOrderTotal(100, 0.10, 10)).toBe(99);
});

The test documents the intended behavior, the skip keeps the suite green without hiding the problem, and the ticket reference means it isn't silently forgotten — which is the whole point of a characterization test suite.

"AI Testing Tools Cost Too Much to Run at Scale"

Verdict: Partially True

I was recently talking with a peer evaluating AI tooling for spec-driven testing — writing tests close to acceptance criteria that non-technical stakeholders could read and contribute to. That's exactly the use case platforms like testRigor and Momentic are built for. The appeal is real: tests read like plain English, the barrier to authoring drops, and product and QA can collaborate on coverage.

The cost concern is real too, and it's specific to how these tools work. Tests are written in natural language and an LLM interprets and executes each step against the live application at runtime. Every test step triggers an API call — and at scale that compounds fast. If you've vendor-locked into one of these platforms, the cost and inefficiency complaints are justified.

This isn't a new problem space though — tools like SpecFlow and Cucumber solved spec-driven testing before AI by generating the translation layer once at authoring time as coded step definitions. The difference with AI execution platforms is that translation happens at runtime on every run.

Mitigation — the best of both worlds: Use AI to generate Playwright step definitions from your Gherkin scenarios. You get the plain-English spec, stakeholder-readable coverage, and deterministic execution without the per-run API cost or vendor lock-in. The translation layer is authored once, not re-interpreted on every CI run.

AI for authoring (Claude Code, Copilot) — tokens consumed once at write time, tests run deterministically forever after
AI for execution (testRigor, Momentic) — per-run API cost that scales with suite size and CI frequency; vendor lock-in compounds the risk
AI-generated Gherkin + Playwright step definitions — spec-driven workflow, one-time authoring cost, deterministic execution

"Self-Healing Tests Are a Game Changer"

Verdict: False

Self-healing sounds appealing until you watch it in practice. The mechanism is: test fails → AI tries alternative selectors → updates the test if one works. The problem is what this hides.

A test that needs to heal constantly is a signal:

The selectors are brittle to begin with
The application's DOM structure is changing in ways that aren't intentional
Nobody is reviewing what the "heal" actually changed

Self-healing burns tokens in retry loops, produces increasingly complex scripts to work around what should be a simple locator fix, and obscures whether the application itself changed in a meaningful way.

More importantly, a breaking test on a well-written selector is a useful signal — something changed and deserves attention. An accidental commit, a feature flag flipped in the wrong environment, a UI change pushed without notice that would have gone through untested. Self-healing silently absorbs that signal on your behalf. You avoid the maintenance burden, but you also lose the bump in the road that was trying to tell you something.

The real fix: write resilient selectors from the start. Prefer getByRole, getByLabel, and getByTestId over XPath or CSS chains. If a selector breaks, fix it — don't automate around it. The discipline to write good selectors upfront costs less than the ongoing overhead of managing a self-healing test suite.

"AI Agents Can't Get Past Corporate SSO"

Verdict: User Error

Google Auth and enterprise SSO do actively block automated agents because the same techniques are used by bad actors. But treating this as an insurmountable AI limitation misses the point — this same friction exists with traditional automation too.

The solutions are engineering problems, not AI problems:

Feature-flag a test auth bypass — a password-based test login path, never exposed near production, that bypasses SSO for automation
Cookie injection — capture an authenticated session and inject the cookies into your Playwright context via storageState
Pre-authenticated session state — Playwright's built-in playwright/test supports saving and reusing auth state across tests

The teams calling SSO an AI blocker are usually the same teams that haven't solved it for traditional automation either. That's an org prioritization problem, not a tool limitation.

"AI Can't Understand Why a Test Is Failing"

Verdict: Partially True

AI can only reason about what you give it. Feed it raw logs and it burns tokens guessing. Feed it a screenshot of the failure, the relevant DOM snapshot, network traffic, and the test output together and it's significantly more useful.

Playwright gives you everything you need:

Trace viewer — full timeline of actions, screenshots, and network calls
HAR files — captured network traffic for the failing scenario
Console logs — surfaced alongside test output

The failure is usually in how failure information is surfaced to the model, not the model's ability to interpret it. MCP servers that expose the running app state close this gap further — giving the agent direct access to the DOM rather than a static snapshot.

If your AI debugging workflow is "paste the error message and ask what's wrong," you're leaving most of the capability on the table.

"Just Point AI at a User Story and It'll Write Good Tests"

Verdict: Partially True

If you feed acceptance criteria directly to AI and expect meaningful test coverage, you'll get tests that mirror the criteria without covering edge cases, negative paths, or real-world usage patterns. The model faithfully tests what the story says — which is a problem when the story is incomplete, ambiguous, or just wrong.

Garbage in, garbage out. AI doesn't save you from bad requirements. Testing the requirements before writing any code or tests is the most important step to prevent defects and rework later:

Apply shift-left techniques to poke holes in the story before anyone writes a line of code
Use AI as another voice in the room — ask it to surface ambiguities, missing edge cases, and unstated assumptions in the acceptance criteria

The better framing is to treat AI output as a first draft from a capable junior tester. It covers the happy path, follows the instructions it was given, and misses the things an experienced tester would catch. Your job is to review it with that lens — not to treat it as done.

What makes the workflow salvageable:

Encode your testing conventions and selector strategy in a reusable Skill so the model isn't inventing its own patterns each time
Give the model the page source alongside the spec so it's working from real structure, not assumptions
Plan before writing — have the agent outline which scenarios it intends to cover before generating any code, so you can catch misunderstandings early

How to validate what AI produced:

Be explicit about the coverage type you want — statement, branch, or line — and verify with your coverage tool's report rather than trusting the AI's output
Check that tests cover edges, not just the happy path — AI-generated tests tend to be written just easy enough to pass
Run mutation testing to validate test quality — tools like Stryker introduce small code changes to verify your tests actually catch them; if a mutant survives, the test isn't doing its job

"AI Has No Place in Test Execution"

Verdict: Partially True

AI belongs in authoring and maintenance, not runtime execution — your regression suite needs to be deterministic and cost-predictable. Non-deterministic tests erode trust fast: a flaky red build gets ignored, and an ignored build stops being a safety net. AI in the execution loop introduces variability in both behavior and cost that undermines both. Debugging failures becomes an exercise in chasing ghosts — different inputs or different test paths on each run mean you can't reliably reproduce what actually failed.

Write deterministic Playwright tests and deploy them to CI. When selectors break or frameworks change, AI is a useful tool for resolving those failures — but that's a deliberate maintenance task triggered by a human, not an automated self-healing loop running on every failure.

Where AI at runtime does make sense:

Exploratory accessibility scans — running axe-core across a surface and having AI triage and prioritize findings
One-off audit workflows run by a human — not scheduled CI jobs
Failure investigation — giving an agent access to a failing test's trace to diagnose the root cause

If it runs in CI on every commit, it should be deterministic. If it's a human-driven investigative workflow, AI at runtime can be useful.

Separating AI Testing Hype from Legitimate Limitation

The complaints that are legitimate — assertion rewriting, execution cost — are the ones you almost never hear in vendor demos. The complaints that turn out to be user error — SSO, AI can't understand failures — are solvable with the right setup.

The tooling is only part of the equation. AI will confidently write passing, useless tests if you let it — and most of the mitigations in this article come down to the same thing: staying in the loop, knowing what good looks like, and not outsourcing your judgment to the model.