Skip to content
All posts

Stop Writing Tests. Start Guiding AI to Write Them.

When a project is moving fast, manual regression testing becomes the tax on every sprint. Features stack up, changes land daily, and the team ends up retesting ground it has already covered. 

On a recent project we picked up, there was no test automation project in place, and the cost of that gap was growing. Rather than pause delivery to build a framework the slow way, there was a third option: use AI to build it faster without cutting corners. 

What followed was a complete test automation framework built with Claude Code, covering agents, conventions, a skills layer and a test management integration. This article walks through how it was built, what the AI got wrong early on, and what it took to make the output production-ready.

A 2025 survey of 1,400 QA professionals, found that 55% cite insufficient time for thorough testing as their biggest obstacle, with high workload a close second. The most commonly automated test type? Regression.

 

Setting Things Up

We picked Playwright for both API and UI tests. One framework, one language (TypeScript), one runner. For reporting, we went with Allure Reports because we needed more than just pass/fail. Allure lets us tag tests with metadata, track history across runs and get enough detail to debug failures without digging through logs.

We wrote the first few tests ourselves to establish the patterns. Then we captured all our best practices in a CLAUDE.md file, a special markdown file that Claude Code automatically loads as context for every conversation. It became the single source of truth: folder structure, fixture and page object conventions, locator strategies, authorization patterns, test data management and general coding rules. Every agent and skill we built afterwards used this file as its reference standard.

The Plugin and Its Agents

Once the foundation was in place, we built the tooling. Rather than creating a single agent that does everything, we split the work into five specialized agents, each with a clear responsibility. We packaged all of them as a Claude Code plugin so the whole setup can be installed on any project without copying files around.

test-designer analyzes code and designs behavioral test scenarios, focusing on what should be tested rather than how.

The designer reads from both the backend and frontend (paths come from the test-config.json that the setup skill writes) to understand the intended flow, not to mirror the implementation. Code can have bugs too. If we just validated existing behavior, we'd lock those in. So the designer focuses on the business flow and writes scenarios against that, and the test-writer turns those scenarios into code afterwards.

test-case-writer turns those scenarios into structured documentation with steps and expected results, writing to a test management tool via MCP if one is connected or to markdown files if not.

test-writer converts test cases into executable code, auto-detecting the framework and language from project files and running in an isolated git worktree.

test-healer diagnoses failed tests by cross-referencing the source code. If the test is wrong, it fixes it. If the test is correct and the application is returning something unexpected, it reports it as a bug instead of touching the test.

test-reviewer audits test quality and produces a score from 0 to 100 across compliance, code quality, coverage, meaningfulness and maintainability.

Each agent has a clear boundary. When something crosses that boundary, it gets handed off rather than handled in place. And nothing the agents produce lands in the repo automatically. The test-writer runs in an isolated git worktree, so we always review and commit manually.

We also created a shared conventions skill that all agents preload. It contains the naming rules, authorization test patterns, data mutation reversal rules, locator priorities and other standards. It's not something you invoke directly. It's hidden from the menu and gets injected into each agent's context automatically, so every agent works from the same set of rules without duplicating them across five files.

Connecting a Test Management System

Writing tests with AI was fast, but reviewing them became a bottleneck. When the agent drops dozens of tests into a PR at once, going through each one line by line to check what it actually covers is exhausting. Code-level test reviews don't scale well and they're not readable by everyone on the team.

So we introduced a test management system into the workflow. Instead of going straight from code to automation, we added a step in between: the agent first creates test cases in the test management tool with clear steps and expected results. The team reviews them there, where they're easier to read and discuss than raw code. Once a test case is approved, its status gets set to "Ready for Automation."

From there, the agent picks up test cases in that status, writes the actual automated tests in the framework and updates the status to "Automated" in the test management tool. This created a full loop: write test cases, review, automate, track. And when tests fail, the healer skill closes the loop by either fixing the test or flagging a real bug in the application.

We built our own MCP (Model Context Protocol) tool to connect the agent to the test management system. It can create test cases, read statuses and update them, all within the same workflow.

This changed the dynamic. Reviews moved from pull requests to a tool everyone could use, not just developers. QA, product and team leads could all see what's being tested and weigh in before anything gets automated.

Here's how the pieces fit together end-to-end:

image-20260428-072947

  1. Code changes trigger the agent via manage-test-cases, which drafts test cases in the test management tool.
  2. The team reviews them there and marks the approved ones as Ready for Automation.
  3. automate-tests picks those up, writes the actual Playwright code and updates the status to Automated.
  4. test-review scores the code against our standards before commit.
  5. When tests fail in CI, test-healer either fixes the test, flags a real bug, or hands a complex failure back to the agent.

What Broke and How We Fixed It

The agents didn't get everything right from the start. Early on, we got tests that just skipped themselves without testing anything, scenarios with weak assertions like expect(response).toBeTruthy() that would pass no matter what, and duplicate page objects targeting the same page. Another one: "ghost endpoint" tests, where the agent wrote authorization tests for routes that didn't actually exist in the backend, so they'd return 404 instead of the expected 401 or 403. Technically passing, practically useless.

Each time we spotted something like this, we added it as a rule. Some went into the conventions skill, some became checks in the test-reviewer agent, and a few (like naming patterns and hardcoded data) became a lightweight hook that runs after every file edit using a smaller model. Over time, these issues stopped appearing.

What We Gained

We got the critical flows covered in days instead of weeks. But speed wasn't the only win. Every test follows the same patterns because they all come from the same conventions skill. The agents also cover things we probably wouldn't have bothered with manually, like authorization tests for every single endpoint. The skills layer is what keeps it all maintained over time. Adding the test management system gave us two review layers: one for test cases that the whole team can see and one for the actual code before it gets committed. That made the process more transparent and caught issues earlier.

Packaging it as a plugin turned out to be worth the effort too. The plugin reads paths from a config file, discovers test management tools through MCP at runtime and detects everything else from the project structure. When we needed test automation on a second project, we just installed the plugin and ran the setup skill.

The 2025 Testing in DevOps Report found that 55% of organisations are already using AI for development and testing, with mature DevOps teams reaching 70% adoption. Yet test maintenance still consumes 20% of team time, and only 14% of teams have achieved 80% or more test coverage. 

 

Looking Back

What was unexpected was how much of the work would be about refining rules rather than writing tests. The AI didn't get it right on its own, but every time we found something off and added a rule, the output got noticeably better.

It still needs a human in the loop. But the human gets to spend their time reviewing and thinking instead of typing repetitive test code. That alone made it worth the setup.

Does this replace test automation engineers? Not really. The agents still make mistakes, still need someone who knows what good tests look like to review their output. But as an assistant, it's genuinely useful if you put in the effort to guide it.

For our team, this solved a real problem. We needed test automation and didn't have time to write it all ourselves. But it also became a way to understand how far AI can go when it's given a real project to work on. How agents and skills work together, where the boundaries need to be, where it breaks down. That was worth exploring on its own.

References

  • Katalon. State of Software Quality Report 2025. katalon.com
  • mabl. 2025 Testing in DevOps Report. mabl.com