The honest QA engineer’s guide to building an agentic automation pipeline with Playwright MCP and Claude Code

My honest journey from skepticism to a fully agentic automation pipeline.

When Playwright MCP started showing up everywhere in the QA and dev communities, I did not start believing the hype instantly. My instinct was to prove and test it, not adopt it blindly. I’ve been in QA for 15 years and I’ve seen enough tools promise to eliminate the hard parts of automation, only to hand the hard parts back to you with extra steps.

Before I even think about adoption, I run any tool against the same criteria, with one overarching question: Does this actually solve the pain points the entire delivery team has been living with, not just QAs?

My criteria:

  • Maintenance burden — test suites that break every time the UI shifts, and someone has to drop everything to fix locators instead of shipping features
  • Slow script completion rate — automation perpetually behind the sprint, never quite catching up
  • Skills gap — test scripts that only an SDET or automation engineer can write, which means the bottleneck is always one person
  • ROI — does this actually pay back, how much, and when

Playwright MCP looked promising on people’s screens but I needed to know if it held up in practice.

This post is that full story. From the first experiments, messy, honest, not what I expected, then to the moment something clicked, which resulted to the pipeline architecture I can now consistently run today.

Before we get into the detail, here’s the full journey at a glance.

Chapter 1: Discovery

I started the experiments on June 26th, my daughter’s birthday. After we finished our celebration, I was at my desk running test scenarios. Some things you just can’t help.

The setup: VS Code, TypeScript, and GitHub Copilot as the AI companion. TypeScript was a new territory for me, Java is my home language, so I leaned on Copilot to explain syntax and async/await patterns as I went. That context matters. What I was testing wasn’t just MCP. It was the whole workflow of using an AI companion to generate and fix tests in real time, on real apps, not demo screens.

I ran three test scenarios with increasing complexity: a search on a demo e-commerce app, a book search on a form-heavy site, and a full registration and login flow on a banking demo app.

The first two were almost impressive. Simple pages, simple locators, code generated in under two seconds. But the moment complexity increased, the same problem appeared every time: locator failures.

Chapter 2: Observation: MCP and GitHub Copilot’s ceiling

These are my notes from the experiments, as I wrote them at the time.

On MCP:

  • Test generation is only impressive when the app and the scenario are simple, such as, navigating to a page, searching, filling a form with placeholder-friendly inputs. Anything beyond that, the default locator strategy breaks down.
  • getByRole, getByLabel, getByPlaceholder — Playwright’s defaults, are not always reliable enough on their own. They look clean in the generated code but they don’t always hold up nor match in the actual DOM. This is the part where it “hallucinates” locator IDs.
  • Stable, proven locator strategies are still the go-to: IDs, name attributes, visible texts and specific XPath. MCP doesn’t know this inherently.
  • You can’t skip the step of manually observing and understanding the application under test. MCP doesn’t replace that and if your foundation is weak, the output is weaker.
  • At this level, MCP felt to me as comparable to a recording tool like Katalon. It scaffolds a starting point. Everything after that is still your job.

On Copilot Agent:

  • Agent mode is a step up from Ask mode, it can create files, run tests, and append to existing classes automatically.
  • Where it earns its place: explaining errors in plain English, identifying which locator is failing and why, and suggesting fixes. That alone cut my debugging time significantly.
  • Where it doesn’t: the locator fixes it suggested were not always correct. Sometimes, it would repeat the same incorrect suggestion across multiple attempts without catching that it already hadn’t worked.
  • This means, it didn’t always hold context between prompts. When I asked it to run a specific test, it would sometimes drift, ignoring the scope I’d set, then repeating mistakes it had already made. I found myself re-supplying the same instructions and context prompt after prompt. I even created a dedicated agent skill file in the codebase to give it a persistent reference point. It helped, but it wasn’t a full fix.

Chapter 3: The turning point

A few things happened around the same time.

With my continuous research, I stumbled upon Anthropic’ s video introducing the skills system for Claude Code. The idea that you could encode your domain knowledge into a markdown file, and an AI agent would load and follow it contextually. Not just prompting an AI but already architecting how it reasons about your specific domain.

That reframed everything.

Around the same time, I was active in the Claude Code Manila community and picking up how practitioners not just demos were actually using this. People weren’t just asking AI to write code. They were designing systems where the AI operated inside explicit constraints they controlled.

That’s when the question changed for me. Not “can MCP solve my automation pains?” but “what if I gave MCP a locator strategy, a structured output format, and a clear scope and used it as one component in systematized workflow?”

I’d love to take a moment to thank the QA community because this chapter wouldn’t exist without them. Back when I first started learning how to automate on my own, it was the community that helped me succeed. People sharing their experiments, their failures, their half-finished ideas. Now that AI-assisted development is in the picture, that hasn’t changed. The community is still one of my biggest driving forces for learning and creating.

A specific thank you to Erkan Barin, whose open-source repo playwright-agent-mcp-starter gave me a foundation to test and iterate on my assumptions faster. The subagent architecture and skill files that came after that’s my own design. But I wouldn’t have gotten there as quickly without that starting point, and without a community that keeps sharing.

Chapter 4: Deeper observation – MCP’s real job

By this point, automation had become entirely my responsibility. One person. Building everything, maintaining everything, making sure it runs. So, I wasn’t looking for a mere framework embellished with AI anymore. Instead, I needed a system that could genuinely act as a second hand for me, something that would help carry the load of building this alone.

So my next intention is directed towards this question: Could Playwright MCP and Claude Code hold up with my actual legacy framework — Java, Maven, TestNG?

The predominant issue from my earlier experiments was still there: locator errors. That hadn’t gone away. So I set a specific target with a detailed locator priority strategy supplied to the agent, could I get at least 90% correct locators out of the box?

Next thing I did, I watched how MCP works when it explores a screen. And I noticed something. Every time it “listens” to a target screen, it creates a temporary file where it writes all the elements and locators it scraped into it, and then deletes it once it’s done.

The first time I saw that file, one thought came immediately: this is an object repository.

After multiple runs, I still wasn’t convinced that relying on MCP alone would get me to the locator accuracy I needed. But that temporary file planted an idea. What if I made this an official document in my codebase, a permanent, structured locator repository and then used MCP not to generate everything in one shot, but specifically as a self-healing mechanism to verify and repair it?

I tried it. The result was finally satisfying.

Tried a one-shot prompt using the MCP tool to test how it can accurately fix 39 flaky locators. Combining this tool with my detailed locatory strategy skill, All 39 locators are healed in a single pass. Each one documented with the old locator, the new locator, and the reason for the change.

I had just built my own self-healing system. Without subscribing to any external tool. Without the high licensing cost. Just MCP, a structured JSON file, and a clear instruction.

Chapter 5: System design – The Agent + Skills pipeline

Once I had that insight, the architecture became clear. MCP’s job was never to generate everything in one shot. Its job was to be one precise stage in a pipeline where each stage had a clear input, a clear output, and explicit rules.

I designed a subagent — ui-automation-pipeline — with five modes.

Mode 1 — Locator repository creation MCP navigates to the target screen, inspects every element, and produces a structured JSON file. Not a test. Not a page object. Just a verified, prioritised map of every element and its most stable locator. This file is the source of truth for everything downstream.

Mode 2 — Page object creation The agent reads the JSON repository and generates a Java page object class. It cannot invent locators. It can only use what the repository provides. Every method wraps an Allure step. Every interaction is preceded by a visibility check.

Mode 3 — Test class creation You provide a structured test scenario — priority level, method name, steps with clear actions (matching what’s in the target screen) and specific assertions. The agent reads the page object and generates the test class. Allure annotations applied consistently, every time.

Mode 4 — Add a test case Extend an existing verified test class with one new scenario. Existing tests are preserved. If a required page object method is missing, the agent reports the gap rather than inventing something.

Mode 5 — Self-healing locator repair When a test fails due to a locator error, you trigger this mode. MCP re-explores the screen, reconciles the live DOM against the repository, heals broken locators, updates the JSON, and reconciles the page object. Delivers a structured heal report.

Each mode is backed by a skill file which is a markdown document encoding the conventions, locator strategy, naming rules, and quality gates for that artifact. The agent follows those rules. It doesn’t make QA decisions on its own, but instead follows the rules, principles and guardrails I designed.

Here’s a sample test case prompt I use, to ensure my AI agent, doesn’t get lost in its inference.

FLOW: User Management — Create User: Happy Path
Test Method Name: createUser_withValidData_isSavedSuccessfully()
Flow Steps + Assertions:
→ Navigate to Users and click Create User
→ Enter valid user details
→ Assign a role from the available options
→ Select one or more permissions
→ Click Save
✓ assert assigned permissions are shown in the user draft
→ Confirm the creation
✓ assert success message is shown
✓ assert new user appears in the users list with correct saved values
✓ assert user details persist when the record is reopened
✓ assert audit log contains the correct Create User entry

Human-readable enough for a manual QA or a developer to write. Structured enough to feed directly into Mode 3.

Chapter 6: Sharpening the tool

The pipeline worked. But here’s another lesson, it was expensive to run.

Before I tightened the skill files, I was burning 90–100K tokens per mode. That’s not sustainable on any subscription. And the reason was straightforward once I looked at it: my skill files were bloated. The same constraints stated three different ways across three different sections. Wide markdown tables that cost tokens without adding information. Operating boundaries that were already enforced by the subagent, restated inside every skill.

So I rewrote everything lean. One rule stated once, in the right place. Tables converted to numbered lists. Redundant sections removed entirely. The subagent’s system prompt cut by more than half without losing a single behavior.

Chapter 7: Making the system reliable – Token discipline

Once my subagent and skills are refined with lean structure and context, and flows relatively from each other. the result is rewarding: I get 4.4K to 8K (average) tokens for a complete test case from scratch. That’s the difference between a tool that’s expensive to run and one you can actually rely on daily.

Plan mode — knowing when to use it

The other sharpening I had to incorporate was deciding where plan mode belonged in the pipeline and where it didn’t.

Modes 2 and 3 use plan mode first. Before the agent writes a single method or test, it reads on the page object and test class skills, so it won’t reinvent its way, and then flag me before it commits anything, so the flow is this, It generates a code, I review and then approve.

Mode 4 skips plan mode for simple additions. If the new test case fits cleanly into the existing class, there’s no need for the overhead.

Mode 5 the self-heal loop, skips it entirely. It’s a precise, scoped task with a defined output format.

The vision behind all of this isn’t just a pipeline I can run. It’s a system anyone can follow.

Every mode has a documented prompt format. Every skill file has a clear description of when it triggers and what it produces. The subagent’s instructions are written to be readable not just by Claude, but by a QA engineer or a developer who wants to understand what the system is doing and why.

The goal: a QA who didn’t build this can pick it up and start building. A developer who wants to contribute a test scenario can write it in the structured format and hand it off. The knowledge lives in the system and not just isolated in someone’s head.

This is what makes it scale.

Chapter 8: The final architecture

This is the system as it stands today.

What this led to

I was 2 sprints behind. One person. Building the framework, writing the scripts, maintaining the locators, keeping it all running.

Two weeks after this system was fully operational, I was able to finally work in the current sprint.

Not because the agent did everything, nor the AI handled all of it, but because intentional systems design is applied, this is something the AI can’t be credited to, this is a culmination of human architecture and experience.

Where this is going

The pipeline is designed to scale beyond me. A QA can write a test scenario in the structured format and hand it to Mode 3. A developer can do the same. They don’t need to know the framework internals. The knowledge lives in the system, in the skill files, in the locator repository, in the conventions baked into every generated artifact.

That’s the real unlock. Not “AI writes my tests.” But: anyone who understands the application can contribute to the test suite. The bottleneck of needing one specialist to build everything that’s what this system is designed to remove.

What’s after this? What’s next is bringing the rest of the delivery team into this. Not just handing them a system to use. Educating them on what it does, coaching them on automation foundations and principles, and then improving it together. The pipeline was built by one person. The next version will be built by a team.


Discover more from QA Education For Real-World Success

Subscribe to get the latest posts sent to your email.


Comments

Leave a Reply

Discover more from QA Education For Real-World Success

Subscribe now to keep reading and get access to the full archive.

Continue reading

Discover more from QA Education For Real-World Success

Subscribe now to keep reading and get access to the full archive.

Continue reading