How I Onboarded Myself Into a Production App Using Exploratory Testing and Playwright MCP

No documentation. No handover. Just a URL and the tools to figure it out.

May–June 2026 • ~12 min read

The Setup: A URL and Zero Context

A freelance client handed me a production URL and said: audit this.

The app was an AI companion chatbot -a PWA with chat, voice calls, image generation, dozens of content scenarios, a token economy, and multi-language support. I had never seen it before. There was no documentation, no test environment walkthrough, no previous bug reports, no handover of any kind.

This article focuses on the first phase - the initial exploratory testing that I split across 8 sessions. This is where I onboarded myself into a product I knew nothing about, mapped its features, and surfaced the bugs that shaped the rest of the engagement.

This is the story of how I did it using exploratory testing, Playwright MCP, Claude Code, and Notion as my command center.

Why Exploratory Testing Was the Only Option

When you have no documentation, no test cases, and no one to explain the app to you, scripted testing is impossible. You can't write test cases for features you don't know exist.

Exploratory testing flips this: you learn the product by testing it. Every click is both discovery and verification. You're building a mental model of the application while simultaneously checking whether that model holds up.

I structured my exploration using session-based test management. Each session had a focus area:

First impression: Landing page, login, initial user experience
Core features: Primary functionality, content creation tools, AI interactions
Settings & navigation: Every section, every button, monetization flows
Mobile: Android and iOS PWA install, platform-specific behavior
Localization: Multiple languages spot-checked, RTL layout, AI language persistence
Edge cases: Network behavior, heuristic evaluation, final report

Each session started with a charter ("Explore the content creation tools to discover how they work and what can go wrong") and ended with notes in Notion. But within each session, I followed my curiosity -when something felt off, I pulled the thread.

Playwright MCP: The Browser as a Tool, Not Just a Window

Before I even started manual testing, Playwright MCP helped me onboard. I pointed it at the app and let it crawl -navigating every section, taking snapshots, mapping the structure. This gave both me and the AI agent a shared understanding of what the app looked like, how it was organized, and where the main features lived. Instead of starting from a completely blank slate, I had a map before the first manual session even began.

Playwright MCP is a browser automation server that Claude Code can drive directly. I used it not for end-to-end test scripts, but as a power tool throughout the entire process -from initial onboarding to deep investigation:

Console monitoring: While I explored manually, Playwright MCP watched for JavaScript errors. The app turned out to be remarkably clean -only 1 console error across the entire desktop session (a third-party analytics failure). This told me the frontend engineering was solid.
Network verification: When I found a feature returning an API 400 error, Playwright MCP helped me capture the exact request/response for the bug report. The team initially said "can't reproduce" -having the exact network trace made the bug undeniable.
DOM inspection: When Arabic text displayed correctly but the layout stayed left-to-right, Playwright MCP confirmed that document.dir stayed ltr regardless of language setting. A human can see the text is Arabic; the tool proves the CSS direction attribute is wrong.
Cross-platform spot checks: After finding a navigation bug on iOS manually, I used Playwright MCP to verify the same interaction worked correctly on desktop -confirming it was a platform-specific touch event issue, not a general bug.
Systematic page walks: Playwright MCP navigated to each section's URL directly and verified pages loaded without errors. This was faster than clicking through manually and caught things like a 404 page still showing developer-facing text instead of a user-friendly message.

The key insight: I never used Playwright MCP instead of manual testing. I used it alongside manual testing. My eyes and judgment caught the things that require human context -confusing flows, missing feedback, inconsistent behavior across sessions. Playwright MCP caught what's invisible to the naked eye -console errors, broken API calls, incorrect DOM attributes. And before any of that, it helped me get to know the product in the first place. Together, they covered more ground than either could alone.

Notion: The Living Bug Tracker

Every bug needs a home. For this engagement, Notion was the entire project management system -bug tracker, session notes, evidence archive, and final report, all in one workspace.

I set up a simple structure:

Bug database: Each bug got a number (#1 through #37), severity, affected area, platform, status, and description. Filterable by any field.
Session log: One page per testing session. Charter at the top, findings inline, screenshots and video embedded directly. This was my exploration journal -not just what I found, but how I found it and what I was thinking.
Evidence folder: Screenshots organized by session. Video recordings for bugs that were hard to reproduce. This became critical when the team pushed back on a bug they couldn't reproduce -the video evidence settled it.
Status tracking: As the dev team fixed bugs during the audit, I updated statuses in real time. By the end, 9 of 37 issues were already fixed, 2 were partially fixed, and 5 were confirmed as intended behavior.

The reporting pipeline was also streamlined through Claude Code. When I found a bug during exploration, I'd describe it in the Claude console - what I saw, the steps to reproduce, the severity, the platform. Claude would then write the formatted bug report directly into Notion with proper structure: title, steps to reproduce, expected vs actual behavior, severity, platform, and screenshots. This meant I never had to context-switch from testing to writing - I stayed in exploration mode while the bug reports wrote themselves.

The Notion workspace served a dual purpose: it was my working tool during the audit AND the deliverable format. The client could see progress in real time, comment on bugs, mark things as intended behavior, and track fixes - all without me generating separate status reports.

When it came time to write the final audit report, everything was already documented. I just structured it into sections: executive summary, feature-by-feature findings, bug table, heuristic scorecard, and recommendations.

The Workflow: How Everything Fit Together

The arrangement was: me exploring manually, Playwright MCP running alongside for technical verification, Claude Code as my assistant for investigation and bug reporting, and Notion as the shared workspace where everything landed.

A typical exploration session looked like this:

Pick a charter - "Explore the voice/audio feature and its options"
Open the app manually - click through as a real user would
Playwright MCP running alongside - monitoring console, ready for DOM queries
Notice something off - multiple options that should differ behave identically
Investigate with Playwright MCP - check if different assets are loaded, verify network requests
Report to Claude - describe what I found in the console, Claude writes the formatted bug report into Notion
Continue exploring - stay within the charter, test related functionality around the same area, follow the thread

The key to this setup was never breaking flow. I didn't stop testing to write bug reports. I didn't switch between apps to format a Notion page. I described what I found to Claude, it handled the documentation, and I kept testing. The explore-verify-report loop stayed fast because the reporting step was practically zero friction.

This loop repeated hundreds of times across the audit. The tools never replaced my judgment about what to test next. They just made each cycle faster and more thorough.

What Exploratory Testing Found That Scripts Never Would

Some of the highest-impact findings came from thinking like a user, not like a tester:

State that doesn't survive restart: Some settings worked perfectly during a live session but reverted or broke after closing and reopening the PWA. This bug only surfaces when you test like a real user -someone who uses the app across multiple sessions, not just one.
Systemic patterns across isolated bugs: Multiple bugs pointed to the same root cause in the backend. A bug-by-bug approach treats these as separate issues. Exploratory testing reveals the pattern -and that changes the fix from patching individual symptoms to addressing the underlying architecture.
Platform-specific breakage: Core navigation worked perfectly on desktop and Android but was broken on iOS due to touch event handling differences. Without testing on all three platforms as a real user would, this would have shipped unnoticed to a significant chunk of the audience.
Competitor context: Researching how similar apps in the category handle the same flows helped me frame findings with market awareness. This kind of context comes from a human who understands the product landscape, not from a test script.

The Heuristic Evaluation: Structured Judgment

After thorough exploration, I scored the app against Nielsen's 10 Usability Heuristics. This wasn't a separate activity -it was a synthesis of everything I'd already observed.

The overall score was 3.7 out of 5. But the interesting part was the gap between categories:

Efficiency & Satisfaction: 4.4/5 -the app is genuinely excellent for power users. Premium design, deep features, flexible customization.
Learnability & Reliability: 3/5 - error recovery had gaps, and some technical issues affected the experience for returning users.

This gap told the whole story: the core product was strong, but technical reliability issues were holding back the overall experience. The features were there -they just needed polish and bug fixes to match the quality of the design and content.

Results

Metric	Value
Total issues found	37
Major severity	7
Fixed during audit	9
Platforms tested	3 (Desktop, Android, iOS)
Languages spot-checked	5 of 13+
Heuristic score	3.7 / 5

What I'd Do Differently

Looking back, a few things I'd adjust:

Push harder for test credentials upfront. The client's #1 priority was payment flow testing. Without test credentials, I couldn't test it. I should have made this a blocking prerequisite before starting, not a noted limitation in the report.
Start with the mobile experience. I did desktop first because it felt natural. But most of the audience uses the iOS/Android PWA. Starting on mobile would have surfaced the highest-impact platform-specific bugs earlier.
Record more video, fewer screenshots. Video evidence settled disputed bugs immediately. Screenshots required more explanation. For the next audit, I'd default to screen recordings for any bug that involves interaction sequences.

Lessons for Solo QA Audits

Exploratory testing is the fastest way to learn a product you've never seen. You can't write test cases for features you don't know exist. Explore first, structure later.
Combine manual and automated. Playwright MCP helps you onboard into the product and catches technical problems. Human eyes catch everything that requires context and judgment. Neither replaces the other.
Your bug tracker is your deliverable. With Notion, I wasn't maintaining a separate bug list and a separate report. The tracking system was the report. When the audit ended, structuring the final document took hours.
Patterns matter more than individual bugs. Finding 37 bugs is useful. Identifying that several of them share a systemic root cause is what changes architecture decisions.
Competitor context matters. Researching how similar apps handle the same flows gave my findings more weight and helped frame recommendations realistically. QA isn't just finding bugs -it's understanding the product's context.
Evidence wins arguments. The dev team said "can't reproduce" on a bug I'd recorded. The video ended the debate. Always capture evidence at the moment of discovery.
Scope what you can't test. I couldn't test payment flows without credentials. Documenting this honestly in the report maintained trust and set up a clear follow-up engagement.

The Bottom Line

I walked into this engagement knowing nothing about the product. By the end of the initial exploratory phase, I'd mapped every feature, found 37 bugs across 3 platforms and 5 languages, and delivered a heuristic evaluation. The team had already fixed 9 issues before I submitted the first report -and the findings from this phase shaped the direction of the rest of the engagement.

The toolchain was simple: exploratory testing for discovery, Playwright MCP for onboarding and verification, Claude Code as the glue that drove the browser and wrote bug reports directly into Notion. No heavyweight test management software. Just focused exploration with the right tools at hand, and an AI assistant that kept the documentation flowing while I kept testing.

The product turned out to be genuinely good -strong AI conversation quality, deep features, premium design. It just needed someone to sit with it, use it like a real person across real devices, and tell the team what was broken before their users did.

That's the job.

May–June 2026

Exploratory TestingPlaywright MCPClaude CodeNotionHeuristic EvaluationCross-PlatformLocalizationFreelancePWA

← Back to Articles