Your AI agent just wrote 5,000 lines of code. How do you know it actually works?

TestSprite

Featured•2mo ago

Genuinely curious what the community does here.

We've been talking to hundreds of teams building with Cursor, Claude Code, and other agentic tools — and the honest answer from most of them is: "We just run it and hope."

Some do a quick manual click-through. Some write a few spot checks. Some just ship and wait for users to find the bugs.

We built TestSprite to solve exactly this — autonomous testing that runs from your PRD and codebase — but I'm curious what your actual workflow looks like before you merge.

Do you have a testing step that actually works? Or is verification still the part of the agentic workflow nobody has figured out?

👇 Drop your answer below — genuinely want to know.

— Yunhao, co-founder @ TestSprite

302 views

Replies

Best

Honest answer from someone who just shipped a solo AI-built product:

My testing workflow was embarrassingly unsophisticated and it mostly worked, which I think is both encouraging and a little alarming.

I built Circle (an AI relationship memory app) entirely with Lovable and Claude, no traditional dev team. My "testing" was three things: I used the product myself obsessively before anyone else touched it and I ask Claude for UAT and run it through agents. I also do security check for every new feature as well to ensure no security flaws.

What that caught: real UX confusion, edge cases in the birthday reminder logic, and one fairly embarrassing auth flow issue that a proper test suite would have caught in seconds.

What it missed: things that only surface at scale, race conditions, and anything my own mental model was too close to the product to see.

The honest meta-point is that agentic tools have made building dramatically faster but the verification layer hasn't kept pace. The gap between "Claude wrote this and it looks right" and "this is actually right" is where most of the risk lives and right now most solo builders are bridging that gap with intuition and luck.

The PRD-to-test-suite idea is genuinely interesting to me because the PRD is often the clearest expression of intent before the code drifts away from it. Would be curious whether TestSprite handles the case where the PRD and the shipped behaviour diverge and whether it can tell you why, not just that they do.

Report

2mo ago

the honest answer is that most verification for agent-generated code is still manual and ad hoc. i run multiple coding agents in parallel on different features and the only thing that consistently catches issues is treating the diff like a code review, not a rubber stamp. i read every file changed, check for hallucinated imports, verify the logic actually matches what i asked for.

one pattern that helps: keep agent tasks small and scoped. 5000 lines in one shot is where things go sideways. if each task touches 200-400 lines max you can actually reason about what changed. the agents that let you set up CLAUDE.md or similar instruction files help a lot here because you can enforce patterns the model tends to drift from.

automated tests are great in theory but the irony is the agent writes the tests too, so they often just verify the agents own assumptions. the real signal comes from running the actual app and seeing if it behaves correctly. not glamorous but it works.

Report

2mo ago

Imagine you hired a developer and requested a new feature to be implemented.

How do you verify that the code quality is high and that it works?

It is the same process with a human developer or an AI agent.

Follow the best engineering practices for checking the definition of done, code review, and testing.
Coding might be a solved problem nowadays, but engineering skills and experience are even more valuable!

Report

2mo ago

Hello Aria

This question keeps me up at night honestly.

With Hello Aria, we use AI-generated code in our codebase and the workflow that actually works for us:

1. Write the test first (spec the behavior), let AI write the implementation

2. Run the AI output against your existing test suite immediately — not "later"

3. Any AI-generated code that touches auth, billing, or data persistence gets manually reviewed, no exceptions

4. End-to-end smoke tests for critical user paths run on every PR

The scariest AI code mistake we've seen isn't the obvious bugs — it's the subtly wrong edge case that passes all tests and only fails in production under specific conditions.

The answer to "how do you know it works" is: you can't fully know. You can only make the failure modes smaller and faster to detect.

Report

2mo ago

I feel like most issues my team has come into has been a lack of connecting the backend of things leading to changes to the function of the app itself. Would this troubleshoot issues as well?

Report

2mo ago

My perspective is that the part before coding matters a lot more than people admit. Planning, breaking the work into a clear sequence, and using tools like MCP to keep context and actions structured make a huge difference. Then test, obviously but even that is not enough if you're only testing against toy examples. A lot of the real signal comes from running it against realistic, messy near-real data and seeing how it behaves in something closer to an actual demo or production scenario. That's usually where the cracks show.

Report

2mo ago

Honest answer: Claude Code plus reading every diff carefully before committing.

I build nights and weekends on a TypeScript monorepo with a fairly complex backend — Azure services, AI pipelines, SSE streams. Claude Code generates a lot of the implementation. My verification step is less about automated tests and more about understanding exactly what changed and why.

Umair's point about task scope is right. The moment I let a task get too large, the diff becomes unreadable and I lose confidence in what I'm merging. Small, focused tasks where I can reason about the full change — that's where I catch problems.

The awkward irony he also nailed: the agent writes the tests too. So you're really just testing the agent's own assumptions about what the code should do. Real confidence comes from running the actual flow end to end.

Report

2mo ago