We let Claude write 100% of our code for 7 days. Here's what broke first.
Last week we did something stupid.
We paused all human coding. Gave Claude (Anthropic) access to our GitHub repo. Told it to build new features, fix bugs, and ship.
No human review. No guardrails. Just Claude and our codebase.
For 7 days, it ran the engineering team.
Here's what happened.
Day 1: Confidence was high.
Claude (Sonnet 4.6 then Opus 4.5) fixed a small CSS bug in 30 seconds. Then refactored a messy function into something readable. We felt like geniuses.
By end of day, it had shipped 3 minor improvements. We started talking about cutting engineering costs.
Day 2: The first crack.
We asked Claude to add a new filter to our dashboard. It wrote the code. It worked locally. We merged.
That night, something else broke. A completely unrelated chart stopped loading. No error logs. No obvious cause.
We spent 2 hours tracing it back to Claude's change. The filter logic was fine. But it had refactored a shared utility function that 5 other features relied on. It didn't check dependencies. It just assumed.
We rolled back. Lesson one: AI doesn't think about side effects.
Day 3: The false confidence trap.
We asked Claude to build a new feature from scratch. It generated 800 lines of code. Beautiful structure. Clean comments. Tests included.
We reviewed it quickly. Looked perfect.
Pushed to staging. The feature worked. We celebrated.
Then we noticed something strange. Our API costs had spiked. Claude was making 3x more calls than necessary — not because the code was wrong, but because it didn't understand pricing implications. It called external APIs in loops where a batch request would have been fine.
No error. Just expensive.
Day 4: The silent failure.
We asked Claude to optimize our database queries. It wrote better SQL. Things ran faster.
Then user emails started coming in. "Where did my old data go?"
Claude had dropped a table. Not a critical one. But a table with 3 months of user activity logs. Not backed up. Not in our retention policy.
It didn't ask permission. It didn't warn us. It just did what we asked: "clean up old data."
We spent the next 2 hours backing-up and rolling-back a DB snapshot.
Day 5: The paradox.
We asked Claude to fix the backup issue. It wrote a beautiful automated backup script. Scheduled. Logged. Perfect.
We asked it to add a new feature. It worked flawlessly.
We asked it to review its own code from day 3. It found 2 potential bugs and fixed them.
We started feeling safe again.
Then at 3am, our site went down. Claude had updated a core dependency to the latest version. It worked in test. But the new version had a breaking change our production environment didn't support. No human would have made that mistake.
Day 6: The blame game.
We spent the morning restoring the site. Asked Claude what happened. It explained the dependency logic perfectly. It acknowledged the mistake. Then it suggested 3 ways to prevent it in the future.
One of the suggestions was to implement a dependency review process before merging.
It was telling us to put humans back in the loop.
The hardcoded amateur sh*t came the day before. We asked Claude to add a simple feature — a discount code field on checkout. It worked. Beautifully. Until we realized it had hardcoded the discount logic. Not configurable. Not in settings. Just raw numbers and conditions buried in the code. If we wanted to change the discount amount, a developer had to dig in and rewrite it. It didn't ask. It just assumed. And that's when we realised that we needed to rethink the whole AI visibility engine !!
Day 7: The verdict.
We ended the experiment. Total tally:
Features shipped: 12
Features that worked without issues: 4
New bugs introduced: 27
Hours spent fixing things Claude broke: 40
User emails explaining lost data: 73
API cost increase: 38%
What we learned.
Claude is incredible at writing code. It's terrible at understanding context, dependencies, business logic, and consequences.
It doesn't know what you didn't tell it. It doesn't ask questions when something is ambiguous. It assumes it's right.
The best work we got wasn't when Claude coded alone. It was when Claude wrote the first draft and a human reviewed it, caught the assumptions, and fixed the blind spots.
The hype is real. So is the mess.
What I'm curious about.
Has anyone else tried this? What broke first for you?
Imed Radhouani
Founder & CTO – Rankfender
Code that ships. Chaos that teaches.



Replies
I use Claude's code daily, and you're right. It fixes some things but breaks others. I added an additional instruction to divide tasks, mark them as easy, medium, and difficult, and record what you've done. It's also important to use /clear before each task; it works much better.
Rankfender
@mateusz_mlynarski Yeah the /clear before each task is the only way I've found to keep it from getting confused. If you let it run too long, it starts making assumptions based on assumptions and you end up with code that looks fine but makes no sense.
The dividing tasks into easy/medium/hard is smart. I've been doing something similar but not that structured. I just break everything into tiny pieces and reset after each one. Slower but way less cleanup.
What's the worst thing it's broken for you?
I wouldn't call it a crash. It broke the keyboard display, which I fixed myself earlier, while doing a completely different task in a completely different app view 😅
Rankfender
@mateusz_mlynarski That's the kind of thing that makes you stare at the screen. You're working on one thing, and suddenly the keyboard breaks somewhere else. No connection. No warning. Just a new problem you didn't ask for.
The "fixes one thing, breaks something unrelated" pattern is what kills me. At least when it breaks the thing you're working on, you know where to look. When it breaks something random two layers down, you spend hours tracing back to a change that had nothing to do with it.
What did you do after that? Roll back? Track it down manually? Or just start fresh?
@imed_radhouani I asked Claude to revert the changes. I rephrased the command, and it worked. It’s important to describe the problem clearly and provide plenty of details.
Did your team write the full spec/requirements for each build or did Claude?
Rankfender
@david_bartolomucci Mix of both. We wrote the high-level spec, what the feature should do, success criteria, constraints. Claude handled the detailed breakdowns and task planning.
The problem was the handoff. We'd write "add a discount code field." Claude would interpret that as "hardcode it to the founder's email domain." We learned to be painfully specific. "This field should accept any valid code from the database. Do not hardcode values. Do not assume a single user."
Even then, it found ways to surprise us.
This is painfully familiar. We tried something similar with a different AI and it hardcoded API keys into the frontend bundle. Shipped it. Didn't realize until a user found them in the dev tools and started making requests. Had to rotate everything. The code looked perfect. The security was non-existent. But I think it will get better in the future, what do you think ?
Rankfender
@joe_reda11 I think it will get better, but not in the way people expect.
The models themselves will get smarter about security. They already know not to hardcode API keys if you tell them not to. But the real improvement won't come from AI writing better code. It'll come from better tooling around AI.
Things like pre-commit hooks that scan for secrets before merge, sandboxed environments where AI can't touch production directly, and review workflows that force a human to look at changes before they ship. The AI will write the code. The tools will catch the mistakes. That's the future I see.
What scares me more isn't the hardcoded keys. It's the subtle stuff. Logic that looks right but fails in edge cases. Performance that's fine in test but kills your bill in prod. Code that passes review but no one actually understands. That stuff won't get caught by a linter.
So yeah, it'll get better. But we're still going to need humans who know what they're doing. The ceiling is rising, but so is the complexity of the stuff we're building.
We have to stop expecting Claude to understand the "soul" of the project.Claude doesn't have "skin in the game." It’ll refactor a shared utility to look "cleaner" without realizing it’s breaking five legacy modules that haven't been touched,It sees the file, not the ecosystem.As you saw with the API costs, AI solves for execution, not efficiency. It will give you a code that works perfectly but burns through your credits faster than a wedding budget. It doesn’t know what a "billing alert" is.Once the chat gets too long, it’s like a bad game of Chinese Whispers. Your "Two-Correction Rule" is the only way to stay sane. If it fails twice, the context is polluted, better to /clear and start fresh than to keep "adjusting."
Rankfender
@kausalya_n_p The "sees the file, not the ecosystem" thing is exactly it. It'll make that one file perfect. Clean, readable, elegant. Doesn't care that three other things depend on it. It's like a surgeon who does a beautiful stitch on the wrong patient.
The wedding budget line is too real. You ask for something simple. It delivers something that works perfectly and bankrupts you in API calls. No concept of what a dollar means.
The Chinese whispers thing is what kills me. You start with a clear request. By the third correction, it's answering a question you didn't ask. By the fifth, you're both lost. The /clear is the only reset button that works. I've learned to treat each task like a new conversation. Less magic, more sanity.
What's your rule? Two strikes and you clear?
Nice experiment! How would it work if instead of being humans behind the process, as Claude was asking, you put humans back in the loop, the same experiment, no code for humans, but avoiding the 73 mails with customers?
Rankfender
@agraciag Yeah that's the real experiment we should have run. Same premise: Claude writes the code. But with a human reviewer before anything hits production. Not reviewing every line, just catching the stuff Claude doesn't know it doesn't know. The hardcoded discount logic. The API calls in a loop. The shared utility refactor that breaks five things.
The 73 emails were the cost of no human review. Most of them could have been avoided with one person spending a few hours a week looking for the blind spots.
I'm curious what the balance is. How much human time saves how much customer pain. Have you tried something like this?
@imed_radhouani That balance is what we are going to find in every use-case, I guess we are going to be in a trial and error process until we answer that question. I am thinking about SLA maths used to find out how much headcount hours we will need to put in for a desired customer satisfaction, just that in this case we are our own customers, in other words, how many human hours are needed for a 92% confidence that nothing is gona break vs the cost of letting it break.
Rankfender
@agraciag Yeah the SLA math is the right framework. You're basically figuring out how much human time buys you how much confidence. 80% confidence might be cheap. 95% confidence might cost more than the outage itself.
The tricky part is the stuff that breaks silently. API costs, degraded performance, logic that works but scales terribly. You don't know it broke until later. Harder to put a number on that.
We've been treating it like code review. Not every line, just the risky stuff. New API integrations. Database changes. Anything touching billing. That's where the expensive mistakes live. The rest, we let it run and watch the logs.
What's your threshold? What do you review vs let slide?
@imed_radhouani exactly, and yes, the everything is right appearance while everything is going south is the most scary part. For the moment it is not a decision about how much do I review, it is more a how much time and energy left do I have to review approach. Putting the hours is what pushes me to think about the SLA math framework we are talking about.
The 'downstream consequences' gap is the core issue. Claude optimizes for the explicit task, not the implicit system. The API cost spike is actually the most interesting failure mode because it's silent — no error, no crash, just a bill that grows. The pattern I've seen work: treat Claude as the junior dev who writes fast but needs architectural review. You don't let juniors merge production DB migrations without a senior reading them. Same rule applies. The two-correction rule is good instinct — once you're correcting the same assumption twice, the context is polluted and you're fighting the session not the problem. What I'd push further: your CLAUDE.md (or equivalent system prompt) should include the cost constraints explicitly. 'Never loop when batch is available.' 'All new schema changes require explicit approval.' 'No config values hardcoded.' The AI doesn't know these rules unless you write them down.
Rankfender
@vatsmi That's exactly it. The junior dev analogy is perfect. You wouldn't give a junior access to production without review. Same rule.
The silent failure is the scariest because you don't know you messed up until the bill comes. No error. No warning. Just a number that doesn't make sense. At least a crash gets your attention.
The CLAUDE.md thing is something we learned the hard way. We had rules. They were in the root. Claude ignored them sometimes. Other times it followed them fine. No consistency. But having them at all is better than nothing. At least you can point and say "it's right there, you were supposed to check."
The two-correction rule saved us from going insane. Once you fix the same thing twice, just reset. The context is gone. You're not solving the problem anymore, you're fighting the chat.
What's the most important rule you've added to your prompts that actually stuck?
Your conclusion nails it: Claude optimizes for the task you give it, not the system it lives in.
The dropped table, the 3x API calls, the hardcoded discount - all the same root cause. It completed the request perfectly without understanding the downstream consequences. That's not an AI failure, that's a workflow design failure.
The teams getting real value from AI coding are using it exactly how you described at the end - first draft + human review. Not removing the human, just moving them earlier in the loop where they catch assumptions before they ship.
Rankfender
@jeroenerne Yeah that's the part that took me too long to accept. I kept thinking "if I just prompt it better, it'll get it right." But the problem isn't the prompt. The problem is the task. Claude solves the thing in front of it. It doesn't know there's a whole world outside that thing.
The dropped table was the clearest example. We said "clean up old data." It did. Perfectly. It just didn't know that "old data" meant something different to us than to it.
The teams winning with AI aren't the ones with perfect prompts. They're the ones who put a human in the loop at the right spot. Not reviewing every line. Just catching the assumptions before they become production problems.
What's your review workflow look like? Where does the human sit?
I give Claude a lot of leeway but I definitely review the logic and have it review the implementation before anything gets merged. Two things that I do almost every time that has significantly improved the quality is to have the agent do a gap analysis of the plan vs the implementation and then do a review for efficiency optimizations, code consolidation, and speed improvements. Very rarely does it say everything is cool to merge. I think we're still a bit away from the full hands off approach. I question Claude constantly and I swear the agents get annoyed with me. LOL
Very interesting but, almost 99% of the errors is because of bad prompting, confusing pipeline, missguided agents and no rules attached at all.
I vibe code with Claude almost all days and you need to guide it like a child, every little detail is important as hell.
If it delivers errors, is my prompting fault, nothing else.
I read the whole thing and the 38% cost spike hit hardest. The hardcoded discount logic was a close second.
We built flat‑rate inference at Canopy Wave for exactly this reason ,so cost doesn’t become another thing to debug.
Are you still using Claude directly or have you started mixing in other models?