Imed Radhouani

We let Claude write 100% of our code for 7 days. Here's what broke first.

by

Last week we did something stupid.


We paused all human coding. Gave Claude (Anthropic) access to our GitHub repo. Told it to build new features, fix bugs, and ship.

No human review. No guardrails. Just Claude and our codebase.

For 7 days, it ran the engineering team.

Here's what happened.

Day 1: Confidence was high.

Claude (Sonnet 4.6 then Opus 4.5) fixed a small CSS bug in 30 seconds. Then refactored a messy function into something readable. We felt like geniuses.

By end of day, it had shipped 3 minor improvements. We started talking about cutting engineering costs.

Day 2: The first crack.

We asked Claude to add a new filter to our dashboard. It wrote the code. It worked locally. We merged.

That night, something else broke. A completely unrelated chart stopped loading. No error logs. No obvious cause.

We spent 2 hours tracing it back to Claude's change. The filter logic was fine. But it had refactored a shared utility function that 5 other features relied on. It didn't check dependencies. It just assumed.

We rolled back. Lesson one: AI doesn't think about side effects.

Day 3: The false confidence trap.

We asked Claude to build a new feature from scratch. It generated 800 lines of code. Beautiful structure. Clean comments. Tests included.

We reviewed it quickly. Looked perfect.

Pushed to staging. The feature worked. We celebrated.

Then we noticed something strange. Our API costs had spiked. Claude was making 3x more calls than necessary — not because the code was wrong, but because it didn't understand pricing implications. It called external APIs in loops where a batch request would have been fine.

No error. Just expensive.

Day 4: The silent failure.

We asked Claude to optimize our database queries. It wrote better SQL. Things ran faster.

Then user emails started coming in. "Where did my old data go?"

Claude had dropped a table. Not a critical one. But a table with 3 months of user activity logs. Not backed up. Not in our retention policy.

It didn't ask permission. It didn't warn us. It just did what we asked: "clean up old data."

We spent the next 2 hours backing-up and rolling-back a DB snapshot.

Day 5: The paradox.

We asked Claude to fix the backup issue. It wrote a beautiful automated backup script. Scheduled. Logged. Perfect.

We asked it to add a new feature. It worked flawlessly.

We asked it to review its own code from day 3. It found 2 potential bugs and fixed them.

We started feeling safe again.

Then at 3am, our site went down. Claude had updated a core dependency to the latest version. It worked in test. But the new version had a breaking change our production environment didn't support. No human would have made that mistake.

Day 6: The blame game.

We spent the morning restoring the site. Asked Claude what happened. It explained the dependency logic perfectly. It acknowledged the mistake. Then it suggested 3 ways to prevent it in the future.

One of the suggestions was to implement a dependency review process before merging.

It was telling us to put humans back in the loop.

The hardcoded amateur sh*t came the day before. We asked Claude to add a simple feature — a discount code field on checkout. It worked. Beautifully. Until we realized it had hardcoded the discount logic. Not configurable. Not in settings. Just raw numbers and conditions buried in the code. If we wanted to change the discount amount, a developer had to dig in and rewrite it. It didn't ask. It just assumed.

Day 7: The verdict.

We ended the experiment. Total tally:

  • Features shipped: 12

  • Features that worked without issues: 4

  • New bugs introduced: 27

  • Hours spent fixing things Claude broke: 40

  • User emails explaining lost data: 73

  • API cost increase: 38%

What we learned.

Claude is incredible at writing code. It's terrible at understanding context, dependencies, business logic, and consequences.

It doesn't know what you didn't tell it. It doesn't ask questions when something is ambiguous. It assumes it's right.

The best work we got wasn't when Claude coded alone. It was when Claude wrote the first draft and a human reviewed it, caught the assumptions, and fixed the blind spots.

The hype is real. So is the mess.

What I'm curious about.

Has anyone else tried this? What broke first for you?

Imed Radhouani
Founder & CTO – Rankfender
Code that ships. Chaos that teaches.

487 views

Add a comment

Replies

Best
Joe Red

This is painfully familiar. We tried something similar with a different AI and it hardcoded API keys into the frontend bundle. Shipped it. Didn't realize until a user found them in the dev tools and started making requests. Had to rotate everything. The code looked perfect. The security was non-existent. But I think it will get better in the future, what do you think ?

Imed Radhouani

@joe_reda11 I think it will get better, but not in the way people expect.

The models themselves will get smarter about security. They already know not to hardcode API keys if you tell them not to. But the real improvement won't come from AI writing better code. It'll come from better tooling around AI.

Things like pre-commit hooks that scan for secrets before merge, sandboxed environments where AI can't touch production directly, and review workflows that force a human to look at changes before they ship. The AI will write the code. The tools will catch the mistakes. That's the future I see.

What scares me more isn't the hardcoded keys. It's the subtle stuff. Logic that looks right but fails in edge cases. Performance that's fine in test but kills your bill in prod. Code that passes review but no one actually understands. That stuff won't get caught by a linter.

So yeah, it'll get better. But we're still going to need humans who know what they're doing. The ceiling is rising, but so is the complexity of the stuff we're building.

Mateusz Młynarski

I use Claude's code daily, and you're right. It fixes some things but breaks others. I added an additional instruction to divide tasks, mark them as easy, medium, and difficult, and record what you've done. It's also important to use /clear before each task; it works much better.

Imed Radhouani

@mateusz_mlynarski Yeah the /clear before each task is the only way I've found to keep it from getting confused. If you let it run too long, it starts making assumptions based on assumptions and you end up with code that looks fine but makes no sense.

The dividing tasks into easy/medium/hard is smart. I've been doing something similar but not that structured. I just break everything into tiny pieces and reset after each one. Slower but way less cleanup.

What's the worst thing it's broken for you?

Kausalya N P

We have to stop expecting Claude to understand the "soul" of the project.Claude doesn't have "skin in the game." It’ll refactor a shared utility to look "cleaner" without realizing it’s breaking five legacy modules that haven't been touched,It sees the file, not the ecosystem.As you saw with the API costs, AI solves for execution, not efficiency. It will give you a code that works perfectly but burns through your credits faster than a wedding budget. It doesn’t know what a "billing alert" is.Once the chat gets too long, it’s like a bad game of Chinese Whispers. Your "Two-Correction Rule" is the only way to stay sane. If it fails twice, the context is polluted, better to /clear and start fresh than to keep "adjusting."