Imed Radhouani

We let Claude write 100% of our code for 7 days. Here's what broke first.

by

Last week we did something stupid.


We paused all human coding. Gave Claude (Anthropic) access to our GitHub repo. Told it to build new features, fix bugs, and ship.

No human review. No guardrails. Just Claude and our codebase.

For 7 days, it ran the engineering team.

Here's what happened.

Day 1: Confidence was high.

Claude (Sonnet 4.6 then Opus 4.5) fixed a small CSS bug in 30 seconds. Then refactored a messy function into something readable. We felt like geniuses.

By end of day, it had shipped 3 minor improvements. We started talking about cutting engineering costs.

Day 2: The first crack.

We asked Claude to add a new filter to our dashboard. It wrote the code. It worked locally. We merged.

That night, something else broke. A completely unrelated chart stopped loading. No error logs. No obvious cause.

We spent 2 hours tracing it back to Claude's change. The filter logic was fine. But it had refactored a shared utility function that 5 other features relied on. It didn't check dependencies. It just assumed.

We rolled back. Lesson one: AI doesn't think about side effects.

Day 3: The false confidence trap.

We asked Claude to build a new feature from scratch. It generated 800 lines of code. Beautiful structure. Clean comments. Tests included.

We reviewed it quickly. Looked perfect.

Pushed to staging. The feature worked. We celebrated.

Then we noticed something strange. Our API costs had spiked. Claude was making 3x more calls than necessary — not because the code was wrong, but because it didn't understand pricing implications. It called external APIs in loops where a batch request would have been fine.

No error. Just expensive.

Day 4: The silent failure.

We asked Claude to optimize our database queries. It wrote better SQL. Things ran faster.

Then user emails started coming in. "Where did my old data go?"

Claude had dropped a table. Not a critical one. But a table with 3 months of user activity logs. Not backed up. Not in our retention policy.

It didn't ask permission. It didn't warn us. It just did what we asked: "clean up old data."

We spent the next 2 hours backing-up and rolling-back a DB snapshot.

Day 5: The paradox.

We asked Claude to fix the backup issue. It wrote a beautiful automated backup script. Scheduled. Logged. Perfect.

We asked it to add a new feature. It worked flawlessly.

We asked it to review its own code from day 3. It found 2 potential bugs and fixed them.

We started feeling safe again.

Then at 3am, our site went down. Claude had updated a core dependency to the latest version. It worked in test. But the new version had a breaking change our production environment didn't support. No human would have made that mistake.

Day 6: The blame game.

We spent the morning restoring the site. Asked Claude what happened. It explained the dependency logic perfectly. It acknowledged the mistake. Then it suggested 3 ways to prevent it in the future.

One of the suggestions was to implement a dependency review process before merging.

It was telling us to put humans back in the loop.

The hardcoded amateur sh*t came the day before. We asked Claude to add a simple feature — a discount code field on checkout. It worked. Beautifully. Until we realized it had hardcoded the discount logic. Not configurable. Not in settings. Just raw numbers and conditions buried in the code. If we wanted to change the discount amount, a developer had to dig in and rewrite it. It didn't ask. It just assumed.

Day 7: The verdict.

We ended the experiment. Total tally:

  • Features shipped: 12

  • Features that worked without issues: 4

  • New bugs introduced: 27

  • Hours spent fixing things Claude broke: 40

  • User emails explaining lost data: 73

  • API cost increase: 38%

What we learned.

Claude is incredible at writing code. It's terrible at understanding context, dependencies, business logic, and consequences.

It doesn't know what you didn't tell it. It doesn't ask questions when something is ambiguous. It assumes it's right.

The best work we got wasn't when Claude coded alone. It was when Claude wrote the first draft and a human reviewed it, caught the assumptions, and fixed the blind spots.

The hype is real. So is the mess.

What I'm curious about.

Has anyone else tried this? What broke first for you?

Imed Radhouani
Founder & CTO – Rankfender
Code that ships. Chaos that teaches.

447 views

Add a comment

Replies

Best
Daniel Yoon

This lines up a lot with what I’ve been seeing. The “it works but breaks something else” pattern is real.

Also ran into issues with Claude recently — not just behavior-wise but even hitting message limits way faster than expected, which makes longer workflows kind of unpredictable.

Feels like the real challenge isn’t capability, it’s consistency + awareness of context over time.

Imed Radhouani

@danielyoonadoba Yeah, that's exactly it. The capability is already insane. It's the consistency that breaks.

The message limits are brutal. We kept hitting them mid-flow and losing context. By the time we got a new session going, Claude had forgotten the previous decisions and started fresh. That's when things got weird.

The pattern I noticed was: first hour, perfect. Second hour, still good but starting to get creative. Third hour, confidently wrong about things it nailed earlier. It's like the context window fills up with its own decisions and it starts making assumptions based on assumptions.

The "works but breaks something else" pattern is the hardest to debug. No error. No crash. Just subtle drift. Everything looks fine until a user emails you asking why their discount code doesn't work.

I think the workflow is going to evolve into very narrow, short-lived sessions. Ask Claude for one thing. Get the output. Review it. Commit it. Close the session. Start fresh for the next thing. No long-running context. No compounding assumptions.

It's less magical. But it's more reliable.

What were you building when you hit the limits?

Daniel Yoon

@imed_radhouani We were experimenting with building multi-region content workflows — scheduling posts, localizing copy, and generating simple variations automatically. It worked fine in short bursts, but as soon as we tried chaining multiple steps across platforms, the session limits and context drift started breaking the sequence.

Ended up realizing the “one request, review, commit, reset” pattern is way more predictable, even if it’s less magical than letting it run free.

Curious if you found any tricks to keep longer sessions somewhat reliable, or if it’s just a hard limit of the system right now.

Imed Radhouani

@danielyoonadoba Yeah, the "works but breaks something else" pattern is the one that gets you. No crash, no error. Just subtle drift until a user emails you asking why something random stopped working.

We didn't find a reliable way to keep longer sessions clean. We tried everything — CLAUDE.md with architecture rules, .claudeignore to filter noise, Plan Mode before every major change . It helped, but eventually the context would still get polluted.

The best thing we found was the two-correction rule . If Claude made the same mistake twice, we'd just /clear and start fresh. Fighting a polluted context is pointless. It's cheaper in both time and tokens to reset than to keep correcting .

We also started using /compact proactively around 70% utilization, not waiting for auto-compaction to kick in . That gave us a clean summary before things got messy. The /compact with a custom focus prompt was useful too, telling it "keep the database schema decisions, discard the UI brainstorming" helped preserve what actually mattered .

But honestly? The real reliability came from giving up on long sessions entirely. One request, review, commit, reset. It's less impressive but way more predictable.

Gabe Perez

Did you break out Claude into different agents/subagents (or chats) to handle different tasks or how were tasks, scope, and project planning assigned?

Imed Radhouani

@gabe Great question. And yeah, we learned this the hard way ( it was a liiiittle bit late, but better late than never hihi )

We started with one Claude session handling everything. Big mistake. Context got polluted, it started making assumptions that carried across tasks, and by day 3 we had code that looked fine but made zero sense together.

We didn't use subagents at first because honestly we didn't know they existed . After the mess, we restructured. We gave Claude a CLAUDE.md file at the root with our architecture, tech stack, and hard rules (like "never hardcode config values") . That helped a lot.

For the actual work, we used Plan Mode (Shift+Tab) to force it to think before coding . Claude would lay out a plan, we'd review it, then let it execute in a fresh session. This stopped the butterfly effect where one wrong assumption ruins everything .

What we didn't do (but probably should have, dunno ..) is use subagents for different roles. There's a pattern where you have an Architect agent design the system, a Builder implement, a Validator test, and a Scribe document . All working in parallel on the same project. Anthropic even did this to build a C compiler with 16 agents running simultaneously . That's way beyond what we attempted...

Next time? We'd set up separate agents with distinct roles, give each a narrow scope, and let the lead agent coordinate. The "two-correction rule" is also key: if you correct Claude twice on the same thing, the context is polluted. Start fresh .

Honestly though, for a 7-day experiment, we probably overcomplicated it. The real lesson was: scope tasks tiny, use Plan Mode, and never let it code without a human reviewing the plan first. Subagents are for when you're building something much bigger than a few features.

What's your setup like? Are you using subagents already or just running single sessions? Would really help us in our next experiment ( since we are really considering to implement it to/in our process in the future, specially Opus 4.6 ! )

Simon Wallace

From my testing, yeah very relatable. The main one for me was it ballooned our operations in our database, which as well as causing performance issues, was expensive.

I also found the isolated thinking an issue too, plus complex original thinking it doesn't handle all too well.

Still its a powerful tool in the right circumstances, we just need to manage those carefully IMO.

Imed Radhouani

@dr_simon_wallace Yeah the database ballooning is brutal. We had the same thing. Claude wrote efficient-looking queries that turned into full table scans at scale. Worked fine in dev. Killed our production database twice before we caught the pattern.

The isolated thinking piece is interesting. It's great at solving the thing you ask it to solve. Terrible at noticing the thing you didn't ask it to solve. That's where the expensive mistakes live. You ask it to add a feature. It does. It doesn't notice that the feature breaks the pricing logic three steps downstream.

Complex original thinking is where it falls apart for us too. Give it a problem that requires a novel architecture, something that isn't just a variation of a common pattern, and it starts generating plausible nonsense. Looks good on the surface. Breaks in ways you can't anticipate.

Powerful tool in the right circumstances is exactly right. The skill now isn't prompting. It's knowing which circumstances are the right ones.

Just out of curiosity, what's your threshold? When do you decide "this one, I'm doing myself" vs "let AI handle it"?

Simon Wallace

@imed_radhouani exactly.

Honestly, depends how tired I am 🤣 I like to think of AI like junk food - I know it's bad for me, but sometimes when you're hungry its quick and easy.

For me I tend to look at what is defined around the issue, if it's a full new feature I will tend to use it to refine, but if I know the architecture is in place and it's "minor" optimisations I'll give it more autonomy.
What my experiences have taught me is that I won't give it full free reign at the moment, revenue is great but retained profit is better! If it doesn't know how to care about that and costs too much to run, that's an issue

Imed Radhouani

@dr_simon_wallace Hahaha loved the comparison! Junk food is exactly right. You know you shouldn't, but sometimes you just need something on the page and it's better than staring at a blank screen.

AI doesn't care about your margins. It'll write beautiful code that triples your API bill and never think twice. You ask for a feature, you get the feature. You don't ask how much it costs to run, it doesn't tell you.

The "if I know the architecture is in place" bit is where I've landed too. For new stuff I want to be in the room. For the boring optimizations, the repetitive stuff, the things I've done a hundred times? Let it run.

Jade Melissa

This is exactly why "it works" is such a dangerous benchmark with AI written code. Did the DB wipe scare you the most, or was it the silent API cost spike?

Ian Maxwell

The hardcoded discount logic part is painfully real. That's such a classic AI move solve the task, ignore the future. Were you giving it product constraints upfront or mostly just technical prompts?

Imed Radhouani

@ian_maxwell2 Yeah it's the perfect example of "solve the task, ignore the future." The prompt was something like "add a discount code field to checkout." It did exactly that. Added the field. Made it work. Never occurred to it that hardcoding the logic to one email was maybe not the plan.

We gave it product constraints upfront. We had a CLAUDE.md file with architecture rules, tech stack, things like "never hardcode config values." It ignored it. Or maybe it just didn't think "this discount code thing counts as a config value."

The technical prompts were fine. It understood the code. It didn't understand the product. That's the gap.

What's the worst "solved the task, ignored the future" thing you've seen?

Kyle Bennett

This feels less like "Claude can't code" and more like "Claude has no sense of downstream consequences." That's probably the biggest gap right now.

Leah Josephine

The scary part is how clean the code can look while the thinking underneath is till working. That false confidence trap is probably where most teams will get burned.

Miles Anthony

"No human review" was brave but also a really useful experiment. Did you notice it getting worse as context piled up, or were the mistakes random throughout?

Imed Radhouani

@miles_anthony2 It got worse as context piled up. The first day was solid. By day 3 it was confidently wrong about things it had gotten right on day 1. The pattern was consistent: start fresh, brilliant. Let it run, slow decay. It's like the context window fills up with its own assumptions and it stops questioning itself.

The randomness came later. Day 5 it would nail something complex and then hardcode a simple config value. No logic to it. Just collapse.

The two correction rule saved us. If it made the same mistake twice, we stopped trying to fix it and started fresh. Fighting a polluted context is pointless.

What's your experience with context drift? Notice it getting worse over time too?

Naomi Florence

The API cost spike is such an underrated failure mode. A lot of AI generated code is "correct" in a way that becomes financially stupid in production.

Imed Radhouani

@naomi_florence1 Yeah the financial stupidity is the worst part because you don't notice it until the bill arrives. The code works. Tests pass. Everything looks fine. Then you get a $2,000 invoice and spend three days tracing it back to a loop that calls an API 400 times when it should've called it once.

What gets me is the AI doesn't think about cost. It solves the technical problem. The financial problem is invisible to it. So you end up with a solution that's technically perfect and commercially insane.

Oliver Nathan

I think this is the clearest argument for AI as a draft engineer, not a decision maker. Would be super interesting to see the same experiment repeated with strict human review in the loop.

Imed Radhouani

@oliver_nathan3 That's exactly it. AI is incredible at generating possibilities. Terrible at choosing which ones matter.

The gap between "here's a feature" and "here's a feature that actually helps someone" is where the human still matters. The AI competitor exercise is useful because it surfaces ideas. But the ones worth building are the ones that feel right, not just the ones that check boxes.

We've been using this as a forcing function on our own product. Every quarter we ask AI to build a better version of Rankfender. Some of the suggestions are genuinely useful. Most are technically correct but miss the point entirely. The stuff that survives the human filter is the stuff we actually build.

What's the most useful thing you've pulled from an exercise like this?

12
Next
Last