We benchmarked Claude Code refactoring, with and without code health guidance

CodeHealth MCP Server by CodeScene

•3h ago

We ran a benchmark to see how well @Claude Code actually refactors legacy code alone and then redid the same test, but this time with code-health guidance via MCP server.

To limit any vendor bias, we used a public data set of 25,000 source code files from competitive programming, including carefully crafted unit tests.
We assessed agent correctness by running those tests.
We measured the Code Health impact using CodeScene.
(See our research Code for Machines, Not just Humans for more details on the methodology and data)

Claude Code that was MCP-guided achieved 2–5x more more improvements in Code Health compared to unguided refactoring.

Some nuance:

The difference wasn’t just in quantity, but in type of changes
Unguided runs mostly did shallow edits (e.g. renaming variables)
Guided runs performed significantly more structural refactorings (e.g. extracting methods, reducing responsibilities)

In other words, same model, but very different behavior.

This lines up with other research suggesting that agents refactor more than humans, but those changes often lack structural impact: "...these changes do not necessarily have the same structural impact as human refactorings”,

What seems to be happening is that, without a signal for “what good looks like”, the model defaults to safe, low-risk edits.

Another pattern we saw: Code Health determines AI performance (see also my earlier post)

On lower code health score, results were less reliable
Defect rates increased significantly (we observed that in unhealthy code there was at least 60%+ defect risk)
As code quality improved (we observed that AI needs 9.5/10.0 in Code Health to became more stable and work reliably.

This suggests that legacy code isn’t just a maintenance problem, it’s also a bottleneck for AI-assisted development.

There’s also a broader implication here:

Average code health in many systems is far below what’s considered “easy to understand” for humans and the bar seems even higher for AI.

So in practice, faster code generation doesn’t automatically translate into faster delivery if the underlying system is hard to reason about.

Curious what you think?

13 views

Replies

Best

The finding that a human-readability metric predicts AI refactoring success better than the model's own confidence signals is the most interesting part here; it suggests human and AI comprehension are more aligned than expected.

The conservative vs structural refactoring split also makes sense. Without a clear quality signal, defaulting to safe surface-level edits is rational.

Curious whether specific code smells drove the differences or whether it was the aggregate score that mattered most?

Report

1h ago

This is a fascinating benchmark. It mirrors exactly what we're seeing on the physical side of AI implementation.

I'm launching fixRAgent today, and the biggest hurdle with 'unguided' AI in home repair is the lack of structural signal. If you just ask a generic model for a part number, it defaults to 'low-risk' (and often wrong) answers because it can't reason about the physical geometry.

Just like you found that Claude needs a 9.5/10 code health to be stable, we found that computer vision needs a highly specific 'data signal' from model plates and SKUs to avoid dangerous mechanical advice. Faster generation is useless if the underlying logic can't 'reason' about the hardware. Great data. I'm serious.

Report

1h ago