Anyone else running Opus 4.7 yet? This one feels different (with CC harness)

by•17d ago

Anthropic just shipped Opus 4.7 today and i had to write about it somewhere because the jump is weird.

I ran the same backlog task on 4.6 and 4.7 back to back. same repo, same prompt, same tools. 4.6 looped on a bug for 25 minutes and was not going to solve it. 4.7 closed it in eleven, and the part that freaked me out is that it paused in the middle to sanity-check an assumption i had not asked it to check. literally wrote "before i write this migration, let me verify the actual shape of the response object, because my assumption here might be wrong" and then went and verified it. unprompted.

That self-verification behavior is the thing. Vercel is reporting it does proofs on systems code before starting work. Hex says it flags missing data instead of making up plausible-but-wrong fallbacks. Genspark measured loop rates on hard queries and 4.7 basically stopped looping. different teams, different harnesses, same pattern.

the numbers are nuts too:

- CursorBench: 58% on 4.6 → 70% on 4.7

- Rakuten SWE-Bench: 3x more production tasks resolved

- XBOW visual-acuity: 54.5% → 98.5%

- Notion: tool errors cut to 1/3

- same $5 / $25 pricing as 4.6

also new stuff that shipped with it: a new `xhigh` effort tier (between `high` and `max`, and it's now the default in Claude Code), task budgets in beta (advisory token cap the model can see and plan against, different from max_tokens), `/ultrareview` slash command in Claude Code, Auto mode extended to Max subscribers.

Migration on the API is not free though. extended thinking is gone, sampling params return 400, new tokenizer uses 1.0-1.35x the token count. Claude Code users get most of this handled for them.

Anyway. two questions for anyone who's already on it:

1. if you run agentic pipelines on the API, what's the worst migration pain you've hit?

2. anyone sizing task budgets in production yet? how are you picking the number?

full writeup here if you want the details: https://buildthisnow.com/blog/models/claude-opus-4-7

87 views

Replies

Best

The looping is huge if it holds up across different setup .That’s been one of the biggest pain points with agentic workflows.

Report

16d ago

The idea of task budgets is really interesting . Feels like early infrastructure for more predictable agent behavior but also adds another layer to tune.

Report

16d ago

Migration pain sounds non trivial though. Removing sampling params and changing tokenization could break a lot of existing pipelines.

Report

16d ago

On task budgets, I'm not sizing them yet. Pinned to 4.6 through a launch in a few weeks and the tokenizer change alone means I want to benchmark before migrating anything. The point about task shape vs task size is exactly right though a budget that works for greenfield generation would be wrong for constrained validation work even at a fraction of the line count.

Report

16d ago

That self-verification shift is exactly what turns a standard bot into a reliable agentic engine. During my own market discovery for a specific niche, the "hallucination loop" was the biggest productivity killer, so 4.7 feels like a godsend. The tokenizer change is a bit of a hidden tax, but it's a small price for a model that actually stress-tests its own logic. I'm still wrapping my head around task budgets are you seeing them actually curb the "infinite loop" costs in production? It definitely beats "hallucinated encouragement" any day of the week!

Report

4d ago