Same model, same tasks. 4 browser automation tools used wildly different amounts of tokens. Why?
I watched Claude read the same Wikipedia page 6 times to extract one fact. The answer was right there after the first read. But something about the tool interface kept making it look again.
That got me curious. If every browser automation tool can get the right answer, what actually determines how much it costs to get there?
So I ran a benchmark. 4 CLI browser automation tools. Same model (Claude Sonnet 4.6). Same 6 real-world tasks against live websites. Same single Bash tool wrapper. Randomized approach and task order. 3 runs each. 10,000-sample bootstrap confidence intervals.
The results (average tokens per task / wall time / tool calls):
Tool A: 36,010 tokens / 84.8s / 15.3 tool calls
Tool B: 77,123 tokens / 106.0s / 20.7 tool calls
Tool C: 94,130 tokens / 118.3s / 25.7 tool calls
Tool D: 90,107 tokens / 99.0s / 25.0 tool calls
All four scored 100% accuracy across all 18 task executions. Every tool got every task right. But one used 2.1 to 2.6x fewer tokens than the rest.
The biggest predictor of cost was tool call count. Every call forces the LLM to re-process the entire conversation history. Tool A averaged 15.3 calls. The others averaged 20 to 26. That gap alone accounts for most of the token difference.
How they differ
All four maintain persistent browser sessions via background daemons. All four can execute JavaScript and return the result. All four have worked on compact page state. So the capabilities are similar.
The difference is in how they expose those capabilities to the LLM.
Three of the tools expose individual CLI commands (open, click, fill, scroll, etc). The LLM issues one command per tool call. One of them also has a code execution mode for JS batching, but it still defaults to individual commands for most operations.
Tool A has no individual commands. The only interface is a code block. navigate, click, evaluate are all async functions in Python. The LLM writes multiple operations as consecutive lines in a single call because there is no other way to use it.
This seems to naturally encourage batching. When there is no "click" command to reach for, the LLM writes click + evaluate + print as three lines in one call instead of three separate calls. But I also told every tool's LLM to batch and be efficient. So the interface explanation is plausible, not proven.
Where the gap was biggest
The per-task breakdown is interesting. On simple tasks like content analysis (read a page and summarize it), all four tools used roughly the same tokens. The gap showed up on complex multi-step tasks:
search + navigate: Tool A used 16k tokens, Tools B-D used 28k to 48k
form fill: Tool A used 8k tokens, Tools B-D used 16k to 32k
Tasks that require multiple sequential interactions are exactly where batching has the most room to reduce round trips. Makes sense.
What this means in dollars
At scale this adds up. On Sonnet 4.6 pricing ($3/$15 per million tokens), if you run 1,000 browser tasks per day:
Tool A: roughly $600/month
Tools B-D: roughly $1,200 to $1,450/month
On Opus 4.6 ($5/$25 per million), the spread is $1,200 vs $2,250-$2,800/month.
Same model. Same tasks. Same accuracy. $600 to $1,600 per month difference just from how the tool presents itself to the LLM.
Methodology
- Single generic Bash tool for all 4 (identical tool-definition overhead)
- Both approach order and task order randomized per run
- Persistent daemon for all 4 tools (no cold-start bias)
- Browser cleanup between approaches
- 6 tasks: Wikipedia fact lookup, httpbin form fill, Hacker News extraction, Wikipedia search+navigate, GitHub release lookup, example(dot)com content analysis
- N=3 runs, 10,000-sample bootstrap CIs
- Full methodology and raw data linked in comments
The bigger question
The thing that stuck with me is not really about browser tools. It is about how we design interfaces for LLMs in general. These four tools have remarkably similar capabilities. But the LLM used them very differently. Something about the interface shape changed the model's behavior, and that behavior change drove a 2x cost difference.
Has anyone else noticed this pattern with other types of tools? I am curious whether code-first interfaces consistently lead to fewer tool calls across other domains too.
Full methodology: https://docs.openbrowser.me/cli-comparison



Replies