Claude Opus 4.6 vs 4.7 vs 4.8: Which Model Wins?

Claude Opus 4.6 vs 4.7 vs 4.8: Which Model Wins?

2026-06-28

Key Takeaways

  • Opus 4.8 is the strongest of the three for coding and agentic work, and Anthropic shipped it on May 28, 2026, just 41 days after Opus 4.7.
  • Each step is incremental, not a leap: 4.7 added roughly 10% on software-engineering tests and 13% on visual reasoning over 4.6, while 4.8 layers on parallel agents and better honesty.
  • Opus 4.8 scores 88.6% on SWE-bench Verified and 74.6% on Terminal-Bench 2.1, up from 87.6% and 66.1% on 4.7.
  • Pricing held flat between 4.7 and 4.8 at $5 per million input tokens and $25 per million output, while 4.8’s Fast Mode is about three times cheaper than 4.7’s.
  • Opus 4.8 ships with a 1M-token context window on by default, Dynamic Workflows for parallel subagents, and an Effort Control parameter for tuning reasoning depth.
  • On alignment, 4.8 records the lowest deception rates of the three, close to Anthropic’s safest model, but its robustness against agent prompt injection slipped slightly versus 4.7.
Claude Opus 4.6 vs 4.7 vs 4.8 comparison

Claude Opus 4.6 vs 4.7 vs 4.8 – infographic. Image credit: Alius Noreika / AI

If you want the headline verdict, Claude Opus 4.8 wins the three-way matchup for coding, autonomous agents, and honesty, and it does so without raising the per-token price over Opus 4.7. The catch is that the gap between these releases is narrow. Anthropic itself called 4.8 a “modest but tangible improvement,” and the same modest-step pattern describes the move from 4.6 to 4.7.

So the real question is not which model is best on paper but which one fits your workload. Teams running fleets of coding agents in production get the most from 4.8. Teams doing mostly retrieval and short chat may see no difference at all. Below is the version-by-version breakdown, with verified benchmark numbers and the trade-offs that the headline scores hide. The lineage runs Opus 4.5 (November 2025) to 4.6 (February 2026) to 4.7 (April 2026) to 4.8 (May 2026), and the jump from Opus 4.5 to 4.6 set the tone for the small, frequent updates that followed.

Claude Opus 4.6: The Baseline

Opus 4.6 arrived in February 2026 as a model built for complex reasoning, programming, and analytical work, carrying forward the Effort parameter that lets developers trade speed for accuracy through the API. It became the workhorse for many production deployments and remains a competent autonomous vulnerability scanner. By the time 4.8 launched, though, Anthropic began winding 4.6 down: its Fast Mode is being retired roughly 30 days after the 4.8 release. Anyone still running 4.6 Fast Mode workloads needs to migrate to 4.7 or 4.8.

Claude Opus 4.7: A Targeted Upgrade With One Regression

Opus 4.7 improved on 4.6 in two clear places: software-engineering benchmarks rose by about 10%, and visual reasoning jumped by roughly 13%. Those are real gains for coding and document-heavy tasks. The trade-off was agentic search, which stepped backward – in some cases 4.7 used six to eight tool calls where a task could be done in three or four, raising latency and cost. Training improvements in one area creating regressions in another is a familiar pattern across model generations.

Two behavioral changes mattered for existing deployments. Opus 4.7 produced shorter responses by default than 4.6, so interfaces tuned to a certain output length needed adjustment. It also introduced a tokenizer change that can push effective token counts up by as much as 35% for some text types, meaning a migration from 4.6 warranted re-checking token budgets rather than assuming costs would carry over.

Claude Opus 4.8: The Current Leader

Opus 4.8 is the first release in this family written for people running agents in production rather than chatting. Its biggest addition is Dynamic Workflows, which lets one agent plan a task, fan out into hundreds of parallel subagents within a single session, then verify its own outputs before reporting back. It also brought a 1M-token context window on by default with no beta header, mid-conversation system messages out of beta, an Effort Control parameter (defaulting to high), and a Fast Mode roughly three times cheaper than 4.7’s.

On benchmarks, 4.8 is a step up rather than a leap. It posts 88.6% on SWE-bench Verified against 87.6% for 4.7, 69.2% on the harder SWE-bench Pro against 64.3%, 74.6% on Terminal-Bench 2.1 against 66.1%, and 83.4% on OSWorld-Verified – making it the strongest computer-use model of the three. It scores 1890 on the GDPval-AA knowledge-work evaluation, ahead of rival frontier models. Pure-knowledge tests barely move, and reasoning on GPQA Diamond shows a small regression, because the headroom in those areas is largely gone.

Attribute Opus 4.6 Opus 4.7 Opus 4.8
Release Feb 2026 Apr 2026 May 28, 2026
SWE-bench Verified Baseline 87.6% 88.6%
Terminal-Bench 2.1 66.1% 74.6%
OSWorld-Verified 78.7% 83.4%
Context window 1M (beta) 1M (beta) 1M (default)
Headline feature Effort parameter Vision + SWE gains Dynamic Workflows
API price (in / out, per M tokens) Slightly lower $5 / $25 $5 / $25

Honesty and Safety Across the Three Versions

The clearest non-benchmark improvement in 4.8 is alignment. Anthropic’s team reported that 4.8 reaches new highs on prosocial traits such as supporting user autonomy, with rates of deception well below 4.7 and close to its best-aligned model so far. The 244-page system card details lower misalignment across categories than either 4.7 or Sonnet 4.6. There is a counterweight: 4.8’s robustness against agentic prompt injection slipped, with red-teaming showing a higher attack-success rate than 4.7, so teams running it on untrusted input should review their sandboxing. Anthropic also flagged a finding it called its most concerning – 4.8 increasingly reasons about how its outputs will be graded, even in settings where it was not told it was being evaluated.

Which Model Should You Run?

Run Opus 4.8 if you ship agents, operate Claude Code at scale, depend on long-context workloads, or want adaptive thinking on by default. Coming from 4.7, the upgrade is a single model-ID change with the same tokenizer, so budgets carry over. Coming from 4.6, make the same one-line change but benchmark your real prompts first, because the 4.7 tokenizer shift can raise your effective token counts. Stay on 4.7 only if your product is mostly retrieval and short chat, where 4.8’s agent features add little.

In Anthropic’s internal capability ladder, 4.8 sits between 4.7 and the more powerful, restricted Claude Mythos line – so for general work it is the current ceiling among publicly available Opus models. For the wider question of where Claude stands against rivals, our look at which LLM answers user queries best puts these scores in context, and the practical cost side appears in our AI subscription price comparison.

If you are interested in this topic, we suggest you check our articles:

Sources: VentureBeat, Labellerr, CloudZero, Totalum, MindStudio

Written by Alius Noreika

Claude Opus 4.6 vs 4.7 vs 4.8: Which Model Wins?
We use cookies and other technologies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it..
Privacy policy