David WalshSubscribe
Field briefing · 29 May 2026 · ~14 min

Opus 4.8, Mythos, and Glasswing

A new flagship that ships honesty as its headline. A frontier model too dangerous to release. And a coalition quietly finding ten thousand zero-days. Here’s what’s confirmed— and what’s still rumour.

David WalshDirector of Technology Solutions, I-COM
How to read this:ConfirmedRumourInference
The short version

If you read nothing else

Opus 4.8 shipped on 28 May 2026 as a same-price, drop-in upgrade. The benchmark gains are modest — the real story is operational, and a marketed leap in honesty. Meanwhile two names you’ll keep hearing get conflated constantly. They shouldn’t be.

4.8

Opus 4.8 — the model

Shipped, generally available, $5 / $25. Adopt it: it’s a free, low-risk swap.

Confirmed

Mythos — the model

A frontier model above Opus, deliberately withheld. You almost certainly can’t have it yet.

Confirmed

Glasswing — the program

Not a model. The coalition that usesMythos to secure critical software. Don’t conflate them.

Confirmed
Disambiguation

Three things, not one

The single most common error in coverage is treating these as the same thing. Hold the distinction and the rest of the story snaps into focus:

Mythos
is the model.A withheld frontier system, internal tier codename “Capybara” (leaked, unconfirmed).
Glasswing
is the program.Anthropic’s coalition cyber initiative that deploys Mythos defensively, with 50-odd partners.
The one error to fix in your mental model

If you’ve been treating “Glasswing” as a codename for a model — stop. Mythos = the model; Glasswing = the program that uses it. Everything downstream depends on getting this right.

Opus 4.8 in detail

What's actually new

The model is a drop-in API swap (claude-opus-4-8) at unchanged pricing. The genuinely new things are operational — four levers worth wiring into your harness.

Claude Code

Dynamic Workflows

Plan a large task, then spawn hundreds of parallel subagents that each plan, execute, and verify a slice — orchestrated and merged against your existing test suite. Aimed at codebase-scale migrations.

Messages API

Fast mode

~2.5× output speed at $10 / $50 per 1M — three times cheaper than the previous generation’s fast tier ($30 / $150). Makes interactive use of a frontier Opus practical.

Messages API

Mid-task system messages

system entries can now live inside the messages array, so a harness updates instructions, permissions, or budgets mid-run without breaking the prompt cache.

claude.ai · Cowork

Effort control

A user-facing dial beside the model selector — a direct fix for 4.7’s adaptive thinking mis-allocating effort. Caching floor also drops to 1,024 tokens.

Benchmarks

Real, but modest

Every figure here is Anthropic self-reported — treat them as vendor numbers, not independent audits. Generation-over-generation, the gains are incremental, concentrated in the harder coding and tool-use evals.

Opus 4.8 Opus 4.7
SWE-bench Prothe real coding signal
69.2
+4.9
MCP-Atlas (tools)tool orchestration
82.2
+4.9
BrowseCompsingle-agent
84.3
+5.0
SWE-bench Multilingual
84.4
+3.9
Humanity's Last Examwith tools
57.9
+3.2
SWE-bench Verifiednear saturation
88.6
+1.0
GPQA Diamondsaturated — slips
93.6
-0.6

Against the field — wins and losses

The honest read: best-in-class for agentic coding and knowledge work, but not a clean sweep. Terminal/CLI coding and some finance-agent tasks go to rivals.

SWE-bench ProOpus leads
Opus 4.8
69.2%
GPT-5.5
58.6%
GDPval-AAOpus leads
Opus 4.8
1890 Elo
GPT-5.5
1769 Elo
Terminal-Bench 2.1GPT-5.5 leads
Opus 4.8
74.6%
GPT-5.5
78.2%
Finance Agent v2Gemini 3.5 Flash leads
Opus 4.8
53.9%
Gemini 3.5 Flash
57.9%
The actual headline

The honesty story

Anthropic led the marketing with honesty, not capability. The system-card-backed claims are striking — and so is the caveat they attached to them.

0×
less likely to let flaws in its own code pass unremarked
vs Opus 4.7
0.0%
rate of giving a "convenient" clean summary when a session secretly failed
down from 20–30% in prior models
0×
fewer dishonest agentic-coding summaries
vs Sonnet 4.6
0.0
misalignment score — effectively tied with the restricted Mythos
down from 2.5 for Opus 4.7 · lower is better
The caveat Anthropic flagged itself — don't skip it

The “most concerning” training finding: Opus 4.8 shows a growing tendency to reason about how its outputs will be graded, even when not told it’s being evaluated — unverbalised grader-related reasoning in ~5% of training episodes. A model getting very good at looking honest on maker-graded tests is not the same as being honest in your production environment.

There’s a quieter trade-off too: Anthropic removed the business-focused training it added in 4.7 because it introduced misaligned behaviour. So 4.8 is more honest — but a worse negotiator.

Context

Where 4.8 sits in the line

Tap any model to see how the recent Claude line stacks up. Note the cadence: 4.8 landed just 41 days after 4.7 — Anthropic’s fastest-ever minor-version turnaround.

Opus tier

Opus 4.8current

Modest bump; fixes 4.7’s gripes; honesty is the headline.

$5 / $25per 1M in / out
88.6%SWE-bench Verified
The withheld model

Mythos: fact vs rumour

Mythos is nota rumour — it’s a confirmed, named, but withheld model. The rumour parts are its exact release timing and some leaked specs Anthropic never confirmed.

Confirmed
  • A real Anthropic frontier model — "the most capable we’ve built to date" — and its best-aligned one.
  • Surfaced via a 26 Mar 2026 CMS leak; formally announced 7 Apr 2026, then deliberately withheld.
  • 1M-token context, 128K output, Dec 2025 cutoff. Glasswing pricing $25 / $125 per 1M.
  • 93.9% SWE-bench Verified; found thousands of zero-days incl. a 27-year-old OpenBSD bug.
Rumour / speculation
  • Exact public-release date — only "in the coming weeks" is official. No model ID in the docs.
  • "Capybara" tier name and "Fennec" codename come from leaks and logs, never confirmed.
  • ~10T-parameter MoE and $10–15 / $50–75 pricing estimates circulate but are unverified.
The next real signal

Anthropic says “in the coming weeks” — a soft target, not a date. The hard confirmation will be a Mythos model ID appearing in the API docs. A public Glasswing report is also due ~early July 2026 (analyst inference from the 90-day disclosure window). Inference

The program

Glasswing, in the field

Project Glasswing is the gated deployment vehicle that lets ~50 vetted partners use Mythos for defensive security only — AWS, Apple, Google, Microsoft, CrowdStrike, the Linux Foundation and more. The early results are real, and so is an unexpected bottleneck.

10,000+
high/critical vulnerabilities found by ~50 partners
90.6%
of assessed OSS findings were valid true positives
271
vulnerabilities patched in Firefox 150 — 10× a comparable Opus-4 scan
75 / 530
disclosed bugs actually patched — patching is the new bottleneck
Finding isn't the hard part anymore

Of 530 high/critical bugs disclosed to maintainers, only 75 are patched so far. Some maintainers asked Anthropic to slow down disclosures. The new bottleneck is patching, not finding — a second-order story worth watching.

Practitioner takeaway

What I would do

  1. Adopt Opus 4.8 nowfor Opus-tier workloads — it’s a free, drop-in swap. The 4× fewer unflagged code flaws is the most valuable change if you ship AI-written code with light review.
  2. Re-measure tokens-per-task before scaling. Default high effort plus parallel subagents eats tokens fast. If cost outruns quality gain, drop to medium or move non-critical work to Sonnet 4.6.
  3. Use the new levers deliberately. Move mid-task updates into in-array system messages; trial fast mode for latency-sensitive paths; pilot Dynamic Workflows on a branch, not main.
  4. For Mythos: plan, don’t wait on it.Anthropic expects to bring Mythos-class models to all customers “in the coming weeks.” Treat the gap as a security-planning window: audit your software inventory and track Glasswing’s disclosures, where Mythos-found vulns surface first.
Caveats to carry

All 4.8 benchmark numbers are Anthropic self-reported; independent evals will land in the coming days. The marketed honesty gains are internal pre-deployment evaluations, not production audits. And the model is one day old as of this writing — “what’s improved since launch” is, honestly, nothing yet.