Sakana Fugu: a multi-agent system as a model

2026-06-23 · 12 min · llm · multi-agent · orchestration · reinforcement-learning · explainer

No single LLM wins everywhere. One model leads on competition math, another on agentic coding, a third on multilingual work, and open models win on cost. The usual response is to pick one and absorb its weak spots. Sakana AI's bet is the other one: don't pick a model — orchestrate a pool of them, and make the orchestration itself the model.

That product is Sakana Fugu (and a heavier tier, Fugu Ultra), shipped behind a single API. Underneath are two ICLR 2026 papers that attack the same problem from opposite ends: TRINITY evolves a tiny coordinator over frozen models, and the Conductor reinforcement-learns a 7B model to write orchestration plans in natural language. This is a walk through both, and what they add up to.

A multi-agent system as a model

Fugu's framing is the whole pitch: one OpenAI-compatible endpoint. You send a request to model: fugu; behind it a learned coordinator assembles a team from a pool of frontier and open models, runs them over several turns, and returns one answer. You never see the routing.

one endpoint · a pool of agents

POST /v1/chat · model: fugu

coordinator

TRINITY · Conductor

agents live

latency

low

billing

single rate

Base Fugu favours a lean, low-latency subset for everyday tasks. Opt a model out for compliance and the coordinator simply routes around it.

Sakana Fugu over a pool of closed and open models, with Fugu itself as one of the workers. — Sakana's own framing of the idea: one Fugu endpoint coordinating a pool of closed and open models — and Fugu can even call itself as a worker (the recursive node on the right).

The pool is swappable — you can opt a model out for compliance and the coordinator routes around it — and billing is a single top-tier rate rather than stacked per-model fees. There's even an export-controls angle: because Fugu can hit frontier-level quality by coordinating open and semi-open models, you get the capability without hard dependence on any one restricted vendor.

But the API is the boring part. The interesting part is that the coordinator is learned, not hand-written. There are two ways to learn it.

TRINITY: evolve a tiny coordinator

TRINITY's constraint shapes everything: you cannot fine-tune GPT-5's weights, and merging models with incompatible architectures doesn't work. So freeze every model in the pool, and learn only a tiny thing on top that decides who does what.

TRINITY's coordination architecture: a coordinator selects an agent and a role each turn, looping Thinker, Worker, Verifier, with a worked example. — TRINITY's coordination loop, from the paper: the coordinator picks an agent and a role each turn, with a worked Thinker → Worker → Verifier example on the right.

The coordinator is under 20,000 parameters

A small model — Qwen3-0.6B — reads the current problem state and produces a hidden vector; a linear head turns that into a choice of agent and role. Given the penultimate-token hidden state $h(s)\in\mathbb{R}^{d}$ from the small model, a head $f_\theta$ of roughly 10K parameters emits logits over $L$ agents plus 3 roles, and the coordinator samples its action $a$ from

\pi_\theta(a \mid s) \;\propto\; \exp\!\big(f_\theta(h(s))_a\big), \qquad a \in \{1,\dots,L\}\cup\{\mathrm{T},\mathrm{W},\mathrm{V}\}

where $s$ is the running transcript, $\mathrm{T},\mathrm{W},\mathrm{V}$ are the three roles below, and $\theta$ is everything that gets trained. On top of the head, TRINITY adds singular-value fine-tuning: take an SVD of one or two of the small model's weight matrices and learn only the singular-value scales, keeping the orthogonal factors fixed. That's a few thousand more numbers. Total trainable: under 20K parameters. The 0.6B backbone and all seven frontier and open models stay frozen.

The entire trainable surface of TRINITY: a hidden state, a ~10K linear head, and a categorical choice over agents and roles. Everything below the head is frozen.

The small model's hidden states are linearly separable by task type (SVM) and form clear task clusters in a t-SNE plot. — Why a ~10K linear head is enough: the small model's hidden states already separate by task type — a linear SVM classifies them almost perfectly (left), and t-SNE shows clean task clusters (right).

Three roles, looped until accepted

Each turn, the coordinator gives the chosen agent one of three roles:

Thinker — plan, decompose, or critique; no direct work.
Worker — do the work: derive, compute, write code.
Verifier — check the current answer and return ACCEPT or REVISE.

It loops, accumulating a transcript, and halts the moment a Verifier accepts (or a fixed turn budget $K$ is exhausted):

\tau \;=\; \min\{\, k \le K \;:\; R_k = \mathrm{V} \ \text{and}\ u_k = \mathrm{ACCEPT} \,\}

where $R_k$ is the role at turn $k$ and $u_k$ is the verifier's verdict. Step through one problem — watch a wrong answer get caught and revised before it's accepted:

TRINITY — coordinate a pool by roleturn 1/5

ThinkerWorkerVerifier

problem: a $50,000 machine, $5,000 salvage, 8-year life — annual depreciation?

Thinker←Gemini-2.5-Pro· plan / decompose / critique

Decompose: straight-line depreciation = (cost − salvage) / life. Read off cost = 50000, salvage = 5000, life = 8.

Trained by evolution, not gradients

Why not just RL the head? Because the reward is binary — the final answer is right or wrong — and the head is tiny, so the per-parameter gradient signal is buried in noise. TRINITY instead optimizes the coordinator with a derivative-free evolution strategy, maximizing expected terminal reward:

J(\theta) \;=\; \mathbb{E}_{\tau \sim \pi_\theta}\big[\, R(\tau) \,\big], \qquad R(\tau) \in \{0, 1\}

The optimizer is separable CMA-ES: it keeps a diagonal Gaussian over the ~10K parameters, samples a small population each generation — $\lambda = \lceil 4 + 3\ln n \rceil \approx 32$ for $n \approx 10{,}000$ — evaluates each candidate's fitness by actually running rollouts, and shifts the distribution toward the winners. The paper shows the coordination objective is nearly block-separable, which is exactly the regime where a diagonal evolution strategy beats both random search and gradient RL under a tight evaluation budget. The honest cost: no gradients means you pay in environment evaluations, and each one is a full multi-turn rollout against real model APIs.

It beats every model in its pool

This is the result that matters. Transferred zero-shot to four held-out tasks, the evolved coordinator outscored every individual model in its pool — including GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet. On LiveCodeBench it set a record at the time of submission:

LiveCodeBench v6 — pass@1 (%)

TRINITY

86.2

GPT-5

83.8

Gemini-2.5-Pro

67.2

Claude-4-Sonnet

46.5

And the multi-turn loop earns its keep: accuracy climbs from 0.823 at two turns to 0.863 at six. One cheap evolved head, a frozen pool, and the ensemble beats its best member.

TRINITY's LiveCodeBench result and its accuracy rising with the turn budget. — TRINITY's own result: on LiveCodeBench it reaches 0.862 pass@1, above GPT-5 (0.838), Gemini-2.5-Pro (0.672), and Claude-4-Sonnet (0.465) — and accuracy keeps climbing with the turn budget (bottom).

Conductor: orchestration written in natural language

The Conductor attacks the same problem with a bigger hammer: a 7B model (Qwen2.5-7B) trained with RL to write the entire workflow itself, in natural language.

Three lists are a workflow

For each problem the Conductor emits three synchronized lists:

model_id — which agent runs each step.
subtasks — a natural-language instruction for each step.
access_list — which earlier outputs each step is allowed to read.

Those three lists are a directed graph. The access_list is the load-bearing idea: [] means the step sees only the original question, ["all"] means it sees everything produced so far, and [0, 2] means it sees steps 0 and 2. By choosing access lists, the Conductor designs the communication topology — a chain, parallel branches, a verify-and-merge — per problem, not from a fixed template. Flip between the topologies it learns to produce:

three lists → one workflow

model_id = [2, 0, 3, 1]

subtasks = ["restate", "solve A", "solve B", "verify + merge"]

access_list = [[], [0], [0], ["all"]]

0: Gemini-2.51: GPT-52: DeepSeek-R13: Claude-4

Two solvers run from the same setup, then a verifier with access "all" sees every prior output and merges them. The Conductor designs this branch-and-join itself — no human topology.

Trained with GRPO

The Conductor is trained end-to-end with GRPO. For each question it samples a group of $G = 64$ candidate workflows, scores each, and pushes the policy toward the above-average ones using the group-normalized advantage

A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \dots, r_G)}{\operatorname{std}(r_1, \dots, r_G)}

The reward $r_i$ is blunt on purpose: $0$ if the three lists don't parse, $1$ if the final workflow output is correct, and $0.5$ otherwise — with no KL penalty ( $\beta = 0$ ). The whole thing trains on just 960 problems for 200 iterations on two H100s. To make one Conductor work over any pool, they then fine-tune it with randomly sampled $k$ -model subsets per question, so it adapts to whatever agents you hand it.

Conductor accuracy climbing over 200 GRPO iterations for out-of-distribution, in-distribution, and mixed agent pools. — Coordination strategy emerging during training: accuracy climbs over 200 GRPO iterations as the Conductor learns to design better workflows — fastest when its few-shot examples are held out-of-distribution.

It can call itself

The Conductor may name itself as a worker. That spawns a fresh sub-workflow on its own draft — a recursive topology that turns inference depth into a tunable compute axis, what Sakana calls dynamic test-time scaling. Recursion buys a point or two on the hardest benchmarks for under 2× the agent calls.

Results

A 7B model orchestrating frontier workers beats the frontier workers. In a controlled run over the same pool:

LiveCodeBench — controlled, shared worker pool (%)

Conductor (7B)

64.3

GPT-5

57.5

Gemini-2.5-Pro

40.1

Claude-4

MoA

38.6

Unconstrained, the headline numbers were each a new high at publication and each above the best single worker: 83.9% on LiveCodeBench, 87.5% on GPQA-Diamond, 93.3% on AIME25 — reached with about 3 agent calls per question, versus 5–8 for prior multi-agent methods.

Conductor leading both GPQA-Diamond and LiveCodeBench against every individual worker model. — The Conductor (highlighted) tops both GPQA-Diamond and LiveCodeBench against every individual worker in its pool — GPT-5, Gemini-2.5-Pro, DeepSeek-R1, and Claude Opus 4.

Scatter of average performance versus average number of agent calls: the Conductor is high-performance at about 3 calls, versus MoA at 8 calls. — Performance versus cost: the Conductor sits top-left — higher accuracy than every multi-agent baseline at roughly 3 agent calls, where MoA needs 8.

Two routes to the same place

TRINITY and the Conductor are the same idea — a learned layer that coordinates a pool — built at opposite scales:

	TRINITY	Conductor
Learnable size	< 20K params (evolved head)	7B params (RL-trained model)
Training	derivative-free sep-CMA-ES	GRPO (reinforcement learning)
Output per step	(agent, role)	a full natural-language workflow
Coordination	fixed Thinker/Worker/Verifier loop	a topology it designs per problem
Reads the task via	the small model's hidden state	reasoning in language
Adapts to new pools	re-evolve (cheap)	randomized-pool fine-tune

TRINITY is the minimal, almost-free coordinator; the Conductor is the expressive one that designs bespoke pipelines. Fugu uses both as its engine.

What ships: Fugu and Fugu Ultra

Two tiers. Base Fugu balances quality and latency over a lean pool. Fugu Ultra coordinates a deeper pool over more turns for hard, high-stakes problems, and takes longer for it. On Sakana's reported numbers, both match or beat the frontier:

SWE-Bench Pro (%)

Fugu Ultra

73.7

Claude Opus 4.8

69.2

Fugu Ultra also posts 50.0 on Humanity's Last Exam, against baselines in the 41–50 range. It's an OpenAI-compatible endpoint — change the base URL and key, no SDK migration — and it bills at a single top-tier rate. (Not available in the EU yet, pending GDPR; the exact routing decisions are kept proprietary.)

Fugu and Fugu Ultra versus Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks. — Fugu and Fugu Ultra (red) against Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks (Sakana). Fugu Ultra leads on SWE-Bench Pro (73.7 vs 69.2), GPQA-D, LiveCodeBench, and Humanity's Last Exam.

What I make of it

The honest read:

The win is real. An orchestration layer that beats every model it coordinates — and generalizes zero-shot to unseen tasks — is a genuine result. "Coordination" is now a trainable layer that sits above frontier models rather than inside one.
The costs are real too. Every model in the pool has to be available at inference; you trade single-model simplicity for a fleet, and latency rises with the extra turns. The biggest gains concentrate on long-tail reasoning and coding benchmarks — on easy tasks the lift is small — and leaning on GPT-5/Claude/Gemini as workers inherits their cost.
The framing is the interesting part. TRINITY argues the coordinator can be almost free: 20K evolved parameters over frozen models. The Conductor argues coordination is itself a reasoning skill worth a 7B model and a full RL run. Both point the same way — as individual models plateau, the next axis is how you make several of them work together, and that orchestration is learnable.

Built on Sakana AI's TRINITY: An Evolved LLM Coordinator and Learning to Orchestrate Agents in Natural Language with the Conductor, both ICLR 2026. Product: Sakana Fugu.