[{"title":"GLM 5.2: long-horizon coding at a million tokens","description":"Z.ai's GLM 5.2 is a 744B/40B-active open-weights MoE with a real 1M-token context, built for long-horizon agentic coding. How IndexShare makes that context cheap, what changed in training, and where it lands against the frontier — with the benchmarks.","date":"2026-06-23","tags":["llm","glm","long-context","agentic-coding","explainer"],"draft":false,"kind":"articles","slug":"glm-5-2","body":"GLM 5.2, from Z.ai (Zhipu AI), is the flagship of the GLM-5 line: a 744-billion-\nparameter mixture-of-experts with 40B active per token, MIT-licensed open weights,\nand — the headline — a genuine **1-million-token context**. It is tuned for one thing\nin particular: long-horizon agentic coding, the sessions that run hundreds of rounds\nand thousands of tool calls without losing the thread.\n\nThere is no standalone GLM 5.2 paper. It builds on the GLM-5 technical report\n([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)) and, for the context trick at\nits center, a method paper — IndexCache / IndexShare\n([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)). This pulls from both, plus the\n[release blog](https://z.ai/blog/glm-5.2).\n\n## What changed from 5.1\n\nGLM-5 → 5.1 → 5.2 share the same 744B/40B backbone. What 5.2 adds:\n\n- a real **1M-token context**, up from 200K;\n- **IndexShare**, the architecture change that makes that context affordable;\n- a shift to **critic-based PPO** for very long RL rollouts;\n- faster speculative decoding (**+20% acceptance length**);\n- a **thinking-effort** dial (High / Max).\n\nThe first two are the load-bearing pair: the long context, and the trick that keeps\nit cheap.\n\n## The model\n\n744B total parameters, 40B active per token — a mixture-of-experts on an 80-layer,\n256-expert backbone. Attention is **DeepSeek Sparse Attention (DSA)**: Multi-head\nLatent Attention plus a lightweight *indexer* that, for each query, selects the\ntop-$k$ tokens worth attending to instead of the whole sequence. That sparsity is\nwhat makes a million-token context tractable at all.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-architecture-1m.png\"\n  alt=\"GLM 5.2 architecture for 1M context — DeepSeek Sparse Attention with the IndexShare layout.\"\n  caption=\"GLM 5.2's architecture for 1M context: sparse attention with a shared indexer (from the Z.ai release).\"\n/>\n\n## IndexShare: making 1M context cheap\n\nDSA has a catch. The indexer runs at *every layer*, and as the context grows toward\n1M tokens, that per-query top-$k$ search becomes the dominant cost. The IndexCache\npaper's observation is the whole insight: **adjacent DSA layers select almost the same\ntokens** — 70–100% of their top-$k$ overlap.\n\n<Figure\n  src=\"/articles/glm-5-2/indexcache-fig4-overlap-heatmap.png\"\n  alt=\"Heatmap of top-k token-selection overlap between every pair of layers, mostly 70-100%.\"\n  caption=\"Pairwise overlap of each layer's selected tokens. Neighbouring layers pick nearly identical sets — so recomputing the indexer for each is wasted work.\"\n/>\n\nSo compute the indexer once per group of layers and reuse its selection for the rest.\nGLM 5.2 shares one indexer across every 4 layers — skipping it in 3 of every 4:\n\n<IndexShare />\n\nIf the indexer's cost per layer scales with selecting top-$k$ over $L$ tokens, then\nsharing it across a group of $g$ layers amortizes that cost to $O(L/g)$ per layer.\nWith $g = 4$ and the rest of each layer unchanged, GLM 5.2 reports **2.9× lower\nper-token FLOPs at a 1M-token context**, with quality essentially intact.\n\n<Figure\n  src=\"/articles/glm-5-2/indexcache-fig2-architecture.png\"\n  alt=\"IndexCache inference loop: F-layers compute and cache indices, S-layers reuse them.\"\n  caption=\"The mechanism: an F-layer computes the indices and caches them; the following S-layers reuse the cache, skipping the indexer entirely.\"\n/>\n\nThe honest tradeoff: push reuse too far — share across 8 layers instead of 4 — and\nlong-context fidelity starts to degrade. One indexer per four layers is the sweet\nspot the paper settles on.\n\n## Faster decoding: MTP and KVShare\n\nGLM 5.2 also sharpens its multi-token-prediction layer (speculative decoding). With\nIndexShare, KVShare, and end-to-end training, the average **acceptance length rises\n~20% — from 4.56 to 5.47 tokens** per verification pass. More accepted tokens per\npass means faster generation, which matters most when you are streaming long agent\ntraces.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-mtp-indexshare-kvshare.png\"\n  alt=\"Two-step MTP inference with IndexShare and KVShare keeping train/infer KV consistent.\"\n  caption=\"Speculative decoding with IndexShare + KVShare — keeping the draft and verify passes consistent.\"\n/>\n\n## Training for the long horizon\n\nPretraining scaled to **28.5T tokens** (up from GLM-4.5's 23T). But the interesting\nchange in 5.2 is the agentic post-training. It moves from group-relative RL to a\n**critic-based PPO** that estimates token-level advantages from individual rollouts —\nwhich accommodates *trajectory compaction* without capping how long a trace can get.\nThat is exactly what you need when a single agent run is thousands of tool calls long\nand won't fit in one rollout.\n\nIt also adds an **anti-reward-hacking module**: a rule-based filter first catches\nlikely hacks (tuned for recall), then an LLM judge checks intent; on a detected hack\nthe system blocks the call and returns dummy information so the rollout continues\ninstead of being thrown away. All of it runs on Zhipu's open asynchronous RL\nframework, **slime**.\n\n## Benchmarks\n\nThe headline result: GLM 5.2 is the **strongest open-weights model on standard and\nlong-horizon coding**, closing much of the gap to Claude Opus 4.8 and GPT-5.5.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-coding-bench.png\"\n  alt=\"GLM 5.2 standard coding benchmark chart vs competitors.\"\n  caption=\"Standard coding benchmarks — GLM 5.2 as the strongest open model (Z.ai).\"\n/>\n\n<BenchBars\n  title=\"SWE-Bench Pro (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Claude Opus 4.8\", value: 69.2 },\n    { label: \"GLM 5.2\", value: 62.1, highlight: true },\n    { label: \"Qwen3.7-Max\", value: 60.6 },\n    { label: \"GPT-5.5\", value: 58.6 },\n    { label: \"GLM 5.1\", value: 58.4 },\n    { label: \"DeepSeek-V4-Pro\", value: 55.4 },\n  ]}\n/>\n\nWhere it stands out most is *long-horizon* coding — runs that have to stay coherent\nover many rounds — where it nearly catches Opus 4.8 and leaves the rest behind:\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-longhorizon-bench.png\"\n  alt=\"Long-horizon coding benchmarks: FrontierSWE, PostTrainBench, SWE-Marathon.\"\n  caption=\"Long-horizon benchmarks (FrontierSWE, PostTrainBench, SWE-Marathon) — the gap to the frontier is small.\"\n/>\n\n<BenchBars\n  title=\"FrontierSWE — long-horizon dominance (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Claude Opus 4.8\", value: 75.1 },\n    { label: \"GLM 5.2\", value: 74.4, highlight: true },\n    { label: \"GPT-5.5\", value: 72.6 },\n    { label: \"Gemini 3.1 Pro\", value: 39.6 },\n    { label: \"GLM 5.1\", value: 30.5 },\n  ]}\n/>\n\nReasoning is strong — a near-perfect AIME — though it trails the very top closed\nmodels on the hardest knowledge benchmarks (GPQA, HLE):\n\n<BenchBars\n  title=\"AIME 2026 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"GLM 5.2\", value: 99.2, highlight: true },\n    { label: \"GPT-5.5\", value: 98.3 },\n    { label: \"Gemini 3.1 Pro\", value: 98.2 },\n    { label: \"Claude Opus 4.8\", value: 95.7 },\n    { label: \"GLM 5.1\", value: 95.3 },\n  ]}\n/>\n\n## Thinking effort, and what 1M costs to serve\n\nGLM 5.2 exposes two reasoning-effort levels — `high` for everyday speed and `max` for\nhard multi-step coding — and Z.ai positions its capability between Claude Opus 4.7 and\n4.8 at similar token spend.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-effort-tokenbudget.png\"\n  alt=\"Agentic coding performance vs token budget at High and Max effort levels.\"\n  caption=\"Effort vs token budget — Max trades more tokens for more capability on hard tasks.\"\n/>\n\nThe 1M context is not free to serve. The bottleneck moves from raw compute to\n**KV-cache capacity, long-context kernels, and CPU-side overhead**; the throughput\nadvantage grows with context length, but you need 8×H100-class hardware and ~1.5 TB\nfor the weights, and the API meters at 3× during peak hours.\n\n<Figure\n  src=\"/articles/glm-5-2/glm52-1m-throughput.png\"\n  alt=\"Serving throughput vs context length — GLM 5.2's advantage grows as context grows.\"\n  caption=\"The IndexShare payoff at serving time: the throughput edge widens as context approaches 1M tokens.\"\n/>\n\n## What I make of it\n\n- **The genuinely new bit is IndexShare** — a clean, well-motivated systems trick\n  (reuse what's nearly identical instead of recomputing it), with a paper that shows\n  *why* it's almost lossless. That's what turns \"1M context\" from a spec-sheet number\n  into something you can actually serve.\n- **It's the strongest open-weights model for long-horizon agentic coding**, and it's\n  MIT-licensed. That combination matters more than the benchmark deltas — you can run\n  and fine-tune it yourself.\n- **It still trails the best closed frontier models** on most hard coding and\n  reasoning axes (SWE-Bench Pro 62.1 vs Opus 4.8's 69.2), and it is heavy to\n  self-host. The bet was never \"beat Opus 4.8 everywhere\" — it's \"match the frontier on\n  long-horizon work, in the open, at a million tokens.\" On that, it largely delivers.\n\n---\n\n*Sources: the [GLM 5.2 release blog](https://z.ai/blog/glm-5.2), the GLM-5 technical\nreport ([arXiv 2602.15763](https://arxiv.org/abs/2602.15763)), and the IndexCache\nmethod paper behind IndexShare ([arXiv 2603.12201](https://arxiv.org/abs/2603.12201)).\nBenchmark figures are from Z.ai; numbers quoted as reported.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/glm-5-2"},{"title":"Sakana Fugu: a multi-agent system as a model","description":"Sakana AI turned LLM orchestration into a single model. A walk through the two ICLR 2026 papers behind Fugu — TRINITY, an evolved sub-20K-parameter coordinator, and the Conductor, a 7B reinforcement-learned orchestrator — and how routing a pool of frontier models beats any one of them.","date":"2026-06-23","tags":["llm","multi-agent","orchestration","reinforcement-learning","explainer"],"draft":false,"kind":"articles","slug":"sakana-fugu","body":"No single LLM wins everywhere. One model leads on competition math, another on\nagentic coding, a third on multilingual work, and open models win on cost. The\nusual response is to pick one and absorb its weak spots. Sakana AI's bet is the\nother one: don't pick a model — *orchestrate a pool of them*, and make the\norchestration itself the model.\n\nThat product is **Sakana Fugu** (and a heavier tier, Fugu Ultra), shipped behind a\nsingle API. Underneath are two ICLR 2026 papers that attack the same problem from\nopposite ends: [TRINITY](https://arxiv.org/abs/2512.04695) *evolves* a tiny\ncoordinator over frozen models, and the\n[Conductor](https://arxiv.org/abs/2512.04388) *reinforcement-learns* a 7B model to\nwrite orchestration plans in natural language. This is a walk through both, and what\nthey add up to.\n\n## A multi-agent system as a model\n\nFugu's framing is the whole pitch: one OpenAI-compatible endpoint. You send a\nrequest to `model: fugu`; behind it a learned coordinator assembles a team from a\npool of frontier and open models, runs them over several turns, and returns one\nanswer. You never see the routing.\n\n<FuguPool />\n\n<Figure\n  src=\"/articles/sakana-fugu/fugu-architecture.png\"\n  alt=\"Sakana Fugu over a pool of closed and open models, with Fugu itself as one of the workers.\"\n  caption=\"Sakana's own framing of the idea: one Fugu endpoint coordinating a pool of closed and open models — and Fugu can even call itself as a worker (the recursive node on the right).\"\n/>\n\nThe pool is swappable — you can opt a model out for compliance and the coordinator\nroutes around it — and billing is a single top-tier rate rather than stacked\nper-model fees. There's even an export-controls angle: because Fugu can hit\nfrontier-level quality by coordinating open and semi-open models, you get the\ncapability without hard dependence on any one restricted vendor.\n\nBut the API is the boring part. The interesting part is that the coordinator is\n*learned*, not hand-written. There are two ways to learn it.\n\n## TRINITY: evolve a tiny coordinator\n\nTRINITY's constraint shapes everything: you cannot fine-tune GPT-5's weights, and\nmerging models with incompatible architectures doesn't work. So freeze every model\nin the pool, and learn only a tiny thing on top that decides who does what.\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-architecture.png\"\n  alt=\"TRINITY's coordination architecture: a coordinator selects an agent and a role each turn, looping Thinker, Worker, Verifier, with a worked example.\"\n  caption=\"TRINITY's coordination loop, from the paper: the coordinator picks an agent and a role each turn, with a worked Thinker → Worker → Verifier example on the right.\"\n/>\n\n### The coordinator is under 20,000 parameters\n\nA small model — Qwen3-0.6B — reads the current problem state and produces a hidden\nvector; a linear head turns that into a choice of *agent* and *role*. Given the\npenultimate-token hidden state $h(s)\\in\\mathbb{R}^{d}$ from the small model, a head\n$f_\\theta$ of roughly 10K parameters emits logits over $L$ agents plus 3 roles, and\nthe coordinator samples its action $a$ from\n\n$$\n\\pi_\\theta(a \\mid s) \\;\\propto\\; \\exp\\!\\big(f_\\theta(h(s))_a\\big),\n\\qquad a \\in \\{1,\\dots,L\\}\\cup\\{\\mathrm{T},\\mathrm{W},\\mathrm{V}\\}\n$$\n\nwhere $s$ is the running transcript, $\\mathrm{T},\\mathrm{W},\\mathrm{V}$ are the three\nroles below, and $\\theta$ is everything that gets trained. On top of the head, TRINITY\nadds *singular-value fine-tuning*: take an SVD of one or two of the small model's\nweight matrices and learn only the singular-value scales, keeping the orthogonal\nfactors fixed. That's a few thousand more numbers. Total trainable: **under 20K\nparameters.** The 0.6B backbone and all seven frontier and open models stay frozen.\n\n<Diagram caption=\"The entire trainable surface of TRINITY: a hidden state, a ~10K linear head, and a categorical choice over agents and roles. Everything below the head is frozen.\">\n  <svg viewBox=\"0 0 640 200\" role=\"img\" aria-label=\"The TRINITY coordinator: the small model maps the problem state to a hidden vector; a tiny linear head turns it into logits over agents and roles.\" style={{ width: \"100%\", height: \"auto\" }}>\n    <rect x=\"16\" y=\"74\" width=\"104\" height=\"44\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"68\" y=\"92\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">problem</text>\n    <text x=\"68\" y=\"108\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">state s</text>\n    <line x1=\"120\" y1=\"96\" x2=\"156\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <rect x=\"156\" y=\"66\" width=\"120\" height=\"60\" rx=\"8\" fill=\"oklch(0.72 0.05 260)\" opacity=\"0.25\" stroke=\"var(--border)\" />\n    <text x=\"216\" y=\"90\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"var(--foreground)\">Qwen3-0.6B</text>\n    <text x=\"216\" y=\"106\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">frozen · SLM</text>\n    <line x1=\"276\" y1=\"96\" x2=\"312\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <text x=\"294\" y=\"88\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">h(s)</text>\n    <rect x=\"312\" y=\"72\" width=\"96\" height=\"48\" rx=\"8\" fill=\"oklch(0.72 0.15 150)\" opacity=\"0.85\" />\n    <text x=\"360\" y=\"92\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"oklch(0.2 0 0)\">head fθ</text>\n    <text x=\"360\" y=\"107\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.2 0 0)\">~10K params</text>\n    <line x1=\"408\" y1=\"96\" x2=\"444\" y2=\"96\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.3\" />\n    <rect x=\"444\" y=\"40\" width=\"180\" height=\"50\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"534\" y=\"60\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">L agent logits</text>\n    <text x=\"534\" y=\"76\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">GPT-5 · Claude · Gemini · …</text>\n    <rect x=\"444\" y=\"102\" width=\"180\" height=\"50\" rx=\"8\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"534\" y=\"122\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">3 role logits</text>\n    <text x=\"534\" y=\"138\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"var(--muted-foreground)\">Thinker · Worker · Verifier</text>\n  </svg>\n</Diagram>\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-hidden-state-separability.png\"\n  alt=\"The small model's hidden states are linearly separable by task type (SVM) and form clear task clusters in a t-SNE plot.\"\n  caption=\"Why a ~10K linear head is enough: the small model's hidden states already separate by task type — a linear SVM classifies them almost perfectly (left), and t-SNE shows clean task clusters (right).\"\n/>\n\n### Three roles, looped until accepted\n\nEach turn, the coordinator gives the chosen agent one of three roles:\n\n- **Thinker** — plan, decompose, or critique; no direct work.\n- **Worker** — do the work: derive, compute, write code.\n- **Verifier** — check the current answer and return `ACCEPT` or `REVISE`.\n\nIt loops, accumulating a transcript, and halts the moment a Verifier accepts (or a\nfixed turn budget $K$ is exhausted):\n\n$$\n\\tau \\;=\\; \\min\\{\\, k \\le K \\;:\\; R_k = \\mathrm{V} \\ \\text{and}\\ u_k = \\mathrm{ACCEPT} \\,\\}\n$$\n\nwhere $R_k$ is the role at turn $k$ and $u_k$ is the verifier's verdict. Step through\none problem — watch a wrong answer get caught and revised before it's accepted:\n\n<TrinityLoop />\n\n### Trained by evolution, not gradients\n\nWhy not just RL the head? Because the reward is binary — the final answer is right or\nwrong — and the head is tiny, so the per-parameter gradient signal is buried in\nnoise. TRINITY instead optimizes the coordinator with a *derivative-free* evolution\nstrategy, maximizing expected terminal reward:\n\n$$\nJ(\\theta) \\;=\\; \\mathbb{E}_{\\tau \\sim \\pi_\\theta}\\big[\\, R(\\tau) \\,\\big],\n\\qquad R(\\tau) \\in \\{0, 1\\}\n$$\n\nThe optimizer is separable CMA-ES: it keeps a diagonal Gaussian over the ~10K\nparameters, samples a small population each generation —\n$\\lambda = \\lceil 4 + 3\\ln n \\rceil \\approx 32$ for $n \\approx 10{,}000$ — evaluates\neach candidate's fitness by actually running rollouts, and shifts the distribution\ntoward the winners. The paper shows the coordination objective is nearly\nblock-separable, which is exactly the regime where a diagonal evolution strategy\nbeats both random search and gradient RL under a tight evaluation budget. The honest\ncost: no gradients means you pay in *environment evaluations*, and each one is a full\nmulti-turn rollout against real model APIs.\n\n### It beats every model in its pool\n\nThis is the result that matters. Transferred zero-shot to four held-out tasks, the\nevolved coordinator outscored every individual model in its pool — including GPT-5,\nGemini-2.5-Pro, and Claude-4-Sonnet. On LiveCodeBench it set a record at the time of\nsubmission:\n\n<BenchBars\n  title=\"LiveCodeBench v6 — pass@1 (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"TRINITY\", value: 86.2, highlight: true },\n    { label: \"GPT-5\", value: 83.8 },\n    { label: \"Gemini-2.5-Pro\", value: 67.2 },\n    { label: \"Claude-4-Sonnet\", value: 46.5 },\n  ]}\n/>\n\nAnd the multi-turn loop earns its keep: accuracy climbs from 0.823 at two turns to\n0.863 at six. One cheap evolved head, a frozen pool, and the ensemble beats its best\nmember.\n\n<Figure\n  src=\"/articles/sakana-fugu/trinity-livecodebench.png\"\n  alt=\"TRINITY's LiveCodeBench result and its accuracy rising with the turn budget.\"\n  caption=\"TRINITY's own result: on LiveCodeBench it reaches 0.862 pass@1, above GPT-5 (0.838), Gemini-2.5-Pro (0.672), and Claude-4-Sonnet (0.465) — and accuracy keeps climbing with the turn budget (bottom).\"\n/>\n\n## Conductor: orchestration written in natural language\n\nThe Conductor attacks the same problem with a bigger hammer: a 7B model (Qwen2.5-7B)\ntrained with RL to *write the entire workflow itself*, in natural language.\n\n### Three lists are a workflow\n\nFor each problem the Conductor emits three synchronized lists:\n\n- `model_id` — which agent runs each step.\n- `subtasks` — a natural-language instruction for each step.\n- `access_list` — which earlier outputs each step is allowed to read.\n\nThose three lists *are* a directed graph. The `access_list` is the load-bearing\nidea: `[]` means the step sees only the original question, `[\"all\"]` means it sees\neverything produced so far, and `[0, 2]` means it sees steps 0 and 2. By choosing\naccess lists, the Conductor designs the communication topology — a chain, parallel\nbranches, a verify-and-merge — *per problem*, not from a fixed template. Flip between\nthe topologies it learns to produce:\n\n<ConductorWorkflow />\n\n### Trained with GRPO\n\nThe Conductor is trained end-to-end with GRPO. For each question it samples a group\nof $G = 64$ candidate workflows, scores each, and pushes the policy toward the\nabove-average ones using the group-normalized advantage\n\n$$\nA_i \\;=\\; \\frac{r_i - \\operatorname{mean}(r_1, \\dots, r_G)}{\\operatorname{std}(r_1, \\dots, r_G)}\n$$\n\nThe reward $r_i$ is blunt on purpose: $0$ if the three lists don't parse, $1$ if the\nfinal workflow output is correct, and $0.5$ otherwise — with no KL penalty\n($\\beta = 0$). The whole thing trains on just 960 problems for 200 iterations on two\nH100s. To make one Conductor work over *any* pool, they then fine-tune it with\nrandomly sampled $k$-model subsets per question, so it adapts to whatever agents you\nhand it.\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-training-emergence.png\"\n  alt=\"Conductor accuracy climbing over 200 GRPO iterations for out-of-distribution, in-distribution, and mixed agent pools.\"\n  caption=\"Coordination strategy emerging during training: accuracy climbs over 200 GRPO iterations as the Conductor learns to design better workflows — fastest when its few-shot examples are held out-of-distribution.\"\n/>\n\n### It can call itself\n\nThe Conductor may name *itself* as a worker. That spawns a fresh sub-workflow on its\nown draft — a recursive topology that turns inference depth into a tunable compute\naxis, what Sakana calls dynamic test-time scaling. Recursion buys a point or two on\nthe hardest benchmarks for under 2× the agent calls.\n\n### Results\n\nA 7B model orchestrating frontier workers beats the frontier workers. In a\ncontrolled run over the same pool:\n\n<BenchBars\n  title=\"LiveCodeBench — controlled, shared worker pool (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Conductor (7B)\", value: 64.3, highlight: true },\n    { label: \"GPT-5\", value: 57.5 },\n    { label: \"Gemini-2.5-Pro\", value: 40.1 },\n    { label: \"Claude-4\", value: 38.0 },\n    { label: \"MoA\", value: 38.6 },\n  ]}\n/>\n\nUnconstrained, the headline numbers were each a new high at publication and each\nabove the best single worker: **83.9% on LiveCodeBench, 87.5% on GPQA-Diamond, 93.3%\non AIME25** — reached with about 3 agent calls per question, versus 5–8 for prior\nmulti-agent methods.\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-leaderboard.png\"\n  alt=\"Conductor leading both GPQA-Diamond and LiveCodeBench against every individual worker model.\"\n  caption=\"The Conductor (highlighted) tops both GPQA-Diamond and LiveCodeBench against every individual worker in its pool — GPT-5, Gemini-2.5-Pro, DeepSeek-R1, and Claude Opus 4.\"\n/>\n\n<Figure\n  src=\"/articles/sakana-fugu/conductor-efficiency.png\"\n  alt=\"Scatter of average performance versus average number of agent calls: the Conductor is high-performance at about 3 calls, versus MoA at 8 calls.\"\n  caption=\"Performance versus cost: the Conductor sits top-left — higher accuracy than every multi-agent baseline at roughly 3 agent calls, where MoA needs 8.\"\n/>\n\n## Two routes to the same place\n\nTRINITY and the Conductor are the same idea — a learned layer that coordinates a\npool — built at opposite scales:\n\n| | TRINITY | Conductor |\n|---|---|---|\n| Learnable size | < 20K params (evolved head) | 7B params (RL-trained model) |\n| Training | derivative-free sep-CMA-ES | GRPO (reinforcement learning) |\n| Output per step | (agent, role) | a full natural-language workflow |\n| Coordination | fixed Thinker/Worker/Verifier loop | a topology it designs per problem |\n| Reads the task via | the small model's hidden state | reasoning in language |\n| Adapts to new pools | re-evolve (cheap) | randomized-pool fine-tune |\n\nTRINITY is the minimal, almost-free coordinator; the Conductor is the expressive one\nthat designs bespoke pipelines. Fugu uses both as its engine.\n\n## What ships: Fugu and Fugu Ultra\n\nTwo tiers. Base **Fugu** balances quality and latency over a lean pool. **Fugu\nUltra** coordinates a deeper pool over more turns for hard, high-stakes problems, and\ntakes longer for it. On Sakana's reported numbers, both match or beat the frontier:\n\n<BenchBars\n  title=\"SWE-Bench Pro (%)\"\n  unit=\"\"\n  bars={[\n    { label: \"Fugu Ultra\", value: 73.7, highlight: true },\n    { label: \"Claude Opus 4.8\", value: 69.2 },\n  ]}\n/>\n\nFugu Ultra also posts **50.0 on Humanity's Last Exam**, against baselines in the\n41–50 range. It's an OpenAI-compatible endpoint — change the base URL and key, no SDK\nmigration — and it bills at a single top-tier rate. (Not available in the EU yet,\npending GDPR; the exact routing decisions are kept proprietary.)\n\n<Figure\n  src=\"/articles/sakana-fugu/fugu-benchmarks.png\"\n  alt=\"Fugu and Fugu Ultra versus Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks.\"\n  caption=\"Fugu and Fugu Ultra (red) against Fable 5, Gemini 3.1 Pro, GPT-5.5, and Claude Opus 4.8 across eight benchmarks (Sakana). Fugu Ultra leads on SWE-Bench Pro (73.7 vs 69.2), GPQA-D, LiveCodeBench, and Humanity's Last Exam.\"\n/>\n\n## What I make of it\n\nThe honest read:\n\n- **The win is real.** An orchestration layer that beats every model it coordinates —\n  and generalizes zero-shot to unseen tasks — is a genuine result. \"Coordination\" is\n  now a trainable layer that sits *above* frontier models rather than inside one.\n- **The costs are real too.** Every model in the pool has to be available at\n  inference; you trade single-model simplicity for a fleet, and latency rises with\n  the extra turns. The biggest gains concentrate on long-tail reasoning and coding\n  benchmarks — on easy tasks the lift is small — and leaning on GPT-5/Claude/Gemini as\n  workers inherits their cost.\n- **The framing is the interesting part.** TRINITY argues the coordinator can be\n  almost free: 20K evolved parameters over frozen models. The Conductor argues\n  coordination is itself a reasoning skill worth a 7B model and a full RL run. Both\n  point the same way — as individual models plateau, the next axis is how you make\n  several of them work together, and that orchestration is learnable.\n\n---\n\n*Built on Sakana AI's [TRINITY: An Evolved LLM Coordinator](https://arxiv.org/abs/2512.04695)\nand [Learning to Orchestrate Agents in Natural Language with the Conductor](https://arxiv.org/abs/2512.04388),\nboth ICLR 2026. Product: [Sakana Fugu](https://sakana.ai/fugu/).*\n","readingTimeMins":12,"url":"https://ai.thesatyajit.com/articles/sakana-fugu"},{"title":"Mixture of Experts, from scratch","description":"Why MoE lets a model carry billions of parameters but only pay for a slice of them per token — built up from one MLP, a router, and a sparse forward pass, with the gating, dispatch, and load-balancing made visible.","date":"2026-06-10","tags":["deep-learning","transformers","mixture-of-experts","explainer"],"draft":false,"kind":"articles","slug":"mixture-of-experts-from-scratch","body":"Scaling a transformer the dense way is a bad trade. Every parameter you add runs\non every token. Double the width of the feed-forward layers and you double both\nthe model's capacity *and* the FLOPs it burns per token — capacity and compute are\nwelded together. You pay for the whole network on every single token, whether that\ntoken needs it or not.\n\nMixture of Experts breaks the weld. The idea is **conditional computation**: keep a\nlarge pile of parameters around, but for any given token, only run a small slice of\nthem. A tiny router looks at each token and picks a couple of sub-networks — the\n*experts* — to handle it. The rest sit idle for that token. You get the capacity of\na big model at the compute of a small one.\n\nHere is the whole model we'll build, end to end. The only thing that makes it a MoE\nis one swapped line — tap the **sparse MoE** block to see it:\n\n<MoeArchitecture />\n\nEverything except that one block is a standard decoder-only transformer: token and\nposition embeddings, a stack of blocks, a final norm, an LM head. Attention is\nuntouched. MoE is a surgical replacement for the feed-forward layer inside each\nblock, and nothing else. So the whole thing reduces to three questions: what is an\nexpert, who decides which experts run, and how do you run only the chosen few.\n\n## An expert is just an MLP\n\nStart with the thing we're replacing. In a normal transformer block, after\nattention, every token goes through the same two-layer MLP — expand to `4 * n_embed`,\nnonlinearity, project back. That's the feed-forward network.\n\nAn expert is exactly that MLP. Nothing more.\n\n```python\nclass Expert(nn.Module):\n    def __init__(self, n_embed, dropout=0.1):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(n_embed, 4 * n_embed),\n            nn.ReLU(),\n            nn.Linear(4 * n_embed, n_embed),\n            nn.Dropout(dropout),\n        )\n\n    def forward(self, x):\n        return self.net(x)\n```\n\nThe move is to keep `num_experts` copies of this MLP instead of one. With 8 experts\nyou have 8× the feed-forward parameters. If every token went through all 8, you'd\nhave spent 8× the compute and gained nothing but a slow, fat FFN. The whole game is\nto run only `top_k` of them — say 2 — per token. So you carry 8 experts' worth of\nparameters and pay for 2.\n\nThe piece that makes that decision is the router.\n\n## Who decides? The router\n\nThe router's job: look at a token's vector $x$ and produce a weight for each expert,\nmostly zero, so that only a few experts actually contribute. Build it up in three\nsteps, because the naive versions teach you why the real one looks the way it does.\n\n**Attempt 1 — send every token to every expert, weighted.** A linear layer maps the\ntoken to one logit per expert, softmax over them, take a weighted sum of all expert\noutputs:\n\n$$\ng(x) = \\mathrm{softmax}(x W_g), \\qquad y = \\sum_{i=1}^{N} g(x)_i \\, E_i(x)\n$$\n\nHere $W_g$ is the router's weight matrix (`n_embed × num_experts`) and $E_i$ is the\n$i$-th expert. This is differentiable and trains fine — but it's *dense*. Every\nexpert runs on every token. We've built an expensive ensemble, not a sparse model.\n\n**Attempt 2 — hard pick the single best expert.** Take $\\arg\\max$ of the logits, run\nonly that expert. Now it's sparse and cheap. But $\\arg\\max$ has zero gradient: the\nrouter only ever learns about the one expert it already chose, and never gets a\nsignal to try the others. Routing freezes. Dead end.\n\n**Attempt 3 — top-$k$ softmax.** Keep the largest $k$ logits, set the rest to\n$-\\infty$, *then* softmax. The $-\\infty$ entries become exactly 0, so only $k$ experts\ncontribute — sparse like attempt 2 — but the softmax over the survivors is smooth, so\ngradients flow to all $k$ chosen experts. This is the real router:\n\n$$\ng(x) = \\mathrm{softmax}\\big(\\mathrm{KeepTopK}(x W_g,\\, k)\\big), \\qquad\n\\mathrm{KeepTopK}(v, k)_i = \\begin{cases} v_i & v_i \\text{ in top } k \\\\ -\\infty & \\text{otherwise} \\end{cases}\n$$\n\nWith $k = 2$ and $N = 8$, six of the eight gate weights are zero for every token, and\nthe two survivors sum to 1. Watch one token go through it — logits, keep the top two,\nsoftmax to gates, combine:\n\n<MoeRouter />\n\nThat stepper is the entire routing mechanism. The bars are the per-expert logits;\ntop-2 keeps two; softmax turns them into weights; the output is just those two\nexperts' outputs scaled by their gates and added.\n\n## Why the noise\n\nThere's one addition that the bare top-$k$ router needs in practice: noise. Before\npicking the top $k$, add a learned, per-expert amount of Gaussian noise to the\nlogits:\n\n$$\nH(x)_i = (x W_g)_i + \\varepsilon_i \\cdot \\mathrm{softplus}\\big((x W_{\\text{noise}})_i\\big), \\qquad \\varepsilon_i \\sim \\mathcal{N}(0, 1)\n$$\n\nThe noise scale is itself learned (a second linear layer $W_{\\text{noise}}$, passed\nthrough `softplus` to keep it positive). Why bother? Because early in training the\nrouter is random, and whichever experts happen to win first get all the gradient and\npull ahead — a rich-get-richer collapse. The noise jitters the top-$k$ selection so\nborderline experts occasionally win, get some tokens, and get a chance to become\nuseful. It's exploration, baked into the forward pass. Hit *resample noise* in the\nwidget above and you can watch which two experts win flip.\n\nIn code the router is four lines of real work:\n\n```python\nclass NoisyTopKRouter(nn.Module):\n    def __init__(self, n_embed, num_experts, top_k):\n        super().__init__()\n        self.top_k = top_k\n        self.route = nn.Linear(n_embed, num_experts)   # gate logits\n        self.noise = nn.Linear(n_embed, num_experts)   # per-expert noise scale\n\n    def forward(self, x):\n        logits = self.route(x)\n        noisy = logits + torch.randn_like(logits) * F.softplus(self.noise(x))\n\n        top_logits, idx = noisy.topk(self.top_k, dim=-1)     # the chosen experts\n        sparse = torch.full_like(noisy, float(\"-inf\"))\n        sparse.scatter_(-1, idx, top_logits)                 # keep top-k, rest -inf\n        return F.softmax(sparse, dim=-1), idx\n```\n\n`scatter_` is the one trick worth pausing on: it writes the kept logits back into a\ntensor of `-inf`, at the indices the `topk` chose. After the softmax those `-inf`\nslots are 0. The router returns the gate weights and the chosen indices — the\nindices tell the next stage which experts to actually run.\n\n## The sparse forward pass\n\nNow the part that earns the word *sparse*. We have gate weights and, for each token,\nthe indices of its top-$k$ experts. We want to run each expert on only the tokens\nrouted to it, scale by the gate, and add the result back.\n\nThe straightforward way: loop over experts, and for each one, mask out the tokens\nthat picked it.\n\n```python\nclass SparseMoE(nn.Module):\n    def __init__(self, n_embed, num_experts, top_k):\n        super().__init__()\n        self.router = NoisyTopKRouter(n_embed, num_experts, top_k)\n        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])\n\n    def forward(self, x):\n        gates, idx = self.router(x)            # (B,T,N), (B,T,k)\n        out = torch.zeros_like(x)\n\n        flat_x = x.view(-1, x.size(-1))        # (B*T, C)\n        flat_gates = gates.view(-1, gates.size(-1))\n        flat_out = out.view(-1, x.size(-1))\n\n        for i, expert in enumerate(self.experts):\n            mask = (idx == i).any(dim=-1).view(-1)   # tokens routed to expert i\n            if mask.any():\n                y = expert(flat_x[mask])             # run on its tokens only\n                flat_out[mask] += flat_gates[mask, i:i+1] * y\n        return out\n```\n\nThe `mask = (idx == i).any(dim=-1)` line is the dispatch: it's true for exactly the\ntokens that have expert `i` somewhere in their top-$k$. We gather those tokens, run\nthe expert once on the batch of them, scale each by its gate weight, and scatter-add\nback into the output. A token routed to experts 2 and 5 gets contributions from both\nloop iterations, summed — which is exactly $\\sum_i g(x)_i E_i(x)$ with all but $k$\nterms zero.\n\nPicture the dispatch over a short sequence. Each token connects to just two of the\neight experts, so most of the grid stays dark — that darkness is the compute you're\n*not* spending:\n\n<MoeRouting />\n\nThe bars underneath are the per-expert load: how many tokens each expert handled.\nNotice it's already uneven — some experts attract more traffic than others. Hold that\nthought; it's the central problem with MoE.\n\n<Callout type=\"note\">\n  This masked loop is the *teaching* implementation. It's correct but it runs every\n  expert as a separate kernel and materialises a mask per expert. Production MoE\n  instead sorts/permutes tokens by expert and does one grouped matmul, and in the\n  distributed case each expert lives on a different GPU and tokens are shipped to\n  them (expert parallelism). Same math, very different plumbing.\n</Callout>\n\n## The one line that changes\n\nWith the experts and the router in hand, dropping MoE into a transformer block is\nanticlimactic — which is the point. A standard block is `attention → FFN`, each\nwrapped in a layer-norm and a residual. MoE swaps the FFN for the `SparseMoE` module\nand touches nothing else:\n\n```python\nclass Block(nn.Module):\n    def __init__(self, n_embed, n_head, num_experts, top_k, block_size):\n        super().__init__()\n        self.sa = MultiHeadAttention(n_head, n_embed, block_size)\n        self.smoe = SparseMoE(n_embed, num_experts, top_k)   # was: FeedForward(n_embed)\n        self.ln1 = nn.LayerNorm(n_embed)\n        self.ln2 = nn.LayerNorm(n_embed)\n\n    def forward(self, x):\n        x = x + self.sa(self.ln1(x))      # attention — unchanged\n        x = x + self.smoe(self.ln2(x))    # MoE replaces the feed-forward layer\n        return x\n```\n\nThat's the whole architectural delta. One `FeedForward` becomes one `SparseMoE`:\n\n<Diagram caption=\"Same slot in the block. Dense runs one MLP on every token; sparse runs a router plus the two chosen experts.\">\n  <svg viewBox=\"0 0 640 250\" role=\"img\" aria-label=\"A dense feed-forward layer applies one MLP to every token; the sparse MoE layer routes each token to two of eight experts.\" style={{ width: \"100%\", height: \"auto\" }}>\n    <defs>\n      <marker id=\"moe-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"var(--muted-foreground)\" />\n      </marker>\n    </defs>\n\n    {/* dense side */}\n    <text x=\"150\" y=\"24\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"var(--foreground)\">dense FFN</text>\n    <rect x=\"110\" y=\"44\" width=\"80\" height=\"22\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"150\" y=\"59\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">tokens</text>\n    <line x1=\"150\" y1=\"66\" x2=\"150\" y2=\"92\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <rect x=\"95\" y=\"94\" width=\"110\" height=\"50\" rx=\"8\" fill=\"oklch(0.72 0.13 250)\" opacity=\"0.9\" />\n    <text x=\"150\" y=\"116\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"oklch(0.2 0 0)\">one MLP</text>\n    <text x=\"150\" y=\"132\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"9\" fill=\"oklch(0.2 0 0)\">every token</text>\n    <line x1=\"150\" y1=\"144\" x2=\"150\" y2=\"170\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <text x=\"150\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">100% of params, every token</text>\n\n    {/* divider */}\n    <line x1=\"320\" y1=\"30\" x2=\"320\" y2=\"210\" stroke=\"var(--border)\" strokeDasharray=\"3 4\" />\n\n    {/* sparse side */}\n    <text x=\"490\" y=\"24\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"13\" fill=\"var(--foreground)\">sparse MoE</text>\n    <rect x=\"450\" y=\"44\" width=\"80\" height=\"22\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--border)\" />\n    <text x=\"490\" y=\"59\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">tokens</text>\n    <line x1=\"490\" y1=\"66\" x2=\"490\" y2=\"84\" stroke=\"var(--muted-foreground)\" strokeWidth=\"1.2\" markerEnd=\"url(#moe-arrow)\" />\n    <rect x=\"448\" y=\"86\" width=\"84\" height=\"20\" rx=\"5\" fill=\"var(--background)\" stroke=\"var(--foreground)\" strokeOpacity=\"0.4\" />\n    <text x=\"490\" y=\"100\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--foreground)\">router</text>\n\n    {/* 8 experts, 2 lit */}\n    {[0,1,2,3,4,5,6,7].map((i) => {\n      const x = 372 + i * 30\n      const lit = i === 2 || i === 5\n      return (\n        <g key={i}>\n          <line x1=\"490\" y1=\"106\" x2={x + 11} y2=\"150\" stroke={`oklch(0.72 0.13 ${(i*45)%360})`} strokeWidth={lit ? 2 : 1} opacity={lit ? 0.9 : 0.12} />\n          <rect x={x} y=\"150\" width=\"22\" height=\"34\" rx=\"4\" fill={`oklch(0.72 0.13 ${(i*45)%360})`} opacity={lit ? 1 : 0.16} />\n        </g>\n      )\n    })}\n    <text x=\"490\" y=\"204\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"var(--muted-foreground)\">8 experts stored · 2 run</text>\n  </svg>\n</Diagram>\n\nStack eight of these blocks, add embeddings and an LM head, and you have the model\nfrom the top of the page. Train it exactly like a dense transformer — cross-entropy\non next-token prediction. The router learns its weights from the same gradient as\neverything else. No special routing supervision; it figures out a useful assignment\non its own.\n\n## Run it yourself\n\nHere is everything above assembled into one file — a char-level model that trains on\ntiny Shakespeare in about 200 lines, with no dependency past PyTorch. The `Expert`,\n`NoisyTopKRouter`, and `SparseMoE` are exactly the pieces we just built; the rest is\nthe smallest transformer that can hold them. Copy it, run `python tinymoe.py`, and\nwatch the loss come down.\n\n```python\n\"\"\"\ntinymoe — a tiny Mixture-of-Experts language model in one file.\nChar-level, trains on tiny Shakespeare. ~4.5M params, ~1.4M active per token.\nRuns on CPU; much faster on a GPU.\n\n    python tinymoe.py        # download data, train, then sample\n\nIt's a small decoder-only transformer where the feed-forward layer of every\nblock is replaced by a sparse mixture of experts with noisy top-k routing.\n\"\"\"\nimport os\nimport urllib.request\n\nimport torch\nimport torch.nn as nn\nfrom torch.nn import functional as F\n\n# --------------------------------------------------------------------- config\nbatch_size = 32          # sequences per step\nblock_size = 128         # context length (chars)\nn_embed = 128            # embedding / residual width\nn_head = 4               # attention heads\nn_layer = 4              # transformer blocks\nnum_experts = 8          # experts per MoE layer\ntop_k = 2                # experts actually run per token\ndropout = 0.1\nlearning_rate = 3e-4\nmax_iters = 5000\neval_interval = 500\neval_iters = 100\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\ntorch.manual_seed(1337)\n\n# ----------------------------------------------------------- data (shakespeare)\nif not os.path.exists(\"input.txt\"):\n    url = (\"https://raw.githubusercontent.com/karpathy/char-rnn/\"\n           \"master/data/tinyshakespeare/input.txt\")\n    urllib.request.urlretrieve(url, \"input.txt\")\ntext = open(\"input.txt\", encoding=\"utf-8\").read()\n\nchars = sorted(set(text))\nvocab_size = len(chars)\nstoi = {c: i for i, c in enumerate(chars)}\nitos = {i: c for i, c in enumerate(chars)}\nencode = lambda s: [stoi[c] for c in s]\ndecode = lambda t: \"\".join(itos[i] for i in t)\n\ndata = torch.tensor(encode(text), dtype=torch.long)\nn = int(0.9 * len(data))\ntrain_data, val_data = data[:n], data[n:]\n\n\ndef get_batch(split):\n    d = train_data if split == \"train\" else val_data\n    ix = torch.randint(len(d) - block_size, (batch_size,))\n    x = torch.stack([d[i:i + block_size] for i in ix])\n    y = torch.stack([d[i + 1:i + block_size + 1] for i in ix])\n    return x.to(device), y.to(device)\n\n\n# ------------------------------------------------------------------- attention\nclass Head(nn.Module):\n    def __init__(self, head_size):\n        super().__init__()\n        self.key = nn.Linear(n_embed, head_size, bias=False)\n        self.query = nn.Linear(n_embed, head_size, bias=False)\n        self.value = nn.Linear(n_embed, head_size, bias=False)\n        self.register_buffer(\"tril\", torch.tril(torch.ones(block_size, block_size)))\n        self.drop = nn.Dropout(dropout)\n\n    def forward(self, x):\n        B, T, C = x.shape\n        k, q = self.key(x), self.query(x)\n        wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5\n        wei = wei.masked_fill(self.tril[:T, :T] == 0, float(\"-inf\"))\n        wei = self.drop(F.softmax(wei, dim=-1))\n        return wei @ self.value(x)\n\n\nclass MultiHeadAttention(nn.Module):\n    def __init__(self, n_head, head_size):\n        super().__init__()\n        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_head)])\n        self.proj = nn.Linear(n_embed, n_embed)\n        self.drop = nn.Dropout(dropout)\n\n    def forward(self, x):\n        out = torch.cat([h(x) for h in self.heads], dim=-1)\n        return self.drop(self.proj(out))\n\n\n# --------------------------------------------------------- mixture of experts\nclass Expert(nn.Module):\n    \"\"\"One expert = one MLP. Same shape as a normal transformer FFN.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.net = nn.Sequential(\n            nn.Linear(n_embed, 4 * n_embed), nn.ReLU(),\n            nn.Linear(4 * n_embed, n_embed), nn.Dropout(dropout),\n        )\n\n    def forward(self, x):\n        return self.net(x)\n\n\nclass NoisyTopKRouter(nn.Module):\n    \"\"\"Score experts per token, add learned noise, keep top-k, softmax.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.route = nn.Linear(n_embed, num_experts)\n        self.noise = nn.Linear(n_embed, num_experts)\n\n    def forward(self, x):\n        logits = self.route(x)\n        noisy = logits + torch.randn_like(logits) * F.softplus(self.noise(x))\n        top_logits, idx = noisy.topk(top_k, dim=-1)\n        sparse = torch.full_like(noisy, float(\"-inf\")).scatter(-1, idx, top_logits)\n        return F.softmax(sparse, dim=-1), idx\n\n\nclass SparseMoE(nn.Module):\n    \"\"\"Run only the top-k experts per token; combine them by gate weight.\"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.router = NoisyTopKRouter()\n        self.experts = nn.ModuleList([Expert() for _ in range(num_experts)])\n\n    def forward(self, x):\n        gates, idx = self.router(x)                  # (B,T,E), (B,T,k)\n        out = torch.zeros_like(x)\n        flat_x = x.reshape(-1, x.size(-1))\n        flat_gates = gates.reshape(-1, gates.size(-1))\n        flat_out = out.reshape(-1, x.size(-1))\n        for i, expert in enumerate(self.experts):\n            mask = (idx == i).any(dim=-1).reshape(-1)  # tokens routed to expert i\n            if mask.any():\n                flat_out[mask] += flat_gates[mask, i:i + 1] * expert(flat_x[mask])\n        return out\n\n\n# ------------------------------------------------------------- block + model\nclass Block(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.sa = MultiHeadAttention(n_head, n_embed // n_head)\n        self.smoe = SparseMoE()                      # <- replaces the FFN\n        self.ln1 = nn.LayerNorm(n_embed)\n        self.ln2 = nn.LayerNorm(n_embed)\n\n    def forward(self, x):\n        x = x + self.sa(self.ln1(x))\n        x = x + self.smoe(self.ln2(x))\n        return x\n\n\nclass MoELanguageModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.tok_emb = nn.Embedding(vocab_size, n_embed)\n        self.pos_emb = nn.Embedding(block_size, n_embed)\n        self.blocks = nn.Sequential(*[Block() for _ in range(n_layer)])\n        self.ln_f = nn.LayerNorm(n_embed)\n        self.head = nn.Linear(n_embed, vocab_size)\n\n    def forward(self, idx, targets=None):\n        B, T = idx.shape\n        x = self.tok_emb(idx) + self.pos_emb(torch.arange(T, device=idx.device))\n        x = self.ln_f(self.blocks(x))\n        logits = self.head(x)\n        loss = None\n        if targets is not None:\n            loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))\n        return logits, loss\n\n    @torch.no_grad()\n    def generate(self, idx, max_new_tokens):\n        for _ in range(max_new_tokens):\n            logits, _ = self(idx[:, -block_size:])\n            probs = F.softmax(logits[:, -1, :], dim=-1)\n            idx = torch.cat([idx, torch.multinomial(probs, 1)], dim=1)\n        return idx\n\n\n# --------------------------------------------------------------------- train\n@torch.no_grad()\ndef estimate_loss(model):\n    out = {}\n    model.eval()\n    for split in (\"train\", \"val\"):\n        losses = torch.zeros(eval_iters)\n        for k in range(eval_iters):\n            x, y = get_batch(split)\n            _, losses[k] = model(x, y)\n        out[split] = losses.mean().item()\n    model.train()\n    return out\n\n\nmodel = MoELanguageModel().to(device)\ntotal = sum(p.numel() for p in model.parameters())\nprint(f\"{total / 1e6:.2f}M params on {device}\")\nopt = torch.optim.AdamW(model.parameters(), lr=learning_rate)\n\nfor it in range(max_iters):\n    if it % eval_interval == 0:\n        l = estimate_loss(model)\n        print(f\"step {it:5d} | train {l['train']:.3f} | val {l['val']:.3f}\")\n    x, y = get_batch(\"train\")\n    _, loss = model(x, y)\n    opt.zero_grad(set_to_none=True)\n    loss.backward()\n    opt.step()\n\n# -------------------------------------------------------------------- sample\nctx = torch.zeros((1, 1), dtype=torch.long, device=device)\nprint(decode(model.generate(ctx, 500)[0].tolist()))\n```\n\nAt the default size it prints `4.52M params` — but only **~1.4M of them run on any\ngiven token**, because 6 of every 8 experts sit out. That's the parameter-vs-compute\nsplit in miniature. Raise `num_experts` and the total climbs while the active count\nbarely moves; lower `top_k` to 1 and it gets sparser still. The same lever Mixtral\npulls, in a model you can train on a laptop.\n\nOne honesty note: this minimal version relies entirely on the routing noise to keep\nexperts balanced — there's no auxiliary loss. At toy scale it trains fine. Scale it\nup and a few experts quietly take over, which is the next problem.\n\n## The catch: load balancing\n\nMoE has one failure mode that dominates everything else, and you saw it forming in the\ndispatch map: **expert collapse**. Routing is a positive feedback loop. An expert that\nwins a few tokens early gets gradient, improves, and so becomes the router's favourite\nfor even more tokens. Meanwhile the experts that lost early get no tokens, no\ngradient, and never improve. Left alone, a handful of experts end up doing all the\nwork and the rest are dead weight — you're paying to store 8 experts and effectively\nrunning 2 or 3.\n\n<MoeLoadBalance />\n\nThe noise we added earlier is the first defense — it keeps the routing from hardening\ntoo fast. The second, used in every serious MoE, is an **auxiliary load-balancing\nloss**: a term added to the training objective that measures how lopsided the routing\nis across a batch and penalises imbalance, nudging the router toward spreading tokens\nevenly. It's a soft constraint — you're not forcing exactly equal load, just paying a\ncost for collapse. Tuning its weight is part of the unglamorous reality of training a\nMoE: too little and experts collapse, too much and you fight the router's ability to\nactually specialise.\n\nThis is the honest tradeoff. A dense FFN has no routing, no balance to maintain, no\nextra loss to tune. MoE buys you cheap capacity and hands you a load-balancing problem\nin return.\n\n## What the experts actually learn\n\nIt's tempting to picture expert 3 as \"the Python expert\" and expert 5 as \"the French\nexpert.\" That's mostly not what happens. When the Mixtral authors inspected their\nrouter, they found no clean topic or domain specialization — experts don't map to\nsubjects. What the router learns is lower-level and more syntactic: routing is\nstrongly correlated across consecutive tokens, and individual experts lean toward\nthings like indentation, punctuation, or particular token shapes. The specialization\nis real, but it's structural, not semantic, and not especially interpretable.\n\"Experts\" is a useful name, not a promise that each one becomes a tidy domain\nspecialist.\n\n## Beyond the basic router\n\nThe router we built is *token-choice*: each token picks its experts. Three variations\nare worth knowing, because they're all different answers to the same load-balancing\nproblem:\n\n- **Expert-choice routing** flips the selection — each expert picks its top tokens.\n  Load is balanced by construction (every expert takes a fixed budget), at the cost of\n  some tokens getting chosen by many experts and others by none.\n- **Shared experts** (as in DeepSeek-MoE) keep one or two experts always on for every\n  token, so the routed experts don't burn capacity re-learning common patterns and can\n  specialize at the margin.\n- **Capacity and token dropping** — in batched or distributed training each expert gets\n  a fixed number of slots per batch; tokens that overflow their chosen expert are\n  dropped and pass through on the residual alone. A blunt cap that keeps the per-expert\n  matmuls a fixed, rectangular shape.\n\nSame tradeoff surface — cheap capacity versus keeping every expert fed — approached\nfrom different sides.\n\n## What you actually buy\n\nWhy put up with the routing machinery? Because the parameter-vs-compute decoupling is\nreal and large. Mixtral 8×7B is the clean reference: 8 experts per layer, top-2\nrouting — the exact configuration we just built. It holds **47B parameters total**,\nbut because only 2 of 8 experts run per token, a forward pass touches **about 13B\nactive parameters**. It runs at the speed and memory-bandwidth cost of a ~13B dense\nmodel while matching or beating a 70B dense one across benchmarks.\n\nThat's the pitch in one line: **capacity you don't pay for on every token.** The\nparameters are the model's knowledge; the active fraction is what each token can\nafford to consult.\n\nThere's a cost on the other side of the ledger, and it's worth stating plainly. MoE\ntrades **compute for memory**. Only $k$ experts run, but *all* of them have to be\nresident — you still hold 47B parameters in memory even though each token uses 13B.\nAnd at batch scale the router scatters tokens across all experts, so the bandwidth and\nthe all-to-all communication of shipping tokens to the right expert (across GPUs)\nbecomes the real bottleneck, not the matmuls. MoE doesn't make models free. It moves\nthe cost from FLOPs, which you pay per token, to memory and bandwidth, which you pay\nonce. For inference-bound serving at scale, that's usually the trade you want.\n\n## The whole thing, in one breath\n\nStrip away the engineering and MoE is small: an expert is the FFN you already had;\nkeep several of them; a one-layer router scores them per token; keep the top two,\nsoftmax for weights, run only those two, add a little noise so routing explores and a\nbalancing loss so it doesn't collapse. One line in the transformer block changes. In\nreturn, the model's parameter count and its per-token compute stop being the same\nnumber — and that decoupling is the entire reason the largest models you can name are\nbuilt this way.\n","readingTimeMins":18,"url":"https://ai.thesatyajit.com/articles/mixture-of-experts-from-scratch"},{"title":"Coroutines in C, intuitively","description":"How to pause a function in the middle and resume it later — using nothing but a switch statement and __LINE__. An intuitive tour of Simon Tatham's classic trick, with a step-through animation.","date":"2026-06-09","tags":["c","coroutines","systems","explainer"],"draft":false,"kind":"articles","slug":"coroutines-in-c","body":"Some functions want to be *callers*. Some want to be *callees*. The trouble starts\nwhen two pieces of code both want to be the caller.\n\nPicture a decompressor that walks a byte stream and emits one character at a time,\nand a parser that consumes characters one at a time. Each is most natural as a loop\nthat *drives* the other:\n\n<Diagram caption=\"Both want to be the loop. Only one can be — the other must invert into a state machine.\">\n  <svg\n    viewBox=\"0 0 600 220\"\n    role=\"img\"\n    aria-label=\"Two functions, a decompressor and a parser, each naturally a loop that wants to drive the other.\"\n    style={{ width: \"100%\", height: \"auto\", color: \"var(--foreground)\" }}\n  >\n    <defs>\n      <marker id=\"cf-arrow\" viewBox=\"0 0 10 10\" refX=\"8\" refY=\"5\" markerWidth=\"6\" markerHeight=\"6\" orient=\"auto-start-reverse\">\n        <path d=\"M0,0 L10,5 L0,10 z\" fill=\"currentColor\" />\n      </marker>\n    </defs>\n\n    {/* left: decompressor loop */}\n    <rect x=\"20\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"120\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">decompressor</text>\n    <text x=\"120\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (bytes) emit(c)</text>\n    {/* loop arrow */}\n    <path d=\"M 92 120 A 28 28 0 1 1 148 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"120\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"120\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to push</text>\n\n    {/* right: parser loop */}\n    <rect x=\"380\" y=\"50\" width=\"200\" height=\"120\" rx=\"10\" fill=\"none\" stroke=\"currentColor\" strokeOpacity=\"0.5\" />\n    <text x=\"480\" y=\"78\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"14\" fill=\"currentColor\">parser</text>\n    <text x=\"480\" y=\"98\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.6\">while (chars) use(c)</text>\n    <path d=\"M 452 120 A 28 28 0 1 1 508 120\" fill=\"none\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" />\n    <text x=\"480\" y=\"128\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.6\">loop</text>\n    <text x=\"480\" y=\"190\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"11\" fill=\"currentColor\" opacity=\"0.55\">wants to pull</text>\n\n    {/* the clash in the middle */}\n    <line x1=\"232\" y1=\"104\" x2=\"368\" y2=\"104\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <line x1=\"368\" y1=\"124\" x2=\"232\" y2=\"124\" stroke=\"currentColor\" strokeWidth=\"1.5\" markerEnd=\"url(#cf-arrow)\" opacity=\"0.8\" />\n    <text x=\"300\" y=\"150\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"22\" fill=\"currentColor\" fontWeight=\"bold\">?</text>\n    <text x=\"300\" y=\"172\" textAnchor=\"middle\" fontFamily=\"monospace\" fontSize=\"10\" fill=\"currentColor\" opacity=\"0.55\">who calls whom</text>\n  </svg>\n</Diagram>\n\nWhichever one you make a *callee*, you have to turn inside-out: rip out its loop,\nhoist its locals into `static` state, and reconstruct \"where was I?\" by hand every\ntime it's called. The algorithm disappears into a state machine.\n\nA **coroutine** is the escape hatch: a function you can `return` from *in the middle*\nand later resume *exactly where it left off*, locals and loop position intact. C\ndoesn't have them. But — as Simon Tatham showed in his\n[classic note](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) — you\ncan fake them with a `switch` statement and one preprocessor macro.\n\n## The painful version first\n\nHere's that decompressor rewritten as a callee the honest way — a hand-rolled state\nmachine. It works, and it's miserable:\n\n```c\nint decompressor(void) {\n  static int state = 0, len, c;\n  switch (state) {\n    case 0:                 /* fresh start */\n      while (1) {\n        c = getchar();\n        if (c == EOF) return EOF;\n        if (c == 0xFF) {    /* run-length escape */\n          len = getchar();\n          c = getchar();\n          while (len--) {\n            state = 1; return c;   /* <-- emit, remember we're here */\n            case 1: ;              /* <-- ...come back to here */\n          }\n        } else {\n          state = 2; return c;\n          case 2: ;\n        }\n      }\n  }\n}\n```\n\nEvery `return` needs a unique number, a matching `case`, and an assignment to\n`state`. Add a branch and you renumber everything. The bookkeeping *is* the bug\nsurface.\n\n<Callout type=\"note\">\n  Notice the `case 1:` sitting **inside** the `while` loop, underneath a `switch`\n  that's outside it. That's legal C — `case` labels can live in any sub-block of a\n  `switch`. This is the same quirk that powers Duff's device, and it's the whole\n  trick.\n</Callout>\n\n## The insight: let `__LINE__` be the state\n\nThe numbers are pure noise. We never *read* them — we only need each `return` to\nhave a label unique to its position, and a way to jump back to it. The C preprocessor\nalready hands out a unique number per position: `__LINE__`.\n\nSo: on the way out, save `__LINE__`. On the way back in, `switch` on the saved value\nand let a `case __LINE__:` right after the `return` catch it. Two macros:\n\n```c\n#define crBegin     static int state = 0; switch (state) { case 0:\n#define crReturn(x) do { state = __LINE__; return x; \\\n                         case __LINE__: ; } while (0)\n#define crFinish    }\n```\n\nThat's the entire idea. `crBegin` opens a `switch` on the saved state. `crReturn`\nstamps the current line into `state`, returns, and drops a `case` label at that exact\nline so the next call resumes one statement later. `crFinish` closes the brace.\n\n## Watch it run\n\nA three-value generator — `next()` returns 0, 1, 2, then -1 — makes the control flow\nvisible. Step through it: watch `state` get stamped with a line number on the way out,\nand the `switch` teleport straight back into the middle of the `for` loop on the way\nback in.\n\n<CoroutineStepper />\n\nThe magic moment is the jump from `switch (state)` to `case __LINE__:` *inside* the\nloop. The function never \"starts over\" — it lands back exactly where it returned, with\n`i` right where it was.\n\n## How the macros expand\n\nIt reads like ordinary code, but here's what the preprocessor actually produces, one\nlayer at a time:\n\n<StepThrough titles={[\"you write\", \"expand crBegin\", \"expand crReturn\", \"what runs\"]}>\n\nYou write the coroutine in its natural, loop-shaped form:\n\n```c\nint next(void) {\n  static int i;\n  crBegin;\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  crFinish;\n}\n```\n\n`crBegin` becomes a `switch` on the saved state, entered at `case 0` on the first call:\n\n```c\nint next(void) {\n  static int i;\n  static int state = 0; switch (state) { case 0:\n  for (i = 0; i < 3; i++)\n    crReturn(i);\n  }\n}\n```\n\n`crReturn(i)` stamps the line number, returns, and leaves a `case` label one line on:\n\n```c\nfor (i = 0; i < 3; i++) {\n  state = __LINE__; return i;\n  case __LINE__: ;\n}\n```\n\nSo the next call jumps from `switch (state)` *directly* to that `case` — back inside\nthe `for` loop, with `i` preserved. No re-entry, no restart:\n\n```c\nswitch (state) {     /* state == that line number */\n  case 0: ...\n  case 17: ;         /* <-- lands here, mid-loop */\n}\n```\n\n</StepThrough>\n\n## Where it bites\n\nThis is a beautiful hack, and like every beautiful hack it has sharp edges. Tatham is\ncandid about them, and you should be too:\n\n<Callout type=\"warn\">\n  **Only `static` locals survive.** A normal `auto` variable is undefined after a\n  `crReturn` — its storage isn't preserved across the return. Loop counters and any\n  state you care about must be `static`. **One `crReturn` per line** (two share a\n  `__LINE__` and collide). And you **can't wrap the body in your own `switch`** — it\n  would capture the `case` labels meant for the coroutine.\n</Callout>\n\nThe `static` rule hides a worse problem: `static` means *one shared instance*. Two\ncallers can't run the same coroutine independently — they'd stomp each other's `state`\nand `i`. Fine for a single global decompressor; fatal for anything reentrant or\nthreaded.\n\n## Making it reentrant\n\nThe fix is to stop using `static` and instead thread all the state through a context\nstruct the caller owns. Every \"serious\" local becomes a field; the macros read and\nwrite `ctx->state` instead of a file-scoped one:\n\n```c\nstruct coro {\n  int state;\n  int i, len, c;   /* everything that must survive a yield */\n};\n\n#define crBegin(ctx)     switch ((ctx)->state) { case 0:\n#define crReturn(ctx, x) do { (ctx)->state = __LINE__; return x; \\\n                              case __LINE__: ; } while (0)\n#define crFinish         }\n\nint next(struct coro *ctx) {\n  crBegin(ctx);\n  for (ctx->i = 0; ctx->i < 3; ctx->i++)\n    crReturn(ctx, ctx->i);\n  crFinish;\n  return -1;\n}\n```\n\nNow each caller allocates its own `struct coro`, and you can run a hundred independent\ngenerators at once. The price is cosmetic — `ctx->i` everywhere you'd have written\n`i` — and Tatham's own verdict is the honest one: *\"virtually all your serious\nvariables become elements of the coroutine context structure.\"* You trade a little\nsyntax for reentrancy. Usually worth it.\n\n## Why this matters beyond the trick\n\nYou don't reach for these macros often — real codebases use explicit state machines,\nthreads, or a language with `async`/`yield` built in. But the idea underneath is worth\nkeeping: **a coroutine is just a state machine where the compiler tracks the state for\nyou.** `async/await` in Rust, generators in Python, goroutines parked on a channel —\nall of them are, at bottom, \"save where I am, return, resume later.\" Tatham's macro is\nthat idea stripped to its absolute minimum: one `switch`, one `__LINE__`, and the\nnerve to put a `case` label inside a loop.\n\n---\n\n*Built on Simon Tatham's [Coroutines in C](https://www.chiark.greenend.org.uk/~sgtatham/coroutines.html) (2000) — still the clearest thing ever written on the subject.*\n","readingTimeMins":7,"url":"https://ai.thesatyajit.com/articles/coroutines-in-c"},{"title":"How self-attention works in transformers","description":"A from-scratch explainer of scaled dot-product attention — queries, keys, values, the softmax, and why the √d scaling matters.","date":"2026-06-02","tags":["transformers","deep-learning","explainer"],"draft":false,"kind":"articles","slug":"how-transformers-attention-works","body":"Self-attention is the single mechanism that lets a transformer decide, for every\ntoken in a sequence, which other tokens are worth listening to. Older architectures\nlike RNNs squeezed an entire sentence through a fixed-size hidden state and read it\nleft to right. Attention throws that bottleneck out: every token can look directly\nat every other token in one parallel step, and it learns *how much* to look.\n\nThe trick is to give each token three learned vectors. The **query** asks a question\n(\"what am I looking for?\"), the **key** advertises what a token offers (\"here is what\nI am about\"), and the **value** is the actual content that gets passed along once a\nmatch is found. You compute these by multiplying the input embeddings by three\nlearned weight matrices, $W_Q$, $W_K$, and $W_V$, giving matrices $Q$, $K$, and $V$.\n\nA token attends to another by comparing its query against that token's key with a\ndot product — a large dot product means the two vectors point in a similar direction,\nso the question and the offer line up. Do this for every query against every key and\nyou get a full grid of raw compatibility scores.\n\n$$\n\\text{Attention}(Q, K, V) = \\text{softmax}\\!\\left(\\frac{Q K^{\\top}}{\\sqrt{d_k}}\\right) V\n$$\n\nThat one line is the whole operation. The matrix below shows the resulting weights\nfor a tiny three-token sequence: each row is one query token, each column is a key it\nmight attend to, and the cell shading is how much weight that pair receives after the\nsoftmax. Hover a row to see where that token looks.\n\n<AttentionMatrix tokens={[\"the\", \"cat\", \"sat\"]} />\n\nIt helps to walk the formula from the inside out. Each step below takes the previous\nresult and transforms it; together they go from raw vectors to a context-aware output.\n\n<StepThrough titles={[\"scores\", \"weights\", \"mix\"]}>\n\n**Q·Kᵀ — raw scores.** Multiply the query matrix by the transpose of the key matrix.\nThe entry at row *i*, column *j* is the dot product of token *i*'s query with token\n*j*'s key — an unnormalised score for how relevant token *j* is to token *i*. The\nresult is a square matrix, one score for every ordered pair of tokens.\n\n**Scale, then softmax — attention weights.** Divide every score by $\\sqrt{d_k}$, the\nsquare root of the key dimension. Without this, large dimensions produce dot products\nwith a big variance, pushing the softmax into saturated regions where gradients\nvanish; the scaling keeps the distribution well-behaved. Then apply softmax across\neach row so the weights are non-negative and sum to one — a proper distribution over\n\"where this token attends.\"\n\n**Weighted sum — the output.** Multiply the weight matrix by the value matrix $V$.\nEach output row is a weighted average of all value vectors, blended according to that\ntoken's attention weights. A token that attended strongly to \"cat\" inherits most of\n\"cat\"'s value, so its new representation is now informed by the context around it.\n\n</StepThrough>\n\nStack several of these in parallel — each with its own $W_Q$, $W_K$, $W_V$ — and you\nget **multi-head attention**, where different heads specialise in different relations\n(syntax, coreference, positional patterns). Concatenate the heads, project once more,\nand that becomes one transformer sub-layer. Repeat across depth and the model builds\nincreasingly abstract, context-rich representations of the sequence.\n\n<Callout type=\"tip\">\n  The √dₖ scaling is easy to skip when implementing attention from scratch, but\n  dropping it is one of the most common reasons a hand-rolled transformer trains\n  slowly or not at all — the softmax saturates and gradients stop flowing.\n</Callout>\n\nThat is the entire idea: project tokens into queries, keys, and values; score every\npair with a scaled dot product; turn the scores into a distribution with softmax; and\nread out a weighted mix of values. Everything else in a transformer — feed-forward\nlayers, residual connections, layer norm, positional encodings — exists to support\nand stack this one operation.\n","readingTimeMins":3,"url":"https://ai.thesatyajit.com/articles/how-transformers-attention-works"}]