March 31, 2026ASICinferenceNvidiaBroadcomTPU

The Decode Tax: Who Survives When Custom Silicon Demands Hyperscaler Scale

Nvidia spent $20B on Groq to fix the part of inference where GPUs waste 99.8% of their silicon. The distribution surface question is settled. The open question is who captures margin in the repricing.

TLDR

GPUs waste 99.8% of their silicon during the decode phase of LLM inference. That is an actual physics constraint - HBM bandwidth is 45x slower than SRAM. Nvidia licensed Groq's IP for $20B to solve this problem. The only companies that have been successful at monetizing custom ASICs so far, are hyperscalers with captive workloads (eg: Google) and the design houses they hire (Broadcom: $73B backlog, Marvell: $75B pipeline). Everyone else, eg: neoclouds carrying $Billions+ in debt, merchant ASIC startups without anchor tenants - is structurally locked out.

The ASIC vs GPU debate is over. ASICs won inference, and GPUs won training. The open question is who controls the inference fabric and captures the margin.

H100 GPU runs at 0.17% utilization during decode. 99.8% of silicon idle.

Special Research Edition | ~4,000 words | 6 charts | March 2026

Distribution surfaces - why captive scale gates ASIC survival
ASIC feasibility - inference vs training, CUDA's narrowing moat, supply chain
Neoclouds - why they can't deploy custom silicon
The Groq IP licensing - prefill/decode economics, roofline math, SRAM physics
NVLink Fusion - Nvidia as fabric layer
Who captures margin

Nvidia spent $20B on Groq to fix the part of inference where GPUs waste 99.8% of their silicon. That number sounds like hyperbole. It isn't. During the decode phase of LLM inference - the autoregressive, one-token-at-a-time part that dominates real-time serving - a modern GPU uses roughly 0.2% of its compute. The rest of the die just waits on memory bandwidth. I spent enough time at Google staring at TPU utilization dashboards to know what a workload mismatch looks like, and this one is hard to overstate. It falls directly out of the roofline model (sourced at the end): at batch size one, arithmetic intensity is 1 FLOP/byte against a ridge point of 591. The chip is 591x over-provisioned for the actual workload.

The distribution surface question - who has the captive query volume to justify half a billion dollars in NRE for custom silicon - was settled years ago by Google, then confirmed by Amazon, Meta, and Microsoft. OpenAI's $10B Cerebras deal does not change the picture. Cerebras was 87% dependent on a single sovereign client before that contract showed up. It needed OpenAI's 300M weekly active users to have a viable business, which is the thesis working, not a counterexample.

The question that matters now is narrower and more immediate: inference is already the majority of AI compute spend, GPUs are structurally wrong for the decode half of it, and Nvidia just paid $20B to bolt on the architecture that fixes it. What does the repricing look like, and who ends up on the right side?

The Distribution Surface Thesis

Every custom AI ASIC that made it to production at scale has a massive captive workload behind it. Every attempt without one ended in an acquisition, a pivot, or a quiet shutdown. There are no exceptions to this pattern, though Cerebras comes closest to looking like one (more on that shortly).

Related Deep Dives

April 14, 2026Pro

July 24 Is Not Priced Into AI Infrastructure

The 2026 hyperscaler capex bridge carries $13 to $38 billion of unpriced tariff cost that did not appear in any Q4 earnings call. Section 122 expires on July 24, but the duties do not come off - only the legal authority collecting them rotates. The 60 to 90 day handoff window is the tradeable event.

tariffsSection 122Section 232data-centershyperscalerscapex

April 2, 2026Pro

The Anthropic Leak and the Daemon Tax: What KAIROS Means for Inference Economics

On March 31, Anthropic accidentally shipped 512,000 lines of Claude Code's unobfuscated source to npm. Inside: a daemon called KAIROS that converts every seat into persistent inference. The cost model that underpins the GPU infrastructure buildout just changed.

AnthropicinferenceneocloudsKAIROSinfrastructure