Nvidia spent $20B on Groq to fix the part of inference where GPUs waste 99.8% of their silicon. That number sounds like hyperbole. It isn't. During the decode phase of LLM inference - the autoregressive, one-token-at-a-time part that dominates real-time serving - a modern GPU uses roughly 0.2% of its compute. The rest of the die just waits on memory bandwidth. I spent enough time at Google staring at TPU utilization dashboards to know what a workload mismatch looks like, and this one is hard to overstate. It falls directly out of the roofline model (sourced at the end): at batch size one, arithmetic intensity is 1 FLOP/byte against a ridge point of 591. The chip is 591x over-provisioned for the actual workload.
The distribution surface question - who has the captive query volume to justify half a billion dollars in NRE for custom silicon - was settled years ago by Google, then confirmed by Amazon, Meta, and Microsoft. OpenAI's $10B Cerebras deal does not change the picture. Cerebras was 87% dependent on a single sovereign client before that contract showed up. It needed OpenAI's 300M weekly active users to have a viable business, which is the thesis working, not a counterexample.
The question that matters now is narrower and more immediate: inference is already the majority of AI compute spend, GPUs are structurally wrong for the decode half of it, and Nvidia just paid $20B to bolt on the architecture that fixes it. What does the repricing look like, and who ends up on the right side?
The Distribution Surface Thesis
Every custom AI ASIC that made it to production at scale has a massive captive workload behind it. Every attempt without one ended in an acquisition, a pivot, or a quiet shutdown. There are no exceptions to this pattern, though Cerebras comes closest to looking like one (more on that shortly).