September 13, 2025 · Hugo Latapie
The “innovation theater” trap—where flashy claims sprint ahead of substance—strikes again.
Thinking Machines recently published Defeating Nondeterminism in LLM Inference, claiming a breakthrough in making large language models (LLMs) consistently produce the same output for the same input. The social media buzz is real—some call it a game-changer, touting potential in finance, healthcare, and compliance. Their method uses batch-invariant GPU kernels to eliminate randomness, ensuring reproducibility. Credit where it’s due: determinism can simplify debugging, ensure consistent contract analysis in legal tech, and make outputs traceable for regulatory audits, like financial risk scoring.
But consistency isn’t correctness. The paper sidesteps a critical question: does determinism improve accuracy? Our analysis suggests it can harm performance, especially in real-world scenarios. A model correct 20% of the time due to LLM randomness offers a lottery ticket, but determinism locking a wrong response drops accuracy to 0% (if P-correct is 20% and P-locked lies outside it, all correct outputs vanish).
To expose the hype touting determinism as a game-changer for real-world applications, consider a reductio ad absurdum: imagine a model that always outputs x=True
for any true/false question. It’s perfectly deterministic—same input, same output—but it’s only right when the answer is “True” (say, 50% in a balanced dataset). For “False” cases, it’s wrong every time—reproducible but useless. In real-world, out-of-distribution (OOD) data—like diverse patient records in healthcare or market data with sudden volatility shifts unlike clean training sets—this risk grows. A deterministic model might consistently misdiagnose patients or misprice risks, tanking accuracy and trust when it matters most.
This highlights the innovation theater trap. As we’ve argued, venture capital dynamics push companies to prioritize flashy claims over substance. Thinking Machines’ paper feels like a case study. It provides no accuracy benchmarks, limited clarity on real-world testing, and no formal acknowledgment of limitations. A buried line admits: “we aim for reproducibility, not necessarily improved performance.” Without evidence, clarity, and limitations, the work reads more like a product showcase than a scientific contribution. This isn’t just about one company—VC expectations often force even solid technical work into overhyped narratives, presented as research.
The hype isn’t harmless. If sold as a win for finance, healthcare, or compliance without accuracy data, it risks disasters—mispriced risks crashing markets, wrong diagnoses harming patients, or audit trails of consistent mistakes, like misclassified trades, fooling regulators. Real-world compliance demands robustness in dynamic, OOD scenarios, not just predictability.
We’re not here to demonize Thinking Machines—tackling determinism is a valuable research problem with niche applications. Theoretical advances can be profound without immediate real-world evidence. But when practical impact is implied, serious work demands critical self-analysis—especially on obvious gaps like accuracy impacts. The paper misses this, and the social media hype calling it “great for finance and healthcare” is downright terrifying, risking catastrophic real-world harm without evidence.
The field deserves better than marketing hype dressed as science. True progress demands rigorous, transparent research: honest acknowledgment of limitations and, when practical impact is claimed, empirical evidence of accuracy and robustness in messy, out-of-distribution conditions. Investors and researchers should demand evidence over spectacle, prioritizing accuracy and robustness over product showcases—because science thrives on scrutiny, not hype, and progress means getting it right in the unpredictable world we live in.