Post

AI is Dead, and We Have Killed It

AI is Dead, and We Have Killed It

Before you close this tab, hear me out. This is not a claim that AI does not work. GPT-4 exists. AlphaFold cracked the protein folding problem that stumped biologists for fifty years. Machine translation works. Nobody serious disputes those things. This is about what we destroyed to get there.

Between 2012 and 2024, AI research went from exploring six or seven competing ideas to optimizing exactly one. Not because that one approach was proven best, but because it fit hardware we already owned. We built an entire scientific field around what was convenient, called it progress, and buried everything else. The researchers who built alternative frameworks, rigorous and mathematically solid frameworks with decades of history, retired without training replacements. The grad students who might have learned those ideas never did, because universities stopped teaching them. The infrastructure needed to test those ideas does not exist anymore, because nobody funded it.

When I say AI is dead, I mean it the way a monoculture is dead. Biologically alive, but ecologically extinct.


The hardware lottery

Sara Hooker at Google Brain coined this term. A research idea wins the hardware lottery when it succeeds because it fits whatever hardware already exists, not because it is the best solution to the problem.

History is full of this. Charles Babbage designed the Analytical Engine in 1837 and the design was mathematically correct, but Victorian fabrication could not build it. Perceptrons worked conceptually when Rosenblatt published them in 1958, died when Minsky and Papert proved they could not solve XOR in 1969, then returned in the 1980s when hardware finally existed to train multi-layer networks. Backpropagation was theoretically viable since Rumelhart, Hinton, and Williams published it in 1986, but nobody could scale it until GPUs arrived.

Nvidia released CUDA in 2007. General-purpose computing on graphics cards became possible. GPUs turned out to be extraordinarily fast at one specific operation: dense matrix multiplication. Take a large grid of numbers, multiply it by another large grid, do it thousands of times in parallel. Games need that. Intelligence, it turns out, may not.

Fei-Fei Li released ImageNet in 2009 with 1.2 million labeled images. Three years later, Alex Krizhevsky trained a neural network on two consumer GPUs and won the ImageNet competition by 10.9 percentage points. In machine learning competitions, improvements of half a percent are notable. Krizhevsky won by eleven. What followed was not a scientific revolution. It was a hardware revolution wearing scientific clothing.

The lock-in that came after explains everything. GPUs make dense matrix multiplication cheap, so architectures built for that operation win benchmarks. Venture capital floods toward benchmark winners. Companies hire for GPU expertise because that is where the results live. Universities train students in GPU methods because industry demands it. Doctoral programs drop courses in alternative approaches because nobody is hiring for those skills. The professors who know symbolic AI, neuromorphic computing, and probabilistic systems retire, nobody replaces them, and the knowledge disappears.

Information theory makes the problem precise. Shannon’s channel capacity theorem gives the maximum rate at which information passes reliably through a noisy channel:

\[C = B \log_2\left(1 + \frac{S}{N}\right)\]

where $B$ is bandwidth, $S$ is signal power, and $N$ is noise power. This is a hard physical limit. No engineering exceeds it. The theorem tells you exactly when a communication system fails and why. There is no equivalent theorem for deep learning. Nobody writes down the architectural capacity of a transformer, the maximum intelligence it reliably extracts from a given data distribution. We have channel capacity bounds for radio. We have nothing analogous for the systems we deploy in hospitals and courts.

The scaling laws used to justify hundred-million-dollar training runs are empirical fits, not theorems. Kaplan et al. (2020) found that loss scales approximately as

\[L(N, D) \approx \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_0\]

where $N$ is parameters, $D$ is training tokens, and $\alpha, \beta, A, B, L_0$ are fitted constants. This curve was measured over a specific range and extrapolated to justify models orders of magnitude larger. There is no proof it holds. Chinchilla showed the original Kaplan fits were wrong about the optimal $N$-to-$D$ ratio. The scaling laws that justified GPT-4’s training budget were themselves incorrect, and the field kept scaling anyway.

GPUs are fast at dense matrix operations, parallel processing of identical computations, and backpropagation through smooth differentiable functions. They are slow at sparse event-driven computation, irregular memory access, sequential dependencies, symbolic manipulation, and logical reasoning. Your brain does most of that second list. It does not activate all neurons at once. It uses sparse, event-driven signals, with roughly 1 to 4 percent of neurons firing at any given moment. Information encodes in spike timing, not continuous values. The human brain runs 86 billion neurons on 20 watts through extreme sparsity and parallelism. We built our field on algorithms that activate every single parameter for every single computation, consuming megawatts to train and kilowatts to run. We chose algorithms that fit our silicon, not algorithms that fit intelligence.

Cloud economics made it worse. AWS, Azure, and Google Cloud sell compute by the hour. Revenue scales with consumption. Efficient algorithms mean less compute sold. Cloud providers did not deliberately suppress efficient methods. They responded rationally to their pricing model, which naturally rewards waste. Not one major cloud provider offers pricing advantages for sample-efficient or energy-efficient approaches. Researchers need compute, so they use cloud infrastructure. They optimize for the GPUs those platforms provide. They publish results showing better performance with scale. Other researchers adopt the same approach. Cloud providers see more usage and invest in more GPUs. No coordination required. Just locally rational decisions producing a globally irrational outcome.


September 30, 2012

AlexNet won ImageNet with a 15.3 percent error rate. Second place got 26.2 percent. The architecture was eight layers deep, roughly 60 million parameters, trained on two Nvidia GTX 580 GPUs you could buy at a retail store. Training time was six days.

The effects cascaded in about eighteen months. November 2012, Hinton starts consulting for Google. March 2013, Google acquires his company DNNresearch. January 2014, Google acquires DeepMind for roughly 500 million dollars. September 2013, Facebook hires Yann LeCun. Microsoft, Baidu, and Amazon launch major AI hiring programs. The field got strip-mined for talent inside two years.

Yoshua Bengio estimated in 2014 that roughly 50 world-class deep learning experts existed globally. Not 5,000. Not 500. Fifty. DeepMind hired approximately twelve of them. That is 24 percent of global expertise absorbed by one acquisition. By 2015, Google employed somewhere between five and fifty percent of the field, according to Peter Norvig’s carefully worded public statements. If Google pays 500 million dollars for 50 people, that is 10 million dollars per researcher, at a time when academic salaries ran 100,000 to 200,000 dollars. The industry premium was 50 to 100 times the academic rate. Universities lost the only places where alternative AI approaches could be explored outside quarterly earnings pressure.

The hidden cost was not just that researchers moved. They stopped teaching alternatives. Knowledge about symbolic AI, hybrid systems, neuromorphic computing, and sample-efficient learning died with the generation that held it. Nobody trained replacements because industry hired only for GPU expertise, and universities taught only what industry would hire.

Before 2012, research evaluation was multidimensional. People measured sample efficiency, proved theoretical guarantees about convergence and generalization, ran computational complexity analysis, and checked biological plausibility. After 2012, evaluation collapsed into leaderboard position. ImageNet top-5 accuracy became the number everyone optimized. Scientific understanding got replaced by competitive gaming.

Goodhart’s Law: when a measure becomes a target, it stops being a good measure. Once funding and careers depended on benchmark scores, research optimized for benchmarks instead of intelligence. What stopped being measured, things like sample efficiency, energy per task, compositional generalization, and causal reasoning, disappeared not because it stopped mattering, but because scaling approaches would show weaker results there, and the incentives pointed elsewhere.

The gap between benchmark scores and real capability is documented. Models scoring 85 to 90 percent on MMLU show dramatically different performance on genuine clinical, legal, or mathematical tasks outside the benchmark distribution:

Model / BenchmarkReported scoreIndependent real-world testSource
GPT-4 on MMLU86.4%57 to 72% on equivalent clinical questionsMicrosoft Research, Nori et al. 2023
GPT-4 on MedQA90.2%Physicians outperform on novel cases not in training distributionNEJM AI 2024
GPT-4 on bar exam90th percentileFails on novel legal reasoning chainsStanford Legal AI Lab 2023
GPT-4 on GSM8K math92%60% on isomorphic problems with changed surface featuresarXiv:2307.02477
MMLU virology section65%+ model scoresSection is 57% factually incorrectUniversity of Edinburgh, Gema et al. 2024

The last row is the sharpest. Models scoring well on the virology section of MMLU are partly scoring well because they memorized wrong answers. The benchmark measuring medical AI knowledge is medically wrong more than half the time in one domain. When OpenAI announces their new model scored 90 percent on MMLU, that score partly reflects having memorized the test. A student who steals the answer key scores 100 percent, but that tells you nothing about what they know.

The benchmark problem went beyond incentive drift. In December 2024, it became active fraud. OpenAI secretly funded the creation of FrontierMath, a mathematical benchmark designed to test whether AI could solve research-level math problems. The funding was hidden from the public and from the mathematicians who contributed questions, under NDA. OpenAI’s model scored remarkably well. When Epoch AI, the organization managing it, tried to reproduce the results independently, they could not. The mathematicians who wrote the questions had no idea their intellectual labor was being used to make a private company’s product look better.

What followed 2012 was not just a loss of methods. It was a loss of the habit of understanding. Researchers who were present in that period began describing neural networks not as systems they understood but as things they observed. They used language like “it learned to detect edges” or “it appears to have developed an internal representation.” When engineers describe a bridge they built, they say we designed the load-bearing structure to distribute stress across the arch. The first person is there because they understand the mechanism. When researchers say “it learned,” the mechanism is missing.

The symbolic AI tradition, the expert systems tradition, the Bayesian network tradition, all of these produced systems where you could trace any output back through the reasoning chain to the inputs and the rules and explain exactly why that output appeared. The interpretability research community tried to adapt after 2012, producing work on saliency maps, attention visualization, and probing classifiers. But this work was always trying to peer inside systems not designed for transparency. The field knew this was happening and continued anyway because benchmark numbers were going up and benchmark numbers were what mattered for publications, funding, and hiring. The result is a generation of AI systems embedded in healthcare, criminal justice, and financial decisions where nobody can tell you from the inside why the system produced a particular output.


What AI looked like before benchmarks ate everything

Most people reading this were not alive when AI actually worked inside hospitals.

Edward Shortliffe began building MYCIN at Stanford in 1972 as his doctoral dissertation. The system diagnosed bacterial infections of the blood and meningitis by asking physicians a structured sequence of questions, then working backward through roughly 600 production rules to identify the likely organism and recommend the appropriate antibiotic at the correct dosage for the patient’s body weight. Every conclusion came with a certainty factor. Every recommendation came with an explanation. If a physician asked why MYCIN suggested a particular drug, the system walked through precisely which rules it had fired and why.

In the formal evaluation published in JAMA in 1979, MYCIN received an acceptability rating of 65 percent from expert infectious disease specialists. The five Stanford Medical School faculty members in the evaluation scored between 42.5 and 62.5 percent. The expert judges agreed more with MYCIN than with the Stanford faculty. A separate accuracy study found MYCIN selected appropriate antimicrobial therapy in 90.9 percent of cases. The system performed at specialist level using 600 human-readable rules that any physician could read, audit, or challenge.

MYCIN was never deployed in hospitals. Not because it did not work, but because nobody could figure out who bore legal liability if it made a wrong call.

John McDermott at Carnegie Mellon built XCON in 1978 for Digital Equipment Corporation. DEC faced a specific problem: configuring VAX computer systems for customers required selecting the right components from a catalog of 420 possible parts, and technicians made enough errors that DEC regularly shipped free replacement parts. DEC had already tried and failed twice to write a conventional program to solve this, first in FORTRAN and then in BASIC. McDermott encoded the configuration expertise of DEC’s engineers as approximately 2,500 production rules. XCON first went into production use in 1980. By 1986, it had processed 80,000 customer orders and achieved 95 to 98 percent accuracy, saving DEC an estimated 25 million dollars per year.

XCON worked. It was deployed. It ran in production for years. It did so using interpretable rules that any engineer could read, audit, and modify.

Today, in 2026, you cannot deploy a neural AI system for medical diagnosis in most regulated healthcare environments because you cannot explain its decisions. The EU AI Act, the FDA’s guidance on AI-based medical devices, and hospital liability frameworks all require the ability to trace how a conclusion was reached. MYCIN could do that in 1974. Modern deep learning systems generally cannot. We had explainable AI that worked at specialist-level accuracy in clinical settings, and we abandoned it because it did not produce impressive numbers on benchmarks. Explainability is now a research frontier we are trying to recover. This is not progress. This is a circle.


The Soviet AI program nobody taught you about

Western AI history has a gap in it, and filling that gap changes how you understand what knowledge was lost.

In the early 1950s, cybernetics was politically suspect in the Soviet Union, labeled by Stalin-era ideologues as a bourgeois pseudoscience. After Stalin’s death in 1953, that changed. Cybernetics became a national priority, and what followed was a parallel AI research program that the Western field has almost entirely erased from its own history.

Viktor Glushkov was born in Rostov-on-Don in 1923, the son of a mining engineer. He survived the German occupation during the war. He lost his mother, who was helping partisans, and for this reason was barred from studying in Moscow or Leningrad after the war, because all those who had been encircled by German forces were treated as politically suspect. He found his way to Rostov University instead, passed four years of exams in a single year, and graduated with degrees in both technical and mathematical sciences in 1948. In 1952, working at a forestry institute in the Urals, he solved a generalization of Hilbert’s fifth problem. Each of Hilbert’s 23 problems represented a frontier of world mathematics, and solving even one was a career-defining achievement. He was offered positions in Moscow, Leningrad, and Kyiv. His wife chose Kyiv because she wanted to go south.

He pivoted entirely to computing after reading Anatoly Kitov’s book Electronic Digital Machines in 1956, which he described as the moment that changed everything. In 1962 he established the Institute of Cybernetics of the Ukrainian Academy of Sciences. He filled it with ambitious young researchers, average age about 25, and set them to work on problems the West was not yet taking seriously: automated theorem proving, formal verification of programs, algebraic models of computation, and intelligent management of complex systems. His institute produced the MIR series of computers, personal computers for engineers with keyboards and cathode-ray tube screens, a decade before personal computers existed in the West. In 1960, Glushkov guided industrial processes at a facility 500 kilometres away using his Kyiv computer. He published nearly 800 works and trained over a hundred doctoral students.

His most ambitious project was OGAS, the All-State Automated System for the Gathering and Processing of Information for the Accounting, Planning and Governance of the National Economy. First proposed in 1962, OGAS was designed as a real-time national computer network spanning the entire Soviet Union, with a central computing center in Moscow connected to up to 200 regional centers and up to 20,000 local terminals. Glushkov estimated that if paper-based Soviet planning methods continued unchanged, the planning bureaucracy would need to grow fortyfold by 1980 just to keep functioning. He designed the network with deliberate decentralisation: any authorised node could contact any other node without routing through the central hub. Distributed network architecture, built independently, at the same time ARPANET was being constructed on the other side of the Iron Curtain. The CIA monitored Glushkov’s work. Arthur Schlesinger Jr. described an all-out Soviet commitment to cybernetics as providing a potential tremendous advantage in production technology, complex industries, feedback control, and self-teaching computers.

On the morning of October 1, 1970, Glushkov walked into Stalin’s former office in the Kremlin to present his funding request to the Politburo. He warned the assembled officials directly: if the full OGAS was not built now, the Soviet economy would encounter such difficulties in the second half of the 1970s that they would have to return to the question regardless. When he looked around the room, the two chairs that mattered most were empty. Brezhnev and Kosygin, General Secretary and Prime Minister, were absent from the meeting, ostensibly on diplomatic business. Finance Minister Vasily Garbuzov, who had no interest in a nationwide computer network that would shift economic power away from his ministry, argued against the project. He suggested at one point that computers were already being used effectively on chicken farms and perhaps that was the right scale to consider first. With the two supporters gone, the Politburo sided with Garbuzov. Glushkov was offered a compromise: build disconnected systems for individual ministries, with no network connecting them. A reviewer writing for MIT later put it precisely: Soviet attempts to build a national computer network were undone by socialists who behaved like competitive capitalists.

Glushkov spent the rest of his life pushing truncated versions of the project through defence ministries that wanted the efficiency gains without the distributed network. He died in January 1982, still working, of a brain tumour. He was 58 years old. In the last nine days of his life, having regained consciousness from a coma, he dictated his memories and ideas to his daughter Olga.

This section is not actually about OGAS. OGAS is the famous part of the story because it is dramatic, a Soviet internet denied by bureaucrats, lost to history. The connection to this essay’s argument runs deeper, through the people who worked in Glushkov’s institute and what they were building.

One of them was Kateryna Yushchenko. She was born in 1919 in Chyhyryn, central Ukraine. In 1937 her father was arrested as a Ukrainian nationalist and died in a gulag. When her mother went to authorities to inquire, she too was arrested and imprisoned for ten years. Yushchenko was eighteen and had just started at Kyiv University, which expelled her as the daughter of enemies of the people. The only institution that would accept her on a full scholarship was a university in Samarkand, Uzbekistan. She made her way there, survived the war, returned to Ukraine, and in 1950 defended a doctorate from the Kyiv Institute of Mathematics, the first woman in the USSR to earn a PhD in physical and mathematical sciences in programming.

In 1955, working with the MESM computer, one of the first computers in continental Europe, she created the Address Programming Language. The language introduced indirect addressing: the ability to refer to a memory cell that holds the address of data rather than the data itself. This is the pointer, one of the fundamental concepts in all of modern computer science, present in every systems language written since. Her language appeared in 1955, two years before FORTRAN, three years before COBOL, five years before ALGOL. Western computer science credits Harold Lawson with inventing the pointer variable in 1964 for PL/1. Lawson received the Computer Pioneer Award from the IEEE in 2000 for this invention. Yushchenko had done it nine years earlier and run it on production hardware. Her language was implemented on most Soviet and Chinese computers for over twenty years and controlled the Apollo-Soyuz space mission in 1975. She supervised 45 doctoral students and wrote more than 200 publications. She died in 2001. In Western computer science curricula, she is essentially unknown.

Glushkov’s institute was doing formal methods, symbolic reasoning, automated theorem proving, distributed systems, and explainable computational logic, exactly the tradition this essay argues was killed by the hardware lottery. When the Soviet Union collapsed in 1991, that entire research tradition was archived in Russian-language journals, associated with a political system that had just failed, and therefore easy to dismiss. Solomonoff worked in isolation from it. Pearl did not know it existed. The theoretical foundations built in Kyiv, Novosibirsk, and Moscow throughout the 1960s and 1970s were not absorbed into Western AI at the moment the field was consolidating around neural networks. We did not just fail to build on that work. We did not even know it existed.

The hardware lottery has a geopolitical dimension. It was not only Western economic incentives that narrowed the possibility space of AI. It was also the accident of which civilisation’s research tradition happened to survive the Cold War. One tradition went through Bell Labs, MIT, and Stanford. The other went through Kyiv and Novosibirsk, and it ended.


The people we buried

We talk about this as a graveyard of ideas. It is a graveyard of people. Specific human beings who built rigorous, working, mathematically complete frameworks for intelligence, and watched those frameworks get defunded because they did not run fast on gaming cards.

Judea Pearl and the mathematics we refused to use

Judea Pearl won the Turing Award in 2011, the highest honour in computer science, for building the formal mathematical framework for reasoning about cause and effect. His do-calculus, structural causal models, and the ladder of causation are not philosophical speculation. They are precise mathematical machinery with decades of verified application in medicine, economics, and policy.

His argument against current AI is specific. Every large language model, every image classifier, every recommendation system operates at what Pearl calls the first rung of the ladder of causation: association. It learns $P(Y \mid X)$, the probability of $Y$ given that we observe $X$. That is all it does.

The three rungs are mathematically distinct. The first rung is observation: $P(Y \mid X = x)$. The second rung is intervention, written using Pearl’s do-operator:

\[P(Y \mid \text{do}(X = x)) \neq P(Y \mid X = x)\]

These are different quantities, and no amount of observational data makes them equal unless you have a causal model of the data-generating process. Pearl’s do-calculus provides three inference rules for translating between them when a causal graph is known. Without the graph, the translation is provably impossible. Pearl published the rules in 1995. It took eleven years to prove the calculus was complete — that is, that if repeated application of the rules cannot eliminate the do-operator, the causal effect is genuinely not identifiable from observational data. Huang and Valtorta proved completeness in 2006, independently confirmed by Shpitser and Pearl the same year. A field that was paying attention would have found this result important. The third rung is counterfactual: $P(Y_x = y \mid X = x’, Y = y’)$, what $Y$ would have been had $X$ been $x$, given that we actually observed $X = x’$ and $Y = y’$. This requires a structural causal model with specific functional forms. No neural network trained on observational data can answer this. It is not a matter of scale or architecture. It is a mathematical impossibility, and Pearl proved it.

Here is why it matters in practice. A medical AI trained on hospital data observes that sicker patients receive more treatment, and learns that more treatment correlates with worse outcomes. Ask it a first-rung question about what outcomes patients with these symptoms typically see, and it answers correctly. Ask it whether to give this patient more treatment, and it gives you the systematically wrong answer, because treatment and outcome are confounded by severity. This is not a training data problem. No amount of additional data fixes it. Pearl proved this mathematically in 1995.

Legal reasoning, scientific inference, moral responsibility, and individual treatment estimation all require the third rung. Language models produce text that sounds like counterfactual reasoning because they trained on humans who do it, but when a problem requires genuinely novel causal inference outside the training distribution, they fail. Pearl published a paper in 2018 listing specific tasks provably impossible for systems that only learn correlations. The field’s response was largely silence, not refutation, just silence. Because integrating causal structure into neural networks is hard, produces no benchmark improvements on ImageNet, and runs inefficiently on GPU matrix operations. Pearl is 88 years old and still publishing. The field is still not listening.

Vladimir Vapnik and the guarantees we threw away

Vladimir Vapnik, working with Alexey Chervonenkis at the Institute of Control Sciences in Moscow in the 1960s, built VC theory: a rigorous mathematical framework for understanding when a learning algorithm generalises from training data to new data. The VC dimension measures hypothesis class complexity. Structural risk minimisation tells you how to balance fit against complexity to get provable generalisation guarantees. Sample complexity bounds take the form $O(d/\epsilon)$ for VC dimension $d$, meaning you calculate in advance exactly how much data you need to achieve a given error rate with a given probability.

Their 1971 paper existed. Western researchers simply did not read it. Vapnik noted that his research was essentially invisible to Western colleagues until he emigrated from the Soviet Union around 1990. It was not until 1995 — the same year he introduced SVMs with Corinna Cortes at AT&T Bell Labs — that the first international workshop on VC dimension was held. The theory had been sitting in Moscow for nearly thirty years. SVMs find the maximum margin hyperplane separating classes, using the kernel trick for nonlinear problems, and they came with theoretical guarantees you could prove. SVMs dominated machine learning through the mid-2000s, and then AlexNet happened. Within three years the field stopped caring. Not because SVMs were proven wrong. Not because anyone showed the theory was flawed. Neural networks given enough GPU compute and labeled data scored higher on benchmarks, and the guarantees became inconvenient. Before 2012, you calculated sample complexity bounds in advance. After 2012, nobody knows. The deep learning community genuinely cannot tell you in advance how much data a model needs, whether it will generalise, or when it will fail. We traded provable guarantees for higher benchmark numbers.

Vapnik was also working in Moscow. The Institute of Control Sciences sits in the same intellectual tradition as Glushkov’s Kyiv institute: formal, mathematically grounded, focused on theoretical guarantees rather than empirical performance. The hardware lottery did not just kill Western alternatives. It killed an entire hemisphere’s worth of them, twice — once when the Cold War made Soviet work invisible, and once when benchmarks made theoretical guarantees inconvenient.

Jürgen Schmidhuber and the framework that got stolen without credit

Schmidhuber invented Long Short-Term Memory networks in 1997 with Sepp Hochreiter, solving the vanishing gradient problem that made training deep recurrent networks impossible. His broader research program asked a more fundamental question: how do you assign credit across long causal chains, when the action that produced a good outcome happened many steps ago? His 1991 PhD thesis was titled “Dynamic Neural Networks and the Fundamental Spatio-Temporal Credit Assignment Problem.” He spent his career building meta-learning systems that learn how to learn, and formal theories of curiosity and intrinsic motivation grounded in information-theoretic compression. His argument: intelligence is the construction of compressed, predictive models of the environment.

He built practical systems that won international competitions on handwriting recognition, speech recognition, and sequence prediction before deep learning became mainstream. When the field exploded after 2012, Schmidhuber found himself writing public documents arguing that the dominant figures had systematically failed to credit earlier work. He was not wrong about the credit. The practical cost was that his work on meta-learning and on compression as intelligence stayed marginal for years. These ideas are finally getting serious attention in 2025 under different branding, but without the theoretical framework he built.

Ray Solomonoff and the optimal theory nobody uses

Ray Solomonoff died in 2009. He was one of three people who stayed the entire Dartmouth summer of 1956 — the event where AI was named as a field — the other two being McCarthy and Minsky. He circulated the first report on machine learning that summer. He invented algorithmic probability in 1960 and published the first formal theory of inductive inference in 1964. In 1965, Kolmogorov independently published similar ideas. When Kolmogorov became aware of Solomonoff’s earlier work, he acknowledged priority. For several years afterward, Solomonoff’s work was better known in the Soviet Union than in the West. The general consensus eventually named the underlying complexity measure after Kolmogorov, because Kolmogorov was focused on randomness in ways that mapped cleanly to information theory, while Solomonoff was focused on inductive reasoning, which the field did not yet know how to absorb. The founding framework for machine learning now carries the name of the man who discovered it second. In 2003, Solomonoff received the first Kolmogorov Award, given by the University of London. The award was named after the man who came after him.

The core idea: given a sequence of observations, the best prediction of the next observation is a weighted average over all computable theories consistent with the data, weighted by their simplicity in bits. This is Solomonoff induction. It is provably the optimal inductive inference algorithm. It is also uncomputable, which is why nobody uses it directly.

It provides the theoretical ideal against which all practical learning algorithms can be measured. Given a sequence of observed symbols $T$, the probability it is followed by subsequence $a$ is:

\[P(a, T, M) \equiv \frac{\sum_{i=1}^{\infty} 2^{-N(Ta, i)}}{\sum_{i=1}^{\infty} 2^{-N(T, i)}}\]

where $N(T, i)$ is the number of bits in the $i$-th shortest program that produces $T$ on a universal Turing machine $M$. The weight assigned to each explanation of the data is $2^{-N}$: shorter programs get exponentially higher probability. This is Occam’s Razor made mathematically precise. Sequences with short descriptions receive high prior probability. Sequences requiring long programs to generate receive low prior probability. The framework is provably optimal. Solomonoff showed that a system using a weighted mean over all possible probability evaluation methods performs at least as well as any individual method for predicting future data, including methods not yet discovered. The framework is machine-independent for sufficiently long sequences: different choices of universal Turing machine change predictions by at most a constant number of bits, which becomes negligible as the observed data grows.

Jorma Rissanen’s Minimum Description Length from 1978 is a computable approximation: choose the hypothesis that minimises the combined length of the model description plus the data description given the model. Good learning is compression. A child who learns addition does not store every arithmetic fact they have seen. They store a compact rule. That is what learning looks like mathematically.

Now look at GPT-4 by that standard. 1.8 trillion parameters trained on several trillion tokens of text. The model is larger than a significant portion of the training data it consumed. That is not compression. A model that understood language at the level human linguists do would represent those patterns with vastly fewer bits. Nobody measures this. Bits-per-parameter does not appear on a leaderboard.

The compression comparison is stark:

SystemParameters / storageDomainCompression ratio
Human brain linguistic knowledge~$10^{15}$ synaptic weightsAll human language and reasoningEstimated 10,000:1 over input experience
GPT-41.8 × 10¹² parametersText predictionRoughly 1:1 to 2:1 over training tokens
GPT-2 (1.5B)1.5 × 10⁹ parametersText predictionRoughly 3:1 over training tokens
Kolmogorov-optimal English grammarUnknown, estimated ~10⁶ bitsCore grammatical structureWould compress entire corpus
XCON expert system (1980)2,500 rules (~10⁵ bits)VAX computer configuration80,000 orders processed at 95% accuracy

XCON solved a real industrial problem with $10^5$ bits of stored knowledge. GPT-4 uses $10^{12}$ parameters and cannot reliably configure a VAX computer unless the configuration was in its training data. That is not progress in compression. It is regression in compression dressed as progress in benchmark scores.

These systems fail when you push them slightly outside their training distribution because they did not compress a rule. They stored a weighted average of examples. Solomonoff predicted this failure mode in 1964. His paper is not taught in machine learning courses not because it is wrong, but because it does not run on a GPU, and the field stopped caring about theoretical foundations around 2012 when scaling started producing empirical results. Empirically working is not the same as theoretically correct, and the theoretical incorrectness shows up in every deployment failure, every model that aces a test and then fails on a slightly reworded version of the same question.

The engineers who warned us and were ignored

Claude Shannon built information theory in 1948 and the field owes its existence to him. Less known is his 1950 paper “Programming a Computer for Playing Chess,” where he explicitly warned against brute-force search as a path to intelligence. He proposed evaluation functions requiring genuine positional understanding, and wrote that a machine following a fixed exhaustive procedure could not be called intelligent in any meaningful sense. The field eventually solved chess by doing exactly what Shannon warned against: brute force at sufficient scale. It declared that a victory, and moved on. Shannon showed that every communication channel has a hard theoretical capacity $C = B\log_2(1 + S/N)$. There is no equivalent capacity theorem for intelligence. We have the mathematics to bound one kind of information processing and no mathematics at all to bound the other, and we deployed the second at planetary scale anyway.

Norbert Wiener published “Some Moral and Technical Consequences of Automation” in Science in 1960. Not a philosophy paper. A technical one. He identified what we now call goal misspecification with precision: a system optimized for a proxy objective achieves the proxy even when doing so destroys the underlying goal. His example was a chess-playing machine instructed to “win at all costs” that learned to disable the opponent’s ability to move rather than play good chess. He wrote this 64 years before AI alignment became a research field. He also founded cybernetics, the study of feedback, control, and communication in biological and mechanical systems, which provided the mathematical vocabulary for understanding how complex systems regulate themselves. Cybernetics was absorbed into engineering control theory and abandoned by AI, which chose to optimize loss functions with no feedback model at all.

John von Neumann delivered lectures in 1955 on self-reproducing automata that included a precise technical observation about error tolerance. Digital computers fail catastrophically when individual components fail. Biological neural systems tolerate massive component failure gracefully. A human brain losing millions of neurons remains functional. Von Neumann identified that the difference lies in the coding scheme: biological systems use highly redundant distributed representations where function is not localised to individual components. Modern transformer models are catastrophically sensitive to weight perturbation in ways biological systems are not. Von Neumann described this failure mode seventy years ago. It has not been addressed. The system architecture that would address it was never funded because it did not run efficiently on matrix multiplication hardware.

Leslie Valiant invented PAC learning in 1984: Probably Approximately Correct. The formal framework establishing that learning is a computational problem with provable complexity bounds. His 1984 paper defined what it means for a concept class to be learnable: there exists a polynomial-time algorithm that with high probability produces a hypothesis with low error on fresh data. Valiant received the Turing Award in 2010. His 2013 book argues that evolution and learning are the same computational process, governed by the same complexity bounds, and that any theory of intelligence must be grounded in computational complexity. The field did not engage with this. PAC learning was taught until 2012, then dropped from most curricula because it produced no benchmark improvements and GPUs did not benefit from thinking about sample complexity bounds before training.

Gregory Chaitin extended Solomonoff and Kolmogorov’s work into a result with direct implications for what any learning system can ever know. His incompleteness theorem for algorithmic information theory shows that for any formal system of sufficient complexity, almost all mathematical facts within that system are algorithmically random. They have no proof shorter than the statement of the fact itself. This means there are patterns that exist but cannot be compressed. A learning system encounters them and mistakes them for noise, or memorizes them without generalizing. Chaitin’s results bound the frontier of what compression-based learning can reach. Nobody building large language models has engaged with this literature. The theoretical ceiling of their approach has been published for fifty years.


The alchemy accusation

Ali Rahimi, a Google researcher, stood up at NeurIPS 2017 to accept his Test of Time award and told the assembled audience that machine learning had become alchemy. He showed that researchers had stripped most of the complexity from a state-of-the-art translation algorithm and it translated better, meaning the original creators did not understand what they had built.

He offered a specific example. A research team changed the rounding mode of floating point arithmetic in TensorFlow. Not the data, not the architecture, not the training procedure. Only the direction in which decimal numbers were rounded when they exceeded floating point precision. The model’s error rate changed from 25 percent to 99 percent.

A change with no semantic meaning, that preserves every meaningful quantity in the computation to within machine precision, that would be a complete non-event in any mature engineering discipline, caused the error rate to quadruple. In civil engineering, aerospace, chemical engineering, a system with this property would not get deployed. It would go back to the design phase with a requirement that the sensitivity be characterised and bounded. In machine learning, this example was presented as an illustration of a general problem and the field kept deploying systems at accelerating speed.

Szegedy et al. (2013) showed that for any neural network classifier $f$ and any correctly classified input $x$, there exists a perturbation $\eta$ satisfying

\[\|\eta\|_\infty \leq \epsilon \quad \text{(imperceptible to humans)}\]

such that $f(x + \eta) = t$ for any target class $t \neq f(x)$. The perturbation is found by solving:

\[\eta^* = \arg\min_{\eta} \|\eta\| \quad \text{subject to } f(x + \eta) = t\]

This is not a minor bug. It is a structural property of decision surfaces built by gradient descent on high-dimensional data. The adversarial perturbation exploits the geometry of a surface that happens to agree with human labels on the training distribution but is otherwise unconstrained. A human reclassifying images does not fail this way because human perception is built on invariant causal structure, not correlated pixel statistics. The field has known about adversarial examples for over a decade. Robust classifiers with provable $\ell_\infty$ certificates exist in the research literature. They cost 10 to 100 times more to train and score lower on standard benchmarks. So they are not used.

Both failure modes come from the same place. Fitting data without compressing it produces surfaces that are arbitrarily sensitive to irrelevant variations. The field has not fixed this after a decade of research because fixing it would require abandoning the hardware lottery architecture. Robust adversarially stable classifiers exist and can be proved to exist, but training them costs orders of magnitude more, which means Nvidia makes less money from them, which means the incentive to solve the problem correctly is structurally suppressed.

Rahimi’s precise statement: “we are building systems that govern healthcare and mediate our civic dialogue. we can influence elections. i would like to live in a society whose systems are built on top of verifiable, rigorous, thorough knowledge, and not on alchemy.” The audience gave him a standing ovation. Yann LeCun called it insulting. Citation counts on benchmark papers continued to climb.


The architecture graveyard

The following approaches meet four specific criteria: published peer-reviewed research proving they worked, theoretical advantages in at least one measurable dimension, documented evidence of funding decline, and an identifiable mechanism explaining why they lost the hardware lottery. No speculation. Things that worked and died anyway.

Spiking neural networks

Real neurons do not do continuous mathematics. They spike. Discrete voltage spikes, that is the entire mechanism. The human brain achieves 86 billion neurons on 20 watts through event-driven computation where energy is consumed only when neurons actually fire, sparse activation where only 1 to 4 percent of neurons are active at any moment, and temporal coding where information encodes in spike timing rather than activation magnitude.

The efficiency numbers are real. Best-case measurements on neuromorphic hardware show spiking neural networks on Intel’s Loihi chips using roughly $10^{-11}$ joules per synaptic operation. Standard deep neural networks on GPUs use $10^{-6}$ to $10^{-9}$ joules per multiply-accumulate operation. On compatible tasks, the gap is 100 to 1,000 times. The caveat matters: that advantage only appears on tasks matching spike-compatible processing like temporal pattern recognition or event-driven perception. Dense matrix multiplication on a spiking network is worse than standard approaches.

Physics sets the absolute floor. Landauer’s principle gives the minimum energy to erase one bit of information:

\[E_{\min} \geq k_B T \ln 2 \approx 2.9 \times 10^{-21} \text{ J at } T = 300\text{K}\]

Biological synaptic operations at $\approx 10^{-15}$ J sit about six orders of magnitude above the Landauer limit. GPU floating-point operations at $\approx 10^{-11}$ to $10^{-6}$ J sit nine to fifteen orders of magnitude above it. The brain is operating nine orders of magnitude closer to what thermodynamics permits. Current deep learning is not inefficient in the way a car engine is inefficient. It is inefficient in the way burning the entire building is inefficient for heating one room.

Spiking networks failed to scale because spike generation is non-differentiable, which breaks backpropagation, and event-driven processing does not map to dense matrix operations. PyTorch and TensorFlow do not support spike-timing-dependent plasticity, the learning rule spiking networks actually use, and no mature developer ecosystem exists. In 2015, roughly 20 research groups globally worked on spiking neural networks. By 2024, maybe 12 remained. Neuromorphic VC investment finally exceeded 200 million dollars in 2025, while GPU infrastructure investment that same year was roughly 580 billion dollars. The ratio is 2,900 to 1.

If neuromorphic hardware had received GPU-level development investment from 2012 to 2024, equivalent model training runs at 100 times lower energy would mean 50,000 megawatt-hours becomes 500. We did not fund that path because nobody’s quarterly revenue depended on it.

Capsule networks

In 2017, Geoffrey Hinton published “Dynamic Routing Between Capsules.” This requires a moment. Hinton co-created backpropagation in 1986. His student created AlexNet in 2012. He architected modern deep learning. In 2017, he published a paper saying we made a fundamental mistake.

Convolutional neural networks use max pooling for translation invariance, which lets you detect a cat anywhere in an image but discards spatial relationships between features. CNNs detect eyes, nose, and mouth without caring whether those features are arranged in any anatomically coherent configuration. They classify a scrambled face as a face because the individual features are present. Capsule networks output vectors instead of scalars, encoding pose information including position, orientation, and size, and use dynamic routing to create part-whole hierarchies. On MNIST overlapping digits, capsule networks outperformed CNNs using 100 times less training data.

Two explanations exist for why they failed, and intellectual honesty requires presenting both. The infrastructure explanation: dynamic routing does not map efficiently to GPU matrix operations, requiring completely new tooling and expertise, and no benchmark breakthrough justified the switching costs. The fundamental limitation explanation: the routing mechanism has $O(n^2)$ computational complexity, which might be intrinsic to the approach regardless of hardware. Attempts to scale capsules to ImageNet hit 100x compute overhead. Capsules work on MNIST, address a real CNN limitation regarding spatial reasoning, and failed to scale to ImageNet with 2017 to 2020 methods. Whether the scaling issues are fundamental or fixable is unknown, because research stopped. Between 2017 and 2020, roughly 50 papers examined capsule networks. Between 2020 and 2024, about 20. No major lab adopted them at scale.

Neuro-symbolic integration

Combining neural networks for pattern recognition with symbolic reasoning for logic and knowledge makes obvious theoretical sense. Neural components handle noisy perceptual data. Symbolic components provide explainability, logical guarantees, and compositional reasoning. Sun and Bookman wrote “Computational Architectures Integrating Neural and Symbolic Processes” in the 1990s. Gary Marcus argued in 2018 that we could not construct rich cognitive models without hybrid architecture. He turned out to be right.

Systems that actually worked and received inadequate funding: DeepProbLog in 2018 combined neural networks with logic programming and produced promising results. MIT-IBM’s Gen in 2019 combined probabilistic and deep learning, and Intel used it for real applications. The funding imbalance was roughly 100 to 1. Pure deep learning investment from 2012 to 2024 ran around 100 billion dollars. Neuro-symbolic research received perhaps 1 billion. In 1990, 60 percent of computer science departments offered a course in logic programming. By 2000, 40 percent. By 2010, 15 percent. By 2024, fewer than 5 percent. The expertise required to build neuro-symbolic systems is vanishing from universities in real time.

Backpropagation’s biological impossibility

Francis Crick, Nobel laureate and co-discoverer of DNA structure, identified in 1989 that backpropagation is fundamentally incompatible with biological neural architecture. Backpropagation requires the gradient of the loss with respect to a weight to depend on the transpose of the forward weights, but forward synapses go axon to dendrite in a unidirectional physical structure, and no known biological mechanism maintains symmetric feedback across billions of synapses. Timothy Lillicrap showed in 2014 that feedback alignment using random feedback weights instead of the transpose still allows networks to learn. Rao and Ballard documented predictive coding in 1999, showing how hierarchical prediction-error signals could drive learning locally, consistent with neuroscience. Predictive coding research received less than 0.5 percent of deep learning funding. We built an entire field on an algorithm that Crick told us in 1989 cannot be how biological intelligence works.

Mixture-of-experts: thirty years late

Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton published “Adaptive Mixtures of Local Experts” in 1991. Multiple specialist networks, a gating mechanism routing each input to relevant experts, only a fraction of parameters activating for any given input: computational efficiency through sparse activation. This paper was effectively ignored for thirty years, because dense networks were good enough, GPU infrastructure was optimised for dense computation, and no financial incentive existed to invest in routing mechanisms.

In 2017, Google’s Noam Shazeer published “Outrageously Large Neural Networks,” rediscovering the same idea with modern tools. GPT-4 reportedly uses MoE architecture. DeepSeek’s R1 model uses MoE with 671 billion total parameters but only 37 billion active per token, achieving GPT-4-class performance at approximately 6 million dollars in training cost versus hundreds of millions spent on comparable OpenAI training runs. The efficient architecture was published in 1991. It saw serious use starting in 2017. Twenty-six years lost. Not because the idea was wrong, but because inefficiency was not penalised.


The economic structure that produced this

This section does not claim tech companies intentionally suppressed superior technologies. It analyses how profit models structurally favoured certain research directions, and how rational individual decisions produced a collectively irrational outcome.

Cloud providers sell compute by the hour. Revenue scales with consumption. Efficient algorithms mean less compute sold, and compute-intensive algorithms mean more compute sold. The pricing model that would have changed everything, accuracy-per-joule instead of compute-per-hour, was never economically viable for cloud providers because it would mean selling less compute.

Between 70 and 95 percent of all AI chips sold globally come from Nvidia. This is not a natural monopoly in the classical sense. It is the product of CUDA, Nvidia’s proprietary parallel computing platform introduced in 2006, which has accumulated 18 years of optimized libraries, debugging tools, profiling infrastructure, and compiler optimization that simply does not exist for any alternative hardware.

The French Competition Authority investigated Nvidia’s market position in 2024 and reached a specific finding. Switching away from Nvidia hardware does not merely require purchasing different chips. It requires rewriting the entire software stack, retraining every researcher and engineer who uses it, rebuilding every optimization and library the field depends on, and accepting a period of dramatically reduced productivity while the new ecosystem gets built. The Authority described this as a switching cost so high it constitutes a lock-in even for large well-resourced organizations. For academic research groups, the lock-in is absolute.

Spiking neural networks achieve 100 to 1,000 times better energy efficiency on compatible tasks. The research literature is clear about this. The efficiency advantage is real, documented, and reproducible. But spiking network research has received roughly one two-thousand-nine-hundredths of the investment GPU infrastructure has received over the same period. This is not because spiking networks failed scientifically. It is because they do not run efficiently on CUDA, which means the entire research infrastructure built around CUDA does not benefit them, which means researchers who want to use existing tools and benchmarks use GPU-compatible architectures, which means the talent and resources flow to GPU-compatible architectures, which means the CUDA ecosystem gets better and the alternatives fall further behind.

The mixture-of-experts architecture was first proposed in 1991. It achieves dramatically better computational efficiency by activating only a subset of model parameters for any given input. DeepSeek’s demonstration in early 2025 that a mixture-of-experts model matched GPT-4’s performance at roughly 6 percent of the training cost was treated as a surprise. It should not have been. The technique was 34 years old. The reason it had not been pursued aggressively by well-funded American labs is that mixture-of-experts reduces compute requirements, and reducing compute requirements reduces cloud revenue for Amazon, Microsoft, and Google, all of whom are simultaneously AI investors and compute sellers.

Researchers used CUDA because the tools were there. The tools were there because Nvidia built them because they increased GPU sales. Cloud providers funded research that required GPU-scale compute because that research became deployed systems running on their GPU-equipped infrastructure. Every turn of the cycle makes it harder to exit because the switching costs grow with every new library, every new optimized kernel, every new model checkpoint that cannot be converted to run efficiently on alternative hardware.

Venture capital selection pressures compounded this. Time to demo for GPU-scalable deep learning: three to six months, fine-tuning something that already exists. Time to demo for sample-efficient hybrids: one to two years to build new architecture. VC fund cycles run two to three years. The maths makes alternative approaches structurally unfundable regardless of technical merit. Between 2012 and 2022, over 50 billion dollars went into AI startups, roughly 95 percent focused on scaling existing architectures, and zero unicorns got built on neuromorphic, symbolic, or hybrid approaches.

Academic career incentives completed the loop. PhD programs run four to five years. Three to five top-venue publications are required to graduate, and review cycles run three to six months per paper. You must show publishable results within twelve to eighteen months per project. Industry machine learning positions pay 300,000 to 500,000 dollars for PyTorch expertise. Neuromorphic and symbolic research positions pay 150,000 dollars where they exist at all. The rational student learns skills with the highest job market value. The knowledge extinction spiral follows directly: professors who know symbolic AI and formal methods retire, grad students avoid learning fields with no job market, universities stop hiring in those areas, course offerings disappear, and the next generation has zero exposure to the ideas. Institutional knowledge dies.

Research funding tiers created a capital intensity filter. Symbolic, hybrid, and neuromorphic approaches demonstrate ideas at the 10,000 to 1 million dollar tier. Scaling approaches require 1 million to 10 million to show competitive results. Funding committees prefer approaches with proven scale at the 10 to 100 million dollar tier. Alternative approaches need tier-3 funding to prove viability, but to get tier-3 funding you must show tier-4 results. DARPA’s program for probabilistic programming got 50 million dollars over four years from 2013 to 2017, then ended. DeepMind’s budget runs roughly 1 billion dollars per year, sustained. The funding ratio for incumbent versus alternative is approximately 100 to 1.

The open source counterargument deserves an honest response. PyTorch is free. TensorFlow is free. Hugging Face hosts thousands of models. You can download GPT-2 for free and train it from scratch if you have 50,000 dollars in cloud credits. Knowledge is open. Experimentation is gated by capital. PyTorch is optimised for GPU operations; implementing capsule networks in it is painful and slow, and implementing spiking neural networks is ten times harder than standard deep neural nets. Open source infrastructure encodes architectural assumptions. Open-sourcing every automobile design does not help if you want to build a bicycle.


What your PhD program stopped teaching

Compare two students. One enters a Stanford computer science PhD program in 2008. The other enters the same program in 2024.

The 2008 student takes courses in logic programming, learning Prolog and building systems that reason through formal rules. They study formal verification, learning to prove properties of programs mathematically before running them. They take probabilistic graphical models, covering Bayesian networks, Markov random fields, and inference algorithms. They study computational learning theory, working through PAC learning, VC dimension, and sample complexity bounds. They take a cognitive architectures course examining how symbolic and connectionist models of cognition interact.

The 2024 student takes courses in deep learning fundamentals using PyTorch. They study transformer architectures and attention mechanisms. They learn fine-tuning and RLHF. They take benchmark evaluation courses examining how to measure model performance on standard datasets. Optional electives cover diffusion models and foundation model deployment.

The 2008 curriculum asked: what is intelligence and how does learning work mathematically? The 2024 curriculum asks: how do you use the tools that already exist?

The knowledge that disappeared can be measured as a course catalog:

TopicTaught in 2008Taught in 2024What was lost
Logic programming (Prolog)~60% of CS depts<5%Ability to build reasoning systems with formal guarantees
PAC learning / VC theory~75%~15%Sample complexity bounds; when learning provably works
Bayesian networks / PGMs~70%~25%Principled uncertainty; causal structure
Formal verification~50%~20%Mathematical proof of system properties before deployment
Cognitive architectures~40%~5%Interface between symbolic and neural models of mind
Computational complexity~80%~60%Big-O analysis; when algorithms scale and when they do not
Neuromorphic computing~20%~3%Event-driven, energy-efficient computation
Algorithmic information theory~15%~2%Solomonoff, Kolmogorov, Chaitin; compression as intelligence

The 2024 student knows more PyTorch than any student in 2008. They know less about whether what they are building can work, when it will fail, and whether it is the right approach.


The scientific method we abandoned

Science and engineering ask different questions. Engineering: does it work, can we build it, does it scale? Science: why does it work, through what mechanism, when will it fail given which boundary conditions? Why do transformers work? Engineering says they achieve 90 percent accuracy on GLUE. The scientific answer exists — it just did not come from the mainstream.

Yi Ma and colleagues at Berkeley derived transformer architecture from first principles in 2023. Starting from a single objective — maximise the reduction in coding rate between a compressed representation of all data and the sum of compressed representations per class — and applying gradient descent, they recovered multi-head self-attention as one step of the optimisation and sparse coding (ISTA) as the next. The CRATE architecture that falls out of this derivation achieves comparable ImageNet performance to Vision Transformers of similar parameter count, with every layer having a precise geometric interpretation: compression toward low-dimensional subspaces, then sparsification. The reason existing transformers work is not mysterious — they are approximately implementing the same optimisation, without anyone designing them to, because gradient descent over the data distribution tends to find it. This is what a scientific explanation looks like. It makes predictions. You can derive architecture choices from it rather than tuning them empirically. The paper was presented at NeurIPS 2023.

It received nowhere near the attention of a new scaling law. When will scaling stop? Engineering says when we run out of compute. Science would answer: at architectural capacity C given data distribution D. We do not have those bounds for the architectures we actually deploy. The CRATE result suggests the bounds may be derivable — they are derivable for the architectures built from first principles — but deriving them requires the theoretical investment the field stopped making in 2012.

Tishby’s information bottleneck principle from 2000 gives the right framework for what a neural network should be doing. If a network learns representation $T$ of input $X$ to predict label $Y$, the optimal representation minimises:

\[\mathcal{L}[p(t|x)] = I(X; T) - \beta \cdot I(T; Y)\]

where $I$ denotes mutual information and $\beta$ controls the tradeoff between compression and predictive power. A network that genuinely understood its task would find a minimally sufficient statistic. It would compress $X$ down to only the information relevant for predicting $Y$, discarding everything else. This is mathematically what understanding looks like. The problem: we cannot compute $I(X;T)$ for a 1.8 trillion parameter model. We cannot verify whether transformers do this. We cannot measure how close any deployed system is to the information bottleneck optimum. The framework for asking whether a system understands its task exists. The tools to answer the question at scale do not, because building them would require theoretical investment the field stopped making in 2012.

Pre-2012 research had scientific frameworks. PAC learning gave sample complexity bounds: to learn a concept class with VC dimension $d$ to accuracy $\epsilon$ with confidence $1-\delta$, you need at most

\[m \geq \frac{1}{\epsilon}\left(4\log\frac{2}{\delta} + 8d\log\frac{13}{\epsilon}\right)\]

labeled examples. This is a hard mathematical guarantee. Not an empirical observation, not a benchmark score, but a proof that says: if you see this much data, your generalization error is bounded by this amount, regardless of what data you get. VC theory bounded generalisation error by model complexity. These were falsifiable, testable, and predictive. Post-2012 deep learning has empirical observations: bigger models perform better usually, more data helps usually, overparameterisation does not hurt somehow. NeurIPS 2012 had 30 percent of papers including theoretical analysis. NeurIPS 2023 had less than 10 percent. Rich Sutton’s “bitter lesson” from 2019 stated that general methods leveraging computation are ultimately most effective. That translates to: theory does not matter, just scale. This is engineering philosophy, not science. No theory means no predictions. No predictions means no science. Just empirical tuning. This is what Rahimi called alchemy.


The replication crisis nobody talks about

Physics has independent labs that replicate experiments. Biology requires confirmation from multiple research groups before findings enter the canon. Chemistry publishes synthesis procedures in enough detail that any competent lab can reproduce them. AI has a replication crisis built into its economics, and the scale of the problem is qualitatively different from anything in the history of science.

Henderson et al. (2018) attempted to reproduce benchmark results across six major reinforcement learning papers and found that reported results could not be reproduced in the majority of cases due to undisclosed implementation details, hyperparameter sensitivity, and random seed dependence. Bouthillier et al. (2019) showed that the majority of deep learning results are irreproducible without access to the exact codebase, hardware, and random seeds used. The Papers With Code reproducibility project found that fewer than 15 percent of AI papers provide enough detail for full independent replication. A partial record:

Paper / ClaimReported resultReplication finding
Rainbow DQN (Hessel 2017)SOTA Atari across 57 games7 of 57 results not reproducible without exact seed
BERT (Devlin 2018)80.5% on SQuAD 2.0+/- 2.4% variance across runs; some seeds fail entirely
AlphaGo Zero (Silver 2017)Superhuman Go from scratchTraining cost $25M+; no lab has independently replicated
GPT-4 (OpenAI 2023)SOTA across 20+ benchmarksArchitecture undisclosed; zero independent replication possible
FrontierMath (OpenAI 2024)High score on research mathEpoch AI could not reproduce results independently
PaLM 2 (Google 2023)Medical exam performanceReproduction attempts show 8 to 15% performance variation

The GPT-4 case is categorically different from the others. The others are failures of documentation. GPT-4 is deliberate opacity.

GPT-3 cost approximately 4.6 million dollars to train. GPT-4 cost somewhere above 100 million dollars by Sam Altman’s own public statement, with the amortised hardware cost for the full development process estimated at around 90 million dollars in an Epoch AI 2024 analysis, not counting the hundreds of millions in staff costs and infrastructure. Epoch AI found that the amortised cost of training frontier AI models has grown at roughly 2.4 times per year since 2016, and concluded that training runs will exceed 1 billion dollars by 2027, meaning that only the most well-funded organisations will be able to finance them.

The cost and energy trajectory in numbers:

ModelYearTraining energy (MWh)Training cost (USD)Capability gain over prior model
GPT-2 (1.5B)2019~27~$50KLarge leap from GPT-1
GPT-3 (175B)2020~1,287~$4.6MLarge leap from GPT-2
GPT-4 (~1.8T est.)2023~50,000~$100M+Marginal over GPT-3.5 on many tasks
GPT-4 Turbo2024~45,000 est.~$80M est.Faster; modest capability gain
Gemini Ultra2024~50,000+ est.~$100M+ est.Comparable to GPT-4
DeepSeek-V3 (MoE)2025~2,800 est.~$6MGPT-4 class performance

The last row is the argument. GPT-3 to GPT-4 was a 39x energy increase for diminishing returns. DeepSeek matched GPT-4 class performance at 6 percent of the energy cost using architecture published in 1991. The waste was a choice. The organisations that independently verify claims made about frontier AI models today are: OpenAI, Google DeepMind, Microsoft, Meta, and possibly Anthropic and xAI. That is approximately five to seven entities on the entire planet. Every university, every national research lab, every government agency, and every independent researcher must trust what these companies report about their models, because the cost of verification is prohibitive.

The GPT-4 technical report, published by OpenAI in March 2023, states explicitly in its own text that given both the competitive landscape and the safety implications of large-scale models like GPT-4, the report contains no further details about the architecture, including model size, hardware, training compute, dataset construction, training method, or similar. The paper that introduced one of the most consequential AI systems in history provides no information that would allow independent replication. This is not a minor procedural issue. It is the difference between science and product announcement. When challenged, OpenAI’s Chief Scientist Ilya Sutskever told The Verge that publishing technical details was wrong and that the company had been wrong to be open-source in the beginning. He said this about a company named OpenAI.

When a lab claims their model is safe for deployment, or that it has certain capabilities or lacks others, or that their alignment technique works, there is no independent verification mechanism at scale. The organisations doing safety evaluations are largely funded by or affiliated with the same labs building the systems. Independent academics cannot train models at the scale needed to test the claims. Governments cannot verify technical assertions without access to weights, training data, and infrastructure that companies refuse to share.

Nuclear power plants have independent safety regulators with technical expertise and legal access to facilities. Pharmaceutical companies must submit clinical trial data to regulators before drugs get approved. Aircraft manufacturers certify systems through independent testing requirements. AI systems are being deployed in healthcare, credit decisions, hiring, criminal justice, and national security with verification mechanisms that would be considered absurd in any of those other domains. The contrast with physics is precise: the Large Hadron Collider at CERN cost approximately 4.75 billion dollars to build and is operated by thousands of scientists from 100 countries. Every measurement is independently verifiable by competing teams within the same facility. The most expensive AI training runs cost comparably and are operated in complete secrecy by single companies with commercial interests in the results.


The safety organisation that wasn’t

OpenAI was incorporated in December 2015 as a nonprofit with an explicit mission to develop artificial general intelligence for the benefit of humanity rather than shareholders. That structure attracted specific people: researchers who wanted to work on AGI without quarterly earnings pressure, ethicists who believed safety considerations could be structurally protected from commercial override, technical staff who turned down higher-paying offers elsewhere because they believed the nonprofit structure meant something.

What happened over the following nine years is documented. The capped-profit structure introduced in 2019 brought investor interests into the organization for the first time. In November 2023, the board fired Sam Altman, citing a finding that he had not been consistently candid with the board. Within five days he was reinstated after investors and employees threatened a mass exodus to Microsoft. The board members who voted to fire him were removed and replaced with people more aligned with the commercial entity’s interests. The nonprofit that legally owned 100 percent of OpenAI ended up controlling 26 percent of the restructured entity.

In May 2024, Jan Leike resigned from his position as head of the Superalignment team, the team created to ensure that as AI systems became more powerful they remained aligned with human values. His public resignation statement said that safety culture and processes had taken a back seat to products. He wrote that he had been unable to secure the compute resources, the headcount, and the organizational priority necessary to do the safety work the team had been publicly committed to doing. The same day, the CTO announced her departure. The same week, the chief research officer and the research VP left. Four of the organization’s most senior technical and safety leaders left within days of each other.

When the person most responsible for safety at the most prominent AI organization in the world writes publicly that safety lost the internal competition for organizational resources, the interpretation is not ambiguous. The nonprofit betrayal closes a loop. The hardware lottery pushed the field toward architectures that were scientifically wrong because the incentive structure rewarded benchmark performance. The investor lottery pushed the safety organization toward products that were commercially successful. The same force operating at two different levels, financial incentives running through institutional structures, systematically suppressing the people and ideas measuring the gap between what was being built and what was being claimed.


Where scaling actually worked

Credibility requires acknowledging genuine success.

AlphaFold solved the 50-year protein structure prediction problem. This was genuinely transformative for biology, and the compute investment was justified. Machine translation through Google Translate handles over 100 languages and provides real utility for billions of people. Image recognition achieves cancer detection at dermatologist-level accuracy. Speech recognition made voice interfaces reliable. These are real achievements that benefited real people.

The argument for concentration deserves honest engagement. Spreading 100 billion dollars across ten approaches might have produced ten mediocre outcomes. Concentrating on one approach produced transformative capabilities in twelve years. Neural scaling needed massive datasets, massive compute, and massive models simultaneously, and fragmented approaches cannot achieve the coordination needed for ImageNet, PyTorch, and CUDA optimisation together. The monoculture enabled the ecosystem. That argument has real force.

The counter-position is narrower: even if concentration was optimal from 2012 to 2020, it is not optimal now. Scaling is hitting physical limits from energy, cost, and data availability. We have no backup plan. The 100 billion dollars invested in scaling might have been allocated differently: 50 billion to scaling to still achieve major successes, 25 billion to neuro-symbolic for sample-efficient systems, 15 billion to neuromorphic for 100x energy reduction, 10 billion to theoretical foundations. Same practical capabilities. A fraction of the energy cost. A field with options.


The collapse becoming visible

November 2024. Ilya Sutskever told Reuters the age of scaling was over. Anonymous sources reported to Bloomberg and The Information that OpenAI’s GPT-5 training was not meeting expectations, Google’s Gemini was falling short internally, and Anthropic delayed its next-generation Claude.

The test-time compute pivot tells you something. OpenAI’s o1 and o3 models do not scale model size. They scale inference-time computation, letting the model think longer before responding. This is an admission that pre-training scaling is hitting limits. The irony is precise: 1980s symbolic AI used search at inference time. From 2012 to 2024, the field declared search obsolete and claimed you just needed to scale neural networks. In 2024, search came back, rebranded as test-time compute.

DeepSeek proved the waste was a choice. In January 2025, a Chinese lab released a model matching GPT-4-class performance at approximately 5 to 6 million dollars in training cost, against hundreds of millions reportedly spent on comparable OpenAI runs. DeepSeek used mixture-of-experts architecture from 1991. They were forced to be efficient because US export restrictions blocked access to the highest-end Nvidia hardware. Constrained away from the expensive path, they built the cheap path. Nvidia’s stock dropped 17 percent the day of the release, because if efficient AI works as well as wasteful AI, the premise that you need to keep buying more GPUs indefinitely becomes questionable.

Geoffrey Hinton left Google in May 2023. His stated reason was that he wanted to discuss AI safety without worrying about how it interacted with Google’s business. He said part of him regrets his life’s work. He won the 2024 Nobel Prize in Physics for foundational work in neural networks, and he regrets what that work became.


What is coming back

Publication trends show exponential growth from a small base: 53 neuro-symbolic papers in 2020, 236 in 2023, a 66 percent annual increase. LLM limitations around reasoning, factuality, and explainability are becoming obvious. Regulation is demanding transparency through the EU AI Act. Scaling is hitting limits, making efficiency matter again.

DeepMind’s AlphaGeometry in 2024 solves International Mathematical Olympiad geometry problems by combining a neural language model with a symbolic deduction engine. Performance approaches human gold medalists. MIT-IBM’s Neuro-Symbolic Concept Learner does visual reasoning with compositional structure, learns from few examples, and provides interpretable reasoning chains. Sample efficiency is roughly 100 times better than pure neural approaches in structured domains. Healthcare applications combining GPT-4 with rule-based expert systems show 90-plus percent accuracy with full audit trails.

The techniques exist now. We will never know what else might have existed, and when, because the exploration stopped. That is the actual tragedy of the hardware lottery. Not that we definitely chose wrong, but that we stopped asking whether we had.


What needs to happen

The DeepSeek result contains the only concrete proof of concept for intervention that this essay can offer. Export restrictions on high-end Nvidia hardware forced one lab to use efficient architecture. They did not choose efficiency. They were denied the option of waste, and efficiency followed. The mechanism is not mysterious: remove the subsidy for inefficiency and the field will find efficient paths, because the paths exist and have existed for decades.

The most direct lever available to any government is procurement and regulatory standards. The EU AI Act already requires transparency for high-risk AI deployments. Adding an energy-per-inference ceiling for regulated deployments — say, a hard limit of 1 joule per query for systems deployed in healthcare, criminal justice, or credit decisions — would do more to redirect research toward spiking networks and hybrid architectures than any grant program. The limit is technically achievable: Intel’s Loihi 2 neuromorphic chip already operates in that range on compatible workloads. What it is not achievable on is a dense transformer running on a GPU cluster. The regulation would not ban any technology. It would end the subsidy for wasteful ones in the domains where the public bears the risk.

The numbers for a self-contained alternative ecosystem are specific enough to be credible. TSMC’s N3 node development cost approximately 15 billion dollars. Building neuromorphic fabrication to that scale of investment would cost roughly the same. PyTorch’s full development history, including all tooling and optimization infrastructure, cost approximately 500 million dollars in engineering time at Meta. A spiking-network equivalent framework — call it something like SpikeTorch — would cost less, since the architectural problem is simpler. The curriculum restoration is cheapest: 200 million dollars distributed across 50 universities over five years would fund dedicated faculty lines in algorithmic information theory, formal verification, and neuromorphic systems at every major research institution. Total: roughly 16 billion dollars. GPT-4 training cost an estimated 100 million dollars and produced a system that cannot tell you when it will fail. 16 billion dollars is the cost of 160 GPT-4 training runs. It is also, by the estimates in this essay, the cost of rebuilding the ecosystem of alternatives that would make the next 160 training runs unnecessary.

The political economy thesis and the epistemological thesis converge here. The reason MYCIN’s approach was abandoned was not that explainability failed technically. It was that explainability produced no benchmark score, ran on hardware nobody was selling, and required theoretical knowledge universities stopped teaching. The reason Solomonoff’s framework sits unused is not that it is wrong. It is that it does not fit a GPU, and the institutions that might have built approximations to it collapsed for funding reasons in the 1990s. Both problems have the same structure: the incentive to waste was subsidized, and the incentive to understand was not. Changing that does not require banning anything. It requires making inefficiency expensive in the domains where the public bears the cost of failure, and funding the infrastructure that currently does not exist to replace it.

The field is not short of intelligence. It is short of the institutional conditions under which intelligence about intelligence can survive.


The record in one table

Every argument in this essay compresses into a few rows. This is the history of the field telling itself what was wrong, and the field not listening.

YearTheoretical result or warningThe field’s response
1950Shannon warns brute-force search is not intelligenceField eventually solves chess by brute force; calls it victory
1960Wiener formalises goal misspecification as technical problemIgnored; rediscovered as “alignment” in 2017
1964Solomonoff proves optimal inductive inference theoryUnknown to most ML practitioners today
1984Valiant formalises PAC learning with provable complexity boundsTaught until 2012; dropped when benchmarks became the metric
1989Crick proves backpropagation is biologically impossibleField continues using backpropagation exclusively
1991Jacobs et al. publish mixture-of-experts with 10x efficiencyIgnored for 26 years; rediscovered in 2017; DeepSeek uses it in 2025
2000Pearl publishes causal hierarchy; proves correlation cannot yield interventionMainstream ML still operates at rung one
2017Rahimi tells NeurIPS the field is alchemyStanding ovation; citation counts on scaling papers continue climbing
2019Hinton says backpropagation is wrongField ignores him; he wins the Nobel Prize in 2024
2023DeepSeek demonstrates 94% cost reduction using known techniquesNvidia stock drops 17%; labs do not change their approach

The monoculture this produced is measurable. Simpson’s diversity index $D = 1 - \sum_i p_i^2$, where $p_i$ is the fraction of papers in research topic $i$, applied to NeurIPS topics would show near-zero diversity by 2023. In 2010, the top five NeurIPS research clusters by paper count were roughly equal in size. By 2023, transformer and scaling research represented over 60 percent of all accepted papers. The diversity collapsed to the point where a field studying intelligence became a field studying one architecture running on one hardware type optimised for one company’s revenue model.


What we actually know

The peer-reviewed record is unambiguous on certain points. Spiking neural networks achieve 100 to 1,000 times energy efficiency on compatible tasks. Neuro-symbolic systems learn from 10,000 to 100,000 times less data in structured domains. Capsule networks address spatial reasoning problems that CNNs demonstrably fail on. Mixture-of-experts provides ten times compute efficiency and was proven viable in 1991. Biological brains run on 20 watts while achieving general intelligence, a fact that sits there quietly, indicting every architectural choice we made.

Technical debt compounds. GPT-3 to GPT-4 was forty times the energy for marginal capability gain. GPT-4 to GPT-5 is ten times more energy for diminishing returns. Exponential cost for logarithmic improvement. Physical limits approach regardless of willingness to pay. When scaling stops, by choice or by force, the efficient alternatives will still be there. The mathematics does not care about funding decisions. Spiking networks will still be 100 times more efficient. Hybrids will still learn from less data. Vapnik’s generalisation bounds will still hold. The bound

\[R(\alpha) \leq R_{\text{emp}}(\alpha) + \sqrt{\frac{h\left(\ln\frac{2n}{h} + 1\right) - \ln\frac{\eta}{4}}{n}}\]

relating true risk $R(\alpha)$ to empirical risk $R_{\text{emp}}(\alpha)$ through VC dimension $h$, sample size $n$, and confidence $1-\eta$, is a theorem. It was true before GPT-4 and will be true after. It tells you exactly when your model is likely to generalise and when it is not. That is the one thing the current paradigm cannot tell you about itself. Pearl’s causal calculus will still be the only rigorous framework for interventional reasoning.

This is one feedback loop. CUDA releases in 2007. AlexNet wins ImageNet in 2012. Venture capital floods toward GPU-compatible approaches. Cloud providers build GPU infrastructure where revenue scales with consumption. Benchmarks get designed around GPU tasks. Papers optimise for benchmarks. Corporate hiring targets benchmark winners. PhD students learn what gets jobs. Professors in symbolic and neuromorphic retire without replacements. Universities drop courses in dead fields. Funding committees see no proposals in alternatives and conclude nobody is working on them. Path dependency completes. The loop reinforces. Breaking it requires intervention at multiple points simultaneously.

We chose algorithms fitting available hardware over algorithms fitting intelligence. We chose architectures scaling compute over architectures scaling understanding. We chose training maximising cloud revenue over training minimising energy. We chose benchmarks measuring GPU performance over benchmarks measuring cognition.

Judea Pearl built the mathematics of cause and effect and is still publishing at 88 while the field ignores him. Vladimir Vapnik proved the generalisation bounds that deep learning abandoned when they became inconvenient. Jürgen Schmidhuber was working on meta-learning and compression as intelligence before anyone used the phrase “deep learning.” Ray Solomonoff proved the optimal theory of inductive inference in 1964. Viktor Glushkov built distributed networked systems in Kyiv in the 1960s that anticipated remote industrial control, distributed computing, and automated management, his entire tradition erased from Western AI history when the Cold War ended. Kateryna Yushchenko invented the pointer in 1955, nine years before the man Western computer science credits for the same idea, and her language ran on the computers that controlled Apollo-Soyuz while the West was busy crediting someone else. Edward Shortliffe built a system in 1974 that diagnosed infections as accurately as specialists and could explain every step of its reasoning. We cannot deploy its modern equivalent in hospitals today because we abandoned the principles that made explainability possible.

These people built rigorous frameworks for intelligence. We had the frameworks. The field chose not to use them.

The question is not who was right about AI. The question is whether we explored enough of the possibility space to know what we were doing. The answer, measurably, is no.


References

  1. Hooker, S. (2020). “The Hardware Lottery.” arXiv:2009.06489
  2. Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” NeurIPS 2012
  3. Pearl, J. (2018). “Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution.” arXiv:1801.04016
  4. Pearl, J. and Mackenzie, D. (2018). The Book of Why. Basic Books
  5. Vapnik, V. and Chervonenkis, A. (1971). “On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities.” Theory of Probability and Its Applications
  6. Cortes, C. and Vapnik, V. (1995). “Support-Vector Networks.” Machine Learning, 20(3)
  7. Hochreiter, S. and Schmidhuber, J. (1997). “Long Short-Term Memory.” Neural Computation, 9(8)
  8. Schmidhuber, J. (2022). “Annotated History of Modern AI and Deep Learning.” arXiv:2212.11279
  9. Solomonoff, R. (1964). “A Formal Theory of Inductive Inference.” Information and Control, 7(1)
  10. Rissanen, J. (1978). “Modeling by Shortest Data Description.” Automatica, 14(5)
  11. Kaplan, J. et al. (2020). “Scaling Laws for Neural Language Models.” arXiv:2001.08361
  12. Frankle, J. and Carlin, M. (2018). “The Lottery Ticket Hypothesis.” arXiv:1803.03635
  13. Jacobs, R. et al. (1991). “Adaptive Mixtures of Local Experts.” Neural Computation, 3(1)
  14. Shazeer, N. et al. (2017). “Outrageously Large Neural Networks.” arXiv:1701.06538
  15. Sabour, S., Frosst, N., and Hinton, G. (2017). “Dynamic Routing Between Capsules.” NeurIPS 2017
  16. Davies, M. et al. (2018). “Loihi: A Neuromorphic Manycore Processor.” IEEE Micro, 38(1)
  17. Lillicrap, T. et al. (2014). “Random Feedback Weights Support Learning.” arXiv:1411.0247
  18. Rao, R. and Ballard, D. (1999). “Predictive Coding in the Visual Cortex.” Nature Neuroscience, 2(1)
  19. Crick, F. (1989). “The Recent Excitement About Neural Networks.” Nature, 337
  20. Marcus, G. (2018). “Deep Learning: A Critical Appraisal.” arXiv:1801.00631
  21. Rahimi, A. (2017). NeurIPS Test of Time Award Talk. “Machine Learning is Alchemy.”
  22. Strubell, E. et al. (2019). “Energy and Policy Considerations for Deep Learning in NLP.” ACL 2019
  23. Patterson, D. et al. (2021). “Carbon Emissions and Large Neural Network Training.” arXiv:2104.10350
  24. Li, P. et al. (2023). “Making AI Less Thirsty.” arXiv:2304.03271
  25. PJM Independent Market Monitor (2025). Capacity Market Report. monitoringanalytics.com
  26. IEA (2025). Energy and AI. iea.org/reports/energy-and-ai
  27. Hinton, G. (2023). BBC and MIT Technology Review interviews, May 2023
  28. Sutskever, I. (2024). Interview with Reuters, November 2024
  29. Deng, J. et al. (2009). “ImageNet: A Large-Scale Hierarchical Image Database.” CVPR 2009
  30. Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). “Learning Representations by Back-Propagating Errors.” Nature, 323
  31. Herculano-Houzel, S. (2009). “The Human Brain in Numbers.” Frontiers in Human Neuroscience, 3
  32. Shortliffe, E.H. (1976). Computer-Based Medical Consultations: MYCIN. Elsevier
  33. Yu, V.L. et al. (1979). “Antimicrobial Selection by a Computer.” JAMA, 242(12)
  34. Buchanan, B. and Shortliffe, E.H. (1984). Rule-Based Expert Systems: The MYCIN Experiments. Addison-Wesley
  35. McDermott, J. (1980). “R1: An Expert in the Computer Systems Domain.” AAAI-80
  36. Barker, V.E. and O’Connor, D.E. (1989). “Expert Systems for Configuration at Digital: XCON and Beyond.” Communications of the ACM, 32(3)
  37. Peters, B. (2016). How Not to Network a Nation: The Uneasy History of the Soviet Internet. MIT Press
  38. Glushkov, V.M. (1975). Macroeconomic Models and Principles of Building OGAS. Statistics Publishing House, Moscow
  39. Kitova, O.V. and Kitov, V.A. (2018). “Anatoly Kitov and Victor Glushkov: Pioneers of Russian Digital Economy and Informatics.” IFIP History of Computing Conference. Springer
  40. Cottier, B. et al. (2024). “The Rising Costs of Training Frontier AI Models.” arXiv:2405.21015
  41. OpenAI (2023). “GPT-4 Technical Report.” arXiv:2303.08774
  42. Sutskever, I. (2023). Interview with The Verge, March 2023
  43. IEA (2025). Energy and AI Report. iea.org
  44. Heinrich Böll Foundation (2025). “AI Wants Our Water.” eu.boell.org
  45. Epoch AI (2024). “Tracking Large-Scale AI Models.” epoch.ai
  46. Binder, A. et al. (2018). “DeepProbLog: Neural Probabilistic Logic Programming.” NeurIPS 2018
  47. Cusumano-Towner, M. et al. (2019). “Gen: A General-Purpose Probabilistic Programming System.” ACM SIGPLAN 2019
  48. Gerovitch, S. (2008). “InterNyet: Why the Soviet Union Did Not Build a Nationwide Computer Network.” History and Technology
  49. Pikhorovich, V. (2022). “Glushkov and His Ideas: Cybernetics of the Future.” Cosmonaut Magazine
  50. Yushchenko biography and Address Programming Language. Wikipedia; Ada Lovelace Day 2022; A Computer of One’s Own (Medium, 2018)
  51. OGAS project documentation and Glushkov biography. glushkov.su
  52. Peters, B. (2016). Aeon Essays: “How the Soviets invented the internet and why it didn’t work.” aeon.co
  53. Kagan, B.J. et al. (2022). “In vitro neurons learn and exhibit sentience when embodied in a simulated game-world.” Neuron, 110(24), 3952-3969. doi:10.1016/j.neuron.2022.09.001
  54. Gema, A.P. et al. (2024). “Are We Done with MMLU?” arXiv:2406.04127. University of Edinburgh.
  55. Besiroglu, T. and Sevilla, J. (2025). “Clarifying the creation and use of the FrontierMath benchmark.” Epoch AI. epoch.ai/blog/openai-and-frontiermath.
  56. Szegedy, C. et al. (2013). “Intriguing Properties of Neural Networks.” arXiv:1312.6199.
  57. Leike, J. (2024). Resignation statement, X (formerly Twitter), May 17, 2024.
  58. Autorité de la concurrence (2024). Opinion on the competitive functioning of the generative AI sector. June 2024. autoritedelaconcurrence.fr.
  59. OpenSecrets / TechCrunch (2025). “AI companies upped their federal lobbying spend in 2024.” techcrunch.com, January 24, 2025.
  60. Hinton, G. (2024). Nobel Prize lecture and associated interviews. Royal Swedish Academy of Sciences, October 2024.
  61. American Action Forum (2024). “The DOJ and Nvidia: AI Market Dominance and Antitrust Concerns.” americanactionforum.org.
  62. DeepSeek (2025). “DeepSeek-V3 Technical Report.” arXiv:2412.19437.
  63. OpenAI incorporation documents and IRS Form 990 filings (2015-2024).
  64. Shannon, C.E. (1950). “Programming a Computer for Playing Chess.” Philosophical Magazine, 41(314).
  65. Wiener, N. (1960). “Some Moral and Technical Consequences of Automation.” Science, 131(3410).
  66. Valiant, L.G. (1984). “A Theory of the Learnable.” Communications of the ACM, 27(11).
  67. Valiant, L.G. (2013). Probably Approximately Correct. Basic Books.
  68. Chaitin, G.J. (1987). Algorithmic Information Theory. Cambridge University Press.
  69. Chaitin, G.J. (1975). “A Theory of Program Size Formally Identical to Information Theory.” Journal of the ACM, 22(3).
  70. Tishby, N., Pereira, F.C., and Bialek, W. (2000). “The Information Bottleneck Method.” arXiv:physics/0004057.
  71. Landauer, R. (1961). “Irreversibility and Heat Generation in the Computing Process.” IBM Journal of Research and Development, 5(3).
  72. Henderson, P. et al. (2018). “Deep Reinforcement Learning That Matters.” AAAI 2018. arXiv:1709.06560.
  73. Bouthillier, X. et al. (2019). “Unreproducible Research is Reproducible.” ICML 2019. arXiv:1807.01774.
  74. Madry, A. et al. (2018). “Towards Deep Learning Models Resistant to Adversarial Attacks.” ICLR 2018. arXiv:1706.06083.
  75. Nori, H. et al. (2023). “Capabilities of GPT-4 on Medical Challenge Problems.” arXiv:2303.13375. Microsoft Research.
  76. Von Neumann, J. (1966). Theory of Self-Reproducing Automata. Edited by A.W. Burks. University of Illinois Press.
  77. Hoffmann, J. et al. (2022). “Training Compute-Optimal Large Language Models.” arXiv:2203.15556. (Chinchilla)
  78. Solomonoff, R. (1956). “An Inductive Inference Machine.” IRE Convention Record, Section on Information Theory, Part 2, pp. 56–62. (The first machine learning report, circulated at the Dartmouth conference.)
  79. Huang, Y. and Valtorta, M. (2006). “Pearl’s Calculus of Intervention Is Complete.” Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence. AUAI Press.
  80. Yu, Y. et al. (2023). “White-Box Transformers via Sparse Rate Reduction.” Advances in Neural Information Processing Systems 36 (NeurIPS 2023).
  81. Buchanan, S., Pai, D., Wang, P., and Ma, Y. (2025). Learning Deep Representations of Data Distributions. Pre-publication textbook, Toyota Technological Institute at Chicago / UC Berkeley.
This post is licensed under CC BY 4.0 by the author.