Technical Research Survey — 80+ Papers

Why Large Language Models
Cannot Reach Artificial General Intelligence

A deep technical survey across five research fronts: reasoning failures, memory limitations, ARC-AGI benchmarks, neuroscience comparisons, and proposed solutions. The evidence converges on a clear conclusion — LLMs are sophisticated interpolation engines whose capabilities are fundamentally bounded by the autoregressive next-token prediction objective.

80+
Papers surveyed across NeurIPS, ICML, arXiv, Nature
~3%
o3's score on ARC-AGI-2 (humans avg 60%)
7
Core capabilities humans have that LLMs lack
5
Hybrid approaches showing genuine promise
Angle 1
Reasoning & Abstraction Failures
LLMs do not reason — they pattern-match on training data. Performance collapses under trivial modifications to familiar problems, revealing that models are doing sophisticated interpolation, not systematic inference.

Core finding: Transformers solve multi-step reasoning by reducing it to "linearized subgraph matching" — matching previously seen computation fragments. When a single irrelevant clause is added to grade-school math problems, accuracy drops up to 65%. Chain-of-thought explanations are post-hoc rationalization, not genuine reasoning traces.

NeurIPS 20232023
Faith and Fate: Limits of Transformers on Compositionality
Dziri et al. — Allen Institute for AI
Transformers reduce multi-step reasoning to linearized subgraph matching. GPT-4 achieves only 59% on 3×3 digit multiplication. Performance approaches zero as complexity grows.
ICLR 20252024
GSM-Symbolic: Understanding Limitations of Mathematical Reasoning in LLMs
Mirzadeh et al. — Apple Research
Performance drops up to 65% when an irrelevant clause is added. Concluded: "We found no evidence of formal reasoning in language models." All frontier models affected.
ICML 20242024
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
Kambhampati et al. — Arizona State University
LLMs function as approximate knowledge sources (System 1) but cannot perform principled planning or self-verification. Even o1 fails to saturate planning benchmarks.
ICLR 20242023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Berglund et al. — Oxford
GPT-4 answers "Tom Cruise's mother?" correctly 79% but "Mary Lee Pfeiffer's son?" only 33%. Basic bidirectional inference fails across all model sizes.
PNAS 20242024
Embers of Autoregression: Understanding LLMs Through the Problem They're Trained to Solve
McCoy et al. — Yale
GPT-4 accuracy on a simple shift cipher drops from 51% to 13% depending purely on whether the output is a high- or low-probability token sequence. Performance is governed by statistics, not logic.
NeurIPS 20242024
Chain of Thoughtlessness? An Analysis of CoT in Planning
Stechly et al. — Arizona State University
CoT gains on planning tasks are highly sensitive to prompt wording and do not transfer to semantically equivalent but differently phrased prompts. Improvements reflect narrow pattern matching.
Accuracy degradation on modified benchmarks
GSM-Symbolic: add irrelevant clause-65%
3×3 Digit Multiplication (GPT-4)-41%
Reversal Curse: "Who is B?" vs "Who is A?"-46%
Shift cipher: low vs high token probability-38%
Alice-in-Wonderland trivial reasoning task<50%
Survey covers 80+ papers · NeurIPS · ICML · ICLR · Nature · arXiv · 2018–2026 · Last updated March 2026