Technical Research Survey — 80+ Papers

Why Large Language Models
Cannot Reach Artificial General Intelligence

A deep technical survey across five research fronts: reasoning failures, memory limitations, ARC-AGI benchmarks, neuroscience comparisons, and proposed solutions. The evidence converges on a clear conclusion — LLMs are sophisticated interpolation engines whose capabilities are fundamentally bounded by the autoregressive next-token prediction objective.

80+

Papers surveyed across NeurIPS, ICML, arXiv, Nature

~3%

o3's score on ARC-AGI-2 (humans avg 60%)

Core capabilities humans have that LLMs lack

Hybrid approaches showing genuine promise

Angle 1

Reasoning & Abstraction Failures

LLMs do not reason — they pattern-match on training data. Performance collapses under trivial modifications to familiar problems, revealing that models are doing sophisticated interpolation, not systematic inference.

Core finding: Transformers solve multi-step reasoning by reducing it to "linearized subgraph matching" — matching previously seen computation fragments. When a single irrelevant clause is added to grade-school math problems, accuracy drops up to 65%. Chain-of-thought explanations are post-hoc rationalization, not genuine reasoning traces.

NeurIPS 20232023

Faith and Fate: Limits of Transformers on Compositionality

Dziri et al. — Allen Institute for AI

Transformers reduce multi-step reasoning to linearized subgraph matching. GPT-4 achieves only 59% on 3×3 digit multiplication. Performance approaches zero as complexity grows.

ICLR 20252024

GSM-Symbolic: Understanding Limitations of Mathematical Reasoning in LLMs

Mirzadeh et al. — Apple Research

Performance drops up to 65% when an irrelevant clause is added. Concluded: "We found no evidence of formal reasoning in language models." All frontier models affected.

ICML 20242024

LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

Kambhampati et al. — Arizona State University

LLMs function as approximate knowledge sources (System 1) but cannot perform principled planning or self-verification. Even o1 fails to saturate planning benchmarks.

ICLR 20242023

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Berglund et al. — Oxford

GPT-4 answers "Tom Cruise's mother?" correctly 79% but "Mary Lee Pfeiffer's son?" only 33%. Basic bidirectional inference fails across all model sizes.

PNAS 20242024

Embers of Autoregression: Understanding LLMs Through the Problem They're Trained to Solve

McCoy et al. — Yale

GPT-4 accuracy on a simple shift cipher drops from 51% to 13% depending purely on whether the output is a high- or low-probability token sequence. Performance is governed by statistics, not logic.

NeurIPS 20242024

Chain of Thoughtlessness? An Analysis of CoT in Planning

Stechly et al. — Arizona State University

CoT gains on planning tasks are highly sensitive to prompt wording and do not transfer to semantically equivalent but differently phrased prompts. Improvements reflect narrow pattern matching.

Accuracy degradation on modified benchmarks

GSM-Symbolic: add irrelevant clause-65%

3×3 Digit Multiplication (GPT-4)-41%

Reversal Curse: "Who is B?" vs "Who is A?"-46%

Shift cipher: low vs high token probability-38%

Alice-in-Wonderland trivial reasoning task<50%

Survey covers 80+ papers · NeurIPS · ICML · ICLR · Nature · arXiv · 2018–2026 · Last updated March 2026

Why Large Language ModelsCannot Reach Artificial General Intelligence

Why Large Language Models
Cannot Reach Artificial General Intelligence