Scientists Discover Artificial Intelligence's Fatal Flaw: Most Advanced Models Fail Basic Logic Tests

Large language models (LLMs) like ChatGPT show reasoning errors in many domains.
Identifying vulnerabilities is good for public safety, industry, and the scientists who make these models.
The human brain is miraculously adept, whereas LLMs are not and cannot be.

In a groundbreaking new paper, scientists from Stanford, Cal Tech, and Carleton College combined existing research with new ideas to analyze the reasoning failures of large language models (LLMs) like ChatGPT and Claude. Those who rely on LLMs for intellectual work often cite the models’ reasoning ability as a major attraction, despite evidence that this ability is limited, even when dealing with simple questions. So what is the truth?

First, a quick primer. One of the main lines of criticism leveled by today’s AI skeptics goes something like this: big language models work just like your phone’s autocomplete—spicy autocomplete, so to speak. But there are significant differences. LLMs come with a much longer attention span and a a lot computer system bigger than your phone’s messaging app does. It comes down to data and processing power. Huge swaths of the public internet, books, magazines, academic journals – whatever is most relevant to a particular model – are turned into code that then organizes everything into complicated lists. Furthermore, while computing in general is nothing like the human brain, LLMs have something in common with the way we humans think. When it receives a prompt, both your brain and an LLM go through many possible paths and hit a bunch of ideas before using logic to piece together an answer. We can think of computers as doing binary arithmetic, but LLMs start with college-level linear and matrix algebra and get more complicated from there.

All this behind-the-curtain math may give the impression that an LLM is thinking or feeling, but it is not. An LLM is, however, capable of certain types of associative reasoning – a technical and philosophical term meaning that they can consider information and apply logic to draw a conclusion. However, as the authors of the new research paper make clear, there are limits. “Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios.”

In their review, which is available now on the preprint site arXivas well as through the online resource Transactions on Machine Learning Research-scientists have classified LLM reasoning failures and chosen common categories of errors, some of which are listed below. (You can also find a link to their repository of compiled references and research here.)

Individual cognitive reasoning

LLMs perpetuate human errors like bias and make other human-like errors because they lack the intuitive scaffolding to help us learn not to make those mistakes.
LLMs lack the basic executive functions (working memory, cognitive flexibility, and inhibitory control) that help people succeed in reasoning, leading to systemic failures in LLMs.
LLMs are weak at abstract reasoning, such as understanding relationships between intangible concepts (eg, knowledge, trust, security) and choosing rules that affect small sets.
LLMs show human-like confirmation bias against information they already analyze well.
LLMs show ordering and anchoring dispositions, such as overloading the first instance in a list of items.

Implicit social reasoning

LLM fails at Theory of Mind tasks such as inferring what someone is thinking, predicting behavior, making judgments, and suggesting actions.
LLMs fail with the moral and social rules that people learn in complex and subtle ways in real life.
“Without consistent and reliable moral reasoning, LLMs are not fully prepared for real-world decision-making involving ethical considerations.”
The sum of these errors results in less system robustness, meaning that LLMs are vulnerable to “jailbreaks and tampering”.

Explicit social reasoning

LLMs fail to have a coherent plan or pattern of reasoning over long interactions because they rely on local, shorter-term data. This can create disagreement between agents as data changes.
LLMs are weak with “multi-step, jointly conditioned objectives” such as planning.
Cognitive biases and reliance on local information lead to snowball errors.

Logic in natural language

LLMs cannot consistently perform “trivial” types of natural language logic, eg if A=B then B=A.
“Studies Show Systematic Failures in Basic Two-Leap Reasoning – Combining Just Two Facts in Papers.”
“[S]studies reveal LLM weaknesses in certain types of logic, such as causal inference and even superficial yes/no questions.”

Arithmetic and Mathematics

“Despite its simplicity, counting poses a notable fundamental challenge for LLMs, even for reasoning ones (Malek et al., 2025), which extend to basic character-level operations such as reordering or substitution.”
LLMs fail to assess and solve math word problems (MWPs) and struggle to analyze whether MWPs contain errors.

Reasoning in embedded environments

LLMs fail at “even basic physical reasoning”, such as knowing where the place is in a given scenario.
LLMs fail at scientific reasoning because it requires more steps and logic.

Physical reasoning failures in the real 3-D world

LLMs fail with spatial tasks such as moving objects to correct locations.
Plans generated by LLM for robotic tasks change with prompt wording and show vulnerability to tampering techniques such as jailbreaking to access private data.
LLMs have poor self-awareness and need structures to incorporate feedback for the future.

The news sounds bad (and it is), but identifying weaknesses and working to mitigate them is key to developing any model or product. The failures of today’s LLMs can be instructive for building better AI architectures in the future. For example, scholars have pointed to architecture and training as an area for major feasible improvements: “[R]The cause analyzes in these categories are particularly rich, suggesting meaningful methods not only for mitigating specific failures, but also for overall improvement of the architecture and our understanding of it.” In other words, large language models are great for a lot of things, but they are not the path to general artificial intelligence.

The scientists also suggested some domain-wide structures for improvement:

1. Root cause analysis for all types of reasoning failures that LLMs display.

2. Unified, persistent reference failures for all types of reasoning failures; “Such benchmarks should preserve historically challenging cases while incorporating newly discovered ones.”

3. Principles of failure injection, applied “by adding adversarial sections, multi-level task difficulty, or multi-domain compositions designed to trigger known weaknesses.”

4. “[D]dynamic and event-driven benchmarks could combat overfitting and encourage continuous improvement.”

“Overall,” the researchers concluded, “the systematic study of reasoning failures in LLM parallels fault tolerance research in early computing and incident analysis in safety-critical industries: understanding and classifying failures is a prerequisite for building resilient systems.”

You may also like

Scientists Discover Artificial Intelligence’s Fatal Flaw: Most Advanced Models Fail Basic Logic Tests