The Real Illusion in Apple’s “Illusion of Thinking” Paper

Ron Green

Co-founder, Chief Technology Officer

A few weeks ago, Apple’s AI research group published a paper titled “The Illusion of Thinking.” It made some waves by claiming that large language models (LLMs) dramatically fail when faced with true reasoning tasks, specifically the classic Tower of Hanoi puzzle.

The implication was clear: LLMs don’t really “think.” They just stitch together language in a way that looks smart until they hit complexity, and then the illusion collapses.

I was skeptical of these claims from the start.

For context, the Tower of Hanoi is a logic puzzle that involves moving disks between three pegs under specific rules: only one disk at a time, and you can’t place a larger disk on a smaller one. The difficulty ramps up fast. The number of required steps grows exponentially with each additional disk.

Apple’s study forced LLMs to output every move explicitly, no shortcut algorithms allowed. And once the number of disks reached a certain point, performance fell to zero. Literally zero.

However, a critical analysis of the paper reveals significant shortcomings.

Here’s what stood out:

1. Context Window Limitations – By the time you hit 15 disks, you're looking at over 32,000 steps. At 20 disks, it’s over a million. The paper used models with a 64,000-token context window, meaning it literally could not fit the full solution in the output. The failure wasn’t about reasoning. It was about real estate.

2. Evaluation Methodology – Models that generated correct recursive or high-level logic were marked incorrect because they didn’t provide step-by-step solutions. That penalizes abstraction, the very thing they were trying to measure. The test essentially punished thinking in favor of brute-force doing.

3. Sampling Parameters – They used a sampling temperature of 1.0. That introduced maximum variance in every run. In other words, the models weren’t being asked to provide consistent or optimal answers. A more rigorous study would’ve used deterministic decoding or a large sample size to separate luck from skill.

And here's the kicker: every one of these critiques came from OpenAI’s o3 model, the same kind of LLM the paper claimed to debunk. A reasoning model demonstrated that the supposed failures were not intrinsic limitations but rather artifacts of experimental design.

That’s not just ironic, it’s inspiring. Because it shows that we may be entering a new phase in how models reason, reflect, and even critique their own limits.

The illusion isn’t that these models appear to think.

The illusion is believing that success or failure on a single task like this is definitive proof of reasoning capabilities or lack thereof.

Intelligence and reasoning are multifaceted, and evaluating them requires nuanced, well-designed approaches that consider context, abstraction, and consistency.

If you're curious about this topic, we talked about it in the this episode of Hidden Layers.