
The Real Breakthrough Behind DeepSeek R1
Many friends, family, and clients have asked for my thoughts on the DeekSeek AI models and the market impact, so I thought I'd share them here in case they are helpful.
First, I don’t think the new DeepSeek model will change the need for massive compute and energy. In fact, I bought a bunch of Nvidia Monday morning when it dipped. Here's why:
- Efficiency fuels demand: More efficient compute doesn’t reduce the need for compute—it enables even larger and more ambitious models.
- Lower costs leads to higher usage: Cheaper inference will lead to skyrocketing adoption and greater overall compute demand (Jevons paradox, etc).
- New behaviors emerge at scale: Larger models have unique capabilities that only emerge as their parameter count and training data increase, reinforcing the push for scale.
- Size matters: Both biological brains and AI show that intelligence scales with size. Larger models (almost) always outperform smaller ones.
- Compute wins long-term: The "Bitter Lesson" (coined by Rich Sutton) shows that massive compute always outperforms clever optimizations over time.
One way to think about AI progress is that it’s really just a function of compute. The more compute we throw at the problem, the better the models get. Algorithmic improvements matter, but history shows that scaling data and compute ultimately wins.
Training, data generation, reinforcement learning — all of these collapse into compute at some level. Synthetic data generation means you can trade compute for data. Reinforcement learning means you can trade compute for experience. The distinction between compute, data, and learning gets fuzzier over time, but the underlying dynamic stays the same: more compute means better AI.
There are several important things about DeepSeek R1 (clever training optimizations, using pure RL at scale, etc). But by far the most significant is that documented their training process in a whitepaper, so we know how they did it. This is huge. It marks a major shift in reasoning model research. Until now, there hasn’t been a definitive guide to training these models. Researchers had to piece things together from scattered papers and trial and error. This changes that. Having a clear blueprint will likely accelerate progress in reasoning language models, making 2025 a big year for their development and deployment.
In short, DeepSeek R1 is impressive, but its real significance lies in its transparency. By openly documenting their training process, they’ve set a new precedent for the AI community. This will likely lead to faster iteration, better models, and a deeper understanding of what it takes to train state-of-the-art reasoning models.
For investors, builders, and researchers, the takeaway is clear: AI progress is still fundamentally about compute, and the race to scale continues. 2025 is shaping up to be a pivotal year. Buckle up.
Compute powers data generation, learning, and reinforcement, blurring the line between these elements and reinforcing its centrality in AI progress. Synthetic data generation allows compute to substitute for data collection, and reinforcement learning, in turn, optimizes models based on experience (generated data). The line between compute, data, and learning becomes increasingly blurred, reinforcing the idea that compute is the ultimate driver of progress in AI.
AI teaching AI: Techniques like distillation demonstrate an accelerating loop where AI models teach themselves, driving rapid innovation and demand for compute. It seems plausible that DeepSeek benefited from distillation, particularly in terms of training R1. That, though, is itself an important takeaway: we have a situation where AI models are teaching AI models, and where AI models are teaching themselves. We are watching the assembly of an AI takeoff scenario in real time.
When I read the R1 paper, I assumed that DeepSeek bootstrapped it (distillation) on data from other frontier models. Interestingly, I don’t think that’s too much of an indictment. I expect all significant models to be trained on the output of previous generations from this point on. I think we have legitimately entered the phase where models are strong enough to bootstrap the next generation indefinitely.
Between the emergent reasoning that R1 showed at scale with simple reinforcement learning and the use of distillation, I think we are entering the final phase before AGI. I did a lot of hard thinking over the weekend and I can see it now. Autoregressive models will work and I can explain why. We’re three to five years away from AGI.