On DeepSeek and Export Controls by Dario Amodei
Download MP3Dario Amode writes on deep seek and export controls
in January 2025, as narrated by
the A OK voice bot from Elevenlabs. Let's begin
A few weeks ago, I made the case for stronger U.S. export controls
on chips to China. Since then, Deepseek,
a Chinese AI company, has managed to, at least
in some respects, come close to the performance of US Frontier
AI models at lower cost.
Here, I won't focus on whether Deepseek is or
isn't a threat to USAI companies like Anthropic, although I
do believe many of the claims about their threat to USAI leadership are
greatly overstated.
Instead, I'll focus on whether DeepSeek's releases
undermine the case for those export control policies on chips.
I don't think they do. In fact, I think they make export control
policies even more existentially important than
they were a week ago, too. Export controls serve a vital purpose
keeping democratic nations at the forefront of AI development.
To be clear, they're not a way to duck the competition between the US
and China. In the end, AI companies in the US
and other democracies must have better models than those in China if we want
to prevail. But we shouldn't hand the Chinese Communist Party technological
advantages when we don't have to Three Dynamics
of AI Development Before I make my policy argument,
I'm going to describe three basic dynamics of AI systems that
it's crucial to understand. Scaling laws,
a property of AI, which I and my co founders were among the first
to document back when we worked at OpenAI, is that all
else equal. Scaling up the training of AI systems
leads to smoothly better results on a range of cognitive tasks
across the board. So, for example,
a one million dollar model might solve 20% of important
coding tasks, a $10 million model
might solve 40%, a hundred million dollar model might
solve all coding tasks,
and so on. These differences tend to have huge implications
in practice. Another factor of 10 may correspond to the difference
between an undergraduate and PhD skill level, and thus
companies are investing heavily in training these models.
Shifting the Curve the field is constantly coming
up with ideas, large and small that
make things more effective or efficient.
It could be an improvement to the architecture of the model,
a tweak to the basic transformer architecture that all of today's
models use, or simply a way of running the model more efficiently on
the underlying hardware. New generations of
hardware also have the same effect.
What this typically does is shift the curve. If the innovation
is a 2 times compute multiplier, then it allows
you to get 40% on a coding task for $5 million instead of
$10 million, or 60% for $50 million instead
of $100 million, etc.
Every Frontier AI company regularly discovers many of these CMS,
frequently small ones approximately 1.2 times,
sometimes medium sized ones approximately 2 times,
and every once in a while very large ones approximately 10 times.
Because the value of having a more intelligent system is so high,
this shifting of the curve typically causes companies to spend more, not less,
on training models. The gains in cost efficiency
end up entirely devoted to training smarter models,
limited only by the company's financial resources. People are
naturally attracted to the idea that first something is expensive,
then it gets cheaper, as if AI is a single thing of
constant quality. And when it gets cheaper, we'll use fewer chips
to train it. But what's important is the scaling
curve. When it shifts, we simply traverse
it faster because the value of what's at the end of the curve
is so high. In 2020 my
team published a paper suggesting that the shift in the curve due
to algorithmic progress is approximately 1.68
times per year. That has probably sped up significantly
since it also doesn't take efficiency and hardware into account.
I'd guess the number today is maybe approximately four times a year.
Another estimate is here shifts in the training curve
also shift the inference curve and as a result large
decreases in price holding constant the quality of model have been
occurring for years. For instance, Claude 3.5 Sonnet,
which was released 15 months later than the original GPT4,
outscores GPT4 on almost all benchmarks while having
a 10 times lower API price. Shifting the paradigm
Every once in a while the underlying thing that is being scaled
changes a bit or a new type of scaling is added to the training
process. From 2020 to 2023,
the main thing being scaled was pre trained models,
models trained on increasing amounts of Internet text with
a tiny bit of other training on top. In 2024,
the idea of using reinforcement learning to
train models to generate chains of thought has become a new focus of
scaling. Anthropic, Deep Seq and many
other companies, perhaps most notably OpenAI, who released their O1
preview model in September, have found that this training greatly increases
performance on certain select objectively measurable tasks like math
coding competitions and on reasoning that resembles these
tasks. This new paradigm involves starting with the ordinary
type of pre trained models and then as a second stage using RL to
add the reasoning. Importantly, because this type
of RL is new, we are still very early on the scaling curve.
The amount being spent on the second RL stage is small.
For all players, spending $1 million instead
of $100,000 is
enough to get huge gains. Companies are now working
very quickly to scale up the second stage to hundreds of millions and
billions, but it's crucial to understand that we're at a unique crossover
point where there is a powerful new paradigm that is
early on the scaling curve and therefore can make big gains quickly.
Deepseek's Models the three dynamics above can help
us understand Deepseek's recent releases. About a month ago
Deepseek released a model called Deepseek V3 that was
a pure pre trained model, the first stage described in number
three above. Then last week they released R1
which added a second stage. It's not possible to determine
everything about these models from the outside, but the following
is my best understanding of the two releases.
DeepSeek v3 was actually the real innovation and what should
have made people take notice a month ago we certainly did.
As a pre trained model, it appears to come close to the performance
of state of the art US models on some important tasks while
costing substantially less to train, although we find that Claude
3.5 sonnet in particular remains much better on some
other key tasks such as real world coding.
Deepseek's team did this via some genuine and impressive innovations,
mostly focused on engineering efficiency. There were
particularly innovative improvements in the management of an aspect called
the key value cache and in enabling a method called mixture
of experts to be pushed further than it had before.
However, it's important to look closer. Deep Seek does not do
for 65 million what cost us AI companies billions.
I can only speak for anthropic, but Claude 3.5 Sonnet
is a mid sized model that cost a few tens if millions if dollars
to train. I won't give an exact number.
Also 3rd.5 Sonnet was not trained in any
way that involved a larger or or more expensive model.
Contrary to some rumors, Sonnet's Training was conducted
nine to 12 months ago and Deep Seek's model was trained in
November or December, while Sonnet remains notably ahead in many
internal and external evals. Thus I think
a fair statement is Deepseek produced a model close to the performance of
US models seven to ten months older for a good deal less cost,
but not anywhere near the ratios people have suggested.
If the historical trend of the cost curve decreases approximately four times
per year, that means that in the ordinary course of business,
in the normal trends of historical cost decreases like
those that happened in 2023 and 2024 we'd expect a
model 3 to 4 times cheaper than 3.5 sonnet GPT4
around now. Since Deepseek V3 is worse than those US
Frontier models, lets say buy 2 times on the scaling curve,
which I think is quite generous to Deepseek v3. That means
it would be totally normal, totally on Trend if
DeepSeek v3 training cost 8 times less than the current
US models developed a year ago. I'm not going to
give a number, but it's clear from the previous bullet point
that even if you take Deep Seek's training cost at face value,
they are on trend at best, and probably not even
that. For example, this is less steep than the original
GPT4 to Claude 3.5
sonnet inference price differential 10 times and
3.5 sonnet is a better model than GPT4.
All of this is to say that deep seq v3 is not
a unique breakthrough or something that fundamentally changes the economics
of LLMs. It's an expected point on an
ongoing cost reduction curve. What's different this time
is that the company that was first to demonstrate the expected cost
reductions was Chinese. This has never happened before
and is geopolitically significant. However,
US companies will soon follow suit and they
won't do this by copying Deepseek, but because they too are
achieving the usual trend in cost reduction. Both Deep
SEQ and USAI companies have much more money and many more
chips than they use to train their headline models.
The extra chips are used for R and D to develop the ideas behind the
model and sometimes to train larger models that are not yet
ready or that needed more than one try to get right.
It's been reported we can't be certain it is true that Deepseek
actually had 50,000 hopper generation chips,
which I'd guess is within a factor 2 to 3x of what
the major USAI companies have. For example, it's 2
to 3 times less than the XAI Colossus cluster,
so those 50,000 hopper chips cost on the order of
$1 billion. Thus Deepseek's total spend
as a company, as distinct from spend to train an individual model,
is not vastly different from USAI Labs.
It's worth noting that the scaling curve analysis is a
bit oversimplified because models are somewhat differentiated and
have different strengths and weaknesses. The scaling curve numbers
are a crude average that ignores a lot of details.
I can only speak to anthropics models, but as I've hinted at
above, Claude is extremely good at coding and at
having a well designed style of interaction with people. Many people
use it for personal advice or support on these and
some additional tasks. There's just no comparison with
Deepseek. These factors don't appear in the scaling numbers.
R1, which is the model that was released last week and
which triggered an explosion of public attention including a
17% decrease in Nvidia's stock price,
is much less interesting from an innovation or engineering
perspective than V3. It adds the second phase
of training reinforcement learning described in number 3
in the previous section and essentially replicates what
OpenAI has done with O1. They appear to be at similar scale
with similar results. 8.
However, because we are on 8 the early part of the scaling curve,
it's possible for several companies to produce models of this type as
long as they're starting from a strong pre trained model.
Producing R1 given V3 was probably very cheap.
We're therefore at an interesting crossover point
where it is temporarily the case that several companies can produce good reasoning
models. This will rapidly cease to be true as everyone moves further
up the scaling curve on these models Export
Controls all of this is just a preamble to my main topic of
interest, the export controls on chips to China.
In light of the above facts, I see the situation as
follows. There is an ongoing trend where companies spend
more and more on training powerful AI models even
as the curve is periodically shifted and the cost of training
a given level of model intelligence declines rapidly.
It's just that the economic value of training more and more intelligent models
is so great that any cost gains are more than eaten up
almost immediately. They're poured back into making even
smarter models for the same huge cost we were originally planning
to spend. To the extent that US labs
haven't already discovered them. The efficiency innovations Deepseek developed
will soon be applied by both US and Chinese labs to train multi
billion dollar models. These will perform better than
the multi billion models they were previously planning to train,
but they'll still spend multi billions.
That number will continue going up until we reach AI that
is smarter than almost all humans at almost all
things. Making AI that is smarter than almost all humans
at almost all things will require millions of chips,
tens of billions of dollars at least, and is
most likely to happen in 2026-2027.
DeepSeek's releases don't change this because
they're roughly on the expected cost reduction curve that has always been factored
into these calculations. This means that in 2026-2027
we could end up in one of two starkly different worlds in
the US, multiple companies will definitely have the required millions
of chips at the cost of tens of billions of dollars.
The question is whether China will also be able to get millions of chips.
If they can, we'll live in a bipolar world where both the US
and China have powerful AI models that will cause extremely
rapid advances in science and technology. What I've
called countries of geniuses in a data center. A bipolar
world would not necessarily be balanced indefinitely. Even if the
US and China were at parity in AI systems, it seems
likely that China could direct more talent, capital and focus to
military applications of the technology. Combined with
its large industrial base and military strategic advantages,
this could help China take a commanding lead on the global stage.
Not just for AI, but for everything. If China
can't get millions of chips, we'll at least temporarily
live in a unipolar world where only the US and its allies
have these models. It's unclear whether the unipolar world will last,
but there's at least the possibility that because AI systems can eventually
help make even smarter AI systems, a temporary lead
could be parlayed into a durable advantage. Thus, in this world,
the US and its allies might take a commanding and long lasting lead on
the global stage. Well enforced export controls
are the only thing that can prevent China from getting millions of
chips and are therefore the most important determinant of
whether we end up in a unipolar or bipolar world.
The performance of Deepseek does not mean the export controls
failed. As I stated above, Deepseek had a moderate to
large number of chips, so it's not surprising that they were able to
develop and then train a powerful model. They were
not substantially more resource constrained than US AI companies,
and the export controls were not the main factor causing
them to innovate. They are simply very
talented engineers and show why China is a serious competitor
to the us. Deepseek also does not show that China
can always obtain the chips it needs via smuggling or that
the controls always have loopholes. I don't believe the export controls
were ever designed to prevent China from getting a few tens of Thousands of chips.
$1 billion of economic activity can be hidden,
but it's hard to hide $100 billion or even $10 billion.
A million chips may also be physically difficult to smuggle.
It's also instructive to look at the chips Deepseek is currently reported to
have. This is a mix of H1 hundreds,
H8 hundreds and H20s. According to Semianalysis,
adding up to 50,000 total H1 hundreds
have been banned under the export controls since their release, so if Deepseek
has any, they must have been smuggled. Note that Nvidia
has stated that DeepSeek's advances are fully export control compliant.
H8 hundreds were allowed under the initial round of 2022
export controls, but were banned in October 2023 when the
controls were updated, so these were probably shipped before the ban.
H20s are less efficient for training and more efficient for sampling
and are still allowed, although I think they should be banned.
All of that is to say that it appears that a substantial
fraction of Deep Seek's AI chip fleet consists of chips
that haven't been banned but should be, chips that
were shipped before they were banned, and some that seem very likely to
have been smuggled. This shows that the export controls are actually working
and adapting loopholes are being closed, otherwise they
would likely have a full fleet of top of the line H1 hundreds.
If we can close them fast enough, we may be able to prevent China from
getting millions of chips, increasing the likelihood of a unipolar
world with the US ahead. Given my focus on export
controls and US national security, I want to be clear on one
I don't see Deep Seek themselves as adversaries, and the point
isn't to target them in particular. In interviews
they've done, they seem like smart, curious researchers who just want to make useful
technology. But they're beholden to an authoritarian
government that has committed human rights violations,
has behaved aggressively on the world stage, and will be
far more unfettered in these actions if they're able to match the US
in AI. Export controls are one of our most powerful tools
for preventing this, and the idea that the technology getting more powerful,
having more bang for the buck is a reason to lift our export controls makes
no sense at all.
