AI Opinion Keynotes (A-OK) | Transcript: On DeepSeek and Export Controls by Dario Amodei

On DeepSeek and Export Controls by Dario Amodei

January 29, 2025 / 19:14/E5 Download MP3

Dario Amode writes on deep seek and export controls

in January 2025, as narrated by

the A OK voice bot from Elevenlabs. Let's begin

A few weeks ago, I made the case for stronger U.S. export controls

on chips to China. Since then, Deepseek,

a Chinese AI company, has managed to, at least

in some respects, come close to the performance of US Frontier

AI models at lower cost.

Here, I won't focus on whether Deepseek is or

isn't a threat to USAI companies like Anthropic, although I

do believe many of the claims about their threat to USAI leadership are

greatly overstated.

Instead, I'll focus on whether DeepSeek's releases

undermine the case for those export control policies on chips.

I don't think they do. In fact, I think they make export control

policies even more existentially important than

they were a week ago, too. Export controls serve a vital purpose

keeping democratic nations at the forefront of AI development.

To be clear, they're not a way to duck the competition between the US

and China. In the end, AI companies in the US

and other democracies must have better models than those in China if we want

to prevail. But we shouldn't hand the Chinese Communist Party technological

advantages when we don't have to Three Dynamics

of AI Development Before I make my policy argument,

I'm going to describe three basic dynamics of AI systems that

it's crucial to understand. Scaling laws,

a property of AI, which I and my co founders were among the first

to document back when we worked at OpenAI, is that all

else equal. Scaling up the training of AI systems

leads to smoothly better results on a range of cognitive tasks

across the board. So, for example,

a one million dollar model might solve 20% of important

coding tasks, a $10 million model

might solve 40%, a hundred million dollar model might

solve all coding tasks,

and so on. These differences tend to have huge implications

in practice. Another factor of 10 may correspond to the difference

between an undergraduate and PhD skill level, and thus

companies are investing heavily in training these models.

Shifting the Curve the field is constantly coming

up with ideas, large and small that

make things more effective or efficient.

It could be an improvement to the architecture of the model,

a tweak to the basic transformer architecture that all of today's

models use, or simply a way of running the model more efficiently on

the underlying hardware. New generations of

hardware also have the same effect.

What this typically does is shift the curve. If the innovation

is a 2 times compute multiplier, then it allows

you to get 40% on a coding task for $5 million instead of

$10 million, or 60% for $50 million instead

of $100 million, etc.

Every Frontier AI company regularly discovers many of these CMS,

frequently small ones approximately 1.2 times,

sometimes medium sized ones approximately 2 times,

and every once in a while very large ones approximately 10 times.

Because the value of having a more intelligent system is so high,

this shifting of the curve typically causes companies to spend more, not less,

on training models. The gains in cost efficiency

end up entirely devoted to training smarter models,

limited only by the company's financial resources. People are

naturally attracted to the idea that first something is expensive,

then it gets cheaper, as if AI is a single thing of

constant quality. And when it gets cheaper, we'll use fewer chips

to train it. But what's important is the scaling

curve. When it shifts, we simply traverse

it faster because the value of what's at the end of the curve

is so high. In 2020 my

team published a paper suggesting that the shift in the curve due

to algorithmic progress is approximately 1.68

times per year. That has probably sped up significantly

since it also doesn't take efficiency and hardware into account.

I'd guess the number today is maybe approximately four times a year.

Another estimate is here shifts in the training curve

also shift the inference curve and as a result large

decreases in price holding constant the quality of model have been

occurring for years. For instance, Claude 3.5 Sonnet,

which was released 15 months later than the original GPT4,

outscores GPT4 on almost all benchmarks while having

a 10 times lower API price. Shifting the paradigm

Every once in a while the underlying thing that is being scaled

changes a bit or a new type of scaling is added to the training

process. From 2020 to 2023,

the main thing being scaled was pre trained models,

models trained on increasing amounts of Internet text with

a tiny bit of other training on top. In 2024,

the idea of using reinforcement learning to

train models to generate chains of thought has become a new focus of

scaling. Anthropic, Deep Seq and many

other companies, perhaps most notably OpenAI, who released their O1

preview model in September, have found that this training greatly increases

performance on certain select objectively measurable tasks like math

coding competitions and on reasoning that resembles these

tasks. This new paradigm involves starting with the ordinary

type of pre trained models and then as a second stage using RL to

add the reasoning. Importantly, because this type

of RL is new, we are still very early on the scaling curve.

The amount being spent on the second RL stage is small.

For all players, spending $1 million instead

of $100,000 is

enough to get huge gains. Companies are now working

very quickly to scale up the second stage to hundreds of millions and

billions, but it's crucial to understand that we're at a unique crossover

point where there is a powerful new paradigm that is

early on the scaling curve and therefore can make big gains quickly.

Deepseek's Models the three dynamics above can help

us understand Deepseek's recent releases. About a month ago

Deepseek released a model called Deepseek V3 that was

a pure pre trained model, the first stage described in number

three above. Then last week they released R1

which added a second stage. It's not possible to determine

everything about these models from the outside, but the following

is my best understanding of the two releases.

DeepSeek v3 was actually the real innovation and what should

have made people take notice a month ago we certainly did.

As a pre trained model, it appears to come close to the performance

of state of the art US models on some important tasks while

costing substantially less to train, although we find that Claude

3.5 sonnet in particular remains much better on some

other key tasks such as real world coding.

Deepseek's team did this via some genuine and impressive innovations,

mostly focused on engineering efficiency. There were

particularly innovative improvements in the management of an aspect called

the key value cache and in enabling a method called mixture

of experts to be pushed further than it had before.

However, it's important to look closer. Deep Seek does not do

for 65 million what cost us AI companies billions.

I can only speak for anthropic, but Claude 3.5 Sonnet

is a mid sized model that cost a few tens if millions if dollars

to train. I won't give an exact number.

Also 3rd.5 Sonnet was not trained in any

way that involved a larger or or more expensive model.

Contrary to some rumors, Sonnet's Training was conducted

nine to 12 months ago and Deep Seek's model was trained in

November or December, while Sonnet remains notably ahead in many

internal and external evals. Thus I think

a fair statement is Deepseek produced a model close to the performance of

US models seven to ten months older for a good deal less cost,

but not anywhere near the ratios people have suggested.

If the historical trend of the cost curve decreases approximately four times

per year, that means that in the ordinary course of business,

in the normal trends of historical cost decreases like

those that happened in 2023 and 2024 we'd expect a

model 3 to 4 times cheaper than 3.5 sonnet GPT4

around now. Since Deepseek V3 is worse than those US

Frontier models, lets say buy 2 times on the scaling curve,

which I think is quite generous to Deepseek v3. That means

it would be totally normal, totally on Trend if

DeepSeek v3 training cost 8 times less than the current

US models developed a year ago. I'm not going to

give a number, but it's clear from the previous bullet point

that even if you take Deep Seek's training cost at face value,

they are on trend at best, and probably not even

that. For example, this is less steep than the original

GPT4 to Claude 3.5

sonnet inference price differential 10 times and

3.5 sonnet is a better model than GPT4.

All of this is to say that deep seq v3 is not

a unique breakthrough or something that fundamentally changes the economics

of LLMs. It's an expected point on an

ongoing cost reduction curve. What's different this time

is that the company that was first to demonstrate the expected cost

reductions was Chinese. This has never happened before

and is geopolitically significant. However,

US companies will soon follow suit and they

won't do this by copying Deepseek, but because they too are

achieving the usual trend in cost reduction. Both Deep

SEQ and USAI companies have much more money and many more

chips than they use to train their headline models.

The extra chips are used for R and D to develop the ideas behind the

model and sometimes to train larger models that are not yet

ready or that needed more than one try to get right.

It's been reported we can't be certain it is true that Deepseek

actually had 50,000 hopper generation chips,

which I'd guess is within a factor 2 to 3x of what

the major USAI companies have. For example, it's 2

to 3 times less than the XAI Colossus cluster,

so those 50,000 hopper chips cost on the order of

$1 billion. Thus Deepseek's total spend

as a company, as distinct from spend to train an individual model,

is not vastly different from USAI Labs.

It's worth noting that the scaling curve analysis is a

bit oversimplified because models are somewhat differentiated and

have different strengths and weaknesses. The scaling curve numbers

are a crude average that ignores a lot of details.

I can only speak to anthropics models, but as I've hinted at

above, Claude is extremely good at coding and at

having a well designed style of interaction with people. Many people

use it for personal advice or support on these and

some additional tasks. There's just no comparison with

Deepseek. These factors don't appear in the scaling numbers.

R1, which is the model that was released last week and

which triggered an explosion of public attention including a

17% decrease in Nvidia's stock price,

is much less interesting from an innovation or engineering

perspective than V3. It adds the second phase

of training reinforcement learning described in number 3

in the previous section and essentially replicates what

OpenAI has done with O1. They appear to be at similar scale

with similar results. 8.

However, because we are on 8 the early part of the scaling curve,

it's possible for several companies to produce models of this type as

long as they're starting from a strong pre trained model.

Producing R1 given V3 was probably very cheap.

We're therefore at an interesting crossover point

where it is temporarily the case that several companies can produce good reasoning

models. This will rapidly cease to be true as everyone moves further

up the scaling curve on these models Export

Controls all of this is just a preamble to my main topic of

interest, the export controls on chips to China.

In light of the above facts, I see the situation as

follows. There is an ongoing trend where companies spend

more and more on training powerful AI models even

as the curve is periodically shifted and the cost of training

a given level of model intelligence declines rapidly.

It's just that the economic value of training more and more intelligent models

is so great that any cost gains are more than eaten up

almost immediately. They're poured back into making even

smarter models for the same huge cost we were originally planning

to spend. To the extent that US labs

haven't already discovered them. The efficiency innovations Deepseek developed

will soon be applied by both US and Chinese labs to train multi

billion dollar models. These will perform better than

the multi billion models they were previously planning to train,

but they'll still spend multi billions.

That number will continue going up until we reach AI that

is smarter than almost all humans at almost all

things. Making AI that is smarter than almost all humans

at almost all things will require millions of chips,

tens of billions of dollars at least, and is

most likely to happen in 2026-2027.

DeepSeek's releases don't change this because

they're roughly on the expected cost reduction curve that has always been factored

into these calculations. This means that in 2026-2027

we could end up in one of two starkly different worlds in

the US, multiple companies will definitely have the required millions

of chips at the cost of tens of billions of dollars.

The question is whether China will also be able to get millions of chips.

If they can, we'll live in a bipolar world where both the US

and China have powerful AI models that will cause extremely

rapid advances in science and technology. What I've

called countries of geniuses in a data center. A bipolar

world would not necessarily be balanced indefinitely. Even if the

US and China were at parity in AI systems, it seems

likely that China could direct more talent, capital and focus to

military applications of the technology. Combined with

its large industrial base and military strategic advantages,

this could help China take a commanding lead on the global stage.

Not just for AI, but for everything. If China

can't get millions of chips, we'll at least temporarily

live in a unipolar world where only the US and its allies

have these models. It's unclear whether the unipolar world will last,

but there's at least the possibility that because AI systems can eventually

help make even smarter AI systems, a temporary lead

could be parlayed into a durable advantage. Thus, in this world,

the US and its allies might take a commanding and long lasting lead on

the global stage. Well enforced export controls

are the only thing that can prevent China from getting millions of

chips and are therefore the most important determinant of

whether we end up in a unipolar or bipolar world.

The performance of Deepseek does not mean the export controls

failed. As I stated above, Deepseek had a moderate to

large number of chips, so it's not surprising that they were able to

develop and then train a powerful model. They were

not substantially more resource constrained than US AI companies,

and the export controls were not the main factor causing

them to innovate. They are simply very

talented engineers and show why China is a serious competitor

to the us. Deepseek also does not show that China

can always obtain the chips it needs via smuggling or that

the controls always have loopholes. I don't believe the export controls

were ever designed to prevent China from getting a few tens of Thousands of chips.

$1 billion of economic activity can be hidden,

but it's hard to hide $100 billion or even $10 billion.

A million chips may also be physically difficult to smuggle.

It's also instructive to look at the chips Deepseek is currently reported to

have. This is a mix of H1 hundreds,

H8 hundreds and H20s. According to Semianalysis,

adding up to 50,000 total H1 hundreds

have been banned under the export controls since their release, so if Deepseek

has any, they must have been smuggled. Note that Nvidia

has stated that DeepSeek's advances are fully export control compliant.

H8 hundreds were allowed under the initial round of 2022

export controls, but were banned in October 2023 when the

controls were updated, so these were probably shipped before the ban.

H20s are less efficient for training and more efficient for sampling

and are still allowed, although I think they should be banned.

All of that is to say that it appears that a substantial

fraction of Deep Seek's AI chip fleet consists of chips

that haven't been banned but should be, chips that

were shipped before they were banned, and some that seem very likely to

have been smuggled. This shows that the export controls are actually working

and adapting loopholes are being closed, otherwise they

would likely have a full fleet of top of the line H1 hundreds.

If we can close them fast enough, we may be able to prevent China from

getting millions of chips, increasing the likelihood of a unipolar

world with the US ahead. Given my focus on export

controls and US national security, I want to be clear on one

I don't see Deep Seek themselves as adversaries, and the point

isn't to target them in particular. In interviews

they've done, they seem like smart, curious researchers who just want to make useful

technology. But they're beholden to an authoritarian

government that has committed human rights violations,

has behaved aggressively on the world stage, and will be

far more unfettered in these actions if they're able to match the US

in AI. Export controls are one of our most powerful tools

for preventing this, and the idea that the technology getting more powerful,

having more bang for the buck is a reason to lift our export controls makes

no sense at all.

Creators and Guests

Host

A-OK

An infinite number of monkeys on an infinite number of "typewriters" banging away on keys in hopes of getting the next-token predictor agent going.

On DeepSeek and Export Controls by Dario Amodei

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere