TRANSCRIPTIONEnglish

The insane engineering of Deepseek V4

29m 32s4,981 mots766 segmentsEnglish

TRANSCRIPTION COMPLÈTE

0:00

Deepseek is built different. They just

0:02

dropped a new model, Deepseek V4. But

0:05

here's the thing. Unlike the top closed

0:08

AI labs out there that are spending

0:10

billions on data centers with unlimited

0:13

compute, DeepSeek is incredibly

0:15

constrained. They don't have nearly as

0:18

much compute. Heck, they don't even have

0:19

the top NVIDIA chips. And their team is

0:22

like 40 times smaller than OpenAI.

0:24

They're incredibly resource limited. But

0:27

yet, they managed to build a model

0:29

that's on par with the top closed models

0:32

out there. The ridiculous thing is they

0:34

even open sourced this and they even

0:36

released a paper on how they built it.

0:38

Now, I spent the past few days reading

0:41

this and the design is absolutely

0:43

beautiful and ingenious. But here's the

0:46

thing. If you're not technical, all of

0:48

this looks like alien language. So, in

0:50

this video, I'm going to break down

0:51

everything in simple terms so you can

0:54

understand how incredibly cracked this

0:56

model and the DeepSeek team is. Let's

0:59

jump right in. First, let's talk specs.

1:01

So, this latest V4 Pro model is massive.

1:05

It has 1.6 trillion parameters. If

1:09

you're not technical, parameters are

1:10

basically the dials and knobs inside the

1:12

model's brain that stores everything it

1:15

knows. In theory, the more parameters,

1:17

the smarter and more capable the model

1:19

is. And 1.6 trillion parameters is among

1:22

the best models out there. But this

1:24

comes with a catch. It's way harder to

1:27

build and train a model of that size,

1:29

which we'll talk more about in a second.

1:31

Now, this new V4 also has a context

1:34

length of 1 million tokens. This is

1:36

basically the model's short-term memory,

1:38

or how much info you can stuff into your

1:40

prompt for it to remember all at once. 1

1:43

million tokens is roughly 750,000 words,

1:47

which is huge. It's like feeding it the

1:49

entire Harry Potter series and asking it

1:52

about a very specific detail on a page,

1:55

and it'll actually remember it. Or for

1:57

agents, it means the model can run for

1:59

hours on long tasks without losing track

2:01

of what it's doing. But again, a 1

2:03

million token context window is insanely

2:06

hard to actually build correctly. So,

2:08

the Deepseek team had to come up with

2:10

some ingenious solutions to all of this.

2:12

In fact, let's dig deeper into why

2:15

having a 1 million token context window

2:17

is so hard. You see, modern AI models

2:20

don't read text the way we do. Every

2:22

time it reads a new word, or what's

2:24

technically called a token, it asks,

2:26

"How does this word relate to all the

2:28

other words before it?" Take a simple

2:30

example. The cat didn't cross the street

2:33

because it was too tired. When it

2:35

processes each word, it looks at all the

2:37

words before it to see which ones are

2:39

most relevant to the current word. In

2:41

fact, this is the attention part first

2:44

introduced by the legendary paper

2:46

attention is all you need by Google. And

2:48

this is the foundation behind all large

2:51

language models that we know of today.

2:53

In fact, if you're interested in

2:54

learning more, definitely see this video

2:56

for a full explainer. Now, for a short

2:59

sentence like this, it's fine. If you're

3:01

at the 10th word, that's just 10

3:03

comparisons. No big deal. But if you're

3:05

at the 100,000th word, that's 100,000

3:09

comparisons. And this is the fundamental

3:11

bottleneck of every large language

3:13

model. Imagine pushing this to a million

3:15

tokens. At that scale, the number of

3:18

comparisons becomes astronomical. So

3:21

large that even high-end hardware starts

3:23

to choke just trying to keep up. And

3:25

it's not just the compute that suffers.

3:28

To make this all work, the model has to

3:29

store intermediate results. Basically, a

3:32

running memory of everything it has seen

3:34

so far. This is called the key value

3:37

cache or KV cache. You can think of it

3:39

as like a massive lookup table. For

3:42

every past word, it stores information

3:44

about what that word meant in context.

3:46

Now, at small scales, like for a short

3:49

paragraph, this is pretty manageable.

3:51

But at a million tokens, it becomes

3:53

absurd. You're now storing a ton of data

3:56

just to maintain context for a single

3:58

conversation. These are like gigabytes

4:00

sitting in expensive GPU memory just so

4:03

the model doesn't forget what it read

4:05

earlier. And if you put these two things

4:07

together, the insane compute required

4:09

and this exploding memory issue, you hit

4:12

a wall. So, how did DeepSk solve this?

4:16

Looking at the paper, what's interesting

4:17

is that the team didn't just throw more

4:19

brute force compute at the problem

4:21

because, well, they didn't have a lot of

4:23

compute. Instead, they asked a much more

4:25

elegant question. What if the model

4:27

didn't have to look at everything in the

4:29

first place? The key idea behind Deepsee

4:32

V4 is deceptively simple. Don't treat

4:35

all past information as equally

4:37

important because in reality it isn't.

4:39

When you're reading a book, you don't

4:41

constantly reread every page you've ever

4:43

seen. You skim, you summarize, and you

4:45

jump back only when something is

4:47

relevant. Your brain is selective, and

4:49

deepseek tries to do the same thing.

4:52

They call this a hybrid attention

4:53

architecture. And at its core are two

4:55

complimentary strategies called CSA and

4:58

HCA. Basically, compress the past and

5:01

then ignore most of it. Let's start with

5:03

the compression. So in a traditional

5:06

model, every token is stored

5:07

individually. One word, one entry, no

5:10

shortcuts. What DeepS did instead is

5:13

group them. So one part of the system is

5:15

called compressed sparse attention or

5:18

CSA for short. It takes small chunks of

5:21

tokens, say four at a time, and merges

5:23

their information into a single denser

5:26

representation. So instead of

5:27

remembering individual tokens, it stores

5:30

a compact summary of all four. So right

5:32

away you've reduced the sequence length

5:34

by a factor of four, which means fewer

5:37

comparisons and less memory and

5:39

therefore less compute. But that alone

5:41

isn't enough. Even after compressing,

5:44

you're still left with hundreds of

5:45

thousands of these blocks at large

5:47

scales. Still too many to process

5:49

efficiently. Compression helps, but it

5:51

doesn't solve the core problem. The real

5:53

breakthrough comes from the next step,

5:55

which is sparsity. Once the past is

5:58

compressed, the model doesn't treat all

6:00

of it equally relevant. Instead, it uses

6:03

a fast internal mechanism, kind of like

6:05

a built-in search engine, to pick out

6:08

only the most useful pieces. They call

6:10

this the Lightning indexer for sparse

6:12

selection. When the model processes a

6:14

new token, it doesn't scan the entire

6:16

history. It rapidly scores all those

6:18

compressed blocks and selects only a

6:21

small subset, the ones that most likely

6:23

matter, for the current context.

6:25

Everything else is ignored, just skipped

6:28

entirely. And this is a subtle but

6:29

profound shift. The model isn't trying

6:32

to remember everything perfectly. It's

6:34

trying to remember the right things at

6:35

the right time. And that changes things

6:38

completely. Instead of doing massive

6:40

computations over the entire past, it

6:42

focuses compute only where it actually

6:45

matters. And this drastically reduces

6:47

the workload without losing meaningful

6:49

context. But Deepseek didn't stop there

6:52

because sometimes you do want a broad

6:54

high-level understanding of everything

6:56

that came before even if you don't need

6:58

the fine details. That's where the

7:00

second system comes in. So this is

7:02

called heavily compressed attention or

7:05

HCA for short. Here the compression is

7:08

far more aggressive. Instead of grouping

7:10

four tokens, it groups something like

7:12

128 tokens or like an entire paragraph

7:15

into a single representation. Now you're

7:18

shrinking the sequence length by orders

7:20

of magnitude. And at that point,

7:22

something interesting happens. The

7:23

sequence becomes so short that the model

7:26

can afford to look at everything at once

7:28

because everything is now small enough

7:31

to handle. So you end up with a layered

7:33

strategy. One pathway keeps moderately

7:36

detailed chunks and selectively

7:38

retrieves the most relevant ones.

7:39

Another one keeps extremely compressed

DÉBLOQUER PLUS

Inscrivez-vous gratuitement pour accéder aux fonctionnalités premium

VISUALISEUR INTERACTIF

Regardez la vidéo avec des sous-titres synchronisés, une superposition réglable et un contrôle total de la lecture.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

RÉSUMÉ IA

Obtenez un résumé instantané généré par l'IA du contenu de la vidéo, des points clés et des principaux enseignements.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TRADUIRE

Traduisez la transcription dans plus de 100 langues en un seul clic. Téléchargez dans n'importe quel format.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

CARTE MENTALE

Visualisez la transcription sous forme de carte mentale interactive. Comprenez la structure en un coup d'œil.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

DISCUTER AVEC LA TRANSCRIPTION

Posez des questions sur le contenu de la vidéo. Obtenez des réponses alimentées par l'IA directement à partir de la transcription.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TIREZ LE MEILLEUR PARTI DE VOS TRANSCRIPTIONS

Inscrivez-vous gratuitement et débloquez la visionneuse interactive, les résumés IA, les traductions, les cartes mentales, et plus encore. Aucune carte de crédit requise.

    The insane engi… - Transcription Complète | YouTubeTranscript.dev