TRANSCRIPTEnglish

The insane engineering of Deepseek V4

29m 32s4,981 words766 segmentsEnglish

FULL TRANSCRIPT

0:00

Deepseek is built different. They just

0:02

dropped a new model, Deepseek V4. But

0:05

here's the thing. Unlike the top closed

0:08

AI labs out there that are spending

0:10

billions on data centers with unlimited

0:13

compute, DeepSeek is incredibly

0:15

constrained. They don't have nearly as

0:18

much compute. Heck, they don't even have

0:19

the top NVIDIA chips. And their team is

0:22

like 40 times smaller than OpenAI.

0:24

They're incredibly resource limited. But

0:27

yet, they managed to build a model

0:29

that's on par with the top closed models

0:32

out there. The ridiculous thing is they

0:34

even open sourced this and they even

0:36

released a paper on how they built it.

0:38

Now, I spent the past few days reading

0:41

this and the design is absolutely

0:43

beautiful and ingenious. But here's the

0:46

thing. If you're not technical, all of

0:48

this looks like alien language. So, in

0:50

this video, I'm going to break down

0:51

everything in simple terms so you can

0:54

understand how incredibly cracked this

0:56

model and the DeepSeek team is. Let's

0:59

jump right in. First, let's talk specs.

1:01

So, this latest V4 Pro model is massive.

1:05

It has 1.6 trillion parameters. If

1:09

you're not technical, parameters are

1:10

basically the dials and knobs inside the

1:12

model's brain that stores everything it

1:15

knows. In theory, the more parameters,

1:17

the smarter and more capable the model

1:19

is. And 1.6 trillion parameters is among

1:22

the best models out there. But this

1:24

comes with a catch. It's way harder to

1:27

build and train a model of that size,

1:29

which we'll talk more about in a second.

1:31

Now, this new V4 also has a context

1:34

length of 1 million tokens. This is

1:36

basically the model's short-term memory,

1:38

or how much info you can stuff into your

1:40

prompt for it to remember all at once. 1

1:43

million tokens is roughly 750,000 words,

1:47

which is huge. It's like feeding it the

1:49

entire Harry Potter series and asking it

1:52

about a very specific detail on a page,

1:55

and it'll actually remember it. Or for

1:57

agents, it means the model can run for

1:59

hours on long tasks without losing track

2:01

of what it's doing. But again, a 1

2:03

million token context window is insanely

2:06

hard to actually build correctly. So,

2:08

the Deepseek team had to come up with

2:10

some ingenious solutions to all of this.

2:12

In fact, let's dig deeper into why

2:15

having a 1 million token context window

2:17

is so hard. You see, modern AI models

2:20

don't read text the way we do. Every

2:22

time it reads a new word, or what's

2:24

technically called a token, it asks,

2:26

"How does this word relate to all the

2:28

other words before it?" Take a simple

2:30

example. The cat didn't cross the street

2:33

because it was too tired. When it

2:35

processes each word, it looks at all the

2:37

words before it to see which ones are

2:39

most relevant to the current word. In

2:41

fact, this is the attention part first

2:44

introduced by the legendary paper

2:46

attention is all you need by Google. And

2:48

this is the foundation behind all large

2:51

language models that we know of today.

2:53

In fact, if you're interested in

2:54

learning more, definitely see this video

2:56

for a full explainer. Now, for a short

2:59

sentence like this, it's fine. If you're

3:01

at the 10th word, that's just 10

3:03

comparisons. No big deal. But if you're

3:05

at the 100,000th word, that's 100,000

3:09

comparisons. And this is the fundamental

3:11

bottleneck of every large language

3:13

model. Imagine pushing this to a million

3:15

tokens. At that scale, the number of

3:18

comparisons becomes astronomical. So

3:21

large that even high-end hardware starts

3:23

to choke just trying to keep up. And

3:25

it's not just the compute that suffers.

3:28

To make this all work, the model has to

3:29

store intermediate results. Basically, a

3:32

running memory of everything it has seen

3:34

so far. This is called the key value

3:37

cache or KV cache. You can think of it

3:39

as like a massive lookup table. For

3:42

every past word, it stores information

3:44

about what that word meant in context.

3:46

Now, at small scales, like for a short

3:49

paragraph, this is pretty manageable.

3:51

But at a million tokens, it becomes

3:53

absurd. You're now storing a ton of data

3:56

just to maintain context for a single

3:58

conversation. These are like gigabytes

4:00

sitting in expensive GPU memory just so

4:03

the model doesn't forget what it read

4:05

earlier. And if you put these two things

4:07

together, the insane compute required

4:09

and this exploding memory issue, you hit

4:12

a wall. So, how did DeepSk solve this?

4:16

Looking at the paper, what's interesting

4:17

is that the team didn't just throw more

4:19

brute force compute at the problem

4:21

because, well, they didn't have a lot of

4:23

compute. Instead, they asked a much more

4:25

elegant question. What if the model

4:27

didn't have to look at everything in the

4:29

first place? The key idea behind Deepsee

4:32

V4 is deceptively simple. Don't treat

4:35

all past information as equally

4:37

important because in reality it isn't.

4:39

When you're reading a book, you don't

4:41

constantly reread every page you've ever

4:43

seen. You skim, you summarize, and you

4:45

jump back only when something is

4:47

relevant. Your brain is selective, and

4:49

deepseek tries to do the same thing.

4:52

They call this a hybrid attention

4:53

architecture. And at its core are two

4:55

complimentary strategies called CSA and

4:58

HCA. Basically, compress the past and

5:01

then ignore most of it. Let's start with

5:03

the compression. So in a traditional

5:06

model, every token is stored

5:07

individually. One word, one entry, no

5:10

shortcuts. What DeepS did instead is

5:13

group them. So one part of the system is

5:15

called compressed sparse attention or

5:18

CSA for short. It takes small chunks of

5:21

tokens, say four at a time, and merges

5:23

their information into a single denser

5:26

representation. So instead of

5:27

remembering individual tokens, it stores

5:30

a compact summary of all four. So right

5:32

away you've reduced the sequence length

5:34

by a factor of four, which means fewer

5:37

comparisons and less memory and

5:39

therefore less compute. But that alone

5:41

isn't enough. Even after compressing,

5:44

you're still left with hundreds of

5:45

thousands of these blocks at large

5:47

scales. Still too many to process

5:49

efficiently. Compression helps, but it

5:51

doesn't solve the core problem. The real

5:53

breakthrough comes from the next step,

5:55

which is sparsity. Once the past is

5:58

compressed, the model doesn't treat all

6:00

of it equally relevant. Instead, it uses

6:03

a fast internal mechanism, kind of like

6:05

a built-in search engine, to pick out

6:08

only the most useful pieces. They call

6:10

this the Lightning indexer for sparse

6:12

selection. When the model processes a

6:14

new token, it doesn't scan the entire

6:16

history. It rapidly scores all those

6:18

compressed blocks and selects only a

6:21

small subset, the ones that most likely

6:23

matter, for the current context.

6:25

Everything else is ignored, just skipped

6:28

entirely. And this is a subtle but

6:29

profound shift. The model isn't trying

6:32

to remember everything perfectly. It's

6:34

trying to remember the right things at

6:35

the right time. And that changes things

6:38

completely. Instead of doing massive

6:40

computations over the entire past, it

6:42

focuses compute only where it actually

6:45

matters. And this drastically reduces

6:47

the workload without losing meaningful

6:49

context. But Deepseek didn't stop there

6:52

because sometimes you do want a broad

6:54

high-level understanding of everything

6:56

that came before even if you don't need

6:58

the fine details. That's where the

7:00

second system comes in. So this is

7:02

called heavily compressed attention or

7:05

HCA for short. Here the compression is

7:08

far more aggressive. Instead of grouping

7:10

four tokens, it groups something like

7:12

128 tokens or like an entire paragraph

7:15

into a single representation. Now you're

7:18

shrinking the sequence length by orders

7:20

of magnitude. And at that point,

7:22

something interesting happens. The

7:23

sequence becomes so short that the model

7:26

can afford to look at everything at once

7:28

because everything is now small enough

7:31

to handle. So you end up with a layered

7:33

strategy. One pathway keeps moderately

7:36

detailed chunks and selectively

7:38

retrieves the most relevant ones.

7:39

Another one keeps extremely compressed

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.