TRANSCRIPTEnglish

The ML Technique Every Founder Should Know

27m 6s5,672 words827 segmentsEnglish

FULL TRANSCRIPT

0:00

Welcome back to another episode of

0:01

Decoded. Today I'm sitting down with YC

0:03

visiting partner Francois Shaard to talk

0:05

about one of the most important topics

0:07

in AI today, diffusion. Francois has

0:09

been doing computer vision since 2012

0:11

when he started in Fee Le's lab. And

0:13

after a decade running focal systems,

0:15

he's currently back at Stanford

0:16

finishing his PhD working on

0:18

diffusion-based world models for AGI.

0:19

We're going to break down what diffusion

0:21

is, how it's evolved over the past

0:22

decade, and how it's used today.

0:29

>> [music]

0:31

>> Francois, thanks for being here.

0:32

>> Thank you for having me.

0:33

>> Well, we just got back from Nurups. We

0:35

just spent a lot of time talking to

0:36

researchers and thinking about all the

0:38

newest models out there. Um, I think we

0:40

saw diffusion pop up over and over and

0:43

newer versions of these type of uh

0:45

approaches that are not auto reggressive

0:46

LLM. And so I wanted to talk to you

0:48

about those today. So first, why don't

0:50

we start by defining um what is

0:52

diffusion? The fusion is a very

0:54

fundamental machine learning framework

0:56

that allows you to learn any p data any

1:00

probability of data for any domain as

1:02

long as you have the data.

1:03

>> So you're trying to learn some data

1:04

distribution.

1:05

>> That's right.

1:05

>> Now in a sense all LLMs or all machine

1:08

learning models are about learning data

1:10

distributions.

1:11

>> How does diffusion in particular what

1:13

what stance does it take or what

1:14

approach does it take to being able to

1:16

learn distribution?

1:16

>> Yeah, I mean I think you can use

1:17

diffusion to to always do that. The

1:20

thing where it stands out in particular

1:21

is mapping from high dimensions to high

1:23

dimensions, especially in low data uh

1:26

regimes. So, say I only have 30 images

1:29

of Gary, which I actually have some code

1:30

that we're going to walk through. Um, I

1:32

only have 30 images of Gary and again,

1:34

we're in this thousand by 10,000 by3 uh

1:38

uh dimensional space and I want to map

1:40

to another three 3 million dimensional

1:43

space with only 30 training samples and

1:46

I can still do it and it's pretty pretty

1:47

powerful in that way.

1:48

>> Okay, cool. So, so you have this ability

1:50

to use relatively small amounts of data

1:52

compared to the dimensionality to learn

1:54

a P data. That's right.

1:55

>> Um, what's the what's the basic process

1:57

by which diffusion works? Like just walk

1:59

through like at a very high level and

2:01

we'll walk through the math a little bit

2:02

later, but a very high level, how does

2:03

this process actually work?

2:05

>> We take some sample of the data, an

2:08

image of Anka, an image of Gary, and we

2:10

just hit it with noise. And then we just

2:12

keep hitting it with noise and we create

2:14

this train of of noised up images. It's

2:17

very easy to create noisy images, right?

2:19

It's hard to create get walk backwards

2:21

and create from noise images of you or

2:24

Gary. And so then we flip it and then we

2:27

try to have teach the model to reverse

2:30

that process. And that's basically it.

2:32

>> Okay, cool. So it's basically a a noiser

2:35

and a dinoiser and the dinoiser is the

2:36

model that you end up training.

2:38

>> Exactly. Yeah. You will uh you will

2:39

basically teach your force and and give

2:41

it uh noised up images and then have it

2:44

learn intermediate representations to

2:46

get to back to P data.

2:48

>> Cool. Nice. And what kinds of stuff is

2:50

diffusion used for today? What are some

2:51

applications that it's widely deployed

2:53

in?

2:53

>> It's honestly surprising how uh uh

2:56

applicable this process is. I think the

2:59

the original 2015 Joshua Sixine paper

3:03

was on CIPR10 which is just images. Um,

3:06

and I think it's got it has its roots in

3:08

images, but it is far uh uh more

3:12

sprawling than just images. As you've

3:13

seen, we you know, uh uh Deep Mind just

3:15

won the Nobel Prize for doing this exact

3:18

procedure on protein folding. Uh you can

3:21

drive cars with this with the diffusion

3:23

policy paper, which is like an insane

3:25

result. Um you can um uh predict the

3:28

weather. Um there's really no limit to

3:31

the things that this can do.

3:32

>> Yeah, it's pretty incredible to see. I

3:33

mean we have these image and video

3:35

generation models that seem to be really

3:37

advancing over the last few years.

3:38

Stable diffusion is the one that I think

3:39

many people have heard of and then newer

3:41

versions of it seem to be using this as

3:43

well. And then yeah in the world of life

3:44

sciences that my company was in too. I

3:46

think we see this newest generation of

3:48

life sciences AI companies are heavily

3:50

investing in this set of technologies.

3:52

There's a model called diff do that

3:53

works really well for predicting uh

3:55

small molecule binding to proteins and

3:57

then yeah alpha fold especially the

3:59

newest alpha fold versions use diffusion

4:00

pretty heavily. It's really cool to see

4:02

the same core piece of technology apply

4:04

to so many different domains.

4:05

>> Yeah. Yeah.

4:06

>> This class of models has evolved over

4:08

the years and you know there's a whole

4:09

slew of papers someone could read. So

4:11

you should probably go read the papers

4:12

to learn all the details. But maybe at a

4:14

high level we can try to trace out a few

4:16

of the key

4:18

>> innovations that happened starting with

4:19

the paper you already mentioned that now

4:21

led to the newest versions of these

4:23

models. So how would you map those out?

4:24

like what was the the first kind of turn

4:26

of the crank from this very high level

4:28

diffusion process you uh outlined what

4:31

was the first version of that that

4:32

started to work

4:33

>> yeah so I think that the 2015 original

4:35

Josha paper um is put up all the key

4:40

pieces all the key components of modern

4:42

diffusion and so like now we're just

4:44

playing with different things so theuler

4:46

how do we add noise at what weight like

4:48

that's a whole part that we can discuss

4:50

what's the loss function should I

4:52

predict should the the deep learning

4:53

model condition upon x of t predict uh

4:56

the actual data x of t minus one or

4:59

should it pred predict the error that

5:00

was just added to it or should it

5:02

predict the velocity which is the error

5:04

divided by the time uh should it predict

5:07

the velocity of the the start and the

5:10

end that's called flow matching there's

5:11

so there's all these different plays on

5:13

what the loss function is

5:14

>> so in all of those the idea is still to

5:16

do dnoising

5:17

>> yes

5:18

>> uh but the objective for each of them is

5:21

somewhat different from each other and

5:23

they're all pretty closly related

5:24

whether it's basically a delta between

5:25

two things or the previous step or the

5:27

first step how did these all actually

5:28

come together but these are series of

5:30

papers that happened one after another

5:32

>> yeah I think we just kind of hill

5:34

climbed on this uh ferroche inception

5:37

distance metric that's kind of a kooky

5:38

weird uh um measure to see how good an

5:41

image is um but we just kept getting

5:43

better and better and better on it by

5:45

like doing these little tricks and so

5:46

like it turns out that predicting the

5:48

actual data itself is actually quite

5:50

hard and like maybe predicting the error

5:51

is actually easier and and predicting

5:53

the velocity was even easier than that.

5:55

And then predicting the uh global error

5:59

across the entire diffusion schedule is

6:01

even easier than that. And like just

6:02

kept finding easier and easier ways um

6:05

to to basically uh sample from noise to

6:08

data.

6:09

>> And here when you say easier were was

6:12

the ease largely driven by it was

6:14

mathematically simpler or it was easier

6:16

to implement and engineer or simpler to

6:19

reason about or what what got easier

6:22

really. It actually is that too, but I

6:24

didn't mean it that way. What I actually

6:25

meant was it's easier for the model to

6:27

learn.

6:28

>> Um, but it is also, and we'll go through

6:30

some coding examples, the math actually

6:32

got easier. Yeah. And like the the the

6:34

code got smaller, which is actually

6:37

opposite oppositely true in uh most of

6:40

uh the case of most machine learning.

6:41

You actually things get more

6:42

complicated. I think we started with

6:44

units and that was like the predominant

6:46

architecture. We didn't really talk

6:47

about architectures that much, but then

6:49

we got into these, uh, diffusion

6:50

transformers and like this cross

6:52

attention mechanism and things like

6:53

that. And so, um, yeah, we just kept

6:55

getting better and better at reducing,

6:58

uh, FID.

7:00

>> Interesting. Should we dive into some

7:01

code examples?

7:02

>> Let's do it.

7:03

>> Let's do it.

7:03

>> I'll walk you through. I made about uh,

7:06

one, two, three, four, five, six, seven

7:08

of these that I implemented um with

7:11

varying levels of success, but uh, the

7:13

the all the structures are going to be

7:15

the same. So the the Josha paper, the

7:17

non-equilibrium thermodynamics paper, uh

7:20

you can see here here are some nice

7:22

images of Gary. You can see here very

7:24

nice. This is the what I could find

7:25

online.

7:26

>> Nice.

7:26

>> Um and then

7:28

>> so those are images of Gary that you've

7:29

downsampled so that they're thousand

7:31

by,000 or they're they're smaller. These

7:33

are 64 by 64 by 64.

7:35

>> Yeah, they're really small. This is just

7:36

a very small example. 64 and then I

7:38

randomly augment it to create more data.

7:40

Great. Um because I was I was lazy and

7:42

that was easier than uh downloading more

7:44

images. Cool. [laughter]

7:46

>> Didn't want to get security call on you.

7:48

>> Exactly. So, and then wait, I

7:50

implemented this diffusion schedule and

7:52

this is probably one of the most

7:53

important like of all the parts of

7:55

diffusion that's difficult to

7:57

comprehend. I would say that the noise

7:58

schedule is actually the hardest part to

8:01

understand that I really like I

8:02

struggled with myself. And so if you can

8:05

see here the noise that's added from

8:07

time step uh zero to t to 10 to 25 all

8:10

the way to 100 uh is clearly destroying

8:14

the structure.

8:15

>> Yes.

8:15

>> And then we want to train

8:16

>> where you end is basically random

8:18

static.

8:18

>> Exactly. And we want to basically

8:20

reverse this and from here get to here

8:22

and have the model get to that point get

8:24

to this point get the point et etc. And

8:26

so, uh, the interesting part, and this

8:29

is Joshua really, you know, uh,

8:32

implemented almost everything that we

8:34

needed for diffusion. Um, and there was

8:36

just a few little tweaks that were

8:37

missing, and he didn't he didn't scale

8:39

it up. That's to me the the parts that,

8:41

um, uh, were were were missing. And if

8:45

you see here the um, the noise schedule.

8:48

So it would make sense to me that I

8:51

would have linear interpolation between

8:55

uh the image and the noise and I would

8:57

start with like one and zero one being

9:00

the image and zero being the noise. You

9:03

gradually add it

9:03

>> and I gra and I linearly add it. But if

9:06

you do that it actually is massively

9:08

unstable because the instantaneous

9:10

amount of error that you're adding is

9:12

very small in the beginning. If you

9:14

think about like an image

9:15

>> on a relative basis

9:16

>> on a relative basis and then at the end

9:18

you have to to destroy all the to get to

9:20

complete noise you need to add a lot of

9:23

error

9:23

>> and so like if you're a model and you're

9:25

just looking at this little chunk of the

9:27

noise schedule then you have to handle a

9:29

lot of error in one step and on this

9:32

side of the the schedule you need to

9:34

handle such small amounts of error and

9:36

what you actually want is constant rel

9:38

like like relatively constant amount of

9:41

error being introduced every single time

9:42

step right and that the the cumulative

9:45

sum of all that error actually ends up

9:47

looking like this uh like this curve

9:50

here.

9:50

>> That's the uh the pink curve.

9:52

>> Yeah.

9:53

>> And so they call this a beta schedule.

9:55

Beta is the diffusion rate, the rate of

9:58

diffusion that I'm I'm doing while I'm

10:00

rolling this thing out from time zero to

10:01

time uh t capital t. And uh and so you

10:05

can see here the the beta schedule. So

10:07

we we have usually have some beta min to

10:10

beta max and then we one minus that is

10:13

the alpha and you can think about the

10:15

beta as like how much noise I'm adding

10:17

at every time step. Yep.

10:18

>> And you think about the alpha as how

10:20

much

10:22

yeah being retained and then the term

10:24

that really matters is the alpha bar and

10:26

these are the weight weights that are

10:28

used and it has this kind of like one

10:31

minus sigmoid looking thing. Um but

10:34

that's basically the the noise schedule

10:37

and once you get that right really this

10:39

this this part here then everything kind

10:41

of else just works and then I train some

10:44

model and then we can actually

10:45

>> so there what was the training objective

10:47

again? So you were adding this noise and

10:48

the training objective was to do what

10:50

exactly

10:50

>> the train in this case it's to minimize

10:52

the kale divergence between the

10:55

distribution uh the real distribution

10:56

and the distribution that I'm learning.

10:58

And so um I I won't go through the code

11:01

for this one because it's a little bit

11:02

hairier, but you can kind of see the

11:04

result on these generated images um

11:07

after 100 diffusion steps at inference

11:10

time. And you can see that the ferroche

11:12

inception distance is 222

11:14

>> which is like extremely high today like

11:17

modern day would be like maybe like

11:18

eight or 10 or something. And what's

11:20

interesting here, I mean, you kind of

11:21

scroll through it there, but it's, and

11:22

you mentioned it, there's quite a lot of

11:24

code that it actually takes to do that

11:26

kale divergence base loss. I suspect

11:29

that in these later models you're going

11:30

to show, it gets significantly simpler.

11:32

So, I'm I'm just mentally noting that

11:34

because I suspect there's going to be

11:35

interesting contrast to draw uh between

11:37

these two.

11:38

>> Yeah. So, the next one I would like to

11:39

show is flow matching, which is it's

11:41

just so beautiful and simple. Uh and

11:43

this was out of um meta uh yearn litman

11:48

uh where he basically said we don't need

11:51

a lot of this stuff. What we need to do

11:52

forget the if you think about the

11:55

noising process as being this like um I

11:59

start from data I randomly sample a

12:02

vector

12:02

>> of noise and I just go in this direction

12:04

and then I do it again. I go in this

12:06

direction and I do it again. I go in

12:07

that direction. I go in this direction

12:08

that direction and then I'm here at

12:10

noise and then you have to teach the

12:11

thing to go in the exact opposite path

12:14

and you have to do this very securest

12:16

path and so at test time it's actually

12:18

quite expensive you have to do we've all

12:19

waited for you know uh chatgbt or or to

12:22

to or midjourney to like make an image

12:24

and it takes a while it's doing like a

12:26

thousand calls to the model again and

12:28

again iterating through to get to that

12:30

point of p data right

12:31

>> instead

12:32

>> and like intuitively it's like okay

12:34

we're doing the ciruitous path but

12:36

surely there's a shorter path between

12:38

those two.

12:38

>> And so that's what makes flow matching

12:40

so cool to me at least is that they said

12:42

forget all of that intermediary results.

12:45

There is a a velocity a global velocity

12:49

between the noise and the data and it's

12:53

just this direction. It's just this

12:55

straight line. And I don't care where

12:56

you are go in that line. Wherever you

12:59

are, you're over here. Go in that line

13:00

and teach it to go in that line. And

13:01

that's what flow matching does. And so

13:03

I'll show that in the code. I bet it's

13:04

like five lines of code. It really is

13:06

quite simple. And so this pretty cool.

13:08

So

13:10

>> here you go. The

13:12

>> you basically have like 10 15 lines of

13:15

code that is the most powerful machine

13:17

learning procedure uh ever.

13:20

>> So I I have some data. I an image of

13:23

Gary.

13:23

>> Yep.

13:23

>> I have some noise that I I I some

13:26

isotropic Gaussian noise that I sample

13:27

from.

13:28

>> Yep. there's some time that I'm I'm

13:31

trying to index into in the diffusion

13:32

schedule and I'm and I create XT, which

13:36

is the image at the noised up image

13:38

that's somewhere between extremely noisy

13:40

and not noisy at all.

13:41

>> And and that's basically just the

13:43

sampling procedure. It's T time data.

13:44

>> Yep.

13:45

>> Plus one minus that times noise. Yeah,

13:47

>> that's right. And then I compute the

13:49

velocity which is independent of the

13:51

time. I don't care where you are. It's

13:53

this the glo this global velocity which

13:55

is just the noise minus the data and

13:57

then it I return that back to my uh

14:00

training loop which is the the shortest

14:02

amount of code training [laughter] loop

14:03

I've ever written which it's five lines

14:06

of code. Um I have my batch I have some

14:09

time I sample from that function I just

14:13

uh explained before and then I have my

14:16

prediction from the model. I feed it in

14:18

this some element uh some some uh noise

14:23

up image somewhere between lots of noise

14:24

and little noise x of t let's call it

14:26

and I just want it to predict the

14:27

velocity that I want to go

14:29

>> and this is also really powerful because

14:31

here you know you have model abstractor

14:32

but that model can be any model right so

14:35

you can put in whatever the relevant

14:37

model is for your distribution whether

14:39

that's a protein model for proteins or

14:41

if it's an LLM for text or an image

14:43

based model for images

14:45

>> that is a very clean abstraction

14:48

as long as you can predict this velocity

14:49

and then move in that direction.

14:51

>> That's right. This code here has nothing

14:53

to do with images. It could be weather

14:55

data. It could be, you know, uh stock

14:58

market data. It could be um trajectories

15:01

from a robotics in a teaops setup. Um it

15:04

could be proteins. It could be DNA. It

15:06

doesn't really matter. It's all the

15:07

exact same code. Um and so and then also

15:10

we haven't talked about the

15:11

architecture. So like this model here

15:13

could be anything you want it to be.

15:15

like it could be a RNN, it could be a uh

15:19

UNET which is typically you know

15:21

traditionally is and and modernly they

15:23

use these diffusion uh transformers

15:25

doing this cross attention mechanism and

15:27

so it can be whatever you want. Um but

15:30

this all that is independent from

15:32

whether or not you're doing flow

15:33

matching or not. I think this is like a

15:35

really profoundly

15:38

interesting result in that especially

15:40

this um I think we often assume as

15:42

models have gotten more sophisticated

15:43

that they become less accessible for

15:45

people to understand but this is

15:47

>> quite literally 10 lines of code right

15:49

[laughter]

15:49

>> that explains

15:50

>> essentially all of the most important

15:52

kind of mathematical and fundamental

15:54

foundations of the models that we all

15:57

see as generating basically like magical

15:59

AI results on our phones. Of course,

16:00

there's lots of engineering how you

16:01

scale them up. that that model could be

16:03

a 100 billion parameter across

16:06

transformer data centers, you know,

16:08

GPUs. Totally. Yeah, 100%.

16:10

>> So, it's the engineering that's the

16:11

really hard part there, but a lot of the

16:12

basic machine learning math is actually

16:14

quite straightforward.

16:15

>> That's right. Um, yeah. And so, uh,

16:17

there's a bunch of these like tangent

16:20

fields to diffusion that all have some

16:22

different interpretation on what's

16:24

actually happening, but it's all the

16:25

same exact math. And most people

16:27

learning diffusion actually get quite

16:28

confused because if you talk to um some

16:32

uh uh you know probabistic graphical

16:34

model people they'll saying oh this is a

16:36

proistic graphical model and what's

16:38

actually this is a hidden markoff model

16:39

and what we're doing is we're learning

16:41

this like marovian thing or whatever.

16:43

It's like okay fine but like it's just

16:47

>> noise [laughter]

16:48

>> and like you should just show that first

16:50

and then like if you think about it from

16:52

like a physics perspective and and

16:54

there's all this statmech people that

16:56

have that interpretation um there's a

16:59

whole bunch of the different

16:59

interpretations I think it's it gets a

17:01

little bit uh confusing um and the whole

17:04

stocastic differential equation people

17:06

like thinking about that this is a an

17:07

SDE and I think that's all fine and it

17:10

probably is helpful to think about but

17:11

in terms of teaching it it's actually

17:13

quite quite simple which is powerful.

17:15

>> Cool.

17:15

>> So if we go back to here you can see

17:18

that this just literally predicting the

17:20

velocity. Your goal is to have the model

17:21

predict

17:22

>> you're minimizing the loss between

17:23

predictive velocity and velocity

17:24

>> and the actual velocity. That's it. And

17:26

that's super stable. Uh and it's it's

17:29

really clean. And then at test time for

17:32

the uh physics people this is like a

17:35

oiler step kind of thing that you're

17:37

doing where you call the model a bunch

17:39

of times um and you iteratively refine.

17:42

So back to the hill climbing that we

17:43

were talking about. I'll grab some

17:45

random uh uh noise here x and I just do

17:51

and I call basically reverse that that

17:54

um uh uh noising process to d noiseise d

17:57

noiseise d noiseise and

17:58

>> it's literally Oiler's method like

18:00

you're you're using the velocity to

18:01

point in the direction you want

18:02

>> point in the direction and just keep

18:03

going keep going keep going until you've

18:05

done the number of steps. The one thing

18:07

that I really don't like about diffusion

18:09

as it's done today is that I can't keep

18:13

calling it um beyond if I only trained

18:16

on 100 uh diffusion steps in my

18:18

diffusion schedule. If I change that at

18:20

test time, it doesn't work. And so you

18:22

can't like, oh, I want it even better,

18:24

so I'll call it even more. That doesn't

18:25

you can't I've tried it. It doesn't

18:27

work.

18:27

>> Yeah. There's various tricks people try

18:28

there, but Yeah.

18:29

>> Yeah. And so like the there's games

18:31

played that is actually quite exciting.

18:34

all the expense.

18:35

>> But wait, sorry, should be clear here.

18:36

Here you're saying that's not relevant,

18:38

right? Because these in this type of

18:39

model, you don't have this time

18:40

dependency.

18:41

>> Well, so you do. So at this time, if you

18:44

change, for example, the number of

18:45

steps, if you double it, let's say that,

18:47

and you expect to get even higher

18:49

resolution images, it actually will just

18:51

turn into like white. Like it actually

18:52

just like doesn't work at all. Um, so

18:54

you can't step beyond number of steps

18:56

that was trained.

18:58

>> That's an important detail. Um, there

18:59

are tricks that people are doing to try

19:01

to compress that. uh representation. So

19:05

like if at train time I train for a 100

19:07

steps and at test time I want to do 10

19:09

steps then what you can do is you can do

19:11

distillation into the model to try to

19:14

have the 10-step model learn the 100

19:15

step models thing but then you still got

19:17

to train with 10 steps and so like if

19:19

you're training with with X steps you

19:21

have to be using X steps at test time.

19:23

>> I see interesting. You've talked about

19:25

this concept of a squint test. Why don't

19:28

you define the squint test for a second?

19:29

Tell me a little about where this comes

19:30

from and then I'd be curious to hear how

19:33

you think about diffusion models in the

19:34

context of general intelligence broadly.

19:36

>> Yan Lakun has this like interesting uh

19:38

lecture where he talks about um our

19:40

discovery of flight and that we didn't

19:42

need uh flapping wings. We kept trying

19:45

to mimic a bat um and how that was a

19:47

waste of time. And uh to that I say

19:50

you're 100% right. However, we did need

19:52

two wings. you look at the Wright

19:54

brother's original plane and you squint

19:56

and you look at a bird, you're just

19:57

like, hm, while we have helicopters and

20:00

we have jets and things like that, uh,

20:02

and rockets, like we we got there

20:04

eventually. Um, and so there's many

20:06

elements in the set of things that can

20:08

achieve flight and they have different

20:09

pros and cons. Um, and there are many

20:12

elements in the set of things that can

20:14

achieve intelligence. We are the only

20:16

existence proof of it at all. And like

20:19

I'm sure there will be more elements in

20:21

the set and maybe LLM's um broadly

20:24

speaking can get there. But if I squint

20:26

and I look at LM setup which I I I see

20:29

this you know monolithic stack

20:31

transformers the same thing stack stack

20:33

stacked and there's three stages of

20:35

training. We do this pre-train uh you

20:37

know SFT you know post- train and then

20:40

no learning at at all beyond that. Um

20:42

and it produces exactly one token at a

20:45

time,

20:45

>> right? So iterative token

20:47

>> iterative token at a time and it never

20:48

goes backwards. Um and then you look at

20:50

a brain massively massive amounts of

20:53

recursion. You have one learning

20:54

procedure the whole time. You have these

20:56

two loes that with a corpus colosum

20:58

between them that's kind of going back

20:59

and forth like this and we think and

21:01

then I definitely don't think in one

21:03

token at a time. When I write code I

21:04

don't write one little character at a

21:05

time. I never go backwards and I kind of

21:07

like am going backwards. I'm recursively

21:09

uh improving. I'm going backwards again

21:11

and again. I'm thinking in concepts.

21:13

>> There's this like dynamic process that's

21:15

emitting concepts and then higher level

21:17

higher level concepts and then lower

21:18

level manifestations of them.

21:19

>> And I'm sure that may be happening

21:21

inside the LLM, but it's like it's

21:22

almost like uh stuck. It can't do more

21:25

than in one step even though it might

21:27

want to because it has to is the way

21:29

that we we we trained it,

21:30

>> right? Like it might have all that in

21:32

the LLM, but then it's it's sort of

21:33

bottlenecked ultimately by only its

21:36

action space is one

21:37

>> one token at a time. And so I think that

21:39

that's where I think about diffusion.

21:40

There's like two main things that

21:42

diffusion gives me. It doesn't get me

21:43

all the way to pass my squint test, but

21:45

it gives me two things that I for sure

21:47

the brain is doing. Number one, the

21:49

entire all of biology and nature

21:52

leverages randomness. Randomness is

21:54

good. And what is diffusion doing is

21:56

leveraging randomness. If you give me

21:58

data and I noise it up and from that I

22:01

can learn about the data and like is the

22:03

can the brain add noise to input data?

22:06

Absolutely. like absolutely like neurons

22:08

are massively random. Uh this logn

22:10

normal distributions, spike patterns and

22:12

things like that. Um and the other one

22:14

is this emission of one thing at a time

22:16

versus thinking in concepts and then

22:18

decoding into a big chunk of text and

22:21

thought and revisioning of the previous

22:23

thoughts and things like that. And so I

22:25

think diffusion gives me both of those

22:26

things for sure.

22:27

>> People have probably heard of stable

22:28

diffusion as a very common application

22:31

of this. people. It's an image

22:32

generation model that was pretty widely

22:34

available for the last few years. What

22:37

people may not be so aware of is all the

22:39

other ways that diffusion is used in the

22:41

last few years in products that people

22:42

are widely using. So what are some of

22:44

the areas in which diffusion is most

22:46

widely uh accessible?

22:48

>> Yeah, it's really any mapping from very

22:50

highdimensional P data to very

22:52

highdimensional action spaces or P data

22:54

uh that you may want to map to. And so I

22:56

mean yeah of course everyone knows uh

22:58

generating images because we've done

22:59

midjourney and things like that uh uh

23:01

and even more modern versions of that

23:03

with Sora and VO and Flux and SD3 now

23:07

and things like that. Um and we've

23:09

generating videos which is just images

23:11

stable together um and and videogen and

23:14

image genen and things like that.

23:16

However, there's so many more

23:17

applications that now we're seeing.

23:19

That's the most exciting part in my my

23:21

view of all the new applications. And so

23:23

whether or not you're um now creating

23:25

sentences, I mean diffusion LLMs was one

23:27

of the biggest topics that we saw in

23:29

Europe. Um whether it's continuous uh

23:31

diffusion LMS or discrete diffusion

23:33

LLMs.

23:34

>> Um that's it's writing code now. Um it's

23:37

creating uh proteins. I mean deep minus

23:40

won the Nobel Prize for that. There is

23:42

uh robotic policies this diffusion

23:44

policy uh uh thing which I think might

23:46

actually be one of the biggest uh uh uh

23:48

uses of it and will result in like

23:50

robotics actually working and rose the

23:52

robot actually uh working. Um there's

23:54

weather forecasting for the gencast uh

23:57

is the most accurate weather forecasting

23:59

system in the world. Um it's really

24:00

anything even even like I I I mentioned

24:03

uh Harrison working on the the diffs

24:05

diffusion uh for failure sampling just

24:07

like sampling from for failures and like

24:09

bad things that could happen. we can do

24:11

that as well.

24:12

>> So a lot of the the products where we

24:13

see people actually using AI especially

24:15

for things other than just textbased

24:17

chat a lot of them are using diffusion

24:19

especially on images, videos

24:20

increasingly now things like code in the

24:22

life sciences. So yeah pretty pretty

24:24

wide birth of things. Yeah, in fact, I

24:26

would say the only two holdouts right

24:28

now where state-of-the-art is not

24:29

diffusion uh diffusion has eaten all of

24:32

AI except two AR LLMs still are

24:34

outperforming uh and uh gameplay and

24:38

things like Alph Go. And so MCTS is

24:41

still a state-of-the-art for those types

24:42

of things. And so we haven't seen

24:43

diffusion really take a step in those

24:45

two fold uh those two areas, but more

24:47

research is needed. So to bring the

24:49

conversation to a head now, how should

24:51

people think about this research area

24:53

either as researchers contributing to

24:55

the field or as founders looking to

24:57

build a new product?

24:58

>> Yeah, I mean I would think about maybe

25:00

there's falls in two camps. If you're

25:02

training models yourself um or if you're

25:05

using models and and you know not in the

25:08

business of training models, if you're

25:09

in the business of training models, I

25:11

would seriously look at diffusion. I

25:12

don't care what your application is. You

25:14

should be looking at this procedure. uh

25:16

even if it's just to get a latent space

25:19

that you can then train off of. And so

25:21

there's no application in machine

25:24

learning that I I don't think you should

25:25

be heavily looking at diffusion

25:26

procedures um as a fundamental piece of

25:29

of your training loop. Um in the in the

25:32

case of people who are are not training

25:34

models, I would just like update your

25:37

prior on how good these things are

25:39

getting. And if you just look at in the

25:41

last 5 years on how good image

25:43

generation got from midjourney when we

25:46

first came out to VO and Sora and Flux

25:49

and SD3 now it's like it's like thousand

25:52

times better right the answer was just

25:54

scale it up and that takes time and that

25:55

takes money and all those things and

25:57

data and now you apply that to proteins

26:00

you apply that to DNA you apply that to

26:02

robotics policies a self-driving car I

26:05

mean it is um skate to where the puck's

26:07

going to go all these things are going

26:08

to work and we're we're watching it

26:10

happen. It may cost money and time and

26:13

you know those kinds of things but those

26:15

are those are solvable things. Those are

26:16

tractable problems that we can go solve.

26:18

Um and also the the core procedure of

26:20

diffusion is getting better. That's

26:22

another major simpler a lot simpler and

26:24

it's getting like it's just working

26:26

better and so skates where the puck's

26:28

going to go. Bet that rose to the robot

26:30

will work in people's homes. bet that

26:32

the the protein folding is only going to

26:34

get better and now we're going to apply

26:36

that to DNA and all these other uh

26:38

metabolomics and things like that.

26:39

>> We we see founders develop new models

26:40

for robotics or for text generation or

26:43

for video using diffusion. Um and we see

26:46

founders who are using all these methods

26:48

coming from other places build companies

26:49

on top of them and it seems like there's

26:51

this whole new wave of companies that

26:52

can be built on either end of this now

26:54

>> right I think it's going to redefine the

26:55

entire economy.

26:56

>> Thanks so much for joining us. We're

26:57

going to keep digging in on topics

26:58

related to machine learning research

27:00

like diffusion. Can't wait to see you at

27:01

the next one.

27:03

[music]

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.

GET STARTED FREE SIGN IN