The ML Technique Every Founder Should Know
FULL TRANSCRIPT
Welcome back to another episode of
Decoded. Today I'm sitting down with YC
visiting partner Francois Shaard to talk
about one of the most important topics
in AI today, diffusion. Francois has
been doing computer vision since 2012
when he started in Fee Le's lab. And
after a decade running focal systems,
he's currently back at Stanford
finishing his PhD working on
diffusion-based world models for AGI.
We're going to break down what diffusion
is, how it's evolved over the past
decade, and how it's used today.
>> [music]
>> Francois, thanks for being here.
>> Thank you for having me.
>> Well, we just got back from Nurups. We
just spent a lot of time talking to
researchers and thinking about all the
newest models out there. Um, I think we
saw diffusion pop up over and over and
newer versions of these type of uh
approaches that are not auto reggressive
LLM. And so I wanted to talk to you
about those today. So first, why don't
we start by defining um what is
diffusion? The fusion is a very
fundamental machine learning framework
that allows you to learn any p data any
probability of data for any domain as
long as you have the data.
>> So you're trying to learn some data
distribution.
>> That's right.
>> Now in a sense all LLMs or all machine
learning models are about learning data
distributions.
>> How does diffusion in particular what
what stance does it take or what
approach does it take to being able to
learn distribution?
>> Yeah, I mean I think you can use
diffusion to to always do that. The
thing where it stands out in particular
is mapping from high dimensions to high
dimensions, especially in low data uh
regimes. So, say I only have 30 images
of Gary, which I actually have some code
that we're going to walk through. Um, I
only have 30 images of Gary and again,
we're in this thousand by 10,000 by3 uh
uh dimensional space and I want to map
to another three 3 million dimensional
space with only 30 training samples and
I can still do it and it's pretty pretty
powerful in that way.
>> Okay, cool. So, so you have this ability
to use relatively small amounts of data
compared to the dimensionality to learn
a P data. That's right.
>> Um, what's the what's the basic process
by which diffusion works? Like just walk
through like at a very high level and
we'll walk through the math a little bit
later, but a very high level, how does
this process actually work?
>> We take some sample of the data, an
image of Anka, an image of Gary, and we
just hit it with noise. And then we just
keep hitting it with noise and we create
this train of of noised up images. It's
very easy to create noisy images, right?
It's hard to create get walk backwards
and create from noise images of you or
Gary. And so then we flip it and then we
try to have teach the model to reverse
that process. And that's basically it.
>> Okay, cool. So it's basically a a noiser
and a dinoiser and the dinoiser is the
model that you end up training.
>> Exactly. Yeah. You will uh you will
basically teach your force and and give
it uh noised up images and then have it
learn intermediate representations to
get to back to P data.
>> Cool. Nice. And what kinds of stuff is
diffusion used for today? What are some
applications that it's widely deployed
in?
>> It's honestly surprising how uh uh
applicable this process is. I think the
the original 2015 Joshua Sixine paper
was on CIPR10 which is just images. Um,
and I think it's got it has its roots in
images, but it is far uh uh more
sprawling than just images. As you've
seen, we you know, uh uh Deep Mind just
won the Nobel Prize for doing this exact
procedure on protein folding. Uh you can
drive cars with this with the diffusion
policy paper, which is like an insane
result. Um you can um uh predict the
weather. Um there's really no limit to
the things that this can do.
>> Yeah, it's pretty incredible to see. I
mean we have these image and video
generation models that seem to be really
advancing over the last few years.
Stable diffusion is the one that I think
many people have heard of and then newer
versions of it seem to be using this as
well. And then yeah in the world of life
sciences that my company was in too. I
think we see this newest generation of
life sciences AI companies are heavily
investing in this set of technologies.
There's a model called diff do that
works really well for predicting uh
small molecule binding to proteins and
then yeah alpha fold especially the
newest alpha fold versions use diffusion
pretty heavily. It's really cool to see
the same core piece of technology apply
to so many different domains.
>> Yeah. Yeah.
>> This class of models has evolved over
the years and you know there's a whole
slew of papers someone could read. So
you should probably go read the papers
to learn all the details. But maybe at a
high level we can try to trace out a few
of the key
>> innovations that happened starting with
the paper you already mentioned that now
led to the newest versions of these
models. So how would you map those out?
like what was the the first kind of turn
of the crank from this very high level
diffusion process you uh outlined what
was the first version of that that
started to work
>> yeah so I think that the 2015 original
Josha paper um is put up all the key
pieces all the key components of modern
diffusion and so like now we're just
playing with different things so theuler
how do we add noise at what weight like
that's a whole part that we can discuss
what's the loss function should I
predict should the the deep learning
model condition upon x of t predict uh
the actual data x of t minus one or
should it pred predict the error that
was just added to it or should it
predict the velocity which is the error
divided by the time uh should it predict
the velocity of the the start and the
end that's called flow matching there's
so there's all these different plays on
what the loss function is
>> so in all of those the idea is still to
do dnoising
>> yes
>> uh but the objective for each of them is
somewhat different from each other and
they're all pretty closly related
whether it's basically a delta between
two things or the previous step or the
first step how did these all actually
come together but these are series of
papers that happened one after another
>> yeah I think we just kind of hill
climbed on this uh ferroche inception
distance metric that's kind of a kooky
weird uh um measure to see how good an
image is um but we just kept getting
better and better and better on it by
like doing these little tricks and so
like it turns out that predicting the
actual data itself is actually quite
hard and like maybe predicting the error
is actually easier and and predicting
the velocity was even easier than that.
And then predicting the uh global error
across the entire diffusion schedule is
even easier than that. And like just
kept finding easier and easier ways um
to to basically uh sample from noise to
data.
>> And here when you say easier were was
the ease largely driven by it was
mathematically simpler or it was easier
to implement and engineer or simpler to
reason about or what what got easier
really. It actually is that too, but I
didn't mean it that way. What I actually
meant was it's easier for the model to
learn.
>> Um, but it is also, and we'll go through
some coding examples, the math actually
got easier. Yeah. And like the the the
code got smaller, which is actually
opposite oppositely true in uh most of
uh the case of most machine learning.
You actually things get more
complicated. I think we started with
units and that was like the predominant
architecture. We didn't really talk
about architectures that much, but then
we got into these, uh, diffusion
transformers and like this cross
attention mechanism and things like
that. And so, um, yeah, we just kept
getting better and better at reducing,
uh, FID.
>> Interesting. Should we dive into some
code examples?
>> Let's do it.
>> Let's do it.
>> I'll walk you through. I made about uh,
one, two, three, four, five, six, seven
of these that I implemented um with
varying levels of success, but uh, the
the all the structures are going to be
the same. So the the Josha paper, the
non-equilibrium thermodynamics paper, uh
you can see here here are some nice
images of Gary. You can see here very
nice. This is the what I could find
online.
>> Nice.
>> Um and then
>> so those are images of Gary that you've
downsampled so that they're thousand
by,000 or they're they're smaller. These
are 64 by 64 by 64.
>> Yeah, they're really small. This is just
a very small example. 64 and then I
randomly augment it to create more data.
Great. Um because I was I was lazy and
that was easier than uh downloading more
images. Cool. [laughter]
>> Didn't want to get security call on you.
>> Exactly. So, and then wait, I
implemented this diffusion schedule and
this is probably one of the most
important like of all the parts of
diffusion that's difficult to
comprehend. I would say that the noise
schedule is actually the hardest part to
understand that I really like I
struggled with myself. And so if you can
see here the noise that's added from
time step uh zero to t to 10 to 25 all
the way to 100 uh is clearly destroying
the structure.
>> Yes.
>> And then we want to train
>> where you end is basically random
static.
>> Exactly. And we want to basically
reverse this and from here get to here
and have the model get to that point get
to this point get the point et etc. And
so, uh, the interesting part, and this
is Joshua really, you know, uh,
implemented almost everything that we
needed for diffusion. Um, and there was
just a few little tweaks that were
missing, and he didn't he didn't scale
it up. That's to me the the parts that,
um, uh, were were were missing. And if
you see here the um, the noise schedule.
So it would make sense to me that I
would have linear interpolation between
uh the image and the noise and I would
start with like one and zero one being
the image and zero being the noise. You
gradually add it
>> and I gra and I linearly add it. But if
you do that it actually is massively
unstable because the instantaneous
amount of error that you're adding is
very small in the beginning. If you
think about like an image
>> on a relative basis
>> on a relative basis and then at the end
you have to to destroy all the to get to
complete noise you need to add a lot of
error
>> and so like if you're a model and you're
just looking at this little chunk of the
noise schedule then you have to handle a
lot of error in one step and on this
side of the the schedule you need to
handle such small amounts of error and
what you actually want is constant rel
like like relatively constant amount of
error being introduced every single time
step right and that the the cumulative
sum of all that error actually ends up
looking like this uh like this curve
here.
>> That's the uh the pink curve.
>> Yeah.
>> And so they call this a beta schedule.
Beta is the diffusion rate, the rate of
diffusion that I'm I'm doing while I'm
rolling this thing out from time zero to
time uh t capital t. And uh and so you
can see here the the beta schedule. So
we we have usually have some beta min to
beta max and then we one minus that is
the alpha and you can think about the
beta as like how much noise I'm adding
at every time step. Yep.
>> And you think about the alpha as how
much
yeah being retained and then the term
that really matters is the alpha bar and
these are the weight weights that are
used and it has this kind of like one
minus sigmoid looking thing. Um but
that's basically the the noise schedule
and once you get that right really this
this this part here then everything kind
of else just works and then I train some
model and then we can actually
>> so there what was the training objective
again? So you were adding this noise and
the training objective was to do what
exactly
>> the train in this case it's to minimize
the kale divergence between the
distribution uh the real distribution
and the distribution that I'm learning.
And so um I I won't go through the code
for this one because it's a little bit
hairier, but you can kind of see the
result on these generated images um
after 100 diffusion steps at inference
time. And you can see that the ferroche
inception distance is 222
>> which is like extremely high today like
modern day would be like maybe like
eight or 10 or something. And what's
interesting here, I mean, you kind of
scroll through it there, but it's, and
you mentioned it, there's quite a lot of
code that it actually takes to do that
kale divergence base loss. I suspect
that in these later models you're going
to show, it gets significantly simpler.
So, I'm I'm just mentally noting that
because I suspect there's going to be
interesting contrast to draw uh between
these two.
>> Yeah. So, the next one I would like to
show is flow matching, which is it's
just so beautiful and simple. Uh and
this was out of um meta uh yearn litman
uh where he basically said we don't need
a lot of this stuff. What we need to do
forget the if you think about the
noising process as being this like um I
start from data I randomly sample a
vector
>> of noise and I just go in this direction
and then I do it again. I go in this
direction and I do it again. I go in
that direction. I go in this direction
that direction and then I'm here at
noise and then you have to teach the
thing to go in the exact opposite path
and you have to do this very securest
path and so at test time it's actually
quite expensive you have to do we've all
waited for you know uh chatgbt or or to
to or midjourney to like make an image
and it takes a while it's doing like a
thousand calls to the model again and
again iterating through to get to that
point of p data right
>> instead
>> and like intuitively it's like okay
we're doing the ciruitous path but
surely there's a shorter path between
those two.
>> And so that's what makes flow matching
so cool to me at least is that they said
forget all of that intermediary results.
There is a a velocity a global velocity
between the noise and the data and it's
just this direction. It's just this
straight line. And I don't care where
you are go in that line. Wherever you
are, you're over here. Go in that line
and teach it to go in that line. And
that's what flow matching does. And so
I'll show that in the code. I bet it's
like five lines of code. It really is
quite simple. And so this pretty cool.
So
>> here you go. The
>> you basically have like 10 15 lines of
code that is the most powerful machine
learning procedure uh ever.
>> So I I have some data. I an image of
Gary.
>> Yep.
>> I have some noise that I I I some
isotropic Gaussian noise that I sample
from.
>> Yep. there's some time that I'm I'm
trying to index into in the diffusion
schedule and I'm and I create XT, which
is the image at the noised up image
that's somewhere between extremely noisy
and not noisy at all.
>> And and that's basically just the
sampling procedure. It's T time data.
>> Yep.
>> Plus one minus that times noise. Yeah,
>> that's right. And then I compute the
velocity which is independent of the
time. I don't care where you are. It's
this the glo this global velocity which
is just the noise minus the data and
then it I return that back to my uh
training loop which is the the shortest
amount of code training [laughter] loop
I've ever written which it's five lines
of code. Um I have my batch I have some
time I sample from that function I just
uh explained before and then I have my
prediction from the model. I feed it in
this some element uh some some uh noise
up image somewhere between lots of noise
and little noise x of t let's call it
and I just want it to predict the
velocity that I want to go
>> and this is also really powerful because
here you know you have model abstractor
but that model can be any model right so
you can put in whatever the relevant
model is for your distribution whether
that's a protein model for proteins or
if it's an LLM for text or an image
based model for images
>> that is a very clean abstraction
as long as you can predict this velocity
and then move in that direction.
>> That's right. This code here has nothing
to do with images. It could be weather
data. It could be, you know, uh stock
market data. It could be um trajectories
from a robotics in a teaops setup. Um it
could be proteins. It could be DNA. It
doesn't really matter. It's all the
exact same code. Um and so and then also
we haven't talked about the
architecture. So like this model here
could be anything you want it to be.
like it could be a RNN, it could be a uh
UNET which is typically you know
traditionally is and and modernly they
use these diffusion uh transformers
doing this cross attention mechanism and
so it can be whatever you want. Um but
this all that is independent from
whether or not you're doing flow
matching or not. I think this is like a
really profoundly
interesting result in that especially
this um I think we often assume as
models have gotten more sophisticated
that they become less accessible for
people to understand but this is
>> quite literally 10 lines of code right
[laughter]
>> that explains
>> essentially all of the most important
kind of mathematical and fundamental
foundations of the models that we all
see as generating basically like magical
AI results on our phones. Of course,
there's lots of engineering how you
scale them up. that that model could be
a 100 billion parameter across
transformer data centers, you know,
GPUs. Totally. Yeah, 100%.
>> So, it's the engineering that's the
really hard part there, but a lot of the
basic machine learning math is actually
quite straightforward.
>> That's right. Um, yeah. And so, uh,
there's a bunch of these like tangent
fields to diffusion that all have some
different interpretation on what's
actually happening, but it's all the
same exact math. And most people
learning diffusion actually get quite
confused because if you talk to um some
uh uh you know probabistic graphical
model people they'll saying oh this is a
proistic graphical model and what's
actually this is a hidden markoff model
and what we're doing is we're learning
this like marovian thing or whatever.
It's like okay fine but like it's just
>> noise [laughter]
>> and like you should just show that first
and then like if you think about it from
like a physics perspective and and
there's all this statmech people that
have that interpretation um there's a
whole bunch of the different
interpretations I think it's it gets a
little bit uh confusing um and the whole
stocastic differential equation people
like thinking about that this is a an
SDE and I think that's all fine and it
probably is helpful to think about but
in terms of teaching it it's actually
quite quite simple which is powerful.
>> Cool.
>> So if we go back to here you can see
that this just literally predicting the
velocity. Your goal is to have the model
predict
>> you're minimizing the loss between
predictive velocity and velocity
>> and the actual velocity. That's it. And
that's super stable. Uh and it's it's
really clean. And then at test time for
the uh physics people this is like a
oiler step kind of thing that you're
doing where you call the model a bunch
of times um and you iteratively refine.
So back to the hill climbing that we
were talking about. I'll grab some
random uh uh noise here x and I just do
and I call basically reverse that that
um uh uh noising process to d noiseise d
noiseise d noiseise and
>> it's literally Oiler's method like
you're you're using the velocity to
point in the direction you want
>> point in the direction and just keep
going keep going keep going until you've
done the number of steps. The one thing
that I really don't like about diffusion
as it's done today is that I can't keep
calling it um beyond if I only trained
on 100 uh diffusion steps in my
diffusion schedule. If I change that at
test time, it doesn't work. And so you
can't like, oh, I want it even better,
so I'll call it even more. That doesn't
you can't I've tried it. It doesn't
work.
>> Yeah. There's various tricks people try
there, but Yeah.
>> Yeah. And so like the there's games
played that is actually quite exciting.
all the expense.
>> But wait, sorry, should be clear here.
Here you're saying that's not relevant,
right? Because these in this type of
model, you don't have this time
dependency.
>> Well, so you do. So at this time, if you
change, for example, the number of
steps, if you double it, let's say that,
and you expect to get even higher
resolution images, it actually will just
turn into like white. Like it actually
just like doesn't work at all. Um, so
you can't step beyond number of steps
that was trained.
>> That's an important detail. Um, there
are tricks that people are doing to try
to compress that. uh representation. So
like if at train time I train for a 100
steps and at test time I want to do 10
steps then what you can do is you can do
distillation into the model to try to
have the 10-step model learn the 100
step models thing but then you still got
to train with 10 steps and so like if
you're training with with X steps you
have to be using X steps at test time.
>> I see interesting. You've talked about
this concept of a squint test. Why don't
you define the squint test for a second?
Tell me a little about where this comes
from and then I'd be curious to hear how
you think about diffusion models in the
context of general intelligence broadly.
>> Yan Lakun has this like interesting uh
lecture where he talks about um our
discovery of flight and that we didn't
need uh flapping wings. We kept trying
to mimic a bat um and how that was a
waste of time. And uh to that I say
you're 100% right. However, we did need
two wings. you look at the Wright
brother's original plane and you squint
and you look at a bird, you're just
like, hm, while we have helicopters and
we have jets and things like that, uh,
and rockets, like we we got there
eventually. Um, and so there's many
elements in the set of things that can
achieve flight and they have different
pros and cons. Um, and there are many
elements in the set of things that can
achieve intelligence. We are the only
existence proof of it at all. And like
I'm sure there will be more elements in
the set and maybe LLM's um broadly
speaking can get there. But if I squint
and I look at LM setup which I I I see
this you know monolithic stack
transformers the same thing stack stack
stacked and there's three stages of
training. We do this pre-train uh you
know SFT you know post- train and then
no learning at at all beyond that. Um
and it produces exactly one token at a
time,
>> right? So iterative token
>> iterative token at a time and it never
goes backwards. Um and then you look at
a brain massively massive amounts of
recursion. You have one learning
procedure the whole time. You have these
two loes that with a corpus colosum
between them that's kind of going back
and forth like this and we think and
then I definitely don't think in one
token at a time. When I write code I
don't write one little character at a
time. I never go backwards and I kind of
like am going backwards. I'm recursively
uh improving. I'm going backwards again
and again. I'm thinking in concepts.
>> There's this like dynamic process that's
emitting concepts and then higher level
higher level concepts and then lower
level manifestations of them.
>> And I'm sure that may be happening
inside the LLM, but it's like it's
almost like uh stuck. It can't do more
than in one step even though it might
want to because it has to is the way
that we we we trained it,
>> right? Like it might have all that in
the LLM, but then it's it's sort of
bottlenecked ultimately by only its
action space is one
>> one token at a time. And so I think that
that's where I think about diffusion.
There's like two main things that
diffusion gives me. It doesn't get me
all the way to pass my squint test, but
it gives me two things that I for sure
the brain is doing. Number one, the
entire all of biology and nature
leverages randomness. Randomness is
good. And what is diffusion doing is
leveraging randomness. If you give me
data and I noise it up and from that I
can learn about the data and like is the
can the brain add noise to input data?
Absolutely. like absolutely like neurons
are massively random. Uh this logn
normal distributions, spike patterns and
things like that. Um and the other one
is this emission of one thing at a time
versus thinking in concepts and then
decoding into a big chunk of text and
thought and revisioning of the previous
thoughts and things like that. And so I
think diffusion gives me both of those
things for sure.
>> People have probably heard of stable
diffusion as a very common application
of this. people. It's an image
generation model that was pretty widely
available for the last few years. What
people may not be so aware of is all the
other ways that diffusion is used in the
last few years in products that people
are widely using. So what are some of
the areas in which diffusion is most
widely uh accessible?
>> Yeah, it's really any mapping from very
highdimensional P data to very
highdimensional action spaces or P data
uh that you may want to map to. And so I
mean yeah of course everyone knows uh
generating images because we've done
midjourney and things like that uh uh
and even more modern versions of that
with Sora and VO and Flux and SD3 now
and things like that. Um and we've
generating videos which is just images
stable together um and and videogen and
image genen and things like that.
However, there's so many more
applications that now we're seeing.
That's the most exciting part in my my
view of all the new applications. And so
whether or not you're um now creating
sentences, I mean diffusion LLMs was one
of the biggest topics that we saw in
Europe. Um whether it's continuous uh
diffusion LMS or discrete diffusion
LLMs.
>> Um that's it's writing code now. Um it's
creating uh proteins. I mean deep minus
won the Nobel Prize for that. There is
uh robotic policies this diffusion
policy uh uh thing which I think might
actually be one of the biggest uh uh uh
uses of it and will result in like
robotics actually working and rose the
robot actually uh working. Um there's
weather forecasting for the gencast uh
is the most accurate weather forecasting
system in the world. Um it's really
anything even even like I I I mentioned
uh Harrison working on the the diffs
diffusion uh for failure sampling just
like sampling from for failures and like
bad things that could happen. we can do
that as well.
>> So a lot of the the products where we
see people actually using AI especially
for things other than just textbased
chat a lot of them are using diffusion
especially on images, videos
increasingly now things like code in the
life sciences. So yeah pretty pretty
wide birth of things. Yeah, in fact, I
would say the only two holdouts right
now where state-of-the-art is not
diffusion uh diffusion has eaten all of
AI except two AR LLMs still are
outperforming uh and uh gameplay and
things like Alph Go. And so MCTS is
still a state-of-the-art for those types
of things. And so we haven't seen
diffusion really take a step in those
two fold uh those two areas, but more
research is needed. So to bring the
conversation to a head now, how should
people think about this research area
either as researchers contributing to
the field or as founders looking to
build a new product?
>> Yeah, I mean I would think about maybe
there's falls in two camps. If you're
training models yourself um or if you're
using models and and you know not in the
business of training models, if you're
in the business of training models, I
would seriously look at diffusion. I
don't care what your application is. You
should be looking at this procedure. uh
even if it's just to get a latent space
that you can then train off of. And so
there's no application in machine
learning that I I don't think you should
be heavily looking at diffusion
procedures um as a fundamental piece of
of your training loop. Um in the in the
case of people who are are not training
models, I would just like update your
prior on how good these things are
getting. And if you just look at in the
last 5 years on how good image
generation got from midjourney when we
first came out to VO and Sora and Flux
and SD3 now it's like it's like thousand
times better right the answer was just
scale it up and that takes time and that
takes money and all those things and
data and now you apply that to proteins
you apply that to DNA you apply that to
robotics policies a self-driving car I
mean it is um skate to where the puck's
going to go all these things are going
to work and we're we're watching it
happen. It may cost money and time and
you know those kinds of things but those
are those are solvable things. Those are
tractable problems that we can go solve.
Um and also the the core procedure of
diffusion is getting better. That's
another major simpler a lot simpler and
it's getting like it's just working
better and so skates where the puck's
going to go. Bet that rose to the robot
will work in people's homes. bet that
the the protein folding is only going to
get better and now we're going to apply
that to DNA and all these other uh
metabolomics and things like that.
>> We we see founders develop new models
for robotics or for text generation or
for video using diffusion. Um and we see
founders who are using all these methods
coming from other places build companies
on top of them and it seems like there's
this whole new wave of companies that
can be built on either end of this now
>> right I think it's going to redefine the
entire economy.
>> Thanks so much for joining us. We're
going to keep digging in on topics
related to machine learning research
like diffusion. Can't wait to see you at
the next one.
[music]
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.