TRANSCRIPTIONEnglish

Gemma 4 Has Landed!

18m 34s3,380 mots478 segmentsEnglish

TRANSCRIPTION COMPLÈTE

0:00

Okay, so Google has just dropped Gemma 4

0:04

and this is four new models with

0:06

multimodality

0:08

thinking function calling the works and

0:11

honestly that alone would get me

0:13

covering this. But that's not even the

0:14

interesting part. The interesting part

0:16

is the license. Gemma 4 ships under an

0:20

Apache 2 license. Not a custom license

0:23

with weird restrictions with the whole

0:25

sort of open weights but don't compete

0:27

with us clauses. This is an actual real

0:30

Apache 2 license, which means for the

0:32

first time you can take Google's best

0:34

open model, modify it, fine-tune it,

0:37

deploy it commercially, do whatever you

0:40

want with it. No strings attached. And

0:43

when we combine that with inside these

0:45

models, we're talking about 128 experte

0:48

here, native audio, native vision,

0:51

built-in reasoning, all of that becomes

0:54

a pretty big deal. Okay, so let me give

0:56

you a quick orientation because there

0:58

are four models and the naming is a

1:00

little bit confusing here. Gemma 4 comes

1:02

in two tiers. You've got what they're

1:04

calling your workstation models. So this

1:07

is a 31 billion parameter dense model

1:10

and a 26 billion parameter mixture of

1:13

experts model with 4 billion parameters

1:15

active. And then you've got your edge

1:18

models. So this is the E2B and the E4B.

1:22

Now, these are tiny, efficient models

1:24

designed to run on phones, Raspberry

1:26

Pies, Jets and Nanos, and really pretty

1:29

much at the edge anywhere you need a

1:31

good quality model here. Now, I've

1:33

covered the Gemma line of models since

1:34

the original release. I covered Gemma 3

1:36

on the channel, and I know back then,

1:38

while a lot of people were very

1:40

impressed with it, they were kind of

1:41

frustrated with some of the things

1:42

around the license. So, you had this

1:44

capable model, but a license with enough

1:47

restrictions that a lot of people went

1:48

with Llama or went with Quen instead. So

1:51

the Apache 2.0 move here is Google

1:54

basically saying, "Okay, fine. We'll

1:56

play the same terms as some of the other

1:58

open model providers out there." And in

2:00

fact, as we're talking about this, some

2:02

of the other model open providers in

2:03

China are actually pulling back their

2:06

latest releases and not making them open

2:08

like they have in the past. So the other

2:11

big thing up front here is that Google

2:13

is saying that these are built from

2:15

Gemini 3 research. So basically the

2:18

architecture innovations that went into

2:20

some of their flagship commercial models

2:22

are slowly now trickling down into the

2:25

open weights models. So if you've been

2:28

running local models and I know a lot of

2:30

you have the landscape has kind of

2:32

settled into this pattern. We've kind of

2:34

gone past the llama models. We've now

2:37

got sort of quan mistral and they're all

2:40

sort of competing on benchmarks in this

2:42

sort of fixed parameter range for dense

2:45

models. But we've also seen, you know,

2:47

up until recently, most of these models

2:49

were text only or at best text plus

2:53

vision. If you want audio, you're kind

2:55

of bolting on whisper. You're bolting on

2:58

some external ASR pipeline. And often if

3:00

you wanted something like function

3:01

calling, you're kind of hoping that the

3:03

model cooperates with your prompt

3:05

template. So what Gemma 4 is doing here

3:07

is shipping all of that natively into a

3:10

single model family. vision, audio,

3:13

thinking, function calling, and all of

3:15

these four are actually built in from

3:18

the architecture level, not sort of

3:20

bolted on after the fact. All right, so

3:22

one of the key things that makes Gemma 4

3:25

better than the previous Gemma series is

3:27

that it now has the ability to do sort

3:30

of long chain of thought reasoning. And

3:32

we've seen clearly that this can improve

3:34

outputs and can get you better final

3:36

answers, etc. Now, not only can this

3:38

reason across text, but it can reason

3:41

across different modalities. So, it can

3:44

reason across images if you wanted to

3:46

basically pass in an image and make use

3:49

of that. And for the first time, you can

3:50

actually reason across audio. So, that

3:53

is also cool in here. Obviously, this

3:56

ability to do the long chain of thought

3:58

has improved a lot of the benchmarks

4:00

that are out there and they're getting

4:02

really strong results on the MMU Pro as

4:04

well as Sweetbench Pro. Along with the

4:07

reasoning comes function calling. So,

4:10

anything you want to do that's aantic,

4:12

you want to basically be using function

4:14

calling and tools. So, this has

4:17

integrated a lot of the research they

4:19

put into the function Gemma model which

4:21

they released at the end of last year.

4:23

But now this is both in the small models

4:25

and the bigger models. So a lot of

4:27

people will think that this is not that

4:29

new. But really the way people did this

4:32

in the past for doing this kind of

4:35

function calling was actually just

4:37

having the model to be better at

4:38

instruction following and then sort of

4:40

coaxing it into it. Gemma 4 actually has

4:43

the function calling baked into it from

4:46

scratch. So this is sort of optimized

4:48

for multi-turn agentic flows allows you

4:50

to do with multiple tools and it really

4:53

shows up in some of the agentic

4:54

benchmarks and tasks that you can do.

4:57

All right, I mentioned earlier in the

4:58

reasoning that the two smaller models,

5:01

unfortunately not all four models, but

5:03

the two smaller models actually have

5:05

audio support and that audio support is

5:08

a lot better than what we had in Gemma

5:10

3N and some of the previous Gemma models

5:13

that had audio support. This means that

5:15

you can do things like ASR and

5:16

transcription, but you can also do

5:19

speech to translated text support. So,

5:21

I'll show you that when we go through

5:23

the walkthrough. On top of this, the

5:24

audio encoder is not only better, but

5:27

it's just a lot smaller. So, this helps

5:29

a lot for anything that you want to do

5:31

at the edge with these models that

5:33

you're just not going to be using as

5:35

much device storage and memory. Another

5:38

thing of comparing Gemma 4 to say the

5:41

Gemma 3N series is to do with the image

5:45

encoder. The image encoder with those

5:47

Gemma 3N models. While it was good, it

5:50

really was a bit sort of old-fashioned

5:52

in the way that they did it. It didn't

5:54

handle things like aspect ratios well.

5:56

And because of that, you would often see

5:58

that it didn't do a great job for things

5:59

like OCR, etc. The Gemma 4 models

6:02

basically have native support for these

6:05

interled multi-image inputs. My guess is

6:08

from playing with it that it's probably

6:10

had a decent amount of sort of OCR and

6:12

document understanding training in

6:14

there. And because you can do that sort

6:16

of multi-image input, you can actually

6:18

do video here and have reasoning across

6:21

those multi-im images. So generally just

6:23

comparing the Gemma 4 against Gemma 3

6:26

and Gemma 3N, you've got a lot more

6:28

updates in both with the smaller models

6:30

supporting the audio and better

6:32

multimodality support. And whereas Gemma

6:35

3N only had a context window of 32K,

6:38

even for the small models on Gemma 4,

6:41

they've got a context window of 128K and

6:44

then 256K for the bigger models. All

6:47

right, so let's talk about some of these

6:49

architecture choices and the model sizes

6:51

themselves. So the mixture of experts

6:54

model is 26 billion total parameters,

6:57

but only 3.8 billion are active at any

7:00

time. Now they haven't gone for a huge

7:03

number of experts like we've seen some

7:05

of the other models go for recently.

7:06

They've got 128 of these sort of tiny

7:09

experts, eight being activated for each

7:12

token plus one sort of shared always on

7:15

expert. So if we compare that to the

7:17

Gemma 3 model which the largest model

7:19

was a 27 billion parameter dense model

7:22

obviously in that case you are using all

7:24

27 billion at the same time. So roughly

7:28

this is giving you sort of the

7:29

intelligence of a 27b model with the

7:32

compute costs of something around a 4B

7:35

model. Now this you can certainly run on

7:37

sort of consumer GPUs and I'm sure that

7:40

even as I'm recording this before it

7:41

comes out we will see this on Oama on

7:44

LLM studio etc. And Google themselves is

7:47

also releasing the QAT checkpoints

7:50

that's the quantized aware training

DÉBLOQUER PLUS

Inscrivez-vous gratuitement pour accéder aux fonctionnalités premium

VISUALISEUR INTERACTIF

Regardez la vidéo avec des sous-titres synchronisés, une superposition réglable et un contrôle total de la lecture.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

RÉSUMÉ IA

Obtenez un résumé instantané généré par l'IA du contenu de la vidéo, des points clés et des principaux enseignements.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TRADUIRE

Traduisez la transcription dans plus de 100 langues en un seul clic. Téléchargez dans n'importe quel format.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

CARTE MENTALE

Visualisez la transcription sous forme de carte mentale interactive. Comprenez la structure en un coup d'œil.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

DISCUTER AVEC LA TRANSCRIPTION

Posez des questions sur le contenu de la vidéo. Obtenez des réponses alimentées par l'IA directement à partir de la transcription.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TIREZ LE MEILLEUR PARTI DE VOS TRANSCRIPTIONS

Inscrivez-vous gratuitement et débloquez la visionneuse interactive, les résumés IA, les traductions, les cartes mentales, et plus encore. Aucune carte de crédit requise.

ESSAYEZ YOUTUBETRANSCRIPT.DEV COMMENCER GRATUITEMENT

Gemma 4 Has Lan… - Transcription Complète | YouTubeTranscript.dev