TRANSCRIÇÃOEnglish

Strix Halo & R9700 AI PRO Updates: Qwen 3.6 and Gemma 4 Support, ROCm 7.2.2 and Two New Series

14m 59s2,010 palavras305 segmentsEnglish

TRANSCRIÇÃO COMPLETA

0:00

If you look at the channel, you'll see

0:02

my last video was uploaded over 47 days

0:06

ago. As a content creator, seeing the

0:09

number of days since the last upload go

0:12

up makes you feel very anxious. The

0:16

temptation is to just release unpolished

0:19

videos to feed the algorithm, but I

0:22

didn't want to release new videos just

0:25

for the sake of it. Reality is between

0:28

work and personal life, things have been

0:31

busy in the past 2 months. At the same

0:34

time, even without new videos, I kept

0:38

working on updating and maintaining the

0:40

toolboxes for Streak Sealion and the R9

0:44

700. Even when things work smoothly,

0:47

running the benchmarks can take over 48

0:51

hours across all the main toolboxes for

0:54

Llama CPP and vLLM, and that is not even

0:57

considering the ComfyUI ones for image

1:00

and video generation. But, I think it's

1:03

important to ensure that the toolboxes

1:06

support new models and that there are

1:09

benchmarks so you can get an idea of

1:11

what kind of performance you get with

1:13

these new models. Aside from the

1:16

toolboxes, I have also been doing the

1:18

background research and prep work for

1:21

two entirely new video series coming to

1:25

the channel starting in May. So, the

1:27

purpose of this video is to give you a

1:29

quick update on all that's been going on

1:32

and what's coming in the next few

1:35

months.

1:36

Starting with Streak Sealion, the Llama

1:40

CPP toolboxes are generally

1:42

straightforward to maintain. I have

1:44

pipelines that automatically check twice

1:47

a day if there are updates and rebuild

1:50

them from scratch. Now, sometimes

1:53

something breaks and I have to go in and

1:55

dig into the logs. For example, the

1:58

Vulcan toolboxes broke, but the fix was

2:01

fairly easy. So, now they are updated

2:04

and work fine. I also recently ran a new

2:08

set of benchmarks for the new OpenWeight

2:10

models like the Qwen 3.5 and 3.6

2:14

families as well as Gemma 4. As usual,

2:18

they are available at the URL on the

2:20

screen and you can also find them linked

2:24

on the GitHub repository.

2:27

As I said, most times the toolboxes are

2:30

updated automatically and nothing fails.

2:33

There is, however, a current issue with

2:36

ROCm nightly builds introduced around

2:39

the 14th of April. This bug prevents the

2:42

driver from properly using the unified

2:46

memory. This means that if you're using

2:49

the latest nightly builds, models that

2:52

require more than 64 GB

2:54

of memory will fail to load.

2:58

There is an open issue tracking this bug

3:00

and the fix should be implemented soon.

3:03

Ultimately, remember that nightly builds

3:05

are developer releases, so they will

3:08

occasionally break. I relied heavily on

3:11

nightly builds initially because Streak

3:14

Sealion support was frankly not fully

3:16

implemented in the stable releases.

3:19

There weren't many other options. But,

3:21

realistically, looking at the benchmarks

3:24

for the stable ROCm 7.2 platforms, they

3:28

are essentially the same on Llama CPP as

3:31

the nightly builds. At the moment, there

3:34

is no reason to use the nightly builds

3:37

for Llama CPP. Additionally, AMD

3:40

recently released ROCm 7.2.2

3:44

and I have pushed a Llama CPP toolbox

3:46

that uses just that release. It's

3:49

currently going through benchmarks, so

3:52

hopefully you'll see them up in the next

3:54

few days.

3:56

Also, on Streak Sealion, there is a

3:59

patch by Sunil Pedapudi. I hope I

4:02

pronounced the name properly, but the

4:04

chances of that are very slim. This

4:07

patch is for Llama CPP and it aims to

4:09

improve performance. At a high level,

4:12

the patch tweaks matrix multiplication

4:14

parameters like the tile size to reduce

4:18

register pressure. Essentially, it

4:20

attempts to prevent the GPU compute

4:22

units from overflowing their available

4:25

vector registers, which would force them

4:27

to spill data into main memory and cause

4:31

a bottleneck. I added a test toolbox

4:35

that includes this PR and run the

4:37

benchmarks. You can see the results on

4:40

screen. While there is a measurable

4:42

performance increase for short context

4:45

lengths, particularly on mixture of

4:47

expert models, these gains evaporate on

4:51

longer contexts. The real issue holding

4:54

this PR back and why it is currently

4:56

stalled is its high variance. While it

5:00

helps mixture of expert models, it can

5:03

actually cause a performance regression

5:05

on standard dense models. On top of

5:09

that, the Llama CPP maintainers are

5:11

pushing back against merging

5:13

optimizations, which are hard-coded

5:15

specifically for Streak Sealion, as

5:18

doing so risks degrading performance on

5:21

other RDNA 3.5 APUs, which might have

5:25

smaller register files.

5:30

Finally, I also updated the Streak

5:33

Sealion vLLM toolbox to support the new

5:37

model families like Qwen 3.5, 3.6, and

5:40

Gemma 4. This turned out to be a long

5:45

process taking an entire weekend and

5:48

most weekday evenings this week. It

5:51

required several patches, but it is now

5:54

stable and performs well. This update

5:57

includes Rico, which is the library that

6:00

allows you to use vLLM tensor

6:03

parallelism for multi-node setups. So,

6:06

if you are interested in that, take a

6:08

look at my clustering video for vLLM on

6:11

Streak Sealion.

6:13

The current vLLM toolbox is based on the

6:16

ROCm nightly builds, but I patched them

6:20

specifically to get around that bug that

6:23

causes the 64 GB memory cap, so they are

6:27

okay to use. Now, if you want to see the

6:30

latest benchmarks, as usual, you will

6:33

find the link in the GitHub repository

6:36

for the project. Remember that these

6:39

throughput benchmarks are designed to

6:42

saturate the GPU with many concurrent

6:45

requests, in my case, 64 requests, and

6:49

this is the kind of workload that vLLM

6:52

is specifically optimized for. It is

6:56

different than the performance you get

6:58

on a single request. So, unless you have

7:01

a use case that requires heavy

7:03

concurrency, you might want to stick to

7:06

Llama CPP, which is usually a more

7:08

practical option.

7:10

Finally, on vLLM, I'm also considering

7:13

releasing a version of these toolboxes

7:15

based on the latest stable release of

7:18

ROCm, which is ensure we have a overall

7:21

more stable environment.

7:26

Before moving on, I want to give a quick

7:28

shout-out to Adrian, known as Lafu

7:32

Namor, and again, probably I pronounced

7:35

this nickname wrong, but I want to thank

7:37

Adrian for all the help with these PRs

7:40

and testing the Llama CPP toolboxes in

7:43

the background. And I want to add a

7:45

thank you to Patrick Audley for his

7:48

repository with extensive build notes of

7:52

vLLM on Streak Sealion, and I used that

7:55

repository quite a lot for the new vLLM

7:59

toolbox.

8:01

Moving over to the R9 700, I ran the

8:05

Llama CPP benchmarks for all the new

8:08

models, and those results are available

8:11

on the GitHub repository as usual. I

8:14

also tried to update the vLLM toolbox,

8:18

but ran into multiple issues.

8:21

Specifically, AMD had to stop compiling

8:25

Rico into the nightly builds for

8:28

platforms like the R9 700. This means

8:32

that you cannot use multi-GPU setups

8:36

with the current ROCm nightly builds.

8:39

This is Donato from the future. As I'm

8:41

editing this video, I want to give you

8:43

an update. I did actually

8:45

try to compile Rico for the R9 700,

8:50

but that still didn't work because there

8:53

is currently a bug. The good news is

8:56

that the bug is tracked and AMD is

8:58

aware, and they're investigating it.

9:00

But, essentially, right now in recent

9:03

versions of vLLM and ROCm, I don't know

9:06

exactly when this started, you cannot

9:10

have

9:11

multi-GPU setups with vLLM

9:15

on, again, the R9 700 GFX1201

9:18

architecture. Again, the good news is

9:20

that AMD is very responsive. They are

9:23

aware of these, and I can expect,

9:25

probably by the time I publish this

9:27

video, there will be a fix, and

9:29

certainly, as soon as there is a fix, I

9:31

will let you know. Because of this, I

9:34

decided not to push the new vLLM toolbox

9:37

version yet and to keep the older

9:40

version up. Just be aware that the older

9:43

version will not support the newer

DESBLOQUEAR MAIS

Registe-se gratuitamente para aceder a funcionalidades premium

VISUALIZADOR INTERATIVO

Assista ao vídeo com legendas sincronizadas, sobreposição ajustável e controlo total da reprodução.

REGISTE-SE GRATUITAMENTE PARA DESBLOQUEAR

RESUMO DE IA

Obtenha um resumo instantâneo gerado por IA do conteúdo do vídeo, pontos-chave e conclusões.

REGISTE-SE GRATUITAMENTE PARA DESBLOQUEAR

TRADUZIR

Traduza a transcrição para mais de 100 idiomas com um clique. Baixe em qualquer formato.

REGISTE-SE GRATUITAMENTE PARA DESBLOQUEAR

MAPA MENTAL

Visualize a transcrição como um mapa mental interativo. Entenda a estrutura rapidamente.

REGISTE-SE GRATUITAMENTE PARA DESBLOQUEAR

CONVERSAR COM A TRANSCRIÇÃO

Faça perguntas sobre o conteúdo do vídeo. Obtenha respostas com tecnologia de IA diretamente da transcrição.

REGISTE-SE GRATUITAMENTE PARA DESBLOQUEAR

APROVEITE MAIS DE SUAS TRANSCRIÇÕES

Inscreva-se gratuitamente e desbloqueie o visualizador interativo, resumos de IA, traduções, mapas mentais e muito mais. Não é necessário cartão de crédito.

    Strix Halo & R970… - Transcrição Completa | YouTubeTranscript.dev