TRANSCRIPTEnglish

Strix Halo & R9700 AI PRO Updates: Qwen 3.6 and Gemma 4 Support, ROCm 7.2.2 and Two New Series

14m 59s2,010 words305 segmentsEnglish

FULL TRANSCRIPT

0:00

If you look at the channel, you'll see

0:02

my last video was uploaded over 47 days

0:06

ago. As a content creator, seeing the

0:09

number of days since the last upload go

0:12

up makes you feel very anxious. The

0:16

temptation is to just release unpolished

0:19

videos to feed the algorithm, but I

0:22

didn't want to release new videos just

0:25

for the sake of it. Reality is between

0:28

work and personal life, things have been

0:31

busy in the past 2 months. At the same

0:34

time, even without new videos, I kept

0:38

working on updating and maintaining the

0:40

toolboxes for Streak Sealion and the R9

0:44

700. Even when things work smoothly,

0:47

running the benchmarks can take over 48

0:51

hours across all the main toolboxes for

0:54

Llama CPP and vLLM, and that is not even

0:57

considering the ComfyUI ones for image

1:00

and video generation. But, I think it's

1:03

important to ensure that the toolboxes

1:06

support new models and that there are

1:09

benchmarks so you can get an idea of

1:11

what kind of performance you get with

1:13

these new models. Aside from the

1:16

toolboxes, I have also been doing the

1:18

background research and prep work for

1:21

two entirely new video series coming to

1:25

the channel starting in May. So, the

1:27

purpose of this video is to give you a

1:29

quick update on all that's been going on

1:32

and what's coming in the next few

1:35

months.

1:36

Starting with Streak Sealion, the Llama

1:40

CPP toolboxes are generally

1:42

straightforward to maintain. I have

1:44

pipelines that automatically check twice

1:47

a day if there are updates and rebuild

1:50

them from scratch. Now, sometimes

1:53

something breaks and I have to go in and

1:55

dig into the logs. For example, the

1:58

Vulcan toolboxes broke, but the fix was

2:01

fairly easy. So, now they are updated

2:04

and work fine. I also recently ran a new

2:08

set of benchmarks for the new OpenWeight

2:10

models like the Qwen 3.5 and 3.6

2:14

families as well as Gemma 4. As usual,

2:18

they are available at the URL on the

2:20

screen and you can also find them linked

2:24

on the GitHub repository.

2:27

As I said, most times the toolboxes are

2:30

updated automatically and nothing fails.

2:33

There is, however, a current issue with

2:36

ROCm nightly builds introduced around

2:39

the 14th of April. This bug prevents the

2:42

driver from properly using the unified

2:46

memory. This means that if you're using

2:49

the latest nightly builds, models that

2:52

require more than 64 GB

2:54

of memory will fail to load.

2:58

There is an open issue tracking this bug

3:00

and the fix should be implemented soon.

3:03

Ultimately, remember that nightly builds

3:05

are developer releases, so they will

3:08

occasionally break. I relied heavily on

3:11

nightly builds initially because Streak

3:14

Sealion support was frankly not fully

3:16

implemented in the stable releases.

3:19

There weren't many other options. But,

3:21

realistically, looking at the benchmarks

3:24

for the stable ROCm 7.2 platforms, they

3:28

are essentially the same on Llama CPP as

3:31

the nightly builds. At the moment, there

3:34

is no reason to use the nightly builds

3:37

for Llama CPP. Additionally, AMD

3:40

recently released ROCm 7.2.2

3:44

and I have pushed a Llama CPP toolbox

3:46

that uses just that release. It's

3:49

currently going through benchmarks, so

3:52

hopefully you'll see them up in the next

3:54

few days.

3:56

Also, on Streak Sealion, there is a

3:59

patch by Sunil Pedapudi. I hope I

4:02

pronounced the name properly, but the

4:04

chances of that are very slim. This

4:07

patch is for Llama CPP and it aims to

4:09

improve performance. At a high level,

4:12

the patch tweaks matrix multiplication

4:14

parameters like the tile size to reduce

4:18

register pressure. Essentially, it

4:20

attempts to prevent the GPU compute

4:22

units from overflowing their available

4:25

vector registers, which would force them

4:27

to spill data into main memory and cause

4:31

a bottleneck. I added a test toolbox

4:35

that includes this PR and run the

4:37

benchmarks. You can see the results on

4:40

screen. While there is a measurable

4:42

performance increase for short context

4:45

lengths, particularly on mixture of

4:47

expert models, these gains evaporate on

4:51

longer contexts. The real issue holding

4:54

this PR back and why it is currently

4:56

stalled is its high variance. While it

5:00

helps mixture of expert models, it can

5:03

actually cause a performance regression

5:05

on standard dense models. On top of

5:09

that, the Llama CPP maintainers are

5:11

pushing back against merging

5:13

optimizations, which are hard-coded

5:15

specifically for Streak Sealion, as

5:18

doing so risks degrading performance on

5:21

other RDNA 3.5 APUs, which might have

5:25

smaller register files.

5:30

Finally, I also updated the Streak

5:33

Sealion vLLM toolbox to support the new

5:37

model families like Qwen 3.5, 3.6, and

5:40

Gemma 4. This turned out to be a long

5:45

process taking an entire weekend and

5:48

most weekday evenings this week. It

5:51

required several patches, but it is now

5:54

stable and performs well. This update

5:57

includes Rico, which is the library that

6:00

allows you to use vLLM tensor

6:03

parallelism for multi-node setups. So,

6:06

if you are interested in that, take a

6:08

look at my clustering video for vLLM on

6:11

Streak Sealion.

6:13

The current vLLM toolbox is based on the

6:16

ROCm nightly builds, but I patched them

6:20

specifically to get around that bug that

6:23

causes the 64 GB memory cap, so they are

6:27

okay to use. Now, if you want to see the

6:30

latest benchmarks, as usual, you will

6:33

find the link in the GitHub repository

6:36

for the project. Remember that these

6:39

throughput benchmarks are designed to

6:42

saturate the GPU with many concurrent

6:45

requests, in my case, 64 requests, and

6:49

this is the kind of workload that vLLM

6:52

is specifically optimized for. It is

6:56

different than the performance you get

6:58

on a single request. So, unless you have

7:01

a use case that requires heavy

7:03

concurrency, you might want to stick to

7:06

Llama CPP, which is usually a more

7:08

practical option.

7:10

Finally, on vLLM, I'm also considering

7:13

releasing a version of these toolboxes

7:15

based on the latest stable release of

7:18

ROCm, which is ensure we have a overall

7:21

more stable environment.

7:26

Before moving on, I want to give a quick

7:28

shout-out to Adrian, known as Lafu

7:32

Namor, and again, probably I pronounced

7:35

this nickname wrong, but I want to thank

7:37

Adrian for all the help with these PRs

7:40

and testing the Llama CPP toolboxes in

7:43

the background. And I want to add a

7:45

thank you to Patrick Audley for his

7:48

repository with extensive build notes of

7:52

vLLM on Streak Sealion, and I used that

7:55

repository quite a lot for the new vLLM

7:59

toolbox.

8:01

Moving over to the R9 700, I ran the

8:05

Llama CPP benchmarks for all the new

8:08

models, and those results are available

8:11

on the GitHub repository as usual. I

8:14

also tried to update the vLLM toolbox,

8:18

but ran into multiple issues.

8:21

Specifically, AMD had to stop compiling

8:25

Rico into the nightly builds for

8:28

platforms like the R9 700. This means

8:32

that you cannot use multi-GPU setups

8:36

with the current ROCm nightly builds.

8:39

This is Donato from the future. As I'm

8:41

editing this video, I want to give you

8:43

an update. I did actually

8:45

try to compile Rico for the R9 700,

8:50

but that still didn't work because there

8:53

is currently a bug. The good news is

8:56

that the bug is tracked and AMD is

8:58

aware, and they're investigating it.

9:00

But, essentially, right now in recent

9:03

versions of vLLM and ROCm, I don't know

9:06

exactly when this started, you cannot

9:10

have

9:11

multi-GPU setups with vLLM

9:15

on, again, the R9 700 GFX1201

9:18

architecture. Again, the good news is

9:20

that AMD is very responsive. They are

9:23

aware of these, and I can expect,

9:25

probably by the time I publish this

9:27

video, there will be a fix, and

9:29

certainly, as soon as there is a fix, I

9:31

will let you know. Because of this, I

9:34

decided not to push the new vLLM toolbox

9:37

version yet and to keep the older

9:40

version up. Just be aware that the older

9:43

version will not support the newer

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.