文本记录English

Strix Halo & R9700 AI PRO Updates: Qwen 3.6 and Gemma 4 Support, ROCm 7.2.2 and Two New Series

14m 59s2,010 字数305 segmentsEnglish

完整文本记录

0:00

If you look at the channel, you'll see

0:02

my last video was uploaded over 47 days

0:06

ago. As a content creator, seeing the

0:09

number of days since the last upload go

0:12

up makes you feel very anxious. The

0:16

temptation is to just release unpolished

0:19

videos to feed the algorithm, but I

0:22

didn't want to release new videos just

0:25

for the sake of it. Reality is between

0:28

work and personal life, things have been

0:31

busy in the past 2 months. At the same

0:34

time, even without new videos, I kept

0:38

working on updating and maintaining the

0:40

toolboxes for Streak Sealion and the R9

0:44

700. Even when things work smoothly,

0:47

running the benchmarks can take over 48

0:51

hours across all the main toolboxes for

0:54

Llama CPP and vLLM, and that is not even

0:57

considering the ComfyUI ones for image

1:00

and video generation. But, I think it's

1:03

important to ensure that the toolboxes

1:06

support new models and that there are

1:09

benchmarks so you can get an idea of

1:11

what kind of performance you get with

1:13

these new models. Aside from the

1:16

toolboxes, I have also been doing the

1:18

background research and prep work for

1:21

two entirely new video series coming to

1:25

the channel starting in May. So, the

1:27

purpose of this video is to give you a

1:29

quick update on all that's been going on

1:32

and what's coming in the next few

1:35

months.

1:36

Starting with Streak Sealion, the Llama

1:40

CPP toolboxes are generally

1:42

straightforward to maintain. I have

1:44

pipelines that automatically check twice

1:47

a day if there are updates and rebuild

1:50

them from scratch. Now, sometimes

1:53

something breaks and I have to go in and

1:55

dig into the logs. For example, the

1:58

Vulcan toolboxes broke, but the fix was

2:01

fairly easy. So, now they are updated

2:04

and work fine. I also recently ran a new

2:08

set of benchmarks for the new OpenWeight

2:10

models like the Qwen 3.5 and 3.6

2:14

families as well as Gemma 4. As usual,

2:18

they are available at the URL on the

2:20

screen and you can also find them linked

2:24

on the GitHub repository.

2:27

As I said, most times the toolboxes are

2:30

updated automatically and nothing fails.

2:33

There is, however, a current issue with

2:36

ROCm nightly builds introduced around

2:39

the 14th of April. This bug prevents the

2:42

driver from properly using the unified

2:46

memory. This means that if you're using

2:49

the latest nightly builds, models that

2:52

require more than 64 GB

2:54

of memory will fail to load.

2:58

There is an open issue tracking this bug

3:00

and the fix should be implemented soon.

3:03

Ultimately, remember that nightly builds

3:05

are developer releases, so they will

3:08

occasionally break. I relied heavily on

3:11

nightly builds initially because Streak

3:14

Sealion support was frankly not fully

3:16

implemented in the stable releases.

3:19

There weren't many other options. But,

3:21

realistically, looking at the benchmarks

3:24

for the stable ROCm 7.2 platforms, they

3:28

are essentially the same on Llama CPP as

3:31

the nightly builds. At the moment, there

3:34

is no reason to use the nightly builds

3:37

for Llama CPP. Additionally, AMD

3:40

recently released ROCm 7.2.2

3:44

and I have pushed a Llama CPP toolbox

3:46

that uses just that release. It's

3:49

currently going through benchmarks, so

3:52

hopefully you'll see them up in the next

3:54

few days.

3:56

Also, on Streak Sealion, there is a

3:59

patch by Sunil Pedapudi. I hope I

4:02

pronounced the name properly, but the

4:04

chances of that are very slim. This

4:07

patch is for Llama CPP and it aims to

4:09

improve performance. At a high level,

4:12

the patch tweaks matrix multiplication

4:14

parameters like the tile size to reduce

4:18

register pressure. Essentially, it

4:20

attempts to prevent the GPU compute

4:22

units from overflowing their available

4:25

vector registers, which would force them

4:27

to spill data into main memory and cause

4:31

a bottleneck. I added a test toolbox

4:35

that includes this PR and run the

4:37

benchmarks. You can see the results on

4:40

screen. While there is a measurable

4:42

performance increase for short context

4:45

lengths, particularly on mixture of

4:47

expert models, these gains evaporate on

4:51

longer contexts. The real issue holding

4:54

this PR back and why it is currently

4:56

stalled is its high variance. While it

5:00

helps mixture of expert models, it can

5:03

actually cause a performance regression

5:05

on standard dense models. On top of

5:09

that, the Llama CPP maintainers are

5:11

pushing back against merging

5:13

optimizations, which are hard-coded

5:15

specifically for Streak Sealion, as

5:18

doing so risks degrading performance on

5:21

other RDNA 3.5 APUs, which might have

5:25

smaller register files.

5:30

Finally, I also updated the Streak

5:33

Sealion vLLM toolbox to support the new

5:37

model families like Qwen 3.5, 3.6, and

5:40

Gemma 4. This turned out to be a long

5:45

process taking an entire weekend and

5:48

most weekday evenings this week. It

5:51

required several patches, but it is now

5:54

stable and performs well. This update

5:57

includes Rico, which is the library that

6:00

allows you to use vLLM tensor

6:03

parallelism for multi-node setups. So,

6:06

if you are interested in that, take a

6:08

look at my clustering video for vLLM on

6:11

Streak Sealion.

6:13

The current vLLM toolbox is based on the

6:16

ROCm nightly builds, but I patched them

6:20

specifically to get around that bug that

6:23

causes the 64 GB memory cap, so they are

6:27

okay to use. Now, if you want to see the

6:30

latest benchmarks, as usual, you will

6:33

find the link in the GitHub repository

6:36

for the project. Remember that these

6:39

throughput benchmarks are designed to

6:42

saturate the GPU with many concurrent

6:45

requests, in my case, 64 requests, and

6:49

this is the kind of workload that vLLM

6:52

is specifically optimized for. It is

6:56

different than the performance you get

6:58

on a single request. So, unless you have

7:01

a use case that requires heavy

7:03

concurrency, you might want to stick to

7:06

Llama CPP, which is usually a more

7:08

practical option.

7:10

Finally, on vLLM, I'm also considering

7:13

releasing a version of these toolboxes

7:15

based on the latest stable release of

7:18

ROCm, which is ensure we have a overall

7:21

more stable environment.

7:26

Before moving on, I want to give a quick

7:28

shout-out to Adrian, known as Lafu

7:32

Namor, and again, probably I pronounced

7:35

this nickname wrong, but I want to thank

7:37

Adrian for all the help with these PRs

7:40

and testing the Llama CPP toolboxes in

7:43

the background. And I want to add a

7:45

thank you to Patrick Audley for his

7:48

repository with extensive build notes of

7:52

vLLM on Streak Sealion, and I used that

7:55

repository quite a lot for the new vLLM

7:59

toolbox.

8:01

Moving over to the R9 700, I ran the

8:05

Llama CPP benchmarks for all the new

8:08

models, and those results are available

8:11

on the GitHub repository as usual. I

8:14

also tried to update the vLLM toolbox,

8:18

but ran into multiple issues.

8:21

Specifically, AMD had to stop compiling

8:25

Rico into the nightly builds for

8:28

platforms like the R9 700. This means

8:32

that you cannot use multi-GPU setups

8:36

with the current ROCm nightly builds.

8:39

This is Donato from the future. As I'm

8:41

editing this video, I want to give you

8:43

an update. I did actually

8:45

try to compile Rico for the R9 700,

8:50

but that still didn't work because there

8:53

is currently a bug. The good news is

8:56

that the bug is tracked and AMD is

8:58

aware, and they're investigating it.

9:00

But, essentially, right now in recent

9:03

versions of vLLM and ROCm, I don't know

9:06

exactly when this started, you cannot

9:10

have

9:11

multi-GPU setups with vLLM

9:15

on, again, the R9 700 GFX1201

9:18

architecture. Again, the good news is

9:20

that AMD is very responsive. They are

9:23

aware of these, and I can expect,

9:25

probably by the time I publish this

9:27

video, there will be a fix, and

9:29

certainly, as soon as there is a fix, I

9:31

will let you know. Because of this, I

9:34

decided not to push the new vLLM toolbox

9:37

version yet and to keep the older

9:40

version up. Just be aware that the older

9:43

version will not support the newer

解锁更多

免费注册以访问高级功能

互动查看器

观看带有同步字幕、可调节叠加层和完整播放控制的视频。

免费注册以解锁

AI 摘要

获取由 AI 立即生成的视频内容摘要、要点和结论。

免费注册以解锁

翻译

一键将字幕翻译成 100 多种语言。以任何格式下载。

免费注册以解锁

思维导图

将字幕可视化为交互式思维导图。一目了然地了解结构。

免费注册以解锁

与字幕聊天

提出关于视频内容的问题。直接从字幕中获取由 AI 驱动的答案。

免费注册以解锁

从您的字幕中获得更多

免费注册并解锁交互式查看器、AI 摘要、翻译、思维导图等。无需信用卡。

    Strix Halo & R9700 AI PRO Updat… - 完整文字记录 | YouTubeTranscript.dev