Strix Halo & R9700 AI PRO Updates: Qwen 3.6 and Gemma 4 Support, ROCm 7.2.2 and Two New Series
FULL TRANSCRIPT
If you look at the channel, you'll see
my last video was uploaded over 47 days
ago. As a content creator, seeing the
number of days since the last upload go
up makes you feel very anxious. The
temptation is to just release unpolished
videos to feed the algorithm, but I
didn't want to release new videos just
for the sake of it. Reality is between
work and personal life, things have been
busy in the past 2 months. At the same
time, even without new videos, I kept
working on updating and maintaining the
toolboxes for Streak Sealion and the R9
700. Even when things work smoothly,
running the benchmarks can take over 48
hours across all the main toolboxes for
Llama CPP and vLLM, and that is not even
considering the ComfyUI ones for image
and video generation. But, I think it's
important to ensure that the toolboxes
support new models and that there are
benchmarks so you can get an idea of
what kind of performance you get with
these new models. Aside from the
toolboxes, I have also been doing the
background research and prep work for
two entirely new video series coming to
the channel starting in May. So, the
purpose of this video is to give you a
quick update on all that's been going on
and what's coming in the next few
months.
Starting with Streak Sealion, the Llama
CPP toolboxes are generally
straightforward to maintain. I have
pipelines that automatically check twice
a day if there are updates and rebuild
them from scratch. Now, sometimes
something breaks and I have to go in and
dig into the logs. For example, the
Vulcan toolboxes broke, but the fix was
fairly easy. So, now they are updated
and work fine. I also recently ran a new
set of benchmarks for the new OpenWeight
models like the Qwen 3.5 and 3.6
families as well as Gemma 4. As usual,
they are available at the URL on the
screen and you can also find them linked
on the GitHub repository.
As I said, most times the toolboxes are
updated automatically and nothing fails.
There is, however, a current issue with
ROCm nightly builds introduced around
the 14th of April. This bug prevents the
driver from properly using the unified
memory. This means that if you're using
the latest nightly builds, models that
require more than 64 GB
of memory will fail to load.
There is an open issue tracking this bug
and the fix should be implemented soon.
Ultimately, remember that nightly builds
are developer releases, so they will
occasionally break. I relied heavily on
nightly builds initially because Streak
Sealion support was frankly not fully
implemented in the stable releases.
There weren't many other options. But,
realistically, looking at the benchmarks
for the stable ROCm 7.2 platforms, they
are essentially the same on Llama CPP as
the nightly builds. At the moment, there
is no reason to use the nightly builds
for Llama CPP. Additionally, AMD
recently released ROCm 7.2.2
and I have pushed a Llama CPP toolbox
that uses just that release. It's
currently going through benchmarks, so
hopefully you'll see them up in the next
few days.
Also, on Streak Sealion, there is a
patch by Sunil Pedapudi. I hope I
pronounced the name properly, but the
chances of that are very slim. This
patch is for Llama CPP and it aims to
improve performance. At a high level,
the patch tweaks matrix multiplication
parameters like the tile size to reduce
register pressure. Essentially, it
attempts to prevent the GPU compute
units from overflowing their available
vector registers, which would force them
to spill data into main memory and cause
a bottleneck. I added a test toolbox
that includes this PR and run the
benchmarks. You can see the results on
screen. While there is a measurable
performance increase for short context
lengths, particularly on mixture of
expert models, these gains evaporate on
longer contexts. The real issue holding
this PR back and why it is currently
stalled is its high variance. While it
helps mixture of expert models, it can
actually cause a performance regression
on standard dense models. On top of
that, the Llama CPP maintainers are
pushing back against merging
optimizations, which are hard-coded
specifically for Streak Sealion, as
doing so risks degrading performance on
other RDNA 3.5 APUs, which might have
smaller register files.
Finally, I also updated the Streak
Sealion vLLM toolbox to support the new
model families like Qwen 3.5, 3.6, and
Gemma 4. This turned out to be a long
process taking an entire weekend and
most weekday evenings this week. It
required several patches, but it is now
stable and performs well. This update
includes Rico, which is the library that
allows you to use vLLM tensor
parallelism for multi-node setups. So,
if you are interested in that, take a
look at my clustering video for vLLM on
Streak Sealion.
The current vLLM toolbox is based on the
ROCm nightly builds, but I patched them
specifically to get around that bug that
causes the 64 GB memory cap, so they are
okay to use. Now, if you want to see the
latest benchmarks, as usual, you will
find the link in the GitHub repository
for the project. Remember that these
throughput benchmarks are designed to
saturate the GPU with many concurrent
requests, in my case, 64 requests, and
this is the kind of workload that vLLM
is specifically optimized for. It is
different than the performance you get
on a single request. So, unless you have
a use case that requires heavy
concurrency, you might want to stick to
Llama CPP, which is usually a more
practical option.
Finally, on vLLM, I'm also considering
releasing a version of these toolboxes
based on the latest stable release of
ROCm, which is ensure we have a overall
more stable environment.
Before moving on, I want to give a quick
shout-out to Adrian, known as Lafu
Namor, and again, probably I pronounced
this nickname wrong, but I want to thank
Adrian for all the help with these PRs
and testing the Llama CPP toolboxes in
the background. And I want to add a
thank you to Patrick Audley for his
repository with extensive build notes of
vLLM on Streak Sealion, and I used that
repository quite a lot for the new vLLM
toolbox.
Moving over to the R9 700, I ran the
Llama CPP benchmarks for all the new
models, and those results are available
on the GitHub repository as usual. I
also tried to update the vLLM toolbox,
but ran into multiple issues.
Specifically, AMD had to stop compiling
Rico into the nightly builds for
platforms like the R9 700. This means
that you cannot use multi-GPU setups
with the current ROCm nightly builds.
This is Donato from the future. As I'm
editing this video, I want to give you
an update. I did actually
try to compile Rico for the R9 700,
but that still didn't work because there
is currently a bug. The good news is
that the bug is tracked and AMD is
aware, and they're investigating it.
But, essentially, right now in recent
versions of vLLM and ROCm, I don't know
exactly when this started, you cannot
have
multi-GPU setups with vLLM
on, again, the R9 700 GFX1201
architecture. Again, the good news is
that AMD is very responsive. They are
aware of these, and I can expect,
probably by the time I publish this
video, there will be a fix, and
certainly, as soon as there is a fix, I
will let you know. Because of this, I
decided not to push the new vLLM toolbox
version yet and to keep the older
version up. Just be aware that the older
version will not support the newer
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.