Intel's Battle Matrix Benchmarks and Review
FULL TRANSCRIPT
Intel Battle Mage for AI use cases.
Intel Arc Pro
B60
244 48 96 GB of VRAM sitting right here.
This is of course Battle Matrix and I
need your help. Yes, you you there
watching the video. We have uh we have a
lot to discuss about Battle Matrix.
I enjoy this because these cards have an
ordinary 8pin connector, just a single
eight pin connector on the end of the
card. Now, these are the 24 gig version.
There is a 48 gig version of these GPUs.
It's not really 48 gigs. It's two 24 gig
GPUs on one card. So, you could pack in
eight of these in four slots. For us,
we're going to do four and four slots.
Well, four and four double slots, but
that'll work in an ordinary desktop
computer with an ordinary North American
power outlet. You know, 15 amps, uh, 120
volts, cuz it's what what we're rocking
here in North America. These Sparkle Arc
Pro cards, in addition to the normal 8
pin power connector, have four display
port connections, a PCIe Gen 5 X8
interface, and uh
yeah, flow through design, which is
actually important when you're packing
in this many GPUs. It makes it a little
easier to deal with. So, let's um
let's get our Intel Xeon host system.
This machine is battle matrix. It's a
Zeon 3435X. Eight memory channels, 16
cores, and that matters because if
you're going to hang four GPUs off of a
workstation and actually feed them, you
don't want a platform that collapses
under its own weight. You need a lot of
IO. But it's interesting because the
CPUs are not doing the heavy lifting
here. I mean, this is basically Intel
Xeon reference platform, and this is
pretty much what you need to drive four
of these GPUs. Each one of these 20 XE
cores, five render slices, 160 XMX
engines, 192bit GDDDR6, 24 gigs of VRAM.
There's eight lanes of PCIe Gen 5 for
connectivity. Together, these four GPUs
have 96 gigs of VRAM, as much as an
Nvidia RTX Pro 6000, which costs $8,500.
But the four of these GPUs cost about
the same as a 32 gig RTX 5090. So, this
is the part that pe people keep missing,
I think, about the B60. It's not trying
to be the biggest baddest data center
GPU ever or even a workstation GPU, but
VRAM per dollar and VRAM per dollar that
speaks modern AI. So that you can have
this workstation power with a relatively
modest power envelope because each one
of the GPUs uh you know is only on the
order of about 150 to 200 watts. There's
some variability in that with board
partners, but Sparkle's doing a good job
here. Intel's dedicated matrix hardware,
there is XMX. Intel's positioning with
that, I think, is run it locally. Intel
is leaning pretty hard into the int8
throughput, though, 197 tops per card.
And the practical story is if your model
fits in VRAM and your kernel path is
INT8 compatible, you should have a
pretty good experience with this
platform. 456 GB per second of bandwidth
per GPU, it's one of the quiet headline
specs. and eight lanes of PCA gen 5 is a
lot of bandwidth. For data parallel
workloads uh that don't need constant
cross GPU chatter, you'll be fine. For
tensor parallel workloads that do need
synchronization traffic, PCIe maybe
becomes a little bit more of the story.
But keep in mind, you still have as much
bandwidth to 16 lanes of Gen 4. And if
you're doing pro workflows, it's not
just for AI. You've also got rate
tracing hardware and modern graphics
APIs and the media block with the dual
codec engines. I think this might
deserve its own separate investigation
apart from the AI and battle matrix
that's the focus of this video. So, 24
gigs of VRAM per card, decent Gen 5
bandwidth, XMX for AI kernels, and Pro
Media Workstation features. That's the
baseline for this platform before we
even touch the benchmarks. So, keep that
in mind. Just level setting expectations
here. It's like, oh, it's not as fast as
an RTX Pro 6000. Let's talk about
strategy because Intel's fighting for AI
market share and not from the high
ground. Nvidia owns mind share. AMD has
real momentum at the highest end in the
enterprise and they're starting to have
a credible workstation and server story.
So Intel their angle with battle matrix
is different validated full stack for
people who want local inference local
workstation type tasks. Intel's own
description is basically uh code name
battle matrix for an all-in-one
inferencing platform combining validated
hardware and software aimed at privacy
conscious pros who want to avoid
subscription costs. Okay, I mean yeah,
don't don't we all? Intel literally says
on their public battle matrix web page,
yes, this could be your taking off
point. up to 192 gigs of VRAM. That's on
the dual card version of that for four
physical cards, but two two GPUs per
card and a workstation experience and a
first class Linux workstation
experience. Intel's pitch will fail if
the software part of it fails. And
historically, Intel has been a software
juggernaut, but 2026 moving fast.
Mainline versus vendor branch. I'm
looking at VLM here. Mainline VLM is at
0.15 at the time that I'm doing this
video. released December 18th, 2025.
Intel's LLM stack on this uh that
shipped in December was a little behind
that. As of January, it's 11.1. So, it's
a little behind. Yes, Intel is
maintaining cadence, but there's still a
gap versus mainline VLM. And whether
that gap matters or not depends on
whether or not you need the latest
features right now, or you want
something stable that's going to ship
with known good drivers and kernels.
Intel is leaning on known good drivers
testing. The target OS here though is
Ubuntu 25.04, kernel 6.14. Was a little
surprised it's Ubuntu 25.04, but I'll
take it. The VLM difference also means
that I have to be a little bit careful
here. When Nvidia shipped Blackwell, a
lot of the optimized software wasn't in
place yet for Blackwell. I'm not ready
to do a full comparison because I'd like
to try to control VLM versions, but I'd
also like to use a version of VLM that
has been properly optimized for
Blackwell. I don't want to accidentally
compare and contrast like, okay, this is
the Blackwell performance that we can
expect. We're using an older VLM when
that older version of VLM lacks the
Blackwell optimizations. You see, it's a
little little tough for me here. So, if
your workflow is to serve a model,
you're going to be fine. It's it's
pretty great. But if your workflow is to
fine-tune and mess around with the
stack, I mean, there are certain
problems with W8A8 implementation
support and LLM compressor. Uh, I don't
know. For diffusion and attention heavy
runs, you may see bandwidth ceilings
before compute ceilings. Uh we can
experiment with this with resolution and
context length. But let's do a quick run
through of the lowhanging fruit, the
benchmarking that I can do with those
caveats. All right, benchmarking battle
matrix. It's a little challenging
because of the VLM version. First off,
MXFP4. That is definitely the best foot
forward with this platform. both the 120
billion parameter model requiring 60
gigs of VRAM and the uh 20 billion
parameter model 120B runs great on a
4GPU setup because you have all that
extra room for lots of vc and also if
you're tempted it's like oh I could run
this on three GPUs this is not a thing
not a thing pretty much universally you
can have tensor parallelism of two or
four three is right out just like in
Monty Python so the performance and the
benchmarks I've collected all of that
and put it in a thread on the forum and
that's really where you should go if
you're interested in the nuts and bolts
of it. This is YouTube. It's video. This
is for me to wax poetic and get
everything mixed up and blah blah blah.
The ground truth is there in the forum
thread. So check out the forum thread.
GPT OSS 120B running the VLM serve
benchmark. uh you know 51,000 tokens
input generated 25,000 tokens
3.9,000
milliseconds or 3.9 seconds to the time
to first response on 120 billion
parameter models is pretty good and 986
tokens per second but I also use open
web UI and so using open web UI is okay
how are you going to use it
interactively with a UI and that's this
and I just hit the drop down and pick
different models I'm testing Quinn 10
here instead of GPOSS 12B quen 3 30
billion and it's like oh hello and it's
like please write an efficient Python
program for searching for perfect
numbers and then watching the result and
seeing how that goes and generally it
does pretty well if we go through the
tasks here everything was basically
going according to plan I encountered
some problems with llama 70B it's like
all right meta llama 70B this is an FP16
model so even though it's 70 billion
parameters the the ground truth model
from uh meta is 70B 70 billion
parameters of FP16. So 140 GB of VRAM.
That's not going to run here. But Intel
does support dynamic quant. You see, you
need to live and die by the GitHub
repository that Intel has for LLM
scaler. That's their fork of VLM. It's,
you know, a slightly older version. And
so they paper over the fact that it's a
slightly older version of VLM. But the
readme is very good and it has green
check marks for all of the models that
you would think that you want to run.
And they also have helpful hints and
notes. VLM MLA disabled if you're going
to run the DeepSseek V2 light. Okay. But
FP16, dynamic online FP8, dynamic online
INT4, and MXFP4. Now, the GPT OSS MX
FP4, but these cards show how strong
they are with MXFP4 IMHO. So, maybe
other quants that other people are
doing. Your mileage may vary. Maybe you
can get into some fun quants from
Unsloth and and get some pretty good
performance here. But this is kind of
the sandbox that you have to stick to at
this point with the work that Intel has
done in order to make this uh operate. I
was really excited for a second with
like GLM was like, "Oh, I'll be able to
run GLM 4.5." Ah, GLM 4.5 is too big.
You'd have to run like a really tiny
version of it in order to get it to fit
with this. But Quinn and you know,
Whisper and the the 8 to 20 billion
parameter models, oh yeah, all day long.
All day long, it's going to work pretty
well on this platform. So, what was my
trouble with Llama 70B? Well, I it's I
got unsupported dtype. And this is
because I needed to run the uh the
command to run to do a block size of 64.
I did not explicitly say block size 64.
The other thing that confused me for a
second was uh VLM use v1 equal one. Like
when troubleshooting this kind of dtype
error, it's like, oh, I should probably
turn v1 off. But v1 is designed only to
use chunked stuff. And that's
commentary. That's a VLM thing. That
doesn't have anything to do with Intel.
But I was trying to troubleshoot why it
was telling me that. So something about
the block size with that running. This
is the performance that I got
immediately, which is also not the the
best performance you can actually get
out of llama 70B, but I'll come back to
that in a second. So I was like, okay,
that's that's, you know, a little
worried about the 30 seconds time to
first token, but these, you know, random
tests are kind of challenging. So I
looked it up to open web UI and I got
this error. It's like, oh, as of
Transformers v4.4, you need a chat
template. And it's like, but the the
Jing Ga chat template's not part of
Meta's repository. So, I created this
one and ran it and it mostly worked, but
it was also a little bit incoherent. So,
if you want to do that, this is the
exact command that I used here. So, you
could recreate what I did or mock me
derisively in the comments for something
that I did wrong. But with that, you
know, we're getting about 20 tokens per
second output. And this is running Llama
70B, the FP16 model, but in a dynamic
FP8 quad quant. So, it's sort of doing
this on the fly. Now, the best case
scenario for Llama 70B is this. It's
about 12.9 seconds time to first token
and about 366
uh outputs, tokens per second in your
absolute best case scenario with llama
70B in this kind of a scenario. So,
Quinn 330B, again, it's a much smaller
model. This is the exact command that I
ran. This is the performance that I got
for that. And that is throughput of 991
tokens per second. And I think that
you'll want to follow this thread
because there'll probably be follow-ups
that are like, oh, there's a new version
of VLM from Intel or oh, instead of
block size 64, you know, use a smaller
block size and your, you know, your
tokens per second and your output will
will go up a little bit or your prompt
processing will improve. I also did the
Deep Seek uh distill llama 70B, which
again is another llama 70B model. And so
this should work just the same as llama
70B model, but the reality is that the
dist the distillation works much better.
Um and so this again was on the order of
178 tokens per second. Uh total token
throughput 536
and the meanantime to first token was
43.8 seconds, but again long prompt. and
the actual web UI results were were much
more coherent. So this parallelizes
really well. But I can also tell you
that in open web UI the time to first
token for my prompt which was a very
short you know write a Python program to
search for perfect numbers. There's not
a lot of tokens there in that prompt and
the time for first token here was a
second or two. Now there is a place
where that is the time to first token
benchmark numbers are a little bit of a
warning and that is when you're using
agentic coding. So if you're using an
editor that is going to load a lot of
context like say 10,000 tokens of
context that is when the prefill
performance will bite you a little bit
and we talk about about that a little
bit uh in this thread. also have I talk
a little bit more about the no chunked
prefill and how that that can sometimes
how some of the things that lead to
that. While I was doing this testing, I
also managed to crash the GPU kernel.
Um, so I put together a little script
here to help you diagnose when a GPU is
down. This could actually be added to
like we did the haroxy video. You should
check that out. Um, this could be added
to a health check on haroxy which could
send you an email when a GPU crashes
like this. This is not necessarily an
Intel thing. I have this problem with
with Blackwell GPUs. I have this problem
pretty much with any any systems that
I'm administering that have more than
like a couple of GPUs in them. Like
something will happen and the GPU will
drop. And so you can put together a
little script to make it part of your HA
proxy or make it part of your your
availability proxy to just check and be
like, "Oh, one of the GPUs is down. I
should email an administrator because
we're probably going to need to reboot
or we're probably going to need to do
something to get that GPU back." Um, and
then there's uh the exact commands that
I ran for the DeepSseek Distill
configuration and all of the other stuff
that goes with that. So, there's some
discussion, there's good discussion to
be had on the forum, but for the first
volley from Intel and for how they're
structuring their development here and
how organized they are and how
everything is happening in the open on
GitHub,
good results there. I want to see better
time to first token and I want to see uh
better performance when we are doing a
gentic type coding or if you're using an
editor that that that needs a lot of
context. I want to see I want to see
these GPUs process that really quickly,
you know, I don't I don't know if that's
prefill. I don't know if that I don't
know where the improvement needs to be
made, but the enduser experience like
all right, you know, I don't care about
the nuts and bolts. I just want to get
up and running with this quickly. Uh
forum.leext.com.
And what about Comfy UI? Well, you're
gonna have to join us in the forum.
LLM scaler Omni is the path to get
through to something that resembles a
comfy UI setup. And it's kind of
similar. It's like the the LLM scaler
container with the older version of VLM.
Run a Docker image, you get a Comfy UI
guey. We want to pick preview method
latent to RGB. And then these are the
supported models. And so this is this is
sort of a subset of what you what you
might be expecting, but you can still do
a lot with this. And so if you have a
particularly interesting something that
you've done with with Comfy UI or a
Comfy UI workflow that you want to show
off, let post it in the forum and then
we'll see if we can do image generation
and and do some other stuff like that.
I'm going to keep tinkering with the uh
the LLM
uh unsupported DTY stuff and try to make
that a little bit more robust, but but
we'll see. It's like, oh, if what should
I do if I out of memory? Uh, disable
smart memory, but I've got I've got 96
gigs to work with. Let's just let's see
how it goes.
So, like digging into this with some of
the models, some of the squiriness here
is apparently like the chunked Dtypes.
Uh, I literally set that we were going
to disable chunked prefill and it says,
"Okay, I'm going to use chunk prefill to
a 48."
Even if you set chunk prefill to zero,
it accepts that, but then crashes. So,
probably a bug in the container. Now, on
the one API side of things, I still get
the warm and fuzzies for what Intel is
trying to do with one API. And
generally, I've had a great experience
with one API in a sort of DIY context.
Putting it plainly, having a vendor who
wants you to use and build with that
kind of a thing where you can actually
script and automate and just deploy
stuff with Python, that's worth some
points. And I like that and I like where
Intel's going with that. So today, this
video is about establishing the
platform, what B60 is, and what we can
do with battle matrix and the concept of
what we're trying to accomplish here.
It's first pass at a validated stack and
or validating the validated stack, I
guess. And yeah, there's more benchmarks
yet to do. Uh, but what what workloads
do you have? Like I kind of want to turn
this around because what do you want to
see me try to run on four 24 gig GPUs
that doesn't easily run on a single
card? Do you care about throughput and
latency, long context, concurrency, test
parallelism 4? Do you want diffusion,
video, rag pipelines, OCR, something
multimodal, or something totally cursed
that only this community would would
like to try? Yeah, maybe. I mean, it's
sort of fun to experiment with. I, you
know, Docker sort of fun for AI and AI
ideas. is drop your workload ideas below
or hit me up in the forum and let's
discuss something reproducible,
containerized, scriptable, something I
can easily run because the only way we
figure out where B60 actually stacks up
and where it actually belongs is to try
to actually do something useful with it.
And that is you in the community. What
useful thing are you trying to do with
AI? What do you want to see me try to
run? Because
you know, I know that like haroxy and
that kind of stuff we can do. We can
split loads. We could move these GPUs
into another machine and you know have
two GPUs in this machine, two GPUs in
that machine and do some like network
stuff. Like there's a lot of fun things
that we could do, but overall for the
testing that I was able to do, the
performance is pretty solid, especially
if you consider performance per watt and
where they are with the VRAM. If Intel
can scale this up, this will be really
good. architecturally, you know, if you
look at like gaudy 2 and gaudy 3 and
like what Intel is doing on the
enterprise side, they've got a little
bit of the same kind of a problem that
AMD has, which is that like RDNA and
CDNA are not at all the same thing. And
Intel basically has the same problem.
And how Intel is going to solve that
problem is nowhere near the forefront as
much as you know on the AMD side. And
then meanwhile, you know, Nvidia is just
off in the background doing Nvidia
things. But still for inferencing and
how accessible this is and the costs,
Intel is uncharacteristically aggressive
with the costs here. Impressively so. So
this is a very promising platform and it
has a lot of really interesting aspects
to it, but there is more science to be
done. I'm level one has been a quick
look at battle matrix B64 GPU
configuration. We're going to run some
more stuff, so engage. All right, I'm
signing out and I'll see you in the
level one forums where I will be
engaging with those of you that chose to
engage
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.