TRANSCRIPTEnglish

Intel's Battle Matrix Benchmarks and Review

20m 9s3,831 words539 segmentsEnglish

FULL TRANSCRIPT

0:00

Intel Battle Mage for AI use cases.

0:06

Intel Arc Pro

0:09

B60

0:14

244 48 96 GB of VRAM sitting right here.

0:18

This is of course Battle Matrix and I

0:20

need your help. Yes, you you there

0:24

watching the video. We have uh we have a

0:28

lot to discuss about Battle Matrix.

0:36

I enjoy this because these cards have an

0:38

ordinary 8pin connector, just a single

0:40

eight pin connector on the end of the

0:41

card. Now, these are the 24 gig version.

0:43

There is a 48 gig version of these GPUs.

0:46

It's not really 48 gigs. It's two 24 gig

0:48

GPUs on one card. So, you could pack in

0:50

eight of these in four slots. For us,

0:54

we're going to do four and four slots.

0:55

Well, four and four double slots, but

0:58

that'll work in an ordinary desktop

1:00

computer with an ordinary North American

1:02

power outlet. You know, 15 amps, uh, 120

1:06

volts, cuz it's what what we're rocking

1:09

here in North America. These Sparkle Arc

1:11

Pro cards, in addition to the normal 8

1:14

pin power connector, have four display

1:15

port connections, a PCIe Gen 5 X8

1:18

interface, and uh

1:22

yeah, flow through design, which is

1:23

actually important when you're packing

1:24

in this many GPUs. It makes it a little

1:26

easier to deal with. So, let's um

1:32

let's get our Intel Xeon host system.

1:41

This machine is battle matrix. It's a

1:43

Zeon 3435X. Eight memory channels, 16

1:46

cores, and that matters because if

1:47

you're going to hang four GPUs off of a

1:48

workstation and actually feed them, you

1:51

don't want a platform that collapses

1:52

under its own weight. You need a lot of

1:54

IO. But it's interesting because the

1:56

CPUs are not doing the heavy lifting

1:58

here. I mean, this is basically Intel

2:00

Xeon reference platform, and this is

2:02

pretty much what you need to drive four

2:04

of these GPUs. Each one of these 20 XE

2:07

cores, five render slices, 160 XMX

2:09

engines, 192bit GDDDR6, 24 gigs of VRAM.

2:13

There's eight lanes of PCIe Gen 5 for

2:16

connectivity. Together, these four GPUs

2:18

have 96 gigs of VRAM, as much as an

2:20

Nvidia RTX Pro 6000, which costs $8,500.

2:24

But the four of these GPUs cost about

2:26

the same as a 32 gig RTX 5090. So, this

2:29

is the part that pe people keep missing,

2:32

I think, about the B60. It's not trying

2:33

to be the biggest baddest data center

2:35

GPU ever or even a workstation GPU, but

2:38

VRAM per dollar and VRAM per dollar that

2:40

speaks modern AI. So that you can have

2:42

this workstation power with a relatively

2:45

modest power envelope because each one

2:48

of the GPUs uh you know is only on the

2:51

order of about 150 to 200 watts. There's

2:53

some variability in that with board

2:54

partners, but Sparkle's doing a good job

2:57

here. Intel's dedicated matrix hardware,

2:59

there is XMX. Intel's positioning with

3:01

that, I think, is run it locally. Intel

3:05

is leaning pretty hard into the int8

3:06

throughput, though, 197 tops per card.

3:10

And the practical story is if your model

3:12

fits in VRAM and your kernel path is

3:14

INT8 compatible, you should have a

3:16

pretty good experience with this

3:17

platform. 456 GB per second of bandwidth

3:20

per GPU, it's one of the quiet headline

3:23

specs. and eight lanes of PCA gen 5 is a

3:26

lot of bandwidth. For data parallel

3:28

workloads uh that don't need constant

3:29

cross GPU chatter, you'll be fine. For

3:32

tensor parallel workloads that do need

3:34

synchronization traffic, PCIe maybe

3:37

becomes a little bit more of the story.

3:39

But keep in mind, you still have as much

3:41

bandwidth to 16 lanes of Gen 4. And if

3:44

you're doing pro workflows, it's not

3:45

just for AI. You've also got rate

3:47

tracing hardware and modern graphics

3:48

APIs and the media block with the dual

3:50

codec engines. I think this might

3:52

deserve its own separate investigation

3:54

apart from the AI and battle matrix

3:57

that's the focus of this video. So, 24

3:59

gigs of VRAM per card, decent Gen 5

4:01

bandwidth, XMX for AI kernels, and Pro

4:04

Media Workstation features. That's the

4:06

baseline for this platform before we

4:08

even touch the benchmarks. So, keep that

4:10

in mind. Just level setting expectations

4:13

here. It's like, oh, it's not as fast as

4:14

an RTX Pro 6000. Let's talk about

4:17

strategy because Intel's fighting for AI

4:19

market share and not from the high

4:22

ground. Nvidia owns mind share. AMD has

4:25

real momentum at the highest end in the

4:27

enterprise and they're starting to have

4:29

a credible workstation and server story.

4:31

So Intel their angle with battle matrix

4:34

is different validated full stack for

4:37

people who want local inference local

4:39

workstation type tasks. Intel's own

4:42

description is basically uh code name

4:45

battle matrix for an all-in-one

4:47

inferencing platform combining validated

4:49

hardware and software aimed at privacy

4:51

conscious pros who want to avoid

4:53

subscription costs. Okay, I mean yeah,

4:55

don't don't we all? Intel literally says

4:57

on their public battle matrix web page,

5:00

yes, this could be your taking off

5:02

point. up to 192 gigs of VRAM. That's on

5:04

the dual card version of that for four

5:07

physical cards, but two two GPUs per

5:09

card and a workstation experience and a

5:12

first class Linux workstation

5:13

experience. Intel's pitch will fail if

5:16

the software part of it fails. And

5:18

historically, Intel has been a software

5:20

juggernaut, but 2026 moving fast.

5:24

Mainline versus vendor branch. I'm

5:26

looking at VLM here. Mainline VLM is at

5:28

0.15 at the time that I'm doing this

5:30

video. released December 18th, 2025.

5:33

Intel's LLM stack on this uh that

5:36

shipped in December was a little behind

5:38

that. As of January, it's 11.1. So, it's

5:40

a little behind. Yes, Intel is

5:43

maintaining cadence, but there's still a

5:44

gap versus mainline VLM. And whether

5:47

that gap matters or not depends on

5:49

whether or not you need the latest

5:50

features right now, or you want

5:52

something stable that's going to ship

5:54

with known good drivers and kernels.

5:56

Intel is leaning on known good drivers

5:58

testing. The target OS here though is

5:59

Ubuntu 25.04, kernel 6.14. Was a little

6:03

surprised it's Ubuntu 25.04, but I'll

6:06

take it. The VLM difference also means

6:08

that I have to be a little bit careful

6:09

here. When Nvidia shipped Blackwell, a

6:11

lot of the optimized software wasn't in

6:12

place yet for Blackwell. I'm not ready

6:15

to do a full comparison because I'd like

6:17

to try to control VLM versions, but I'd

6:20

also like to use a version of VLM that

6:22

has been properly optimized for

6:23

Blackwell. I don't want to accidentally

6:25

compare and contrast like, okay, this is

6:27

the Blackwell performance that we can

6:28

expect. We're using an older VLM when

6:30

that older version of VLM lacks the

6:32

Blackwell optimizations. You see, it's a

6:34

little little tough for me here. So, if

6:36

your workflow is to serve a model,

6:38

you're going to be fine. It's it's

6:40

pretty great. But if your workflow is to

6:42

fine-tune and mess around with the

6:44

stack, I mean, there are certain

6:46

problems with W8A8 implementation

6:48

support and LLM compressor. Uh, I don't

6:51

know. For diffusion and attention heavy

6:53

runs, you may see bandwidth ceilings

6:54

before compute ceilings. Uh we can

6:57

experiment with this with resolution and

6:59

context length. But let's do a quick run

7:02

through of the lowhanging fruit, the

7:04

benchmarking that I can do with those

7:06

caveats. All right, benchmarking battle

7:09

matrix. It's a little challenging

7:10

because of the VLM version. First off,

7:13

MXFP4. That is definitely the best foot

7:16

forward with this platform. both the 120

7:18

billion parameter model requiring 60

7:20

gigs of VRAM and the uh 20 billion

7:23

parameter model 120B runs great on a

7:26

4GPU setup because you have all that

7:28

extra room for lots of vc and also if

7:31

you're tempted it's like oh I could run

7:32

this on three GPUs this is not a thing

7:35

not a thing pretty much universally you

7:38

can have tensor parallelism of two or

7:40

four three is right out just like in

7:43

Monty Python so the performance and the

7:45

benchmarks I've collected all of that

7:46

and put it in a thread on the forum and

7:48

that's really where you should go if

7:49

you're interested in the nuts and bolts

7:51

of it. This is YouTube. It's video. This

7:52

is for me to wax poetic and get

7:55

everything mixed up and blah blah blah.

7:56

The ground truth is there in the forum

7:58

thread. So check out the forum thread.

8:01

GPT OSS 120B running the VLM serve

8:04

benchmark. uh you know 51,000 tokens

8:07

input generated 25,000 tokens

8:11

3.9,000

8:13

milliseconds or 3.9 seconds to the time

8:15

to first response on 120 billion

8:17

parameter models is pretty good and 986

8:20

tokens per second but I also use open

8:23

web UI and so using open web UI is okay

8:27

how are you going to use it

8:29

interactively with a UI and that's this

8:32

and I just hit the drop down and pick

8:33

different models I'm testing Quinn 10

8:35

here instead of GPOSS 12B quen 3 30

8:37

billion and it's like oh hello and it's

8:39

like please write an efficient Python

8:41

program for searching for perfect

8:42

numbers and then watching the result and

8:44

seeing how that goes and generally it

8:47

does pretty well if we go through the

8:48

tasks here everything was basically

8:50

going according to plan I encountered

8:51

some problems with llama 70B it's like

8:53

all right meta llama 70B this is an FP16

8:56

model so even though it's 70 billion

8:58

parameters the the ground truth model

9:00

from uh meta is 70B 70 billion

9:04

parameters of FP16. So 140 GB of VRAM.

9:07

That's not going to run here. But Intel

9:09

does support dynamic quant. You see, you

9:11

need to live and die by the GitHub

9:16

repository that Intel has for LLM

9:18

scaler. That's their fork of VLM. It's,

9:21

you know, a slightly older version. And

9:22

so they paper over the fact that it's a

9:24

slightly older version of VLM. But the

9:25

readme is very good and it has green

9:27

check marks for all of the models that

9:29

you would think that you want to run.

9:31

And they also have helpful hints and

9:32

notes. VLM MLA disabled if you're going

9:35

to run the DeepSseek V2 light. Okay. But

9:38

FP16, dynamic online FP8, dynamic online

9:42

INT4, and MXFP4. Now, the GPT OSS MX

9:46

FP4, but these cards show how strong

9:48

they are with MXFP4 IMHO. So, maybe

9:51

other quants that other people are

9:52

doing. Your mileage may vary. Maybe you

9:54

can get into some fun quants from

9:56

Unsloth and and get some pretty good

9:58

performance here. But this is kind of

10:01

the sandbox that you have to stick to at

10:04

this point with the work that Intel has

10:06

done in order to make this uh operate. I

10:08

was really excited for a second with

10:10

like GLM was like, "Oh, I'll be able to

10:11

run GLM 4.5." Ah, GLM 4.5 is too big.

10:14

You'd have to run like a really tiny

10:16

version of it in order to get it to fit

10:18

with this. But Quinn and you know,

10:19

Whisper and the the 8 to 20 billion

10:21

parameter models, oh yeah, all day long.

10:23

All day long, it's going to work pretty

10:25

well on this platform. So, what was my

10:27

trouble with Llama 70B? Well, I it's I

10:30

got unsupported dtype. And this is

10:32

because I needed to run the uh the

10:36

command to run to do a block size of 64.

10:39

I did not explicitly say block size 64.

10:42

The other thing that confused me for a

10:43

second was uh VLM use v1 equal one. Like

10:47

when troubleshooting this kind of dtype

10:49

error, it's like, oh, I should probably

10:50

turn v1 off. But v1 is designed only to

10:53

use chunked stuff. And that's

10:55

commentary. That's a VLM thing. That

10:56

doesn't have anything to do with Intel.

10:58

But I was trying to troubleshoot why it

10:59

was telling me that. So something about

11:01

the block size with that running. This

11:03

is the performance that I got

11:05

immediately, which is also not the the

11:06

best performance you can actually get

11:08

out of llama 70B, but I'll come back to

11:10

that in a second. So I was like, okay,

11:12

that's that's, you know, a little

11:14

worried about the 30 seconds time to

11:16

first token, but these, you know, random

11:18

tests are kind of challenging. So I

11:20

looked it up to open web UI and I got

11:23

this error. It's like, oh, as of

11:24

Transformers v4.4, you need a chat

11:26

template. And it's like, but the the

11:28

Jing Ga chat template's not part of

11:31

Meta's repository. So, I created this

11:35

one and ran it and it mostly worked, but

11:37

it was also a little bit incoherent. So,

11:40

if you want to do that, this is the

11:41

exact command that I used here. So, you

11:44

could recreate what I did or mock me

11:47

derisively in the comments for something

11:49

that I did wrong. But with that, you

11:51

know, we're getting about 20 tokens per

11:52

second output. And this is running Llama

11:54

70B, the FP16 model, but in a dynamic

11:57

FP8 quad quant. So, it's sort of doing

12:00

this on the fly. Now, the best case

12:02

scenario for Llama 70B is this. It's

12:05

about 12.9 seconds time to first token

12:08

and about 366

12:11

uh outputs, tokens per second in your

12:13

absolute best case scenario with llama

12:16

70B in this kind of a scenario. So,

12:19

Quinn 330B, again, it's a much smaller

12:22

model. This is the exact command that I

12:23

ran. This is the performance that I got

12:25

for that. And that is throughput of 991

12:28

tokens per second. And I think that

12:30

you'll want to follow this thread

12:32

because there'll probably be follow-ups

12:33

that are like, oh, there's a new version

12:35

of VLM from Intel or oh, instead of

12:37

block size 64, you know, use a smaller

12:39

block size and your, you know, your

12:41

tokens per second and your output will

12:43

will go up a little bit or your prompt

12:44

processing will improve. I also did the

12:46

Deep Seek uh distill llama 70B, which

12:49

again is another llama 70B model. And so

12:51

this should work just the same as llama

12:53

70B model, but the reality is that the

12:56

dist the distillation works much better.

12:59

Um and so this again was on the order of

13:04

178 tokens per second. Uh total token

13:07

throughput 536

13:09

and the meanantime to first token was

13:11

43.8 seconds, but again long prompt. and

13:15

the actual web UI results were were much

13:18

more coherent. So this parallelizes

13:20

really well. But I can also tell you

13:21

that in open web UI the time to first

13:24

token for my prompt which was a very

13:26

short you know write a Python program to

13:28

search for perfect numbers. There's not

13:29

a lot of tokens there in that prompt and

13:31

the time for first token here was a

13:34

second or two. Now there is a place

13:37

where that is the time to first token

13:40

benchmark numbers are a little bit of a

13:43

warning and that is when you're using

13:44

agentic coding. So if you're using an

13:47

editor that is going to load a lot of

13:49

context like say 10,000 tokens of

13:51

context that is when the prefill

13:54

performance will bite you a little bit

13:56

and we talk about about that a little

13:58

bit uh in this thread. also have I talk

14:00

a little bit more about the no chunked

14:02

prefill and how that that can sometimes

14:05

how some of the things that lead to

14:07

that. While I was doing this testing, I

14:08

also managed to crash the GPU kernel.

14:10

Um, so I put together a little script

14:11

here to help you diagnose when a GPU is

14:14

down. This could actually be added to

14:16

like we did the haroxy video. You should

14:18

check that out. Um, this could be added

14:19

to a health check on haroxy which could

14:22

send you an email when a GPU crashes

14:24

like this. This is not necessarily an

14:25

Intel thing. I have this problem with

14:26

with Blackwell GPUs. I have this problem

14:29

pretty much with any any systems that

14:32

I'm administering that have more than

14:33

like a couple of GPUs in them. Like

14:35

something will happen and the GPU will

14:37

drop. And so you can put together a

14:39

little script to make it part of your HA

14:41

proxy or make it part of your your

14:43

availability proxy to just check and be

14:44

like, "Oh, one of the GPUs is down. I

14:46

should email an administrator because

14:47

we're probably going to need to reboot

14:48

or we're probably going to need to do

14:50

something to get that GPU back." Um, and

14:53

then there's uh the exact commands that

14:55

I ran for the DeepSseek Distill

14:57

configuration and all of the other stuff

14:59

that goes with that. So, there's some

15:00

discussion, there's good discussion to

15:02

be had on the forum, but for the first

15:04

volley from Intel and for how they're

15:06

structuring their development here and

15:07

how organized they are and how

15:09

everything is happening in the open on

15:11

GitHub,

15:13

good results there. I want to see better

15:15

time to first token and I want to see uh

15:18

better performance when we are doing a

15:20

gentic type coding or if you're using an

15:22

editor that that that needs a lot of

15:24

context. I want to see I want to see

15:26

these GPUs process that really quickly,

15:29

you know, I don't I don't know if that's

15:30

prefill. I don't know if that I don't

15:32

know where the improvement needs to be

15:33

made, but the enduser experience like

15:35

all right, you know, I don't care about

15:36

the nuts and bolts. I just want to get

15:37

up and running with this quickly. Uh

15:39

forum.leext.com.

15:41

And what about Comfy UI? Well, you're

15:43

gonna have to join us in the forum.

15:46

LLM scaler Omni is the path to get

15:50

through to something that resembles a

15:52

comfy UI setup. And it's kind of

15:54

similar. It's like the the LLM scaler

15:56

container with the older version of VLM.

15:59

Run a Docker image, you get a Comfy UI

16:02

guey. We want to pick preview method

16:04

latent to RGB. And then these are the

16:06

supported models. And so this is this is

16:08

sort of a subset of what you what you

16:10

might be expecting, but you can still do

16:12

a lot with this. And so if you have a

16:14

particularly interesting something that

16:16

you've done with with Comfy UI or a

16:18

Comfy UI workflow that you want to show

16:19

off, let post it in the forum and then

16:21

we'll see if we can do image generation

16:23

and and do some other stuff like that.

16:25

I'm going to keep tinkering with the uh

16:26

the LLM

16:28

uh unsupported DTY stuff and try to make

16:32

that a little bit more robust, but but

16:35

we'll see. It's like, oh, if what should

16:36

I do if I out of memory? Uh, disable

16:38

smart memory, but I've got I've got 96

16:40

gigs to work with. Let's just let's see

16:42

how it goes.

16:44

So, like digging into this with some of

16:45

the models, some of the squiriness here

16:47

is apparently like the chunked Dtypes.

16:49

Uh, I literally set that we were going

16:53

to disable chunked prefill and it says,

16:56

"Okay, I'm going to use chunk prefill to

16:58

a 48."

17:00

Even if you set chunk prefill to zero,

17:03

it accepts that, but then crashes. So,

17:06

probably a bug in the container. Now, on

17:08

the one API side of things, I still get

17:10

the warm and fuzzies for what Intel is

17:12

trying to do with one API. And

17:13

generally, I've had a great experience

17:15

with one API in a sort of DIY context.

17:17

Putting it plainly, having a vendor who

17:19

wants you to use and build with that

17:21

kind of a thing where you can actually

17:23

script and automate and just deploy

17:24

stuff with Python, that's worth some

17:26

points. And I like that and I like where

17:28

Intel's going with that. So today, this

17:31

video is about establishing the

17:32

platform, what B60 is, and what we can

17:34

do with battle matrix and the concept of

17:36

what we're trying to accomplish here.

17:38

It's first pass at a validated stack and

17:40

or validating the validated stack, I

17:43

guess. And yeah, there's more benchmarks

17:45

yet to do. Uh, but what what workloads

17:48

do you have? Like I kind of want to turn

17:49

this around because what do you want to

17:52

see me try to run on four 24 gig GPUs

17:55

that doesn't easily run on a single

17:57

card? Do you care about throughput and

17:59

latency, long context, concurrency, test

18:01

parallelism 4? Do you want diffusion,

18:04

video, rag pipelines, OCR, something

18:06

multimodal, or something totally cursed

18:08

that only this community would would

18:10

like to try? Yeah, maybe. I mean, it's

18:13

sort of fun to experiment with. I, you

18:15

know, Docker sort of fun for AI and AI

18:19

ideas. is drop your workload ideas below

18:20

or hit me up in the forum and let's

18:22

discuss something reproducible,

18:23

containerized, scriptable, something I

18:25

can easily run because the only way we

18:28

figure out where B60 actually stacks up

18:30

and where it actually belongs is to try

18:32

to actually do something useful with it.

18:34

And that is you in the community. What

18:36

useful thing are you trying to do with

18:38

AI? What do you want to see me try to

18:40

run? Because

18:43

you know, I know that like haroxy and

18:45

that kind of stuff we can do. We can

18:46

split loads. We could move these GPUs

18:48

into another machine and you know have

18:50

two GPUs in this machine, two GPUs in

18:52

that machine and do some like network

18:53

stuff. Like there's a lot of fun things

18:54

that we could do, but overall for the

18:57

testing that I was able to do, the

18:59

performance is pretty solid, especially

19:01

if you consider performance per watt and

19:02

where they are with the VRAM. If Intel

19:04

can scale this up, this will be really

19:05

good. architecturally, you know, if you

19:07

look at like gaudy 2 and gaudy 3 and

19:09

like what Intel is doing on the

19:10

enterprise side, they've got a little

19:11

bit of the same kind of a problem that

19:13

AMD has, which is that like RDNA and

19:15

CDNA are not at all the same thing. And

19:18

Intel basically has the same problem.

19:20

And how Intel is going to solve that

19:22

problem is nowhere near the forefront as

19:26

much as you know on the AMD side. And

19:29

then meanwhile, you know, Nvidia is just

19:31

off in the background doing Nvidia

19:32

things. But still for inferencing and

19:35

how accessible this is and the costs,

19:38

Intel is uncharacteristically aggressive

19:40

with the costs here. Impressively so. So

19:44

this is a very promising platform and it

19:47

has a lot of really interesting aspects

19:49

to it, but there is more science to be

19:52

done. I'm level one has been a quick

19:54

look at battle matrix B64 GPU

19:56

configuration. We're going to run some

19:58

more stuff, so engage. All right, I'm

20:01

signing out and I'll see you in the

20:02

level one forums where I will be

20:04

engaging with those of you that chose to

20:05

engage

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.

GET STARTED FREE SIGN IN