TRANSCRIPTIONEnglish

I Tested the Cheapest Path to 96GB of VRAM

19m 39s3,718 mots511 segmentsEnglish

TRANSCRIPTION COMPLÈTE

0:00

Usually when you hear 96 GB of VRAM, you

0:03

expect something absurdly expensive like

0:05

this Nvidia RTX Pro 6000, which was 10

0:08

grand, but now it's down to 8,500. But

0:10

still, however, this right here might be

0:12

the most affordable 96 GB of VRAM you

0:16

can buy in a single system right now.

0:17

The question is whether cheap VRAM is

0:19

actually useful or just cheap. But this

0:22

server has four Intel ARC Pro B60 cards

0:26

in it. Yes, Intel is continuing the Pro

0:29

line. And Intel's pitch here is pretty

0:31

clear. Each B60 has 24 GB of GDDR6. So

0:35

together, that gives me 96 of total VRAM

0:38

in one box. 456 GB per second of memory

0:41

bandwidth, which is useful for LLMs, the

0:44

decode phase of it. If you've been

0:46

watching this channel, you know what

0:47

that is. It also has about 200 W of

0:49

board power. And this particular version

0:51

is the sparkle card and it's listed at

0:54

$799 bucks, but I've seen it on Newegg

0:57

for $650. $650 for 24 GB. Nvidia's

1:00

previous generation 4090 has 24 gigs of

1:03

VRAM. And this one cost me over 2 grand.

1:05

The newest Blackwell generation 5080

1:07

also has 24 gigs. And that one, even

1:09

though it's listed at a,000, you

1:10

probably can find it for 1500 to 1,800

1:13

now. So on paper, this looks like a

1:15

pretty simple idea. A lot of VRAM for

1:17

not that much money. So, I wanted to

1:18

compare this to a couple of GPUs in the

1:20

same price range. What's available?

1:22

Well, from AMD, we've got the RX7900 XT.

1:26

There's nothing in the Pro line from AMD

1:27

that's close to the price. And from

1:29

Nvidia, we've got the RTX Pro 2000

1:33

Blackwell. Yes, same generation as the

1:36

big brother, but this is a tiny little

1:38

one with very different specs, yet it

1:40

carries that Pro name and the price tag.

1:43

The RX7900 XT goes in a very different

1:46

direction. This one has 20 GB of VRAM,

1:49

not 24 like the Intel. That means it'll

1:51

allow you to run smaller models, but all

1:53

these cards will run smaller models by

1:54

themselves. This one will just allow you

1:56

to have less uh context and less KV

1:59

cache. However, this card has 800 GB per

2:02

second of memory bandwidth and 315 watts

2:05

of board power. So, compared to the B60,

2:07

AMD is basically giving me less memory,

2:10

but a lot more bandwidth and a lot more

2:12

power. The RTX Pro 2000, it's a bit of

2:14

an oddball. It has 16 GB of GDDDR7, so

2:19

the brand new memory. But it only has

2:22

288 GB per second of memory bandwidth

2:25

and it uses 70 Ws of power, which means

2:27

you don't need extra power cables to run

2:28

it. It just gets this power from the PCI

2:31

bus, but this cost me 800 bucks, so

2:33

price is up there. Now, Nvidia's angle

2:35

is almost the opposite of Intel's.

2:37

There's less VRAM, much less bandwidth,

2:39

way lower power, and a much smaller

2:41

card. So, those are the things you get

2:42

for that price range. Now, the B60 is

2:44

not trying to be the fastest GPU. It's

2:46

just trying to be the GPU that gives you

2:48

the most VRAM density for the money. And

2:50

once I stack four of them into one

2:53

server, that becomes the real question.

2:55

Is cheap VRAM actually useful or is it

2:57

just cheap? Can we actually use this and

2:59

get good results? We're about to see. As

3:01

devs, our info ends up everywhere.

3:03

Repos, bug trackers, random API,

3:05

signups, that turns into a profile that

3:07

data brokers can package and resell. The

3:10

harder you are to find, the harder you

3:12

are to target. In a lot of countries

3:13

now, the law says data brokers have to

3:15

remove your info when you request it.

3:17

But doing it yourself means hunting down

3:19

hundreds of brokers, dealing with each

3:21

one, and checking back again later.

3:23

Incogns those removal requests

3:25

automatically and keeps following up

3:27

until they comply. My own dashboard

3:29

showed hundreds of hits connected to my

3:31

details, and most of them have already

3:33

been taken down. And when I find a

3:34

specific page exposing my info, I use

3:37

custom removals. It's easy. I submit the

3:39

link, and their team handles the

3:40

takedown and follow-ups. You can think

3:42

of it like this. Find it, remove it, and

3:44

keep it removed. For an extra peace of

3:46

mind, Deoid verified their data removal

3:48

processes. It helps with broker sites

3:50

and eligible pages, but it's not for

3:52

things like official records or random

3:54

social posts. You're on your own for

3:56

those. And if you want to test it out

3:57

first, there is a 30-day money back

3:59

guarantee. Take your personal data back

4:01

with Incogn. Go to incogn.com/alexiskin

4:04

and use code alexiskin for 60% off an

4:07

annual plan. Link down below. I'm going

4:10

to kick things off with the RTX Pro 2000

4:12

comparison cuz it's already near me and

4:15

I don't need to plug anything in. It's

4:16

nice. Boom. Oh, this is just to get a

4:19

flavor of how these cards compare. So,

4:21

I'm going to use a relatively small

4:23

model, but remember it needs context.

4:25

So, even the Quinn 34B model, we're

4:28

running the full BF-16 on all these

4:30

machines. That one is 8 GB. It's already

4:33

half the memory of what's available on

4:35

this RTX Pro 2000. Yeah, you're not

4:38

going to be able to run huge models on

4:39

this. But this will give us a little

4:40

comparison point of how perhaps these

4:43

machines will scale. In actuality, when

4:45

you scale them out, it might be a little

4:46

different, but I don't have four RTX Pro

4:49

2000s or four of the AMD cards. I do

4:51

have four of these, and we'll get to

4:53

that. So, I'm going to kick off VLM, and

4:55

we're going to use VLM throughout here.

4:57

I'm going to keep an eye on Nvidia SMI

4:58

here. We've got 70 Ws maximum for this

5:01

GPU. And over here on the Intel box,

5:04

this is showing us that I have four GPUs

5:06

installed. 0 1 2 and 3, but we're just

5:08

going to be using the zero GPU for this

5:10

test. And over here, I'll kick off the

5:13

same exact model, but using the Intel

5:15

version of VLM, and I'll get into that

5:17

in a moment. Here, I'm going to run

5:18

Llama Beni, which is a nice tool by

5:20

Yuger. You can find it on GitHub. And

5:22

it's really a good tool because of its

5:23

flexibility, and it works kind of like

5:25

Llama Bench, but across HTTP, and you

5:28

can run it against any back end. First,

5:29

let's do concurrency of one, which means

5:32

it's going to do only one request,

5:33

simulating kind of like a chat scenario.

5:36

And boom, there we go. You can see we

5:38

got that request right here in VLM. And

5:40

we're using 69 watts of power out of 70.

5:43

So, pretty much maxing it out. Prom

5:45

processing 5,223

5:47

tokens per second. Nvidia is really good

5:50

at prompt processing speed, even on such

5:52

a tiny GPU. That's really impressive. 27

5:55

tokens per second for token generation.

5:57

Remember, this is a BF-16 model, even

6:00

though it's a small one. Now, let's do

6:01

this against the Intel box.

6:06

What? I think I named my models

6:08

differently there. Indeed. Let's copy

6:10

that model name. And there we go. You

6:12

can see that this is only using that

6:14

zerooth GPU, not the rest of them. And

6:17

we got 17% utilization. Not great. About

6:20

120 watts of power also. But look at

6:22

that. 22 GB is being used up on that

6:26

machine, which is given us all that

6:27

extra cash, all that extra space for the

6:30

context. That's where it's really handy

6:32

to have more VRAM. How's the speed? WA I

6:35

mean it does have higher bandwidth much

6:37

higher bandwidth than the Nvidia GPU

6:39

9576

6:41

tokens per second for prompt processing

6:43

and token generation is 45 tokens per

6:46

second. Now what happens if we change

6:47

the concurrency to say 32. So that means

6:51

32 requests at a time is being handled.

6:53

Send that over and that's going to the

6:55

Nvidia GPU right now. There you can see

6:57

that we got a bunch of requests at the

6:59

same time. They're all being processed.

7:01

So this is going to take a little bit

7:03

longer. 69 watts being used out of 70.

7:05

And this is the entire system. 158 watts

7:10

being used right now by this entire

7:12

computer. I mean, it's kind of not a

7:14

fair comparison because this is a very

7:17

different kind of system than this

7:19

server. This is an AMD desktop chip and

7:21

this is a serverbased Xeon machine. And

7:24

it's done now. And while it's done,

7:26

we're down to 75 74. Okay, that makes

7:30

sense. Woah. So, the prompt processing

7:33

speed went down a little bit. 1,313,

DÉBLOQUER PLUS

Inscrivez-vous gratuitement pour accéder aux fonctionnalités premium

VISUALISEUR INTERACTIF

Regardez la vidéo avec des sous-titres synchronisés, une superposition réglable et un contrôle total de la lecture.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

RÉSUMÉ IA

Obtenez un résumé instantané généré par l'IA du contenu de la vidéo, des points clés et des principaux enseignements.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TRADUIRE

Traduisez la transcription dans plus de 100 langues en un seul clic. Téléchargez dans n'importe quel format.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

CARTE MENTALE

Visualisez la transcription sous forme de carte mentale interactive. Comprenez la structure en un coup d'œil.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

DISCUTER AVEC LA TRANSCRIPTION

Posez des questions sur le contenu de la vidéo. Obtenez des réponses alimentées par l'IA directement à partir de la transcription.

INSCRIVEZ-VOUS GRATUITEMENT POUR DÉBLOQUER

TIREZ LE MEILLEUR PARTI DE VOS TRANSCRIPTIONS

Inscrivez-vous gratuitement et débloquez la visionneuse interactive, les résumés IA, les traductions, les cartes mentales, et plus encore. Aucune carte de crédit requise.

ESSAYEZ YOUTUBETRANSCRIPT.DEV COMMENCER GRATUITEMENT

I Tested the Ch… - Transcription Complète | YouTubeTranscript.dev