I Tested the Cheapest Path to 96GB of VRAM
TRANSCRIPTION COMPLÈTE
Usually when you hear 96 GB of VRAM, you
expect something absurdly expensive like
this Nvidia RTX Pro 6000, which was 10
grand, but now it's down to 8,500. But
still, however, this right here might be
the most affordable 96 GB of VRAM you
can buy in a single system right now.
The question is whether cheap VRAM is
actually useful or just cheap. But this
server has four Intel ARC Pro B60 cards
in it. Yes, Intel is continuing the Pro
line. And Intel's pitch here is pretty
clear. Each B60 has 24 GB of GDDR6. So
together, that gives me 96 of total VRAM
in one box. 456 GB per second of memory
bandwidth, which is useful for LLMs, the
decode phase of it. If you've been
watching this channel, you know what
that is. It also has about 200 W of
board power. And this particular version
is the sparkle card and it's listed at
$799 bucks, but I've seen it on Newegg
for $650. $650 for 24 GB. Nvidia's
previous generation 4090 has 24 gigs of
VRAM. And this one cost me over 2 grand.
The newest Blackwell generation 5080
also has 24 gigs. And that one, even
though it's listed at a,000, you
probably can find it for 1500 to 1,800
now. So on paper, this looks like a
pretty simple idea. A lot of VRAM for
not that much money. So, I wanted to
compare this to a couple of GPUs in the
same price range. What's available?
Well, from AMD, we've got the RX7900 XT.
There's nothing in the Pro line from AMD
that's close to the price. And from
Nvidia, we've got the RTX Pro 2000
Blackwell. Yes, same generation as the
big brother, but this is a tiny little
one with very different specs, yet it
carries that Pro name and the price tag.
The RX7900 XT goes in a very different
direction. This one has 20 GB of VRAM,
not 24 like the Intel. That means it'll
allow you to run smaller models, but all
these cards will run smaller models by
themselves. This one will just allow you
to have less uh context and less KV
cache. However, this card has 800 GB per
second of memory bandwidth and 315 watts
of board power. So, compared to the B60,
AMD is basically giving me less memory,
but a lot more bandwidth and a lot more
power. The RTX Pro 2000, it's a bit of
an oddball. It has 16 GB of GDDDR7, so
the brand new memory. But it only has
288 GB per second of memory bandwidth
and it uses 70 Ws of power, which means
you don't need extra power cables to run
it. It just gets this power from the PCI
bus, but this cost me 800 bucks, so
price is up there. Now, Nvidia's angle
is almost the opposite of Intel's.
There's less VRAM, much less bandwidth,
way lower power, and a much smaller
card. So, those are the things you get
for that price range. Now, the B60 is
not trying to be the fastest GPU. It's
just trying to be the GPU that gives you
the most VRAM density for the money. And
once I stack four of them into one
server, that becomes the real question.
Is cheap VRAM actually useful or is it
just cheap? Can we actually use this and
get good results? We're about to see. As
devs, our info ends up everywhere.
Repos, bug trackers, random API,
signups, that turns into a profile that
data brokers can package and resell. The
harder you are to find, the harder you
are to target. In a lot of countries
now, the law says data brokers have to
remove your info when you request it.
But doing it yourself means hunting down
hundreds of brokers, dealing with each
one, and checking back again later.
Incogns those removal requests
automatically and keeps following up
until they comply. My own dashboard
showed hundreds of hits connected to my
details, and most of them have already
been taken down. And when I find a
specific page exposing my info, I use
custom removals. It's easy. I submit the
link, and their team handles the
takedown and follow-ups. You can think
of it like this. Find it, remove it, and
keep it removed. For an extra peace of
mind, Deoid verified their data removal
processes. It helps with broker sites
and eligible pages, but it's not for
things like official records or random
social posts. You're on your own for
those. And if you want to test it out
first, there is a 30-day money back
guarantee. Take your personal data back
with Incogn. Go to incogn.com/alexiskin
and use code alexiskin for 60% off an
annual plan. Link down below. I'm going
to kick things off with the RTX Pro 2000
comparison cuz it's already near me and
I don't need to plug anything in. It's
nice. Boom. Oh, this is just to get a
flavor of how these cards compare. So,
I'm going to use a relatively small
model, but remember it needs context.
So, even the Quinn 34B model, we're
running the full BF-16 on all these
machines. That one is 8 GB. It's already
half the memory of what's available on
this RTX Pro 2000. Yeah, you're not
going to be able to run huge models on
this. But this will give us a little
comparison point of how perhaps these
machines will scale. In actuality, when
you scale them out, it might be a little
different, but I don't have four RTX Pro
2000s or four of the AMD cards. I do
have four of these, and we'll get to
that. So, I'm going to kick off VLM, and
we're going to use VLM throughout here.
I'm going to keep an eye on Nvidia SMI
here. We've got 70 Ws maximum for this
GPU. And over here on the Intel box,
this is showing us that I have four GPUs
installed. 0 1 2 and 3, but we're just
going to be using the zero GPU for this
test. And over here, I'll kick off the
same exact model, but using the Intel
version of VLM, and I'll get into that
in a moment. Here, I'm going to run
Llama Beni, which is a nice tool by
Yuger. You can find it on GitHub. And
it's really a good tool because of its
flexibility, and it works kind of like
Llama Bench, but across HTTP, and you
can run it against any back end. First,
let's do concurrency of one, which means
it's going to do only one request,
simulating kind of like a chat scenario.
And boom, there we go. You can see we
got that request right here in VLM. And
we're using 69 watts of power out of 70.
So, pretty much maxing it out. Prom
processing 5,223
tokens per second. Nvidia is really good
at prompt processing speed, even on such
a tiny GPU. That's really impressive. 27
tokens per second for token generation.
Remember, this is a BF-16 model, even
though it's a small one. Now, let's do
this against the Intel box.
What? I think I named my models
differently there. Indeed. Let's copy
that model name. And there we go. You
can see that this is only using that
zerooth GPU, not the rest of them. And
we got 17% utilization. Not great. About
120 watts of power also. But look at
that. 22 GB is being used up on that
machine, which is given us all that
extra cash, all that extra space for the
context. That's where it's really handy
to have more VRAM. How's the speed? WA I
mean it does have higher bandwidth much
higher bandwidth than the Nvidia GPU
9576
tokens per second for prompt processing
and token generation is 45 tokens per
second. Now what happens if we change
the concurrency to say 32. So that means
32 requests at a time is being handled.
Send that over and that's going to the
Nvidia GPU right now. There you can see
that we got a bunch of requests at the
same time. They're all being processed.
So this is going to take a little bit
longer. 69 watts being used out of 70.
And this is the entire system. 158 watts
being used right now by this entire
computer. I mean, it's kind of not a
fair comparison because this is a very
different kind of system than this
server. This is an AMD desktop chip and
this is a serverbased Xeon machine. And
it's done now. And while it's done,
we're down to 75 74. Okay, that makes
sense. Woah. So, the prompt processing
speed went down a little bit. 1,313,
DÉBLOQUER PLUS
Inscrivez-vous gratuitement pour accéder aux fonctionnalités premium
VISUALISEUR INTERACTIF
Regardez la vidéo avec des sous-titres synchronisés, une superposition réglable et un contrôle total de la lecture.
RÉSUMÉ IA
Obtenez un résumé instantané généré par l'IA du contenu de la vidéo, des points clés et des principaux enseignements.
TRADUIRE
Traduisez la transcription dans plus de 100 langues en un seul clic. Téléchargez dans n'importe quel format.
CARTE MENTALE
Visualisez la transcription sous forme de carte mentale interactive. Comprenez la structure en un coup d'œil.
DISCUTER AVEC LA TRANSCRIPTION
Posez des questions sur le contenu de la vidéo. Obtenez des réponses alimentées par l'IA directement à partir de la transcription.
TIREZ LE MEILLEUR PARTI DE VOS TRANSCRIPTIONS
Inscrivez-vous gratuitement et débloquez la visionneuse interactive, les résumés IA, les traductions, les cartes mentales, et plus encore. Aucune carte de crédit requise.