トランスクリプトEnglish

I Tested the Cheapest Path to 96GB of VRAM

19m 39s3,718 単語511 segmentsEnglish

全トランスクリプト

0:00

Usually when you hear 96 GB of VRAM, you

0:03

expect something absurdly expensive like

0:05

this Nvidia RTX Pro 6000, which was 10

0:08

grand, but now it's down to 8,500. But

0:10

still, however, this right here might be

0:12

the most affordable 96 GB of VRAM you

0:16

can buy in a single system right now.

0:17

The question is whether cheap VRAM is

0:19

actually useful or just cheap. But this

0:22

server has four Intel ARC Pro B60 cards

0:26

in it. Yes, Intel is continuing the Pro

0:29

line. And Intel's pitch here is pretty

0:31

clear. Each B60 has 24 GB of GDDR6. So

0:35

together, that gives me 96 of total VRAM

0:38

in one box. 456 GB per second of memory

0:41

bandwidth, which is useful for LLMs, the

0:44

decode phase of it. If you've been

0:46

watching this channel, you know what

0:47

that is. It also has about 200 W of

0:49

board power. And this particular version

0:51

is the sparkle card and it's listed at

0:54

$799 bucks, but I've seen it on Newegg

0:57

for $650. $650 for 24 GB. Nvidia's

1:00

previous generation 4090 has 24 gigs of

1:03

VRAM. And this one cost me over 2 grand.

1:05

The newest Blackwell generation 5080

1:07

also has 24 gigs. And that one, even

1:09

though it's listed at a,000, you

1:10

probably can find it for 1500 to 1,800

1:13

now. So on paper, this looks like a

1:15

pretty simple idea. A lot of VRAM for

1:17

not that much money. So, I wanted to

1:18

compare this to a couple of GPUs in the

1:20

same price range. What's available?

1:22

Well, from AMD, we've got the RX7900 XT.

1:26

There's nothing in the Pro line from AMD

1:27

that's close to the price. And from

1:29

Nvidia, we've got the RTX Pro 2000

1:33

Blackwell. Yes, same generation as the

1:36

big brother, but this is a tiny little

1:38

one with very different specs, yet it

1:40

carries that Pro name and the price tag.

1:43

The RX7900 XT goes in a very different

1:46

direction. This one has 20 GB of VRAM,

1:49

not 24 like the Intel. That means it'll

1:51

allow you to run smaller models, but all

1:53

these cards will run smaller models by

1:54

themselves. This one will just allow you

1:56

to have less uh context and less KV

1:59

cache. However, this card has 800 GB per

2:02

second of memory bandwidth and 315 watts

2:05

of board power. So, compared to the B60,

2:07

AMD is basically giving me less memory,

2:10

but a lot more bandwidth and a lot more

2:12

power. The RTX Pro 2000, it's a bit of

2:14

an oddball. It has 16 GB of GDDDR7, so

2:19

the brand new memory. But it only has

2:22

288 GB per second of memory bandwidth

2:25

and it uses 70 Ws of power, which means

2:27

you don't need extra power cables to run

2:28

it. It just gets this power from the PCI

2:31

bus, but this cost me 800 bucks, so

2:33

price is up there. Now, Nvidia's angle

2:35

is almost the opposite of Intel's.

2:37

There's less VRAM, much less bandwidth,

2:39

way lower power, and a much smaller

2:41

card. So, those are the things you get

2:42

for that price range. Now, the B60 is

2:44

not trying to be the fastest GPU. It's

2:46

just trying to be the GPU that gives you

2:48

the most VRAM density for the money. And

2:50

once I stack four of them into one

2:53

server, that becomes the real question.

2:55

Is cheap VRAM actually useful or is it

2:57

just cheap? Can we actually use this and

2:59

get good results? We're about to see. As

3:01

devs, our info ends up everywhere.

3:03

Repos, bug trackers, random API,

3:05

signups, that turns into a profile that

3:07

data brokers can package and resell. The

3:10

harder you are to find, the harder you

3:12

are to target. In a lot of countries

3:13

now, the law says data brokers have to

3:15

remove your info when you request it.

3:17

But doing it yourself means hunting down

3:19

hundreds of brokers, dealing with each

3:21

one, and checking back again later.

3:23

Incogns those removal requests

3:25

automatically and keeps following up

3:27

until they comply. My own dashboard

3:29

showed hundreds of hits connected to my

3:31

details, and most of them have already

3:33

been taken down. And when I find a

3:34

specific page exposing my info, I use

3:37

custom removals. It's easy. I submit the

3:39

link, and their team handles the

3:40

takedown and follow-ups. You can think

3:42

of it like this. Find it, remove it, and

3:44

keep it removed. For an extra peace of

3:46

mind, Deoid verified their data removal

3:48

processes. It helps with broker sites

3:50

and eligible pages, but it's not for

3:52

things like official records or random

3:54

social posts. You're on your own for

3:56

those. And if you want to test it out

3:57

first, there is a 30-day money back

3:59

guarantee. Take your personal data back

4:01

with Incogn. Go to incogn.com/alexiskin

4:04

and use code alexiskin for 60% off an

4:07

annual plan. Link down below. I'm going

4:10

to kick things off with the RTX Pro 2000

4:12

comparison cuz it's already near me and

4:15

I don't need to plug anything in. It's

4:16

nice. Boom. Oh, this is just to get a

4:19

flavor of how these cards compare. So,

4:21

I'm going to use a relatively small

4:23

model, but remember it needs context.

4:25

So, even the Quinn 34B model, we're

4:28

running the full BF-16 on all these

4:30

machines. That one is 8 GB. It's already

4:33

half the memory of what's available on

4:35

this RTX Pro 2000. Yeah, you're not

4:38

going to be able to run huge models on

4:39

this. But this will give us a little

4:40

comparison point of how perhaps these

4:43

machines will scale. In actuality, when

4:45

you scale them out, it might be a little

4:46

different, but I don't have four RTX Pro

4:49

2000s or four of the AMD cards. I do

4:51

have four of these, and we'll get to

4:53

that. So, I'm going to kick off VLM, and

4:55

we're going to use VLM throughout here.

4:57

I'm going to keep an eye on Nvidia SMI

4:58

here. We've got 70 Ws maximum for this

5:01

GPU. And over here on the Intel box,

5:04

this is showing us that I have four GPUs

5:06

installed. 0 1 2 and 3, but we're just

5:08

going to be using the zero GPU for this

5:10

test. And over here, I'll kick off the

5:13

same exact model, but using the Intel

5:15

version of VLM, and I'll get into that

5:17

in a moment. Here, I'm going to run

5:18

Llama Beni, which is a nice tool by

5:20

Yuger. You can find it on GitHub. And

5:22

it's really a good tool because of its

5:23

flexibility, and it works kind of like

5:25

Llama Bench, but across HTTP, and you

5:28

can run it against any back end. First,

5:29

let's do concurrency of one, which means

5:32

it's going to do only one request,

5:33

simulating kind of like a chat scenario.

5:36

And boom, there we go. You can see we

5:38

got that request right here in VLM. And

5:40

we're using 69 watts of power out of 70.

5:43

So, pretty much maxing it out. Prom

5:45

processing 5,223

5:47

tokens per second. Nvidia is really good

5:50

at prompt processing speed, even on such

5:52

a tiny GPU. That's really impressive. 27

5:55

tokens per second for token generation.

5:57

Remember, this is a BF-16 model, even

6:00

though it's a small one. Now, let's do

6:01

this against the Intel box.

6:06

What? I think I named my models

6:08

differently there. Indeed. Let's copy

6:10

that model name. And there we go. You

6:12

can see that this is only using that

6:14

zerooth GPU, not the rest of them. And

6:17

we got 17% utilization. Not great. About

6:20

120 watts of power also. But look at

6:22

that. 22 GB is being used up on that

6:26

machine, which is given us all that

6:27

extra cash, all that extra space for the

6:30

context. That's where it's really handy

6:32

to have more VRAM. How's the speed? WA I

6:35

mean it does have higher bandwidth much

6:37

higher bandwidth than the Nvidia GPU

6:39

9576

6:41

tokens per second for prompt processing

6:43

and token generation is 45 tokens per

6:46

second. Now what happens if we change

6:47

the concurrency to say 32. So that means

6:51

32 requests at a time is being handled.

6:53

Send that over and that's going to the

6:55

Nvidia GPU right now. There you can see

6:57

that we got a bunch of requests at the

6:59

same time. They're all being processed.

7:01

So this is going to take a little bit

7:03

longer. 69 watts being used out of 70.

7:05

And this is the entire system. 158 watts

7:10

being used right now by this entire

7:12

computer. I mean, it's kind of not a

7:14

fair comparison because this is a very

7:17

different kind of system than this

7:19

server. This is an AMD desktop chip and

7:21

this is a serverbased Xeon machine. And

7:24

it's done now. And while it's done,

7:26

we're down to 75 74. Okay, that makes

7:30

sense. Woah. So, the prompt processing

7:33

speed went down a little bit. 1,313,

さらにアンロック

無料でサインアップしてプレミアム機能にアクセス

インタラクティブビューア

字幕を同期させ、オーバーレイを調整し、完全な再生コントロールでビデオを視聴できます。

無料でサインアップしてアンロック

AI要約

動画コンテンツ、キーポイント、および重要なポイントのAI生成された要約を即座に取得します。

無料でサインアップしてアンロック

翻訳

ワンクリックでトランスクリプトを100以上の言語に翻訳します。任意の形式でダウンロードできます。

無料でサインアップしてアンロック

マインドマップ

トランスクリプトをインタラクティブなマインドマップとして視覚化します。構造を一目で理解できます。

無料でサインアップしてアンロック

トランスクリプトとチャット

動画コンテンツについて質問します。AIを利用してトランスクリプトから直接回答を得られます。

無料でサインアップしてアンロック

トランスクリプトをもっと活用する

無料でサインアップして、インタラクティブビューア、AI要約、翻訳、マインドマップなどをアンロックしてください。クレジットカードは不要です。

    I Tested the Cheapest Path to… - 全文書き起こし | YouTubeTranscript.dev