文本记录English

Intel just CRUSHED Nvidia & AMD GPU pricing

25m 18s4,450 字数641 segmentsEnglish

完整文本记录

0:00

This cute little guy is the Intel Arc

0:02

Pro B50. This is its bigger brother, the

0:05

Arc Pro B70. It just came out. Now, the

0:08

B50 didn't need any extra power cables.

0:10

It got all its power straight from the

0:12

PCIe bus. Nice, simple, civilized. The

0:15

B70 is a little bit less civilized. This

0:18

one needs power, but it also brings 32

0:21

GB of VRAM and comes in under $1,000.

0:25

And because local AI tends to turn into

0:27

a hardware addiction, I'm plugging in

0:29

four of them. That gives me 128 gigs of

0:32

total VRAM, which means I'm going to

0:33

need a computer that can actually power

0:35

all four, which is exactly the kind of

0:37

sentence Local AI makes you say out

0:40

loud. Now, I already did a video on the

0:42

B50 last year, and it was the most bang

0:44

for the buck GPU you can get in 2025.

0:46

So, now I want to know if the B70 could

0:49

do the same thing for Local AI in 2026.

0:52

Because just for comparison, a single

0:54

RTX5090

0:56

from Nvidia also has 32 GB of VRAM, but

0:59

that comes in in just under $4,000 at

1:02

this point. So, yeah, we're about to

1:04

find out. So, of course, to do this

1:06

comparison properly, I did what any

1:08

reasonable person would do. I bought

1:10

more GPUs. This is the Nvidia RTX Pro

1:12

4000 Black. Well, check out that price

1:14

tag right there. Not flexing, just

1:16

saying that this is retailed two times

1:19

higher than this card and it's 24 gigs

1:21

of memory, not 32. However, it is GDDR7,

1:26

so it's the newer VRAM. And it has 672

1:29

GB per second of memory bandwidth. It's

1:32

also very skinny, just like my wallet.

1:37

Wow, I can see you through the fan. the

1:40

comparison physically. The RTX Pro 4000

1:42

is a much skinnier card and it also

1:44

gives you the four display port ports.

1:46

So, that's a single card slot with

1:48

pretty impressive bandwidth and very

1:50

impressive price, too. By the way, I got

1:52

it at MicroEnter was $1699, not 1999.

1:55

Not sponsored by MicroEnter, but still a

1:57

lot of money. Next, I got AMD's Radeon

1:59

AI R9700.

2:03

And up until now, this was the only GPU

2:06

you can get

2:08

with 32 gigs of memory for about $1,300.

2:12

This is 32 gigs of GDDDR6 and 640 memory

2:16

bandwidth, so not as high as the Nvidia

2:18

one. So, they match on VRAM, but this

2:21

card is about $350 more per GPU. Now,

2:25

the B70 comes in at the lowest price.

2:27

However, it also has the lowest memory

2:29

bandwidth at 68. So yeah, this has been

2:31

pretty heavy on the wallet, which is why

2:34

I'm thankful to the sponsor of this

2:36

video. So these days, I'm always

2:37

flipping between models. GPT for

2:40

research, cloud for coding, nanobanana

2:42

for image generation, VO cling, and

2:44

runway for video, six tabs, six bills,

2:47

and counting. Enter chat LLM teams. One

2:50

dashboard houses every top LLM and route

2:53

Olympics, the right one. GPT Mini for

2:55

ultra fast answers, Claude Sonnet for

2:57

coding, Gemini Pro for massive context.

3:00

They recently added Gemini 3 and GPT 5.1

3:03

the moment they dropped. Create

3:04

professional presentations with graphs,

3:06

charts, and deep research detailed

3:08

content. Need human sounding copy?

3:10

Humanize rewrites text to defeat AI

3:12

detectors. Need visuals? Pick Frontier

3:14

or open- source models. Nanobanana

3:17

Midjourney Flux for images. Magnific

3:19

upscaling plus VO WAN and Sora for video

3:23

all builtin. You also get Avac's AI deep

3:25

agent to pretty much do anything. build

3:27

full stack apps, websites, reports with

3:30

just text prompts and deploy them on the

3:32

spot. They have Abaca's AI desktop,

3:34

which is the brand new coding editor and

3:36

assistant that lets you vibe code and

3:38

build productionready apps. And the

3:40

kicker, it's just $10 a month, less than

3:42

one premium model. Head over to

3:44

chatlm.abacus.ai

3:46

or click the link below to level up with

3:48

Chat LLM teams. I just finished testing

3:51

the B60 and depending on the software

3:53

stack that you're running, you get very

3:55

different results. Here's Cickle, which

3:57

is running Llama CPP. You can run Llama

4:00

CPP and that'll utilize either Sickle,

4:02

which is a Intel specific stack. It

4:05

gives you really good performance there.

4:06

By the way, this is the Quen 34B model

4:09

at Q4KM

4:11

quantization. We're getting 1,0 tokens

4:13

per second for prompt processing here.

4:15

And since these are professional level

4:16

cards, it's also a good idea to test

4:18

them with a higher concurrency. So, this

4:20

is concurrency of one, which kind of

4:22

simulates chatting with the thing,

4:24

right? But if you have a concurrency of

4:26

four, which is leaning towards more of a

4:29

agentic workflow or multiple users at

4:32

the same time, then we come down to 898

4:34

tokens per second here. And that just

4:35

shows that Llama CPP is not the best for

4:40

higher concurrency throughput. Now, for

4:42

token generation, we're getting 66

4:44

tokens per second here for C1. And for

4:47

C4, which is concurrency 4, we're

4:49

getting 83. So just a little bit higher

4:52

than your single. And of course, I did

4:54

Llama CPP for Vulcan, which is a

4:57

cross-platform approach. And Vulcan did

4:59

better in certain scenarios, like for

5:01

example, prompt processing, which is

5:03

kind of surprising. 1,162 tokens per

5:06

second there for Single. But Sickle did

5:08

do better for token generation, 66

5:10

tokens per second versus Vulcan at 44.

5:13

However, the best the best performance

5:16

we got was from VLM, of course. Look at

5:19

that huge difference right there. This

5:21

is meant to run on professional GPUs

5:23

like this with high concurrency and

5:25

throughput. And this is the 4bit AWQ

5:28

quantization for VLM. 8,118 tokens per

5:31

second here for concurrency of one token

5:34

generation 67. So not that much higher

5:36

for single concurrency for token

5:39

generation than cickle. But look at the

5:40

scaling when it comes down to

5:42

concurrency of four. 215 tokens per

5:45

second for token generation. All right,

5:46

that's just a review of the B60. Let's

5:48

see what happens when we do the B70

5:51

along with these other ones. So, I'm

5:52

going to kick things off with one B70

5:55

and this Nvidia 4000. I do like that one

5:58

slot feel. Ah, that fresh new GPU smell.

6:02

Nvidia SMI. We've got 145 watts

6:06

available on this thing and 24 GB. So, I

6:09

kicked off VLM cuz I'm not doing Llama

6:11

CPP at this point. VLM is the way to go

6:14

on these kinds of machines and these

6:16

kinds of GPUs. I'm running both of these

6:17

now and I'm pointing to Quen 34B. This

6:20

is the full BF-16 version. And the way

6:22

VLM likes to work is it takes up as much

6:26

memory as possible. So, it's going to

6:27

fill up all 24 GB on this board and all

6:31

32 GB on that board just because it

6:33

likes to have extra room for KV cache.

6:35

And I'm using a tool called Llama Beni

6:37

by Yugger over here. It's an open source

6:39

project and it's a really nice tool for

6:41

doing this kind of benchmarking. Don't

6:42

confuse this with just Llama Bench. This

6:45

is Llama Beni. All right, it's

6:47

different. So, what's different about it

6:50

specifically is in the read me, so you

6:52

can go read it. But just to give you a

6:54

brief overview is first of all, it's

6:55

going to work with any kind of server,

6:57

not just Llama CPP, including VLM. And

7:00

second, it allows me to prefill the

7:01

context so that we can test filled

7:04

contexts also, not just empty context.

7:07

And that is very useful. All right, here

7:09

we go. We're going to do a little race,

7:10

but it's not going to be instant because

7:12

I got it on two different windows. Boom.

7:13

And boom. And I

7:17

It's funny the sounds that these things

7:18

make cuz they all have coil wind. And

7:21

also the coil wind is different based on

7:23

what model you're running and what

7:24

concurrency you're running. And probably

7:26

other parameters will affect it too. So

7:28

if you have a really really keen

7:30

hearing, you'll be able to tell me

7:32

exactly what model is running, how many

7:35

concurrence requests were processing.

7:37

I'm just kidding. But maybe AI will be

7:39

able to tell the difference. And here we

7:40

go. We got our first results. 56 tokens

7:43

per second token generation on the B70

7:47

compared to 51 over here. So the B70

7:50

beats it out just by a little bit. Ooh,

7:53

that's hot. Um, also PROM processing

解锁更多

免费注册以访问高级功能

互动查看器

观看带有同步字幕、可调节叠加层和完整播放控制的视频。

免费注册以解锁

AI 摘要

获取由 AI 立即生成的视频内容摘要、要点和结论。

免费注册以解锁

翻译

一键将字幕翻译成 100 多种语言。以任何格式下载。

免费注册以解锁

思维导图

将字幕可视化为交互式思维导图。一目了然地了解结构。

免费注册以解锁

与字幕聊天

提出关于视频内容的问题。直接从字幕中获取由 AI 驱动的答案。

免费注册以解锁

从您的字幕中获得更多

免费注册并解锁交互式查看器、AI 摘要、翻译、思维导图等。无需信用卡。

    Intel just CRUSHED Nvidia & AMD… - 完整文字记录 | YouTubeTranscript.dev