Intel just CRUSHED Nvidia & AMD GPU pricing
全トランスクリプト
This cute little guy is the Intel Arc
Pro B50. This is its bigger brother, the
Arc Pro B70. It just came out. Now, the
B50 didn't need any extra power cables.
It got all its power straight from the
PCIe bus. Nice, simple, civilized. The
B70 is a little bit less civilized. This
one needs power, but it also brings 32
GB of VRAM and comes in under $1,000.
And because local AI tends to turn into
a hardware addiction, I'm plugging in
four of them. That gives me 128 gigs of
total VRAM, which means I'm going to
need a computer that can actually power
all four, which is exactly the kind of
sentence Local AI makes you say out
loud. Now, I already did a video on the
B50 last year, and it was the most bang
for the buck GPU you can get in 2025.
So, now I want to know if the B70 could
do the same thing for Local AI in 2026.
Because just for comparison, a single
RTX5090
from Nvidia also has 32 GB of VRAM, but
that comes in in just under $4,000 at
this point. So, yeah, we're about to
find out. So, of course, to do this
comparison properly, I did what any
reasonable person would do. I bought
more GPUs. This is the Nvidia RTX Pro
4000 Black. Well, check out that price
tag right there. Not flexing, just
saying that this is retailed two times
higher than this card and it's 24 gigs
of memory, not 32. However, it is GDDR7,
so it's the newer VRAM. And it has 672
GB per second of memory bandwidth. It's
also very skinny, just like my wallet.
Wow, I can see you through the fan. the
comparison physically. The RTX Pro 4000
is a much skinnier card and it also
gives you the four display port ports.
So, that's a single card slot with
pretty impressive bandwidth and very
impressive price, too. By the way, I got
it at MicroEnter was $1699, not 1999.
Not sponsored by MicroEnter, but still a
lot of money. Next, I got AMD's Radeon
AI R9700.
And up until now, this was the only GPU
you can get
with 32 gigs of memory for about $1,300.
This is 32 gigs of GDDDR6 and 640 memory
bandwidth, so not as high as the Nvidia
one. So, they match on VRAM, but this
card is about $350 more per GPU. Now,
the B70 comes in at the lowest price.
However, it also has the lowest memory
bandwidth at 68. So yeah, this has been
pretty heavy on the wallet, which is why
I'm thankful to the sponsor of this
video. So these days, I'm always
flipping between models. GPT for
research, cloud for coding, nanobanana
for image generation, VO cling, and
runway for video, six tabs, six bills,
and counting. Enter chat LLM teams. One
dashboard houses every top LLM and route
Olympics, the right one. GPT Mini for
ultra fast answers, Claude Sonnet for
coding, Gemini Pro for massive context.
They recently added Gemini 3 and GPT 5.1
the moment they dropped. Create
professional presentations with graphs,
charts, and deep research detailed
content. Need human sounding copy?
Humanize rewrites text to defeat AI
detectors. Need visuals? Pick Frontier
or open- source models. Nanobanana
Midjourney Flux for images. Magnific
upscaling plus VO WAN and Sora for video
all builtin. You also get Avac's AI deep
agent to pretty much do anything. build
full stack apps, websites, reports with
just text prompts and deploy them on the
spot. They have Abaca's AI desktop,
which is the brand new coding editor and
assistant that lets you vibe code and
build productionready apps. And the
kicker, it's just $10 a month, less than
one premium model. Head over to
chatlm.abacus.ai
or click the link below to level up with
Chat LLM teams. I just finished testing
the B60 and depending on the software
stack that you're running, you get very
different results. Here's Cickle, which
is running Llama CPP. You can run Llama
CPP and that'll utilize either Sickle,
which is a Intel specific stack. It
gives you really good performance there.
By the way, this is the Quen 34B model
at Q4KM
quantization. We're getting 1,0 tokens
per second for prompt processing here.
And since these are professional level
cards, it's also a good idea to test
them with a higher concurrency. So, this
is concurrency of one, which kind of
simulates chatting with the thing,
right? But if you have a concurrency of
four, which is leaning towards more of a
agentic workflow or multiple users at
the same time, then we come down to 898
tokens per second here. And that just
shows that Llama CPP is not the best for
higher concurrency throughput. Now, for
token generation, we're getting 66
tokens per second here for C1. And for
C4, which is concurrency 4, we're
getting 83. So just a little bit higher
than your single. And of course, I did
Llama CPP for Vulcan, which is a
cross-platform approach. And Vulcan did
better in certain scenarios, like for
example, prompt processing, which is
kind of surprising. 1,162 tokens per
second there for Single. But Sickle did
do better for token generation, 66
tokens per second versus Vulcan at 44.
However, the best the best performance
we got was from VLM, of course. Look at
that huge difference right there. This
is meant to run on professional GPUs
like this with high concurrency and
throughput. And this is the 4bit AWQ
quantization for VLM. 8,118 tokens per
second here for concurrency of one token
generation 67. So not that much higher
for single concurrency for token
generation than cickle. But look at the
scaling when it comes down to
concurrency of four. 215 tokens per
second for token generation. All right,
that's just a review of the B60. Let's
see what happens when we do the B70
along with these other ones. So, I'm
going to kick things off with one B70
and this Nvidia 4000. I do like that one
slot feel. Ah, that fresh new GPU smell.
Nvidia SMI. We've got 145 watts
available on this thing and 24 GB. So, I
kicked off VLM cuz I'm not doing Llama
CPP at this point. VLM is the way to go
on these kinds of machines and these
kinds of GPUs. I'm running both of these
now and I'm pointing to Quen 34B. This
is the full BF-16 version. And the way
VLM likes to work is it takes up as much
memory as possible. So, it's going to
fill up all 24 GB on this board and all
32 GB on that board just because it
likes to have extra room for KV cache.
And I'm using a tool called Llama Beni
by Yugger over here. It's an open source
project and it's a really nice tool for
doing this kind of benchmarking. Don't
confuse this with just Llama Bench. This
is Llama Beni. All right, it's
different. So, what's different about it
specifically is in the read me, so you
can go read it. But just to give you a
brief overview is first of all, it's
going to work with any kind of server,
not just Llama CPP, including VLM. And
second, it allows me to prefill the
context so that we can test filled
contexts also, not just empty context.
And that is very useful. All right, here
we go. We're going to do a little race,
but it's not going to be instant because
I got it on two different windows. Boom.
And boom. And I
It's funny the sounds that these things
make cuz they all have coil wind. And
also the coil wind is different based on
what model you're running and what
concurrency you're running. And probably
other parameters will affect it too. So
if you have a really really keen
hearing, you'll be able to tell me
exactly what model is running, how many
concurrence requests were processing.
I'm just kidding. But maybe AI will be
able to tell the difference. And here we
go. We got our first results. 56 tokens
per second token generation on the B70
compared to 51 over here. So the B70
beats it out just by a little bit. Ooh,
that's hot. Um, also PROM processing
さらにアンロック
無料でサインアップしてプレミアム機能にアクセス
インタラクティブビューア
字幕を同期させ、オーバーレイを調整し、完全な再生コントロールでビデオを視聴できます。
AI要約
動画コンテンツ、キーポイント、および重要なポイントのAI生成された要約を即座に取得します。
翻訳
ワンクリックでトランスクリプトを100以上の言語に翻訳します。任意の形式でダウンロードできます。
マインドマップ
トランスクリプトをインタラクティブなマインドマップとして視覚化します。構造を一目で理解できます。
トランスクリプトとチャット
動画コンテンツについて質問します。AIを利用してトランスクリプトから直接回答を得られます。
トランスクリプトをもっと活用する
無料でサインアップして、インタラクティブビューア、AI要約、翻訳、マインドマップなどをアンロックしてください。クレジットカードは不要です。