Your Mac Has Hidden VRAM… Here's How to Unlock It
FULL TRANSCRIPT
If your Apple silicon machine, like a
MacBook or a little Mac Mini, doesn't
have much RAM, and you still want to run
large language models that are decently
sized, you can download more RAM. You
can't download more RAM, but there is a
trick you can do to allow your system to
use more memory. Here, I've got the base
model M4 MacBook Air, and it only has 16
GB of unified memory. So, when I open up
LM Studio, and I try to load GPTOSS
20B, a pretty popular model, look at
that. It's 11.2 28 GB on disk, which
should be okay for a 16 GB machine,
right? Let's load it up. Load model.
There it goes. It's trying, desperately
trying to load that model. Oh, what
happened here? Oh no, failed to load the
model. Oh, please check the settings and
try loading it again. That's not going
to help. You You can't just check the
settings without doing anything and try
loading it again. That's like the
definition of insanity. You need to make
some changes. So, let's take a look at
LM Studio. You can go to settings down
here and if you go to the hardware tab,
it'll show you what you have available.
There you go. Apple M4 RAM is 16 GB and
VRAM is 11.84 GB. So, LM Studio actually
tells you you can't use all of 16 GB.
There's some memory that needs to be
allocated for other tasks like your
operating system and other things you're
running. There's things like the kernel,
IO buffers, Windows server, GPU drive,
compressed memory pool, and all the
background tasks. You can see some of
these actually running. If you go to
activity monitor and take a look, there
they are. Hey, look at that. They're
using some memory. Now, Mac OS is pretty
good at handling this stuff, putting
aside what's not being used, compressing
the rest and things like that. But there
is a limit to it, right? If you're
trying to load a model that's going to
be bigger than this 11.84 GB with
context, that is cuz when you start to
load GPoss 20B, estimated memory usage
is 9.26, total is 12.34. So, we need a
little bit more, don't we? I'm going to
show you how to download this RAM. All
right, enough of that joke. I'm not
going to show you how to download RAM.
I'm going to show you this command here
right now. If you take a look at this
command, sudo, which means you're going
to need to execute this as admin
account.
>> Hey, mom. Some guy on the internet told
me to run a weird command on my
computer.
>> Okay, just be careful, dear.
>> It's like, hey, Mac, what's the maximum
amount of memory the GPU is allowed to
lock and keep for itself? Right, kid?
>> Huh? And then this command will show you
what this limit is set to right now. You
need to enter your password. And usually
it's zero by default. And that's the
default setting, but you can change
that. 81 92. Why the weird number, Alex?
Well, these all have to be powers of
two. Like memory, it goes 2 4 8 uh 16
and it goes up from there. 8192 is 4,96
* 2. These are all powers of two, and
it's used all throughout computing.
That's how things just work really well.
So, I'm going to set that. And now, if I
run that initial command to check and
not set 8192. Let's restart LM Studio.
What do you think is going to happen
here? I've just set the memory limit in
megabytes to basically half of 16 GB.
So, if I check hardware, look at that.
VRAM is now 8 GB instead of 16. So, it's
even less. You get where I'm going here,
right? I'm just showing you that
basically you can alter it. You can make
it less, you can make it more. And LM
Studio is a good tool to show you it
graphically, which is pretty cool. So
now, let's go a little higher so we can
actually load that model. Well, should
we set it to 16 GB? No, because if you
do, you're going to break things. Why
the heck not? You know what? This
channel is for that kind of thing. I
break things so you don't have to. I'm
going to set it to 16,000 and this is
probably not going to be good. Yeah,
it's set to 16,000. Let's quit LM
Studio. Hey, maybe it'll work. I don't
know. Let's see. Things are still
working and snappy. Let's check
hardware. VRAM is set to 15.63 GB. Um,
I'm pretty sure that's not a good idea.
This is probably not very safe, but I'm
going to go ahead and try to load this
20 billion parameter model. Load. Let's
see what's happening here. Let's take a
look at the activity monitor and the
memory pressure. The memory used is
going up. And there it goes. It's going
up. Oh boy. This is making me a little
nervous here. Look at that. It's orange
now. And now it's going back down. And
it loaded. It loaded.
What? Okay, I'm going to keep an eye on
that little window there while I try to
run a prompt. New chat. Write a story.
Boom. It doesn't matter what the prompt
is. It's just going to generate text no
matter what. And it's working. That's
crazy. The memory used is 15.65 GB. The
memory pressure is crazy high. It's not
in the red though. Sometimes it turns
red and that's really not good. But this
is actually working. Do I recommend that
you set that setting? No. You want to
leave a little bit of room unless this
is all you're doing with this machine.
If you're operating your Mac Mini in
headless mode, for example, and you can
do that on all the Apple silicon
machines, Mac Studios, in fact, uh back
here I have a cluster of Mac Studios
where I can run really gigantic models
sharing the memory between all those. I
had to go in there and manually adjust
the available memory. Each one of those
boxes has 512 GB of memory, but by
default, that wired limit in megabytes
is set to zero, which is by default
going to be calculated to about 75% of
what's available. So, if you want to
squeeze a little bit more out of that to
be able to run larger models, that's the
setting you want to use. Now, going back
to a small machine like this one, look
at that. 16.72 tokens per second. I'm
going to actually set the memory limit
to something more reasonable. And this
is the number I'm going to use, 14336 on
a 16 gigabyte machine, which is still
not the safest thing you can do, but
it'll allow me to have a couple of
gigabytes to run for some background
processes and still have enough room to
run GPT OSS. Going to restart LM Studio
because it needs to restart to detect
those changes. And yeah, so 14 GB now is
allocated to VRAM. And now I can load my
model GPUs 20B. There's that memory
pressure going up again. But in the end,
it loaded. Write a story, thought for a
brief moment, and off it goes. Now, if
you want to see some huge models
running, check out my video on that
cluster over there. That video is right
over here. Thanks for watching, and I'll
see you next time.
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.