GPT-5.4: The Best Model That's Almost Perfect
FULL TRANSCRIPT
So, I don't know if you guys are going
to believe this or not, but the new
model is better than the old model. And
in fact, it's actually a very, very good
model. I was lucky enough to get to test
this thing early. I've had it for about
a week, done about 300 million tokens on
it over Codex. Don't know how many I've
done over API or in chatgpt.com, but
probably another like 50 million or so
there. So, I've definitely got a lot of
opinions and things to say about this
thing. Before we get into all that, I
want to be super clear upfront. This
video is not sponsored by OpenAI in any
way at all. They offered me a free year
of codecs as a thank you for early
testing. I said no. I don't want to have
any biasing influences with any of these
labs. This is entirely my opinion. This
is not vetted by anyone but myself. The
only thing that happened here is they
gave me early access to tests within my
codec and chatgbt account. That is
literally it. But with that out of the
way, there's a lot of stuff that I want
to talk about with this model. the
training cutoff, the model behavior, how
I've been using it, what I built with
it, what it's like to talk to within
chat GBT, because it actually does feel
different from the previous models, how
to prompt it correctly for like more
complex agents over API because that
actually is a very real thing. We we'll
get to it. The absolutely awful UIs that
has been generating, and finally, I want
to start with just how difficult it has
gotten to test a new model at this point
because we've definitely crossed a
threshold. Like when Opus 45 came out
like 2 months ago or whatever, that was
the first time I had used a new model
and like instantly felt it. Within like
four generations, I was looking at the
output and seeing what it was doing. I
was like, "Oh, holy [ __ ] This is very
different. This is clearly a level
beyond what we had before." And it
changed a lot of things. I'm sure if
you've been on Twitter, you know exactly
what that's resulted in over the last
couple months. But now that we're kind
of past that point, all of the really
good models like the Opuses and the 5.2
two codecs and beyond OpenAI models.
They're all like really good at a lot of
things. You can kind of just give them
most tasks and they'll just get them
done just fine. The difference between
5.3 CEX and 5.4 in just like normal
CRUDL logic tasks is just not that big
because it's just kind of a solved
problem for these models at this point.
It really takes time at this point to
get the full picture of what the
difference between two models is because
the reality is the gains that we're
making are in very longunning, very
complicated tasks. I've tried to throw a
very wide variety of problems at this
model. See what it's good at. See what
it's not good at. And while I'm
definitely feeling a very real
difference between 5.4 and 5.3 CEX, the
really big differences in wins are
realistically we're just not going to
find those for some time. That's kind of
started to change with 5.3 codecs, but
like I haven't been able to test this in
cursor's longunning agent harness where
you can basically set up a model to do a
very longunning task. I've had 5.3
codecs do some pretty enormous tasks
that took I think the longest one I've
had was like 10 plus hours. I haven't
gotten to test 5.4 in any scenario like
that other than some contrived stuff
locally. It's going to take time for us
to see all these things in the real
world. What this video is mostly going
to be about is just my first impressions
using it day-to-day for like normal
random work stuff. And I think the best
place to start is actually with the
training data cut off. For a very long
time, the biggest problem I've had with
OpenAI models is the fact that their
knowledge just felt super super out ofd,
like it was stuck in 2024. I don't think
that they've had any like fresh data in
there. They've just been rlinging on top
of GBT or GBT 4.5. That's like what I've
heard through the rumor mill. I don't
want to speculate too much about that
because I barely understand how models
work under the hood. It's not my thing,
not my area. Just know that the old
models were really bad about modern
up-to-date knowledge. The new ones
aren't. It seems like both 5.3 and 5.4
were RL on a similar snapshot because
you can kind of tell when you put their
answers side by side that it's very
similar in nature. Like I asked all
three of these models 5.2, 5.3, and 5.4.
What are all of the different remote
functions in Spellkit? The answer we get
from 5.2 is total nonsense. This is just
incorrect. This is its outofdate
knowledge showing where it doesn't even
know that remote functions exist. These
are a new primitive that was added last
fall, which is why I like using this as
a [ __ ] test for models and what they
actually know. The new models do
actually know about this. Both 5.3 and
5.4 and this is 5.3 codeex are saying in
spellkit remote functions usually refers
to serverside functions blah blah blah.
It's very you can just kind of see in
the writing that there's something
similar happening deep underneath the
hood of these models. Even though in
day-to-day use they definitely feel very
very different, but they actually know
what they are. They know what queries
are. They know what commands are. They
know what forms are. It feels really
good to have an OpenAI model with actual
up-to-date information in it. That was
always something that really only the
Claude family of models had for the
longest time. I remember last summer I
really really liked Sonnet because it
knew about the actual modern
technologies and practices versus with
GPT you would constantly have to give it
extra context on how to do things in the
modern way. That's kind of gotten fixed.
Another interesting thing is that the
new model is actually really fast. Like
if I just send this query, it's going to
start streaming in here. It took a
little bit longer to start showing up in
5.4, but like the actual tokens per
second on this is really really good.
And I've noticed that compared to like
5.2 and even 5.3, the thing that OpenAI
models used to do where they would just
reason and reason and reason and take
forever to do anything has really gotten
smacked out of 5.4. It's very good at
just kind of doing things and it feels
very fast in day-to-day use. Now,
obviously, I am using an early alpha
cluster. Once this actually goes live,
the real tokens per second is probably
going to end up going down slightly, but
you can even still see like 5.3 was
pretty damn fast. Not that much slower
than 5.4. I feel like in day-to-day use,
this thing is going to feel much faster
than old GBT models entirely because of
a the tokens per second, but b far more
importantly, it's very efficient with
tool calling. Like it does not like
calling tons of tools. It's very
surgical in what it actually picks and
does unless you tell it exactly what to
do otherwise. And we're going to have a
long section about that later. But
before we do that, we need to hear a
word from today's sponsor. Today's
sponsor is Graptile. And you've probably
already heard the AI code review pitch
before, so we're going to skip all that
and instead I want to show you this.
This is a change I've been working on
for better context where I'm fixing some
system prompts and doing some things
that we'll talk about later in this
video. But when I pushed this up to
GitHub and made the PR, Gravile left a
little review here with a confidence
score of four out of five with a little
nitpick here of like, hey, we should
probably fix this. What I could do is I
could manually copy paste this. That
works. Instead, what I'm going to do is
use the Grapile MCP through codeex to
check for any code reviews on this PR,
see if there are any comments, and have
it fix any notes that were left behind
by Grapile. And what's really cool about
this is since it is an MCP, this can now
be called by agents, which means that
you can have your agents automatically
review themselves. Once they're done
making a change, you can just tell them,
hey, go check for any reviews on this.
You don't have to go into GitHub, find
the PR, check the review, copy paste it
back in. It just kind of works. And you
can see right here what it's doing. It
is getting the merge request. It is
listing the merge request comments. It
is checking for the actual code reviews.
It found two unadressed comments. It is
now exploring these, getting these
fixed. And now that they're fixed, I can
tell it to just commit and push the
changes. This feedback loop feels so
much better. I've been really, really
liking this in all my workflows. Just
having the ability to pull down code
review changes directly within my actual
environment instead of having to go up
to GitHub. It's such a lifesaver. It
makes things so much easier. You can
already see back on the PR, the little
eyes emoji has been left by Grapile to
let you know that, hey, it is now
re-reviewing this PR cuz we just pushed
it up. It's 100% off for open source,
50% off for startups. There's no reason
to not be using an AI code reviewer.
They just make life so much easier at
this point. Grapile is my personal
favorite. I highly recommend it at
davis7.link/gile.
So, now I want to talk about how this
thing feels to actually use. And it
feels amazing. It's extremely good at
tool calling to the point where I don't
think I've ever seen a tool fail. It
almost always calls the logical sensible
tool. It's just extremely good at
operating within the codeex harness,
within custom harnesses I've given it,
within the chat GBT web app harness. It
just it works. An interesting thing I've
noticed it doing, and I think GBT models
have done this for a while, sometimes it
will just eject out of the codeex tools
and it'll just be like, "No, I don't
want to do that." And it will instead
just handwrite a Python script and then
run that in a ripple to actually change
a file. So here it's doing a file edit
where it wrote out a full Python script
to change all the files in the codebase.
I've seen it do this very consistently
throughout all my different projects and
they always work. I have never seen it
malform one of these. I've never seen a
Python error come out of it. It just
kind of works. Another thing I've been
noticing with this model is it feels a
lot less like 5.3 CEX did and a lot more
like Opus does. It's not quite at the
Opus level of just full sending whatever
you tell it to do and it will just do
it. consequences be damned. But it is
definitely more proactive and aggressive
with its changes than 5.3 CEX was. 53
Codex really likes just taking its time
and doing a bunch of exploring and
really just kind of thinking through and
asking permission before it did
anything. This one feels a lot more
biased towards actually just taking
actions. I had this old thread I did a
couple days ago where I wanted to set up
the local convex dev stuff within a
project that I was working on. I gave it
another project where I had already set
this up as a reference, which we'll talk
about this later, but the way I've been
using these models has changed a lot. I
don't let these things run in any way
other than full system access. This
thing could wipe my home directory at
any point if it wanted to. Honestly, I'd
be kind of okay with it cuz it'd be
really funny and it would be a hilarious
video to make. But like the reality is,
even if there are some concerns with
doing that, the pros so massively
outweigh the cons for me personally that
I've just been letting it go and do its
thing. And the results have been really
good. It was able to just take this
pretty simple prompt and just do it
right. It did its exploration step, then
it put together a plan. And this is
usually where I would find 5.3 CEX would
step in and just stop and it would tell
me the plan and be like, "Hey, does this
look okay? Can I go and actually do
this?" 5.4 is just full sending. It is
just actually doing the plan. It gets
all of this stuff actually implemented.
And then at the end here, it gave me the
summary, the actual change worked
exactly as you would want. It does the
very chat GBT GBT thing where every time
you have it make a change in codeex or
you ask it something in chatgbt.com it
will always ask you hey if you want the
next useful step is blank and it will
always effectively prompt you with the
next actionable step to be like hey you
can keep running if you just do this and
it said hey if the next useful step is
to add a tinyv.local.ample
to make things easier so I was like yeah
that's a great idea let's just do that
so I had it add the enev.local.example
example. It got it set up in there.
Everything worked. And then after all
that, I realized, oh wait, I do actually
need web hooks. I'm kind of dumb here.
Let's just get rid of all that stuff,
stash it away, and then leave anything
unrelated to that change cuz I had other
miscellaneous changes unstaged within
this branch. And again, it was really
good about that. I told it, hey, if
there's anything unrelated, leave it as
is. And it did. A big change in the way
I've been using these things. And I
don't know if this is just the model or
if it's just me, but I've been much more
willing to just kind of let them handle
get for me. Let them handle stashing for
me. In the past, I would have really
wanted to do this by hand. At this
point, I trust the model enough. It just
works well enough that I just kind of
let it go do it. It did the stashing for
me. It didn't touch anything unrelated.
That was true. And it was all good. It
worked. I've really been trying and
racking my brain to find things that it
just egregiously does wrong over the
last week or so. I could easily find
examples of a lot of other models just
kind of flying off the rails, but the
best way I can describe it is this is
really the first model where I kind of
just trust it to go do the thing.
There's nothing super fancy about it.
It's nothing super insane. I didn't do
anything super difficult with this task,
but it just followed the instructions
perfectly. It made the change perfectly.
That is the best way I can describe this
model. It is the model that just works.
You don't have to think that hard about
it. You don't have to guard rail it too
hard. And having done a pretty insane
amount of work with this over the last
week or so, and these are just the saved
threads in here. I had to wipe all my
old threads when we were doing a data
model migration on the T3 code alpha.
It's a good model. This is a very, very
good model. And on that topic, I've
started using this model for a lot of
things that aren't just code. This is
something that I've really started doing
recently, and this model has helped
kickstart more and more the way I use my
terminal and use my computer is just
letting codeex go do it. That's why I
like giving it full route access to
everything is because I can allow it to
do something like, "Hey, could you build
the desktop app and install the DMG onto
my computer, replace the existing T3
code build there?" Because as Julius has
been building out the T3 code alpha,
he's constantly shipping a bunch of
changes up to GitHub. We don't have a
full release pipeline ready for it yet.
So, I was like, "Okay, I want to get the
latest changes, but I don't want to have
to like manually run the build command
and then like grab the DMG and put it in
my applications folder and just like do
all this random stuff that I could do,
but I just don't care enough to do
because I can tell the model to do it."
And within 2 minutes, it'll just do the
thing. It'll build the project, grab the
DMG, put it where it needs to be,
restart my T3 code instance, and it just
works. I've even gotten to the point
where I've had a lot of these where I'm
like, "Hey, can you pull the latest
changes from main, rebuild the project,
install the DMG as my T3 code instance
on my computer and give me a list of
everything that changed from the pull
down. It just works super super well.
I've started building out this
experiments directory which I
effectively just let the model run and
use itself where whenever I have some
random idea or thing I want to make,
which we'll talk about those later in
this video, I will just open up a codeex
terminal in this experiments directory.
I'll be like, "Hey, make a new spell kit
project that does XYZ." I wanted a
random one-off script that would run in
the cloud so that whenever this new
monitor that Theo and I want to buy for
the studio comes up and is available on
Amazon because it's not clear when it's
going to come out yet, it'll give me a
Discord notification when that happens.
I just let it go make that, let it make
the project, let it do all those things,
let it deploy it. And it just kind of
worked. And I've even started to use it
for just randomly testing things. Where
in one of the projects I'm working on,
Better Context, it has a CLI. And the
CLI is kind of annoying to test manually
every single time I want to make a
change and make sure I didn't
accidentally break something important.
I've got unit tests in there, but like a
good smoke test is really nice. In the
past, I would just run it manually, but
I'd get kind of lazy with it until I
realized, oh wait, this model can just
run commands on my computer and act like
a human would on my computer. Why don't
I just give it a smoke test file which
has a list of all the commands that I
want it to run with the desired outputs
and the desired error shapes and all
that stuff. Let it run that, have it go
through and actually test the project on
my machine. And then at the end of it,
it gave me this nice little breakdown of
everything does seem to be working just
fine. And I'm like, cool, that's a
really nice sanity check. I can just let
this thing use my computer. And once you
kind of unlock that in your brain, these
models and harnesses get so much more
useful. I really think that these coding
agent harnesses right now they're really
good for writing code, but over time
they're going to just get to the point
where they're really useful for just
doing anything on your computer. I mean,
that's like what Claude Co-work is. It
is clawed code with a nice UI on it to
do a bunch of commands under the hood to
just use your computer like normal. It's
really, really good. But none of that
actually matters without answering the
question of what have I actually built
with this thing? The right way to answer
that is to start with the thing that the
model's really, really bad at, and that
is UI. I was working on some browser
automation stuff because the browser use
capabilities of this model are really
really good. And this is the UI it came
up with for a live view of the browser
automation agent. Like it is unusably
awful and bad. It really likes to use
this weird singular style for a bunch of
random things. I've had it pop this out
for a bunch of internal tools and I
thought that maybe this was just like
because I had the front-end design skill
on and that was [ __ ] with it. So I
turned that off and I tried to just do a
sidebyside of okay, here's a very simple
UI prompt. I gave it to both Opus 46 and
GPT 5.4. See what the two options look
like. And the results were honestly bad.
Like this is the UI that came out of GPT
5.4. It very much has that GPT5 UI
smell. This is what the GPT model UIs
have looked like since GPT5, and that
smell has not gone away with the newest
model. It's got the weird text thing up
here that they all love doing. It's got
these weird rounded corners. This UI
just does not look or feel good. It's
very cluttered and gross. The one that
Opus created was this, which is just
infinitely better. This is such a better
chat UI that is actually usable and I
could build something on top of.
Obviously, it's not the most inspired
thing in the world, but with a little
bit of tuning and work, this could be
usable. This just can't. And I've had a
lot of examples like this. I'm working
on BTCA v2, and one of the things I
wanted to do was build out like a nice
research view for it. And this is just
terrible. Like, there is too much random
crap going on in this UI. It looks the
exact same as all the normal GPT UIs do.
It's not very readable. It just doesn't
feel good. It didn't even follow the
styles that I had established within
this codebase which I told it to follow.
The really interesting thing is that I
had 5.3 codecs do the same exact thing
but in the cursor cloud agent harness.
So to be fair, this is definitely biased
because in the cloud agent harness, they
have access to a browser. So was able to
see what it was actually doing as it was
happening and get some good back
pressure there versus just writing the
Tailwind classes and hoping they're
good. But still like this followed the
styles that I had already established
for the project. It's still very
cluttered. It's still kind of GPTE. But
still this is much better than this. And
unfortunately I still think we are at
the point where the GPT 5.4 as good as
it is for basically everything. It still
can't handle UI super well. So there
still is a need to have something like
clawed in on the back burner so that
anytime you need to make a page not look
like this and do some very specific UI
tweaks, you have a competent model for
that task. But once you get outside of
front-end problems, it's really really
good. A big thing I've been working on
is migrating the better context CLI and
server over to using the effect v4 beta.
I want all of my services and projects
to be built on top of this going
forward. It is very much, in my opinion,
the right way to build anything that is
a CLI or a server at this point. And I
was shocked at how well it was able to
actually parse and use this thing. I
gave it access to the source code for
effect v4 with the BTCA tool, but other
than that, all it had was just this link
and some basic instructions on how to
port it over, and it did a really good
job. It centralized the services
correctly. It used good effect patterns.
There's still some weird holdovers in
the actual implementation from the old
version because the old version was very
promise based and I want it to be all
like effect generator based. So like
getting all of those boundaries hammered
out so that there aren't a bunch of
random awaits in the codebase is still a
thing. I still haven't gotten all of
those ironed out yet. I you still have
to be in the driver's seat for that.
It's not magically doing all that for
me. But again, if it was in a better
harness that was designed to run for
very very long periods of time, maybe it
could have one-shotted it. This is the
kind of thing that I really want to test
once it's available in something like
the cursor cloud agents or whatever. It
has been working very well for feature
additions and security audit type stuff.
One of the things I wanted to do was add
in support for GitHub private repos to
the BTCA web app. This is something that
a couple different people wanted. So I
was like yeah sure let's get that added
in. I gave it some basic information on
how to do it. We designed out how it
will actually end up working in the back
and forth. It was able to use clerk to
reauthor the GitHub instance to get the
user repo read only permissions that it
needed for this. And as a result, I'm
able to add in something like this
BTCAEX, which is a private repo. I have
just some internal experimentations
where now if I go in here, what is at
btca?
It'll just work. It'll actually be able
to load this into the sandbox correctly,
authenticate it with the proper
credentials being passed into the
sandbox. The way it was architected was
really, really good. It did a full
security audit to make sure that
everything was super sound in how this
was actually implemented. And you can
see it it works. It works really really
well. I also had it do a pretty heavy
security audit of all the endpoints and
queries and mutations and actions on
this app. And it found a couple actually
pretty novel edge cases that I had not
thought of that you could slip past the
guards that had already been implemented
from like the normal middleware clerk
guards on the back end to exploit random
things like the sandbox and the sandbox
URL and all this stuff. It found all of
those. It patched all of those and the
end result is very very good. This is a
great model for systems security
backends CLIs anything like that it can
handle very well. The only other thing
that it doesn't do super well, but
frankly no models do this well, is
actually funny enough crafting prompts
for itself to use. I wish I could show
it, but it's internal like alpha
documentation. So I don't want to leak
that hopefully. But when this goes live
tomorrow, all this is public and I can
link it down in the description. But
OpenAI gave us a really nice site that
has a bunch of references on how to
write good system prompts for it, what
the model behavior is, like, how it
works in different harnesses. And one of
the things that they suggested is adding
in these sections to your system prompt
to define the behavior that you want
from it, what its personality is, how it
should use its tools, what the output
needs to look like, how verbose it
should be, how it should follow
instructions, all that stuff. This stuff
needs to be handwritten. I tried to let
the model handwrite the system prompt.
And the difference between a system
prompt written by me and a system prompt
written by the model is night and day. I
was actually testing this within BTCA.
I'm about to ship an update which will
have a much better system prompt to get
much better results out of the model.
One of the things I did was add in those
really nice XML blocks and format the
prompt in a way that it very clearly
directed the model on how it should
behave because this is a model that
follows instructions very very well. I
have this one on the right that is using
the new system prompt. This one on the
left that is using the old system
prompt. And when I hit enter on both of
these, it'll take a second to actually
come in. But the actual outputs you get
from the new system prompt are so much
better. They're much more thorough. They
include much better examples. It it just
does a better job of actually running
when you give it better instructions
because this model, like all GPT models,
is incredible at following instructions.
If you tell it to just yolo do the
thing, don't think too hard, it'll yolo
do the thing, not think too hard. versus
if you tell it to be very thorough in
its research, make sure it's doing
proper citations. It'll end up taking
slightly longer, like this one did a few
more tool calls than this one did. But
when we look at the actual end results,
the new one is much more thorough and
much more correct, where this one
caught, hey, you can handle argument
validation failures at the endpoint
boundary of handle validation error
within the hooks.ts file, which is
basically just the middleware in Spelkit
land. and it gave a very long detailed
answer that includes really good
examples for all of the different ways
you can handle these things, including
that handle validation error down here
and all of the sources properly linked
to the original GitHub repo. This one
still gets the job done, but it's not
quite as good. And honestly, going
forward, one of the hardest and most
important parts of building a complex
agent is the system prompt. You do have
to be very clear on what it can do, what
it can't do, how it should behave in
different scenarios because this thing
will do exactly what you tell it to.
That prompt I showed earlier was from a
browser use agent session that I had it
do. I wanted to see how well this thing
could handle working in a browser with a
bunch of like computery type tools and
it worked super well. I wanted to have
it search for flights from San Francisco
to Columbus just at a random date
towards the end of the summer just cuz I
was curious if it could do it and it did
it really well. You can see it's
inputting the text correctly there.
It'll input the text correctly here. It
does end up breaking, not because of the
model, but because the way the site is
coded, it just kind of breaks in weird
ways. That's not the model's fault.
That's something that we can tune out.
The important part is that it can
operate in a wide variety of harnesses
very, very well. Another thing I was
pretty impressed it actually did is uh
Maria, who's been helping out with my
channel and Theo's channel, sent me this
image of like a feature that she wanted
to get added to Pick Thing for like
thumbnail face stuff. And it was just a
bunch of like random red text and a
bunch of random images with arrows and
like it's kind of a funny looking
drawing. And I was really curious if I
just give this drawing with follow the
instructions on this image and implement
it to the model. Could it do it? And you
can see I gave it this did a bunch of
work here. I tested the actual thing. It
works perfectly. I shipped it. It's not
the prettiest thing in the world. Like
this definitely could be better. But
like the features implemented, it works.
It's just fine. it was able to extract
all the information it needed directly
from that very random image to actually
make a useful feature. There's a bunch
of other stuff I had it do and to avoid
bloating the runtime, I'll skip over
those, but I had it make a Semox
alternative just to test if that was
even possible and look at some different
UX flows. It was able to basically
oneshot that perfectly, get the Swift
app and the Ghosty rendering all working
in a really nice way. It was able to
make my codeex usage script, which is
the thing that I use to figure out how
much I've been using the new model on my
computer. It was able to put together a
React Native Plus Expo mobile app very
easily that I was using for a bunch of
random personal finance stuff, which is
why I unfortunately can't show that one.
It's pretty cool. I I don't want to leak
any of that information. It It's a
really really good model. The last thing
I wanted to talk a little bit about was
just how it feels within chat GBT
because as you've seen through this
entire video, this thing is a monster at
instruction following. My
personalization is set to be candid, uh,
less warm, not that enthusiastic.
Headers and lists are fine, no emojis. I
don't like any of that stuff. Like, I
wanted to get the chat GBT voice out of
it. And it's mostly gone. The weird
speech patterns that you're probably
used to if you use the chat GBT app a
lot with like 5.2 or any other model.
They're pretty much gone from this
model, which feels really good. I'm glad
that that's not a thing I have to deal
with anymore. I like my models being
much more pragmatic and to the point
with the way they talk to me and the way
they do things. This is definitely not a
sickopantic model at all. I have had
this thing tell me multiple times flat
out you're an idiot. No, do XYZ or no
that doesn't make any sense. Do this or
that. It is very happy to disagree with
you and it does not take [ __ ] This
is not another 40 which I know a lot of
people are not going to like. But if you
are using these things for work, this is
an incredible work model. An incredible
like actual useful research assistant
type thing. I'm very happy with it. I
want to close this by talking about what
the general landscape of models looks
like right now. Especially now that GPT
5.4 is out and it is released in the
API. It's released in chatgbt.com and
it's released within codeex. They did it
right this time. I'm very happy about
that. You can just use it everywhere now
and I will be using it everywhere now.
This is getting very close to being the
god model. I think we are rapidly
heading to the point where the old split
of models where like you had haiku,
sonnet, and obus, they all kind of serve
their purpose. I think that's kind of
just going away. This model, if it's
priced the way I assume it is, which is
just the same as GBT 5.3 CEX, I don't
see why it wouldn't be, is very
reasonably priced. It's very efficient
with its token usage. Its reasoning is
not all that bad. You can use this for
very small one-off tasks that you would
use a haik coup for, and you can use it
for massive 5-hour long background jobs
all on the same model. The only places
where this model has really felt bad to
me is in UI and in making system prompts
and agent harnesses for other models,
but that's always been a thing. There is
no model that can write a good system
prompt for another model. You have to do
those by hand. There's been a lot of
research about this. That is like one of
the most important parts to be paying
very close attention to if you are
building anything involving AI. But like
honestly, if we got GBT 5.5 in a couple
months and it's just as good at UI as
Opus 46 is while still being better at
everything else and maintaining its
current characteristics, I don't really
see a world where I would use literally
any other model for anything else. We're
getting into speculation land, so take
all this with a grain of salt.
Predicting the future is a fool's
errand, but we're going to do it now cuz
it's fun. I really think we are going to
get to the point where one model is
going to win literally everything. is
just going to be better at all the other
models, at all of the other tasks. And
realistically speaking, I think we're at
the point where OpenAI is clearly on the
path to making it. They're pretty damn
close with this one. Anthropic is kind
of getting there with Opus, but it's
still just not. And I still actually
find myself kind of liking the way
Sonnet feels. There's still more of a
discretion between their different
models, but this is the first one where
like it's all unified into one.
Everything from chatting with it for a
normie to heavy agentic work is all
wrapped up in one model GBT 5.4. It is
the single model for everything. And
honestly, my biggest hope for the future
at this point is I kind of hope Google
gets their [ __ ] together so that OpenAI
has some more real competition cuz right
now the only competition OpenAI actually
has is anthropic. There just isn't
anything quite at the level of GPD 5.4
and Opus 4.6. And I can't really imagine
anyone else making anything quite this
good. It is impossible to put into words
exactly how good this model is and
exactly how good it feels without just
using it. Give it time. Give it a bunch
of different prompts. Give it a bunch of
different things to work on and you'll
just get a feel for how it actually
works. And I think slowly but surely
realize that this is the model that can
do basically anything. And that's we're
heading into weird weird times. I don't
really have anything else to say on
this. I'm sure I will in the future
after I've done more work with it and
put it in more complex harnesses. I
really like this thing. You should go
try it. If you enjoyed this video, there
are more on screen that you should
probably click on. They're probably
pretty good.
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.