TRANSCRIPTEnglish

GPT-5.4: The Best Model That's Almost Perfect

28m 31s6,729 words911 segmentsEnglish

FULL TRANSCRIPT

0:00

So, I don't know if you guys are going

0:01

to believe this or not, but the new

0:02

model is better than the old model. And

0:04

in fact, it's actually a very, very good

0:06

model. I was lucky enough to get to test

0:08

this thing early. I've had it for about

0:10

a week, done about 300 million tokens on

0:12

it over Codex. Don't know how many I've

0:14

done over API or in chatgpt.com, but

0:17

probably another like 50 million or so

0:19

there. So, I've definitely got a lot of

0:20

opinions and things to say about this

0:22

thing. Before we get into all that, I

0:23

want to be super clear upfront. This

0:25

video is not sponsored by OpenAI in any

0:27

way at all. They offered me a free year

0:30

of codecs as a thank you for early

0:31

testing. I said no. I don't want to have

0:34

any biasing influences with any of these

0:36

labs. This is entirely my opinion. This

0:38

is not vetted by anyone but myself. The

0:40

only thing that happened here is they

0:41

gave me early access to tests within my

0:43

codec and chatgbt account. That is

0:45

literally it. But with that out of the

0:47

way, there's a lot of stuff that I want

0:48

to talk about with this model. the

0:50

training cutoff, the model behavior, how

0:52

I've been using it, what I built with

0:53

it, what it's like to talk to within

0:55

chat GBT, because it actually does feel

0:57

different from the previous models, how

0:58

to prompt it correctly for like more

1:00

complex agents over API because that

1:02

actually is a very real thing. We we'll

1:05

get to it. The absolutely awful UIs that

1:08

has been generating, and finally, I want

1:09

to start with just how difficult it has

1:11

gotten to test a new model at this point

1:13

because we've definitely crossed a

1:15

threshold. Like when Opus 45 came out

1:17

like 2 months ago or whatever, that was

1:19

the first time I had used a new model

1:21

and like instantly felt it. Within like

1:23

four generations, I was looking at the

1:24

output and seeing what it was doing. I

1:26

was like, "Oh, holy [ __ ] This is very

1:27

different. This is clearly a level

1:30

beyond what we had before." And it

1:31

changed a lot of things. I'm sure if

1:33

you've been on Twitter, you know exactly

1:35

what that's resulted in over the last

1:36

couple months. But now that we're kind

1:38

of past that point, all of the really

1:40

good models like the Opuses and the 5.2

1:43

two codecs and beyond OpenAI models.

1:45

They're all like really good at a lot of

1:47

things. You can kind of just give them

1:48

most tasks and they'll just get them

1:50

done just fine. The difference between

1:52

5.3 CEX and 5.4 in just like normal

1:55

CRUDL logic tasks is just not that big

1:57

because it's just kind of a solved

1:59

problem for these models at this point.

2:01

It really takes time at this point to

2:02

get the full picture of what the

2:04

difference between two models is because

2:06

the reality is the gains that we're

2:09

making are in very longunning, very

2:11

complicated tasks. I've tried to throw a

2:13

very wide variety of problems at this

2:15

model. See what it's good at. See what

2:16

it's not good at. And while I'm

2:17

definitely feeling a very real

2:19

difference between 5.4 and 5.3 CEX, the

2:22

really big differences in wins are

2:24

realistically we're just not going to

2:25

find those for some time. That's kind of

2:28

started to change with 5.3 codecs, but

2:30

like I haven't been able to test this in

2:31

cursor's longunning agent harness where

2:33

you can basically set up a model to do a

2:36

very longunning task. I've had 5.3

2:38

codecs do some pretty enormous tasks

2:40

that took I think the longest one I've

2:42

had was like 10 plus hours. I haven't

2:44

gotten to test 5.4 in any scenario like

2:46

that other than some contrived stuff

2:48

locally. It's going to take time for us

2:50

to see all these things in the real

2:51

world. What this video is mostly going

2:52

to be about is just my first impressions

2:54

using it day-to-day for like normal

2:56

random work stuff. And I think the best

2:58

place to start is actually with the

2:59

training data cut off. For a very long

3:01

time, the biggest problem I've had with

3:03

OpenAI models is the fact that their

3:04

knowledge just felt super super out ofd,

3:07

like it was stuck in 2024. I don't think

3:09

that they've had any like fresh data in

3:12

there. They've just been rlinging on top

3:14

of GBT or GBT 4.5. That's like what I've

3:17

heard through the rumor mill. I don't

3:18

want to speculate too much about that

3:20

because I barely understand how models

3:22

work under the hood. It's not my thing,

3:23

not my area. Just know that the old

3:25

models were really bad about modern

3:28

up-to-date knowledge. The new ones

3:29

aren't. It seems like both 5.3 and 5.4

3:32

were RL on a similar snapshot because

3:35

you can kind of tell when you put their

3:36

answers side by side that it's very

3:38

similar in nature. Like I asked all

3:40

three of these models 5.2, 5.3, and 5.4.

3:43

What are all of the different remote

3:44

functions in Spellkit? The answer we get

3:46

from 5.2 is total nonsense. This is just

3:49

incorrect. This is its outofdate

3:50

knowledge showing where it doesn't even

3:52

know that remote functions exist. These

3:54

are a new primitive that was added last

3:55

fall, which is why I like using this as

3:57

a [ __ ] test for models and what they

3:59

actually know. The new models do

4:00

actually know about this. Both 5.3 and

4:02

5.4 and this is 5.3 codeex are saying in

4:05

spellkit remote functions usually refers

4:07

to serverside functions blah blah blah.

4:09

It's very you can just kind of see in

4:10

the writing that there's something

4:12

similar happening deep underneath the

4:14

hood of these models. Even though in

4:15

day-to-day use they definitely feel very

4:17

very different, but they actually know

4:18

what they are. They know what queries

4:20

are. They know what commands are. They

4:21

know what forms are. It feels really

4:23

good to have an OpenAI model with actual

4:26

up-to-date information in it. That was

4:28

always something that really only the

4:29

Claude family of models had for the

4:31

longest time. I remember last summer I

4:33

really really liked Sonnet because it

4:34

knew about the actual modern

4:36

technologies and practices versus with

4:38

GPT you would constantly have to give it

4:40

extra context on how to do things in the

4:42

modern way. That's kind of gotten fixed.

4:44

Another interesting thing is that the

4:46

new model is actually really fast. Like

4:48

if I just send this query, it's going to

4:50

start streaming in here. It took a

4:51

little bit longer to start showing up in

4:53

5.4, but like the actual tokens per

4:55

second on this is really really good.

4:57

And I've noticed that compared to like

4:59

5.2 and even 5.3, the thing that OpenAI

5:02

models used to do where they would just

5:03

reason and reason and reason and take

5:05

forever to do anything has really gotten

5:07

smacked out of 5.4. It's very good at

5:10

just kind of doing things and it feels

5:12

very fast in day-to-day use. Now,

5:13

obviously, I am using an early alpha

5:15

cluster. Once this actually goes live,

5:17

the real tokens per second is probably

5:19

going to end up going down slightly, but

5:21

you can even still see like 5.3 was

5:23

pretty damn fast. Not that much slower

5:25

than 5.4. I feel like in day-to-day use,

5:27

this thing is going to feel much faster

5:29

than old GBT models entirely because of

5:31

a the tokens per second, but b far more

5:33

importantly, it's very efficient with

5:35

tool calling. Like it does not like

5:36

calling tons of tools. It's very

5:38

surgical in what it actually picks and

5:40

does unless you tell it exactly what to

5:42

do otherwise. And we're going to have a

5:44

long section about that later. But

5:45

before we do that, we need to hear a

5:46

word from today's sponsor. Today's

5:48

sponsor is Graptile. And you've probably

5:50

already heard the AI code review pitch

5:52

before, so we're going to skip all that

5:53

and instead I want to show you this.

5:55

This is a change I've been working on

5:56

for better context where I'm fixing some

5:58

system prompts and doing some things

5:59

that we'll talk about later in this

6:01

video. But when I pushed this up to

6:02

GitHub and made the PR, Gravile left a

6:04

little review here with a confidence

6:06

score of four out of five with a little

6:07

nitpick here of like, hey, we should

6:09

probably fix this. What I could do is I

6:11

could manually copy paste this. That

6:13

works. Instead, what I'm going to do is

6:14

use the Grapile MCP through codeex to

6:17

check for any code reviews on this PR,

6:18

see if there are any comments, and have

6:20

it fix any notes that were left behind

6:22

by Grapile. And what's really cool about

6:24

this is since it is an MCP, this can now

6:26

be called by agents, which means that

6:28

you can have your agents automatically

6:29

review themselves. Once they're done

6:31

making a change, you can just tell them,

6:33

hey, go check for any reviews on this.

6:35

You don't have to go into GitHub, find

6:36

the PR, check the review, copy paste it

6:38

back in. It just kind of works. And you

6:40

can see right here what it's doing. It

6:42

is getting the merge request. It is

6:43

listing the merge request comments. It

6:45

is checking for the actual code reviews.

6:47

It found two unadressed comments. It is

6:49

now exploring these, getting these

6:51

fixed. And now that they're fixed, I can

6:52

tell it to just commit and push the

6:54

changes. This feedback loop feels so

6:56

much better. I've been really, really

6:58

liking this in all my workflows. Just

7:00

having the ability to pull down code

7:02

review changes directly within my actual

7:05

environment instead of having to go up

7:06

to GitHub. It's such a lifesaver. It

7:08

makes things so much easier. You can

7:10

already see back on the PR, the little

7:11

eyes emoji has been left by Grapile to

7:13

let you know that, hey, it is now

7:15

re-reviewing this PR cuz we just pushed

7:16

it up. It's 100% off for open source,

7:18

50% off for startups. There's no reason

7:20

to not be using an AI code reviewer.

7:22

They just make life so much easier at

7:23

this point. Grapile is my personal

7:25

favorite. I highly recommend it at

7:26

davis7.link/gile.

7:28

So, now I want to talk about how this

7:30

thing feels to actually use. And it

7:32

feels amazing. It's extremely good at

7:34

tool calling to the point where I don't

7:35

think I've ever seen a tool fail. It

7:37

almost always calls the logical sensible

7:39

tool. It's just extremely good at

7:42

operating within the codeex harness,

7:43

within custom harnesses I've given it,

7:45

within the chat GBT web app harness. It

7:47

just it works. An interesting thing I've

7:49

noticed it doing, and I think GBT models

7:51

have done this for a while, sometimes it

7:53

will just eject out of the codeex tools

7:55

and it'll just be like, "No, I don't

7:56

want to do that." And it will instead

7:57

just handwrite a Python script and then

8:00

run that in a ripple to actually change

8:02

a file. So here it's doing a file edit

8:04

where it wrote out a full Python script

8:05

to change all the files in the codebase.

8:08

I've seen it do this very consistently

8:09

throughout all my different projects and

8:11

they always work. I have never seen it

8:13

malform one of these. I've never seen a

8:15

Python error come out of it. It just

8:16

kind of works. Another thing I've been

8:18

noticing with this model is it feels a

8:20

lot less like 5.3 CEX did and a lot more

8:23

like Opus does. It's not quite at the

8:25

Opus level of just full sending whatever

8:27

you tell it to do and it will just do

8:29

it. consequences be damned. But it is

8:31

definitely more proactive and aggressive

8:33

with its changes than 5.3 CEX was. 53

8:36

Codex really likes just taking its time

8:38

and doing a bunch of exploring and

8:40

really just kind of thinking through and

8:42

asking permission before it did

8:43

anything. This one feels a lot more

8:45

biased towards actually just taking

8:46

actions. I had this old thread I did a

8:48

couple days ago where I wanted to set up

8:50

the local convex dev stuff within a

8:53

project that I was working on. I gave it

8:54

another project where I had already set

8:56

this up as a reference, which we'll talk

8:58

about this later, but the way I've been

8:59

using these models has changed a lot. I

9:01

don't let these things run in any way

9:03

other than full system access. This

9:05

thing could wipe my home directory at

9:07

any point if it wanted to. Honestly, I'd

9:08

be kind of okay with it cuz it'd be

9:10

really funny and it would be a hilarious

9:11

video to make. But like the reality is,

9:13

even if there are some concerns with

9:15

doing that, the pros so massively

9:17

outweigh the cons for me personally that

9:19

I've just been letting it go and do its

9:21

thing. And the results have been really

9:22

good. It was able to just take this

9:24

pretty simple prompt and just do it

9:26

right. It did its exploration step, then

9:28

it put together a plan. And this is

9:30

usually where I would find 5.3 CEX would

9:32

step in and just stop and it would tell

9:34

me the plan and be like, "Hey, does this

9:36

look okay? Can I go and actually do

9:37

this?" 5.4 is just full sending. It is

9:39

just actually doing the plan. It gets

9:41

all of this stuff actually implemented.

9:43

And then at the end here, it gave me the

9:44

summary, the actual change worked

9:46

exactly as you would want. It does the

9:48

very chat GBT GBT thing where every time

9:51

you have it make a change in codeex or

9:53

you ask it something in chatgbt.com it

9:56

will always ask you hey if you want the

9:58

next useful step is blank and it will

10:00

always effectively prompt you with the

10:03

next actionable step to be like hey you

10:05

can keep running if you just do this and

10:07

it said hey if the next useful step is

10:09

to add a tinyv.local.ample

10:12

to make things easier so I was like yeah

10:13

that's a great idea let's just do that

10:15

so I had it add the enev.local.example

10:17

example. It got it set up in there.

10:18

Everything worked. And then after all

10:20

that, I realized, oh wait, I do actually

10:21

need web hooks. I'm kind of dumb here.

10:23

Let's just get rid of all that stuff,

10:26

stash it away, and then leave anything

10:28

unrelated to that change cuz I had other

10:30

miscellaneous changes unstaged within

10:32

this branch. And again, it was really

10:34

good about that. I told it, hey, if

10:35

there's anything unrelated, leave it as

10:37

is. And it did. A big change in the way

10:40

I've been using these things. And I

10:41

don't know if this is just the model or

10:42

if it's just me, but I've been much more

10:44

willing to just kind of let them handle

10:46

get for me. Let them handle stashing for

10:48

me. In the past, I would have really

10:49

wanted to do this by hand. At this

10:51

point, I trust the model enough. It just

10:53

works well enough that I just kind of

10:54

let it go do it. It did the stashing for

10:56

me. It didn't touch anything unrelated.

10:58

That was true. And it was all good. It

11:00

worked. I've really been trying and

11:01

racking my brain to find things that it

11:03

just egregiously does wrong over the

11:05

last week or so. I could easily find

11:07

examples of a lot of other models just

11:09

kind of flying off the rails, but the

11:10

best way I can describe it is this is

11:12

really the first model where I kind of

11:13

just trust it to go do the thing.

11:16

There's nothing super fancy about it.

11:18

It's nothing super insane. I didn't do

11:20

anything super difficult with this task,

11:22

but it just followed the instructions

11:24

perfectly. It made the change perfectly.

11:26

That is the best way I can describe this

11:28

model. It is the model that just works.

11:31

You don't have to think that hard about

11:32

it. You don't have to guard rail it too

11:34

hard. And having done a pretty insane

11:36

amount of work with this over the last

11:38

week or so, and these are just the saved

11:40

threads in here. I had to wipe all my

11:42

old threads when we were doing a data

11:43

model migration on the T3 code alpha.

11:46

It's a good model. This is a very, very

11:48

good model. And on that topic, I've

11:49

started using this model for a lot of

11:51

things that aren't just code. This is

11:53

something that I've really started doing

11:54

recently, and this model has helped

11:56

kickstart more and more the way I use my

11:58

terminal and use my computer is just

12:00

letting codeex go do it. That's why I

12:02

like giving it full route access to

12:04

everything is because I can allow it to

12:06

do something like, "Hey, could you build

12:07

the desktop app and install the DMG onto

12:09

my computer, replace the existing T3

12:11

code build there?" Because as Julius has

12:13

been building out the T3 code alpha,

12:15

he's constantly shipping a bunch of

12:16

changes up to GitHub. We don't have a

12:18

full release pipeline ready for it yet.

12:20

So, I was like, "Okay, I want to get the

12:22

latest changes, but I don't want to have

12:23

to like manually run the build command

12:25

and then like grab the DMG and put it in

12:27

my applications folder and just like do

12:28

all this random stuff that I could do,

12:30

but I just don't care enough to do

12:31

because I can tell the model to do it."

12:33

And within 2 minutes, it'll just do the

12:35

thing. It'll build the project, grab the

12:37

DMG, put it where it needs to be,

12:39

restart my T3 code instance, and it just

12:41

works. I've even gotten to the point

12:42

where I've had a lot of these where I'm

12:43

like, "Hey, can you pull the latest

12:45

changes from main, rebuild the project,

12:46

install the DMG as my T3 code instance

12:49

on my computer and give me a list of

12:51

everything that changed from the pull

12:52

down. It just works super super well.

12:54

I've started building out this

12:56

experiments directory which I

12:57

effectively just let the model run and

12:59

use itself where whenever I have some

13:01

random idea or thing I want to make,

13:03

which we'll talk about those later in

13:04

this video, I will just open up a codeex

13:06

terminal in this experiments directory.

13:08

I'll be like, "Hey, make a new spell kit

13:10

project that does XYZ." I wanted a

13:12

random one-off script that would run in

13:14

the cloud so that whenever this new

13:16

monitor that Theo and I want to buy for

13:17

the studio comes up and is available on

13:20

Amazon because it's not clear when it's

13:22

going to come out yet, it'll give me a

13:23

Discord notification when that happens.

13:25

I just let it go make that, let it make

13:27

the project, let it do all those things,

13:28

let it deploy it. And it just kind of

13:30

worked. And I've even started to use it

13:32

for just randomly testing things. Where

13:34

in one of the projects I'm working on,

13:35

Better Context, it has a CLI. And the

13:38

CLI is kind of annoying to test manually

13:40

every single time I want to make a

13:41

change and make sure I didn't

13:43

accidentally break something important.

13:44

I've got unit tests in there, but like a

13:46

good smoke test is really nice. In the

13:48

past, I would just run it manually, but

13:49

I'd get kind of lazy with it until I

13:51

realized, oh wait, this model can just

13:53

run commands on my computer and act like

13:55

a human would on my computer. Why don't

13:57

I just give it a smoke test file which

13:59

has a list of all the commands that I

14:00

want it to run with the desired outputs

14:02

and the desired error shapes and all

14:04

that stuff. Let it run that, have it go

14:05

through and actually test the project on

14:07

my machine. And then at the end of it,

14:09

it gave me this nice little breakdown of

14:11

everything does seem to be working just

14:12

fine. And I'm like, cool, that's a

14:14

really nice sanity check. I can just let

14:16

this thing use my computer. And once you

14:18

kind of unlock that in your brain, these

14:20

models and harnesses get so much more

14:22

useful. I really think that these coding

14:24

agent harnesses right now they're really

14:25

good for writing code, but over time

14:27

they're going to just get to the point

14:29

where they're really useful for just

14:31

doing anything on your computer. I mean,

14:32

that's like what Claude Co-work is. It

14:34

is clawed code with a nice UI on it to

14:36

do a bunch of commands under the hood to

14:38

just use your computer like normal. It's

14:40

really, really good. But none of that

14:42

actually matters without answering the

14:44

question of what have I actually built

14:46

with this thing? The right way to answer

14:47

that is to start with the thing that the

14:49

model's really, really bad at, and that

14:50

is UI. I was working on some browser

14:52

automation stuff because the browser use

14:54

capabilities of this model are really

14:56

really good. And this is the UI it came

14:57

up with for a live view of the browser

15:00

automation agent. Like it is unusably

15:02

awful and bad. It really likes to use

15:05

this weird singular style for a bunch of

15:07

random things. I've had it pop this out

15:09

for a bunch of internal tools and I

15:10

thought that maybe this was just like

15:12

because I had the front-end design skill

15:13

on and that was [ __ ] with it. So I

15:15

turned that off and I tried to just do a

15:17

sidebyside of okay, here's a very simple

15:19

UI prompt. I gave it to both Opus 46 and

15:22

GPT 5.4. See what the two options look

15:24

like. And the results were honestly bad.

15:26

Like this is the UI that came out of GPT

15:29

5.4. It very much has that GPT5 UI

15:32

smell. This is what the GPT model UIs

15:35

have looked like since GPT5, and that

15:37

smell has not gone away with the newest

15:39

model. It's got the weird text thing up

15:41

here that they all love doing. It's got

15:43

these weird rounded corners. This UI

15:44

just does not look or feel good. It's

15:46

very cluttered and gross. The one that

15:48

Opus created was this, which is just

15:50

infinitely better. This is such a better

15:53

chat UI that is actually usable and I

15:55

could build something on top of.

15:56

Obviously, it's not the most inspired

15:57

thing in the world, but with a little

15:59

bit of tuning and work, this could be

16:01

usable. This just can't. And I've had a

16:03

lot of examples like this. I'm working

16:04

on BTCA v2, and one of the things I

16:06

wanted to do was build out like a nice

16:08

research view for it. And this is just

16:10

terrible. Like, there is too much random

16:12

crap going on in this UI. It looks the

16:14

exact same as all the normal GPT UIs do.

16:17

It's not very readable. It just doesn't

16:19

feel good. It didn't even follow the

16:21

styles that I had established within

16:22

this codebase which I told it to follow.

16:24

The really interesting thing is that I

16:26

had 5.3 codecs do the same exact thing

16:28

but in the cursor cloud agent harness.

16:31

So to be fair, this is definitely biased

16:33

because in the cloud agent harness, they

16:34

have access to a browser. So was able to

16:36

see what it was actually doing as it was

16:38

happening and get some good back

16:40

pressure there versus just writing the

16:42

Tailwind classes and hoping they're

16:43

good. But still like this followed the

16:45

styles that I had already established

16:47

for the project. It's still very

16:48

cluttered. It's still kind of GPTE. But

16:50

still this is much better than this. And

16:53

unfortunately I still think we are at

16:54

the point where the GPT 5.4 as good as

16:57

it is for basically everything. It still

16:59

can't handle UI super well. So there

17:01

still is a need to have something like

17:03

clawed in on the back burner so that

17:06

anytime you need to make a page not look

17:08

like this and do some very specific UI

17:11

tweaks, you have a competent model for

17:13

that task. But once you get outside of

17:15

front-end problems, it's really really

17:16

good. A big thing I've been working on

17:18

is migrating the better context CLI and

17:21

server over to using the effect v4 beta.

17:24

I want all of my services and projects

17:26

to be built on top of this going

17:27

forward. It is very much, in my opinion,

17:29

the right way to build anything that is

17:30

a CLI or a server at this point. And I

17:33

was shocked at how well it was able to

17:35

actually parse and use this thing. I

17:37

gave it access to the source code for

17:38

effect v4 with the BTCA tool, but other

17:41

than that, all it had was just this link

17:44

and some basic instructions on how to

17:46

port it over, and it did a really good

17:48

job. It centralized the services

17:50

correctly. It used good effect patterns.

17:52

There's still some weird holdovers in

17:53

the actual implementation from the old

17:56

version because the old version was very

17:57

promise based and I want it to be all

17:59

like effect generator based. So like

18:01

getting all of those boundaries hammered

18:03

out so that there aren't a bunch of

18:04

random awaits in the codebase is still a

18:06

thing. I still haven't gotten all of

18:07

those ironed out yet. I you still have

18:09

to be in the driver's seat for that.

18:10

It's not magically doing all that for

18:12

me. But again, if it was in a better

18:14

harness that was designed to run for

18:16

very very long periods of time, maybe it

18:18

could have one-shotted it. This is the

18:19

kind of thing that I really want to test

18:21

once it's available in something like

18:22

the cursor cloud agents or whatever. It

18:24

has been working very well for feature

18:26

additions and security audit type stuff.

18:28

One of the things I wanted to do was add

18:30

in support for GitHub private repos to

18:32

the BTCA web app. This is something that

18:34

a couple different people wanted. So I

18:35

was like yeah sure let's get that added

18:37

in. I gave it some basic information on

18:38

how to do it. We designed out how it

18:40

will actually end up working in the back

18:41

and forth. It was able to use clerk to

18:44

reauthor the GitHub instance to get the

18:47

user repo read only permissions that it

18:49

needed for this. And as a result, I'm

18:51

able to add in something like this

18:52

BTCAEX, which is a private repo. I have

18:54

just some internal experimentations

18:56

where now if I go in here, what is at

18:58

btca?

19:00

It'll just work. It'll actually be able

19:01

to load this into the sandbox correctly,

19:03

authenticate it with the proper

19:04

credentials being passed into the

19:06

sandbox. The way it was architected was

19:08

really, really good. It did a full

19:10

security audit to make sure that

19:11

everything was super sound in how this

19:13

was actually implemented. And you can

19:15

see it it works. It works really really

19:17

well. I also had it do a pretty heavy

19:19

security audit of all the endpoints and

19:22

queries and mutations and actions on

19:24

this app. And it found a couple actually

19:26

pretty novel edge cases that I had not

19:28

thought of that you could slip past the

19:30

guards that had already been implemented

19:32

from like the normal middleware clerk

19:34

guards on the back end to exploit random

19:36

things like the sandbox and the sandbox

19:38

URL and all this stuff. It found all of

19:40

those. It patched all of those and the

19:42

end result is very very good. This is a

19:44

great model for systems security

19:45

backends CLIs anything like that it can

19:48

handle very well. The only other thing

19:50

that it doesn't do super well, but

19:52

frankly no models do this well, is

19:54

actually funny enough crafting prompts

19:56

for itself to use. I wish I could show

19:58

it, but it's internal like alpha

20:00

documentation. So I don't want to leak

20:01

that hopefully. But when this goes live

20:03

tomorrow, all this is public and I can

20:05

link it down in the description. But

20:06

OpenAI gave us a really nice site that

20:08

has a bunch of references on how to

20:10

write good system prompts for it, what

20:12

the model behavior is, like, how it

20:14

works in different harnesses. And one of

20:15

the things that they suggested is adding

20:17

in these sections to your system prompt

20:19

to define the behavior that you want

20:21

from it, what its personality is, how it

20:23

should use its tools, what the output

20:25

needs to look like, how verbose it

20:27

should be, how it should follow

20:28

instructions, all that stuff. This stuff

20:30

needs to be handwritten. I tried to let

20:32

the model handwrite the system prompt.

20:34

And the difference between a system

20:36

prompt written by me and a system prompt

20:38

written by the model is night and day. I

20:40

was actually testing this within BTCA.

20:42

I'm about to ship an update which will

20:44

have a much better system prompt to get

20:45

much better results out of the model.

20:47

One of the things I did was add in those

20:48

really nice XML blocks and format the

20:50

prompt in a way that it very clearly

20:52

directed the model on how it should

20:54

behave because this is a model that

20:55

follows instructions very very well. I

20:58

have this one on the right that is using

20:59

the new system prompt. This one on the

21:00

left that is using the old system

21:02

prompt. And when I hit enter on both of

21:03

these, it'll take a second to actually

21:05

come in. But the actual outputs you get

21:07

from the new system prompt are so much

21:09

better. They're much more thorough. They

21:11

include much better examples. It it just

21:13

does a better job of actually running

21:15

when you give it better instructions

21:17

because this model, like all GPT models,

21:19

is incredible at following instructions.

21:21

If you tell it to just yolo do the

21:22

thing, don't think too hard, it'll yolo

21:24

do the thing, not think too hard. versus

21:26

if you tell it to be very thorough in

21:28

its research, make sure it's doing

21:29

proper citations. It'll end up taking

21:31

slightly longer, like this one did a few

21:33

more tool calls than this one did. But

21:35

when we look at the actual end results,

21:37

the new one is much more thorough and

21:39

much more correct, where this one

21:40

caught, hey, you can handle argument

21:42

validation failures at the endpoint

21:44

boundary of handle validation error

21:46

within the hooks.ts file, which is

21:49

basically just the middleware in Spelkit

21:50

land. and it gave a very long detailed

21:53

answer that includes really good

21:54

examples for all of the different ways

21:56

you can handle these things, including

21:58

that handle validation error down here

22:00

and all of the sources properly linked

22:01

to the original GitHub repo. This one

22:03

still gets the job done, but it's not

22:05

quite as good. And honestly, going

22:07

forward, one of the hardest and most

22:09

important parts of building a complex

22:11

agent is the system prompt. You do have

22:14

to be very clear on what it can do, what

22:16

it can't do, how it should behave in

22:18

different scenarios because this thing

22:20

will do exactly what you tell it to.

22:22

That prompt I showed earlier was from a

22:24

browser use agent session that I had it

22:25

do. I wanted to see how well this thing

22:27

could handle working in a browser with a

22:29

bunch of like computery type tools and

22:31

it worked super well. I wanted to have

22:33

it search for flights from San Francisco

22:34

to Columbus just at a random date

22:37

towards the end of the summer just cuz I

22:38

was curious if it could do it and it did

22:40

it really well. You can see it's

22:40

inputting the text correctly there.

22:42

It'll input the text correctly here. It

22:44

does end up breaking, not because of the

22:46

model, but because the way the site is

22:48

coded, it just kind of breaks in weird

22:49

ways. That's not the model's fault.

22:51

That's something that we can tune out.

22:52

The important part is that it can

22:54

operate in a wide variety of harnesses

22:57

very, very well. Another thing I was

22:58

pretty impressed it actually did is uh

23:00

Maria, who's been helping out with my

23:01

channel and Theo's channel, sent me this

23:03

image of like a feature that she wanted

23:05

to get added to Pick Thing for like

23:07

thumbnail face stuff. And it was just a

23:09

bunch of like random red text and a

23:11

bunch of random images with arrows and

23:13

like it's kind of a funny looking

23:15

drawing. And I was really curious if I

23:17

just give this drawing with follow the

23:19

instructions on this image and implement

23:20

it to the model. Could it do it? And you

23:22

can see I gave it this did a bunch of

23:24

work here. I tested the actual thing. It

23:26

works perfectly. I shipped it. It's not

23:28

the prettiest thing in the world. Like

23:29

this definitely could be better. But

23:31

like the features implemented, it works.

23:33

It's just fine. it was able to extract

23:35

all the information it needed directly

23:37

from that very random image to actually

23:40

make a useful feature. There's a bunch

23:42

of other stuff I had it do and to avoid

23:44

bloating the runtime, I'll skip over

23:45

those, but I had it make a Semox

23:47

alternative just to test if that was

23:49

even possible and look at some different

23:50

UX flows. It was able to basically

23:52

oneshot that perfectly, get the Swift

23:54

app and the Ghosty rendering all working

23:56

in a really nice way. It was able to

23:57

make my codeex usage script, which is

23:59

the thing that I use to figure out how

24:01

much I've been using the new model on my

24:03

computer. It was able to put together a

24:05

React Native Plus Expo mobile app very

24:08

easily that I was using for a bunch of

24:10

random personal finance stuff, which is

24:11

why I unfortunately can't show that one.

24:13

It's pretty cool. I I don't want to leak

24:14

any of that information. It It's a

24:16

really really good model. The last thing

24:17

I wanted to talk a little bit about was

24:19

just how it feels within chat GBT

24:21

because as you've seen through this

24:23

entire video, this thing is a monster at

24:25

instruction following. My

24:26

personalization is set to be candid, uh,

24:29

less warm, not that enthusiastic.

24:31

Headers and lists are fine, no emojis. I

24:34

don't like any of that stuff. Like, I

24:35

wanted to get the chat GBT voice out of

24:37

it. And it's mostly gone. The weird

24:40

speech patterns that you're probably

24:41

used to if you use the chat GBT app a

24:43

lot with like 5.2 or any other model.

24:46

They're pretty much gone from this

24:47

model, which feels really good. I'm glad

24:49

that that's not a thing I have to deal

24:50

with anymore. I like my models being

24:52

much more pragmatic and to the point

24:54

with the way they talk to me and the way

24:56

they do things. This is definitely not a

24:58

sickopantic model at all. I have had

25:00

this thing tell me multiple times flat

25:02

out you're an idiot. No, do XYZ or no

25:05

that doesn't make any sense. Do this or

25:07

that. It is very happy to disagree with

25:09

you and it does not take [ __ ] This

25:11

is not another 40 which I know a lot of

25:13

people are not going to like. But if you

25:15

are using these things for work, this is

25:16

an incredible work model. An incredible

25:18

like actual useful research assistant

25:21

type thing. I'm very happy with it. I

25:23

want to close this by talking about what

25:25

the general landscape of models looks

25:27

like right now. Especially now that GPT

25:29

5.4 is out and it is released in the

25:31

API. It's released in chatgbt.com and

25:34

it's released within codeex. They did it

25:36

right this time. I'm very happy about

25:37

that. You can just use it everywhere now

25:39

and I will be using it everywhere now.

25:41

This is getting very close to being the

25:43

god model. I think we are rapidly

25:45

heading to the point where the old split

25:47

of models where like you had haiku,

25:49

sonnet, and obus, they all kind of serve

25:51

their purpose. I think that's kind of

25:52

just going away. This model, if it's

25:54

priced the way I assume it is, which is

25:56

just the same as GBT 5.3 CEX, I don't

25:58

see why it wouldn't be, is very

26:00

reasonably priced. It's very efficient

26:01

with its token usage. Its reasoning is

26:03

not all that bad. You can use this for

26:05

very small one-off tasks that you would

26:07

use a haik coup for, and you can use it

26:09

for massive 5-hour long background jobs

26:11

all on the same model. The only places

26:13

where this model has really felt bad to

26:15

me is in UI and in making system prompts

26:18

and agent harnesses for other models,

26:20

but that's always been a thing. There is

26:22

no model that can write a good system

26:24

prompt for another model. You have to do

26:25

those by hand. There's been a lot of

26:27

research about this. That is like one of

26:28

the most important parts to be paying

26:30

very close attention to if you are

26:32

building anything involving AI. But like

26:34

honestly, if we got GBT 5.5 in a couple

26:37

months and it's just as good at UI as

26:39

Opus 46 is while still being better at

26:41

everything else and maintaining its

26:43

current characteristics, I don't really

26:44

see a world where I would use literally

26:46

any other model for anything else. We're

26:48

getting into speculation land, so take

26:50

all this with a grain of salt.

26:51

Predicting the future is a fool's

26:53

errand, but we're going to do it now cuz

26:55

it's fun. I really think we are going to

26:56

get to the point where one model is

26:59

going to win literally everything. is

27:01

just going to be better at all the other

27:02

models, at all of the other tasks. And

27:04

realistically speaking, I think we're at

27:06

the point where OpenAI is clearly on the

27:08

path to making it. They're pretty damn

27:10

close with this one. Anthropic is kind

27:12

of getting there with Opus, but it's

27:14

still just not. And I still actually

27:16

find myself kind of liking the way

27:17

Sonnet feels. There's still more of a

27:19

discretion between their different

27:20

models, but this is the first one where

27:22

like it's all unified into one.

27:24

Everything from chatting with it for a

27:26

normie to heavy agentic work is all

27:29

wrapped up in one model GBT 5.4. It is

27:32

the single model for everything. And

27:34

honestly, my biggest hope for the future

27:36

at this point is I kind of hope Google

27:38

gets their [ __ ] together so that OpenAI

27:40

has some more real competition cuz right

27:42

now the only competition OpenAI actually

27:44

has is anthropic. There just isn't

27:46

anything quite at the level of GPD 5.4

27:49

and Opus 4.6. And I can't really imagine

27:52

anyone else making anything quite this

27:54

good. It is impossible to put into words

27:56

exactly how good this model is and

27:57

exactly how good it feels without just

27:59

using it. Give it time. Give it a bunch

28:01

of different prompts. Give it a bunch of

28:03

different things to work on and you'll

28:05

just get a feel for how it actually

28:06

works. And I think slowly but surely

28:08

realize that this is the model that can

28:10

do basically anything. And that's we're

28:13

heading into weird weird times. I don't

28:16

really have anything else to say on

28:17

this. I'm sure I will in the future

28:18

after I've done more work with it and

28:20

put it in more complex harnesses. I

28:22

really like this thing. You should go

28:23

try it. If you enjoyed this video, there

28:25

are more on screen that you should

28:26

probably click on. They're probably

28:27

pretty good.

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.

GET STARTED FREE SIGN IN