TRANSCRIPTEnglish

Delete your CLAUDE.md (and your AGENT.md too)

29m 18s6,360 words891 segmentsEnglish

FULL TRANSCRIPT

0:00

Are you really an AI engineer if you

0:01

haven't put a ton of time into your

0:03

agent MD or Claude MD files or both?

0:05

Like really, come on. Everyone's doing

0:07

it. It has to be good, right? Well, what

0:09

if I told you that a study just came out

0:11

all about those Claude MD and Agent MD

0:13

files and uh the numbers weren't good.

0:16

They were actually quite bad. Here we

0:18

see comparisons across Sonet 45, GBD52,

0:21

51 mini, and Quen 3. And when given an

0:23

agent MD or Claude MD file, they

0:25

actually performed worse consistently.

0:28

This is a thing we should probably be

0:30

talking about, right? Like, I've been

0:32

told so many times that I'm having

0:34

prompt issues because I didn't write an

0:36

agent MD or CloudMD. They're so

0:37

important. Every codebase needs them.

0:39

Everybody's publishing their own rule

0:41

files and skills and all these things.

0:42

It'd be pretty bad if it turned out

0:44

those things were making the tools

0:46

worse, right? Well, that's a thing we're

0:49

going to have to dive into because on

0:50

one hand, a lot of people are using

0:52

these files wrong and on the other hand,

0:54

it is likely that they shouldn't be used

0:56

at all for many cases because they steer

0:58

models incorrectly. This is going to be

1:00

a fun deep dive on context management

1:02

and best practices for using AI to code

1:05

and build real software. And a lot of

1:07

this is coming from my own experience,

1:08

which is admittedly subjective, but is

1:10

really cool to see it coming out in a

1:12

study. It's been awesome to see studies

1:13

like this recently popping up. things

1:15

like figuring out if these agentmd files

1:17

are actually useful. Skills bench

1:18

figuring out how useful skills are is a

1:20

thing that the models have access to

1:22

when they are working and even

1:23

benchmarks that are trying to figure out

1:25

why models are more likely to get a

1:27

question right if you ask them the same

1:28

thing twice. There's a lot of fun stuff

1:30

to dive into here and it's all about

1:32

context management and I do actually

1:34

think this video can help you get better

1:36

at using AI to code. That all said, I'm

1:38

about to do a lot of work for free that

1:40

OpenAI Anthropic probably should be

1:42

doing. Neither are paying me for this

1:43

and the team needs to get paid. So,

1:45

we're going to do a quick break for

1:46

today's sponsor. If Clawbots, sorry,

1:48

OpenClaw has proven anything to be true,

1:50

it's that AI is way more powerful when

1:52

you give it a computer of its own. Which

1:54

is why I'm so excited about today's

1:55

sponsor. And no, it's not a Mac Mini.

1:57

It's Daytona. So, is it another GPU

1:59

cloud? No, it's way better than that.

2:02

It's elastic containers for running your

2:04

agents in. So, if you want agents that

2:06

are able to do things like edit code,

2:07

write code, run code, make file changes,

2:10

edit things in Git, and all the stuff

2:11

that you would do on a computer, Daytona

2:14

has you covered and then some. Here's

2:15

how easy it is to set up a secure

2:17

sandbox with Daytona. You create a

2:18

Daytona instance with your API key. You

2:20

define a sandbox. Then, this I would

2:22

argue optional try catch wrapper. I've

2:24

never seen it ever. Sandbox is await

2:26

Daytona.create TypeScript response

2:29

equals await sandbox.process.coderun.

2:32

And then you pass it TypeScript code. It

2:34

executes, you get a response, you throw

2:36

an error if it failed, and then you show

2:38

the result if it didn't. It has never

2:40

been easier to set up a remote box for

2:42

your code to run in. And don't worry,

2:44

they have Python bindings as well for

2:45

you Python people. But what about other

2:47

languages? Well, I have good news

2:49

because the snapshot can be anything. As

2:50

long as it runs in Docker or Kubernetes,

2:52

it can probably run on Daytona without

2:54

issue. As with all their crazy benefits

2:56

from the networking to the memory to the

2:57

storage, insane. I just learned about

2:59

this while filming. Apparently, they now

3:01

have full computer use sandboxes with

3:03

virtual desktops with all major oss. I

3:06

did not know they did this. That's

3:08

really cool. It suddenly makes a lot of

3:10

sense why all of the companies I talked

3:11

to that are doing things like mobile app

3:13

deployments suddenly have support for

3:15

doing cloud-based builds. They're all

3:18

probably using this Mac OS automation.

3:20

Crazy, especially when you realize how

3:21

absurdly cheap the platform is. We're

3:23

talking 5 cents per hour of compute, 1.6

3:26

6 cents per hour of memory and a

3:29

basically impossible to measure cost per

3:31

hour per gig of storage. And remember,

3:33

everything spins up and down instantly.

3:35

So, you're only paying for the compute

3:36

you're actually using, making the costs

3:38

hit the floor really fast. I'm going to

3:40

be real with you guys. If you need a

3:41

sandbox, you should use Daytona. And if

3:43

you don't, you'll probably need one

3:44

soon. Check them out now at

3:45

soy.link/tona.

3:47

Let's dive into this study cuz I'm

3:48

actually really excited. A widespread

3:50

practice in software development is to

3:51

tailor coding agents to repositories

3:53

using context files such as an agents MD

3:55

by either manually or automatically

3:58

generating them. Although this practice

3:59

is strongly encouraged by agent

4:01

developers, there is currently no

4:03

rigorous investigation into whether such

4:04

context files are actually effective for

4:06

real world tasks. In this work, we study

4:08

the question and evaluate coding agents

4:10

task completion performance in two

4:12

complimentary settings. Establish SW

4:15

bench tasks for popular repos with LM

4:17

generated context files. And the other

4:19

one is a novel collection of issues from

4:22

repositories containing developer

4:23

committed context files. Interesting.

4:25

Should probably give some context on

4:26

what an agent MMD is and what they're

4:27

talking about with the init. Let's start

4:29

with the T3 chat one. This file provides

4:32

guidance to cloud code when working in

4:33

the repo. We should probably remove that

4:35

part cuz this is actually an agent MD

4:36

that we sim linked the cloud MD to. You

4:38

can tell a lot of this was generated and

4:40

probably isn't useful. Always use PMPM

4:42

to run scripts. We describe the common

4:44

scripts dev lint lint-fix check test

4:48

watch vest and pmppm generate. This is

4:51

all pretty basic stuff from our package

4:53

json. But here you'll see how a little

4:55

bit of how I use things leaking. Spoiler

4:57

for the future. Do not run pnpm dev

4:59

assume already running pnpm build ci

5:02

only. Interesting. I wonder why that's

5:04

there. We then have an architecture

5:05

overview telling it where things are. We

5:07

have the front-end folder, the backend

5:09

folder, the shared folder, app, and then

5:11

the convex folder for all of our convex

5:12

stuff. We also describe what services

5:14

we're using for what things. And one of

5:16

the mistakes we have in here that I need

5:17

to change is putting TRPC here confuses

5:19

the model a bunch, and it tries to use

5:20

TRPC in places where it shouldn't. I

5:22

guess I'm doing an audit as well as this

5:25

in this video. We have key patterns for

5:27

things that we do and how we recommend

5:28

doing them. Code style stuff, talking

5:30

about how we use effect, stuff like

5:31

that. Follow the convex guidelines. Do

5:33

not use the filter and convex queries.

5:35

Use with index. Always include return

5:37

validators on convex functions. Use

5:39

internal queries. Also, I don't like

5:40

return validators always being included.

5:42

I was wondering why this codebase did

5:43

that. It's apparently specified in

5:44

there. I guess that someone else in the

5:46

team likes that. Each their own. Some

5:47

additional information. This is the part

5:49

I wrote, which we'll talk more about in

5:51

a bit. It's probably important to know

5:53

what the this file even does,

5:54

though, cuz this is just like an

5:56

overview of the repo, right? So, the way

5:58

that these work is to put it simply,

6:00

context management. When you make a

6:03

prompt to an AI system of some form,

6:06

that prompt is not the only thing the

6:08

model is getting. The way most people

6:09

think of AI is pretty simple. I guess

6:10

I'll do user some question. And this is

6:13

the first block that exists in this

6:15

context. The user asks some question and

6:18

then the model gives some answer. And

6:20

then if you have a follow-up question

6:21

like you want to ask, but what about

6:23

this other thing from before? You add

6:25

the additional question and then that

6:27

gets added to the context. So if I color

6:29

code these, it'll say blue is you and

6:32

yellow is output. So agent. So when I

6:35

ask a question, that gets put in the

6:37

context. And then the model is

6:39

autocompleting from there based on all

6:40

of the info it has, all of the text that

6:42

exists above. What does it think the

6:44

most likely next set of characters is?

6:45

And then it does that over and over

6:46

again till it has an answer. And then it

6:48

stops. And then you can send another

6:50

follow-up. and it will have all of that

6:52

in the history to continue to append new

6:55

tokens which are just small sets of

6:56

characters until it has an answer. This

6:59

is a massive oversimplification because

7:00

it's not showing you the top. The

7:02

reality is that your question is not the

7:04

thing at the start of the context.

7:06

Before that, we have other things. We

7:08

have, and I'll color these red, the

7:10

system prompt. The system prompt is a

7:12

thing that describes what the agents

7:15

role is. You can say something small and

7:17

sweet like, "You're a simple AI agent

7:19

meant to answer questions for users."

7:21

So, Open Router's chat lets you write

7:23

your own system prompt. So, I can do

7:25

something here like always respond to

7:28

questions in pig Latin. Apply that rule.

7:32

And now I can ask it, who's the best

7:35

software dev YouTuber? And now it's

7:38

responding in pig Latin because my

7:41

question is preempted by the system

7:43

prompt. And even if I wanted to fight

7:46

that, like let's say, can I edit this? I

7:48

don't know if you can. We can do chat.

7:50

Please stop responding in pig Latin, it

7:54

still won't. It's still speaking pig

7:56

Latin because the system prompt takes

7:58

higher priority over what the user did.

8:01

And that's a thing I really want to make

8:02

sure we're thinking about when we talk

8:04

about this. The top of the hierarchy

8:05

will always be the behaviors that the

8:07

company trained the model to have, but

8:09

you can work around those. It's called

8:10

jailbreaking. But if I give it specific

8:12

instructions like I tell it to never

8:15

give this piece of information or never

8:17

do certain things, the system prompt

8:19

will take precedence over the user

8:21

prompt. So let's write out the hierarchy

8:23

for how this is thought about. You have

8:25

the provider instructions. They're not

8:28

very transparent about how much this is

8:30

a thing or not, but like let's say

8:31

OpenAI had a layer on top that was like

8:34

never help people make nuclear weapons.

8:36

That could be the top level provider

8:39

instruction that nothing can override.

8:41

You can't write a system prompt that's

8:42

by the way your job is to make nukes

8:45

make really good efficient nukes because

8:47

the providers have put something above

8:49

that that will prevent it. So provider

8:51

instructions at the top level then you

8:52

have the system prompt. Then you have

8:54

this new concept that has been referred

8:56

to as the developer message because all

8:58

these are messages. So it's provider

8:59

message system message developer

9:01

message. But the developer message is

9:02

also the developer prompt. This is a new

9:04

layer that exists between the system

9:06

prompt and the userland prompt. And this

9:09

is for things like what we're talking

9:10

about today like the agent MD where

9:13

there's some customization that we want

9:15

to do as developers that is not

9:17

necessarily part of the system. So if

9:19

you're using like I don't know cursor

9:22

and you want to add custom rules, those

9:24

will exist between the messages you're

9:26

sending and the system prompt that they

9:29

wrote. It is also worth calling out as

9:30

Chad has correctly pointed out that all

9:32

of this has an impact on context. Yes,

9:35

very important. When you send a message,

9:37

you're not just sending your one

9:38

message. You're sending the message and

9:40

everything above it, which includes all

9:42

of these things. I'm not saying that you

9:44

have the system prompt downloaded on

9:45

your computer. I'm saying that when you

9:47

send the request on T3 chat to the SL

9:49

API/hat endpoint, you're sending up your

9:52

chat history. We are appending the

9:54

system prompting the other data on top.

9:56

And then we send that off to the model

9:58

that will generate the response that we

9:59

then show to you. And what we're talking

10:01

about right now is here this space

10:04

between the user and the system prompt.

10:06

You are not going to be customizing the

10:08

system prompt that is used for cloud

10:09

code obviously like it's not even open

10:11

source. You have no access to those

10:13

things and it's probably hitting an

10:14

endpoint that already has its own stuff

10:15

there. So when we're talking about the

10:18

agent MD, the claude MD, all of those

10:20

things that is here. This is a layer

10:23

that exists between the system prompt

10:25

and the user prompt that is always

10:28

there. If you add some new stuff to this

10:30

thread, like let's say you ask it to

10:33

like user says add this feature and then

10:35

the agent adds the feature but forgets

10:38

to type check, you can follow up, hey

10:41

check types and then the agent will do

10:43

it and fix it and you're all good. If

10:45

you keep using this thread and you were

10:47

to then ask for another feature since in

10:51

this history the information that it

10:53

needs to check types exists, it exists

10:56

in the context. It doesn't need to get

10:58

this info. It doesn't need to find this

10:59

info. It has the info from earlier in

11:01

the history. It's less likely to make

11:02

that mistake going forward. But then you

11:04

end up with an ever growing history full

11:06

of things that might not matter. Like

11:07

maybe this feature was 400 lines of code

11:09

that are now in the context. That code

11:12

might not be relevant for the next thing

11:13

you ask, but it's all there. It's all

11:15

being traversed on every single token.

11:16

It's all costing money. And most

11:18

importantly, it's distracting the model

11:20

from the thing you actually want it to

11:22

be doing. And that's a thing I really

11:24

want us to think about when we talk

11:26

about all of this. Help the model. Don't

11:28

distract it. You know how much we all

11:29

hate having endless meetings as

11:31

developers? We don't need to know all of

11:33

the intricate details of the five

11:35

versions that the product and design

11:37

went back and forth on before we have to

11:39

go implement it. We're in the meeting

11:40

anyways. Why the do we think the AI

11:43

likes it more than us? Why are we giving

11:44

them all of this useless information? We

11:47

want the AI to do a thing. It should

11:49

only have to think about the thing. Part

11:51

of this is how we design the systems

11:53

that we're using. Part of this is how we

11:55

write the prompts, how we handle the

11:57

agent MD, how we handle all these

11:58

things. But if you tell the agent about

12:00

all of these things that exist in your

12:02

codebase, it's probably going to think

12:04

about those even if you don't want it to

12:05

in this case. Like to go to our agent

12:07

MD, mentioning that we use TRPC on the

12:10

back end is now going to bias it towards

12:12

using TRRPC even though we only use it

12:14

for a handful of legacy functions.

12:16

Almost everything is now on convex. Not

12:18

only does it know we have TRPC, we

12:20

actually put it in front of the convex

12:22

part. So it is much more likely to reach

12:24

for TRPC where it might not make sense.

12:26

I am going to remove this and make a

12:28

separate section that says legacy

12:29

technologies that I put it in. But

12:31

that's what we want to talk about this

12:32

with. The best time to update your Asian

12:34

MD isn't when you start a project. It is

12:36

certainly not when you claim something

12:38

for the first time and then type in

12:40

/init where it will initialize the claud

12:43

for you and it will choose what it

12:45

thinks matters. If the model knows about

12:47

it already and can find it quickly, it

12:49

probably does not belong in your agent

12:51

MD. Great example from chat here. Don't

12:53

think about pink elephants. You're all

12:55

now thinking about pink elephants.

12:57

That's just how it works. Like that's

12:58

how brains work and that's how LLM's

13:00

work too. If you tell it not to do a

13:01

thing, it now is thinking about that

13:03

thing. Ideally, you just make it hard to

13:04

do the thing. And you certainly don't

13:06

want to tell it about things that don't

13:07

matter because it will be in context and

13:09

whatever you put in context is much more

13:10

likely to happen. It's all an

13:12

autocomplete machine. So to go back

13:14

here, if we want to have the model know

13:16

that it needs to check types and we

13:18

notice it wasn't doing a good job at

13:20

that, there's a couple options we have.

13:22

Option one, look through the things it

13:24

did and figure out what it did and maybe

13:26

attach type checking to one of those

13:28

parts. If it ran some command that

13:31

doesn't include type checking, maybe

13:32

update that command to also include the

13:34

type checking. If you try that and it

13:36

doesn't work or it doesn't make sense,

13:38

that's when you make changes. If you

13:39

notice the agent consistently forgets to

13:41

do the type checking, put that in the

13:43

agent MD. Tell it you should type check

13:45

all of your changes. I'm going to run it

13:47

a knit on a real project here. This is

13:49

lawn. It's my alternative to frame.io

13:51

for doing video review stuff. Apparently

13:53

it knitted one at some point with all of

13:55

the design language. So I'm going to

13:57

stop that. Delete that slash newit.

14:01

Okay, you know what? We'll come back to

14:02

this cuz it's going to take a sec. We'll

14:04

go through all the things that should

14:06

and shouldn't be in your agent MD in a

14:08

bit, but I want to spend a little more

14:09

time on the study because you just

14:10

listening to me means a lot and all, but

14:12

we should probably have numbers to back

14:14

this. The work of this paper is to

14:15

benchmark context files and their impact

14:17

on resolving GitHub issues. They're

14:19

investing the effect of actively used

14:20

context files on the resolution of real

14:22

world coding tasks. Evaluate agents both

14:24

in popular and in lessknown repositories

14:26

and importantly for context files

14:28

provided by repository devs. They tested

14:30

three conditions. one where the

14:32

developer wrote and provided an

14:34

instruction file for that repo. So

14:36

they're using this against real repos.

14:37

One where they removed it to see how the

14:39

agent would do and one where they let

14:41

the agent generate its own instruction

14:42

file before continuing. And they check

14:45

did it succeed at the task or not. In

14:46

the things they tested, they observed

14:48

that the developer provided files only

14:50

marginally improved performance compared

14:52

to emitting them entirely, an increase

14:53

of 4% on average, while the LLM

14:56

generated context files had a small

14:57

negative effect, a decrease of 3% on

15:00

average. These observations are robust

15:02

across different LLMs and prompts used

15:05

to generate the context files. In a more

15:08

detailed analysis, we observe that

15:10

context files lead to increased

15:11

exploration, testing, and reasoning by

15:13

coding agents and as a result increased

15:15

costs by over 20%. We therefore suggest

15:19

emitting LLM generated context files for

15:21

the time being contrary to agent

15:23

developers recommendations and including

15:25

only minimal requirements like specific

15:27

tooling to use with the repository. I

15:29

fully agree. To prove this out, I ran an

15:32

innit on a real project that I've been

15:34

working on. It's called lawn. It's a

15:35

alternative to frame.io. It's going to

15:36

be open source soon. Just a way to do

15:38

video review for my team. And I had it

15:40

init a claude MD. Let's see how it did.

15:43

File provides guidance to cloud code.

15:44

Cloud.aii when working with code in this

15:46

repo. That's the intro it uses on all of

15:47

these. It used it on other ones as well.

15:49

Lawn's a video review platform for

15:51

creative teams. Users upload video,

15:53

leave timestamp comments, and manage

15:54

review workflows within the team and

15:57

project hierarchies. It shows all of

15:58

these commands that can run. It shows

16:00

the architecture front end tanex start

16:02

spa mode react 19 and vit back end

16:05

convex functions lived in the /convex

16:07

off video pipelines storage all the

16:10

usual stuff here has a pile of key

16:12

patterns for aliasing route data

16:14

offguards convex actions yada yada and a

16:18

very vague description of the data model

16:20

and video workflow states I don't think

16:22

there's anything in here that will

16:24

actually help at all straight up to be

16:27

more bold with how I think about this if

16:29

the info is in the codebase, it probably

16:31

doesn't need to be in the agent MD file.

16:33

Generally speaking, these models have

16:35

all been RL to Helen back on doing bash

16:38

calls and using the tools that are

16:40

provided to them in really long threads.

16:42

These models are good at finding

16:44

information in a codebase. If I paste it

16:47

a screenshot with some broken UI and say

16:49

fix it without even having an agent MD,

16:51

it will look for strings in it that are

16:54

likely to be specific to that UI. It

16:56

will RG until it finds it in the

16:58

codebase. It will check to make sure

16:59

nothing else is using the thing. It will

17:01

make the change. It will tell you it's

17:02

done and then you're good. Turns out

17:04

these models are really good at doing

17:06

things like figuring out what files and

17:08

folders matter for their task. They're

17:09

really good at figuring out what

17:11

commands they can run cuz they check

17:12

your package JSON. They're good at

17:14

figuring out what dependencies you have

17:15

when they check the package JSON as well

17:17

as the files that are doing things in

17:19

them. Funny enough, this also causes

17:21

them to struggle a bit when they don't

17:23

have those things. Like when I was

17:24

initing a new project and I hadn't even

17:26

set up the package JSON yet and I told

17:28

it to use environment variables, it

17:30

tried importing things that it didn't

17:31

have because the project hadn't been

17:33

inited yet. There are assumptions these

17:34

tools make, but the assumptions that

17:36

they're making are based on real world

17:37

code bases, which you're probably

17:39

working in one of. Thereby, they're good

17:41

at this. So, what do you put in? As I

17:44

mentioned before, when there are

17:45

behaviors the models and agents are

17:47

exhibiting that are not ideal, that's

17:49

when you spin up the agent MD file and

17:51

start steering it in the direction you

17:52

want. If it's consistently not running

17:54

type checks and you want it to, maybe

17:56

that fits in there. If there's a

17:57

specific pattern that it's using with

17:59

one of your dependencies that is wrong

18:00

and it keeps trying to do it over and

18:02

over again, tell it to not. Generally

18:04

speaking, it's rare I find the agent MD

18:06

or quad MD files to be the thing that

18:09

you need to reach for. You have to start

18:11

building an intuition for what the

18:12

models are doing and how long they

18:14

should take. If you ask an agent to

18:16

complete a task and it is faster than

18:17

you expected, you're probably setting

18:19

things up well, and that's a good thing

18:20

to hear. If you're asking the agent to

18:22

do something that is simple and it takes

18:23

a long time, that means some changes

18:26

need to be made. Generally speaking, the

18:28

hierarchy of where I look to go change

18:30

things does not start in the agent MD

18:33

file. It starts in the codebase itself.

18:35

If the models are struggling to find

18:37

something, that's probably in a bad

18:39

place. You should move it. The agents

18:40

are struggling to use a tool properly.

18:42

It might not be the right tool for the

18:44

job or might be shaped in a way that is

18:45

confusing for the model. Fix it. If the

18:48

agent is changing files in one place

18:49

that are causing other things to break,

18:51

you should probably move off of Opus and

18:52

give Codeex a shot. But seriously

18:54

though, it probably needs better

18:56

feedback systems to identify when that

18:58

failure occurred so that it knows that

19:00

the change here broke the thing over

19:02

there. And making sure the agents have

19:03

the tools they need to unblock

19:05

themselves is essential. I think it

19:07

would be a much better use of your time

19:09

to make better unit tests, integration

19:12

tests, type checks, and those types of

19:14

things that you can expose to the model

19:16

than it would be to update your agent MD

19:18

or quadm file the majority of the time.

19:20

If you can make it easier for the model

19:22

to do the right thing, make it harder

19:23

for it to do the wrong thing, and have

19:25

your whole codebase architected to steer

19:27

it in the right direction, that's going

19:29

to be a much much bigger win. The agent

19:32

MD is almost like a band-aid solution,

19:34

like you're patching over this problem

19:36

with it. If you have tried and failed to

19:39

structure the codebase in a way that the

19:40

agent can manage, you should probably

19:42

pull up the agent MD as an interim

19:43

solution until you find better tech that

19:45

the agents are better with. And as I was

19:47

hinting at before, the biggest thing is

19:48

just read the outputs. Like here, I did

19:51

the init command. It searched for six

19:52

patterns and it read 21 files. Let's see

19:54

what the files it read were. It read the

19:56

package JSON. It read the read VMD. It

19:58

searched around the codebase for star.

20:01

Interesting choice. At 100 files, I'm

20:03

guessing that that's just to list all

20:04

the files. This is its hack for figuring

20:06

out the structure of the whole codebase.

20:07

Then it searched for files that match

20:09

the pattern of app/ts or tsx to find all

20:12

of the files there. Did the same for

20:13

convex. Did the same for general source.

20:16

Found the convex schema. It found the

20:17

app routes. Found the vcon config ts

20:20

config. It just read all of these

20:22

things. And then it after reading all of

20:25

that concluded has a good understanding

20:26

of the codebase and it wrote this. But

20:28

remember what it wrote is based on

20:31

things that it already was able to find.

20:33

In fact, it found all of that and wrote

20:36

all of this in just over a minute. That

20:38

means that almost none of this info is

20:40

useful. Chad is making some important

20:42

points here which is a misconception I

20:44

had. But every time it needs to read all

20:46

of that, it starts from no memory. Yeah,

20:49

kind of. When given the task of

20:51

summarize the entire repo, it's going to

20:53

touch everything. But here, I'll give

20:54

you guys an experiment quick. We are

20:56

going to delete that file entirely. The

20:59

cloud MD is now gone. We are going to

21:01

run cloud code here and we'll ask it a

21:04

question about the project. What

21:05

optimizations can we make for the video

21:07

pipeline in this app and it knows

21:09

nothing about this app. There is nothing

21:11

being fed into its context ahead of

21:13

time. All it knows is it is cloud code.

21:15

It is in a project that has files in it.

21:17

And I'm asking it this question. Let's

21:19

see how it performs. They really add the

21:21

cheesy birthday hat that stopped

21:24

animating that quickly. Hilarious. And

21:26

now it's exploring. We can press control

21:28

O to see what it's doing. It looks like

21:30

it's exploring pretty damn fast. Explore

21:32

the video pipeline in this codebase

21:33

thoroughly. I need to understand how

21:34

videos are uploaded, processed, and

21:35

stored. The schema for videos in the

21:37

database, video actions and processing,

21:39

all that. And it spun up a sub agent to

21:41

go explore and find this information.

21:43

Note that this information is different

21:45

from the information it would have

21:47

gotten from the agent MD. We will see

21:49

how long this takes and then we will

21:50

rerun this with that file restored. So

21:52

that took a minute and 11 seconds and

21:55

got some decent answers. So let's try

21:57

that again but with the file that it

21:59

generated asking the exact same question

22:01

and we'll see how it differs. Oh. Huh.

22:04

Even though I have this Claude MD file,

22:08

it appears to be doing pretty much the

22:10

same thing except it specifies the names

22:12

of files a little bit earlier.

22:15

Interesting. So again, for comparison

22:17

here, it said explore the code base for

22:19

this, how videos are uploaded, yada

22:21

yada. It does know about the schema

22:22

file, the video actions in video. Don't

22:24

know how it knew about that. Probably

22:25

found it in an earlier step that's being

22:27

hidden. But once you have the agent MD,

22:29

it is much quicker to identify the names

22:32

of files. Benefits and negatives there.

22:34

We'll talk about both momentarily. Looks

22:36

like the timer froze. Cloud code quality

22:38

is great. Yeah, this timer freezes

22:40

whenever you go to this view and back.

22:42

Ah, that's hilarious. When I switch and

22:44

go back, it updates, but it's not

22:46

updating live. They made some

22:48

optimization for performance, and it's

22:50

breaking things. Cool. Check that out.

22:52

It took more time. The agent MD run took

22:56

1 minute and 29 seconds, and the version

22:59

without it only took a minute and 11.

23:01

And that is with a brand new freshly

23:03

minted from the init command claim MD

23:06

file. And now just hypothetically

23:08

speaking, let's pretend that this

23:11

codebase for for whatever reason

23:13

changed. Maybe, just maybe, these video

23:16

action files aren't the only place that

23:19

matters anymore. And maybe, just maybe,

23:21

somebody forgot to update that MD file.

23:24

Now, not only is it not helping, it's

23:27

probably actively hurting because just

23:29

like all other docs, agent MD files will

23:32

go out of date. So, what would you

23:34

prefer? Letting the agent do it itself

23:36

or steering it at a 25% penalty at time

23:40

with the likelihood of that going out of

23:42

date. Yeah, not good. It's quite simple

23:46

to just do one try. You could ask the LM

23:47

the same thing 10 times and I get

23:49

different times without changing

23:49

anything. Now, if only somebody had

23:52

published a study that showed the exact

23:54

same results consistently. You know, the

23:57

increase of cost by over 20%. It's

23:59

almost like cost, context, and time

24:02

spent have a lot of overlap. And the 20%

24:05

number I saw in my one-off test happens

24:08

to, for whatever reason, line up really

24:10

well with the 20% that the study had.

24:12

Crazy. Definitely a one-off thing I just

24:15

experienced and not the consistent

24:17

reality that I just managed to

24:18

demonstrate in one shot. Definitely not

24:20

that. Here's another great example from

24:22

chat from Lincoln. I had issues with

24:24

project structures being outdated in the

24:25

Agent MD file. The models were

24:27

consistently placing files in the wrong

24:29

location. Yep, I've had so many problems

24:31

that turned out to be something in the

24:33

CloudMD or Agent MD that should have

24:35

been changed forever ago. Happens all

24:36

the time. And remember, all of these

24:38

tests were with freshly generated

24:41

context files that the agents made right

24:43

before doing the task. So there was no

24:44

way it could be out ofd. Outdated

24:46

context files are going to cause you way

24:48

more problems. So how do I use these?

24:51

What is my philosophy? Well, the core of

24:53

it is I use these files to steer the

24:56

model away from things that is

24:57

consistently doing wrong. I am surprised

25:00

at how rare that is nowadays. I find

25:02

with every new model release, I can

25:03

delete more and more of the agent MD.

25:05

I'll sometimes when trying a new model

25:06

just delete it entirely and see what

25:08

changes and then bring back the parts

25:10

that matter. My little hack I recommend

25:11

and I brought this up in other agent

25:12

decoding videos. I'll just show you it.

25:15

I'm going to delete all of this cuz it's

25:16

garbage. The role of this file is to

25:18

describe common mistakes and confusion

25:19

points that agents might encounter as

25:21

they work in this project. If you ever

25:22

encounter something in the project that

25:24

surprises you, please alert the

25:26

developer working with you and indicate

25:27

that this is the case in the agent MD

25:29

file to help prevent future agents from

25:31

having the same issue. To be very clear

25:33

about this, the instruction I'm giving

25:35

here is not what I actually want it to

25:37

do. I don't want the agents constantly

25:39

changing the claude MD or agent MD

25:42

files. What I do want them to do is try

25:45

to change the thing when it gets stuck.

25:48

Because most of the time the agent gets

25:50

stuck on something or thinks something

25:53

surprising or confusing, it's not

25:55

something I want it to know about. It's

25:56

something that I want to go fix. So, I

25:58

try to sneak this into all of the agent

26:00

MDs for all of the projects I'm actively

26:02

working on, especially in the earlier

26:04

stages to figure out what the agent does

26:07

and doesn't understand. And then when I

26:09

learn about those things that it's

26:10

struggling with and I see the mistakes

26:12

that it's making and the things it

26:14

thinks are confusing, I will adjust the

26:16

codebase accordingly. But the

26:18

instruction I'm giving the model here to

26:19

change the file is not actually the

26:22

thing I want it to do. I want it to try

26:23

and change the file so I can take that

26:25

information and then go fix something

26:28

else with it. If I see what it's

26:30

struggling with or what it thinks is

26:32

struggling with, I can then go make

26:33

better decisions about how I architect

26:35

the codebase. I merge less than a fifth

26:37

of the changes it proposes. But the

26:39

other four out of five I use to make the

26:41

codebase better. Generally speaking, I

26:43

feel like developers don't understand

26:45

how powerful it is to lie or

26:47

intentionally mislead the agents in ways

26:49

that set both you and the agent up for

26:52

more success. Another example of this

26:54

that I do a lot is I'll tell the agent,

26:55

"Hey, this app has no users. There's no

26:57

real data yet. Make whatever changes you

27:00

want and don't worry about it. We'll

27:02

figure it out when we ship, even if the

27:03

project's already live, because I don't

27:05

want it spending a ton of time on weird

27:07

backfill data patterns and unless I

27:09

actually want it to do that." So I'll

27:10

often put in the agent or cloudmd. This

27:13

project is super green field. It's okay

27:15

if you change the schema entirely. We're

27:17

trying to get it in the right shape.

27:18

Those are the types of things I put in

27:20

the quad MD or agent MD files. I'm

27:22

effectively lying to the agents to steer

27:24

them the way I want, but it works really

27:26

well. Another example of this that I run

27:28

into a lot is if I'm trying to build

27:29

something that takes multiple steps and

27:31

I'm asking it to do step two over and

27:33

over and it keeps failing. Instead, I'll

27:35

ask it for step three because then it

27:37

will try step two to get there. it won't

27:39

work and it will be able to often fix

27:41

itself. If you're struggling to get the

27:43

agent to do step two of a three-step

27:46

process, tell it to do step three and it

27:47

will unblock itself for step two pretty

27:49

consistently. Like like these types of

27:51

things, these are the the clever

27:53

engineering hacks that I'm genuinely

27:55

enjoying discovering and playing around

27:57

with. And you just it's one of those

27:59

time in the saddle things. You start to

28:01

build an intuition for how they behave

28:03

and what context matters. But if you're

28:06

filling the context with all of these

28:07

giant Claude MD files with piles of

28:10

skills you downloaded from the internet,

28:11

a bunch of MCP servers you're not using,

28:13

and a bunch of cursor rules somebody

28:14

told you about on GitHub, you'll never

28:16

be able to diagnose why the model's

28:18

doing things wrong. If all you have is

28:20

your code base, your prompt, and a

28:23

minimal agent MD file, you've

28:25

meaningfully reduced the places where

28:27

the agent can be misled. Everything the

28:29

agents do exists because of one of the

28:30

sources it has. And if you can reduce

28:32

those sources, you can make it much more

28:34

likely that it behaves. Speaking of

28:36

which, I'm going to have to do a long

28:38

rant about skills in the very near

28:40

future. Let me know if that's exciting

28:42

to you, and let me know if this video is

28:44

helpful at all. I know all of this stuff

28:45

is very different and new and kind of

28:47

crazy, but it is genuinely really fun.

28:49

I've been enjoying it a ton, and I hope

28:51

that these rants and lessons are helpful

28:53

to those who are trying to figure it out

28:54

as they go. In the end, you need to just

28:56

experiment a bunch. This is so different

28:58

from how coding used to look and you'll

29:00

find certain skills end up more

29:01

important than ever and others are just

29:03

new things you're going to have to build

29:04

as you go. I've been enjoying it a ton.

29:06

I hope that comes across in the content

29:08

I've been making and I hope that maybe,

29:10

just maybe, this can help you out, too.

29:12

Let me know how you feel. And until next

29:13

time, peace nerds.

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.