No Vibes Allowed: Solving Hard Problems in Complex Codebases – Dex Horthy, HumanLayer
FULL TRANSCRIPT
[music]
Hi everybody. How y'all doing?
>> It's exciting. I'm Dex. Uh, as they did
in the great intro, I've been hacking on
agents for a while. Um, our talk 12
factor agents at AI engineer in June was
one of the top talks of all time. Uh, I
think top eight or something. One of the
best ones from from AI engineer in June.
May or may not have said something about
context engineering. Um, why am I here
today? What am I here to talk about? Um,
I want to talk about one of my favorite
talks from AI engineer in June. And I
know we all got the update from Eigor
yesterday, but they wouldn't let me
change my slides. So, this is going to
be about what Eigor talked about in
June. uh basically that they surveyed a
100,000 developers across all company
sizes and they found that most of the
time you use AI for software engineering
you're doing a lot of rework a lot of
codebase churn uh and it doesn't really
work well for complex tasks brownfield
code bases um and you can see in the
chart basically you are shipping a lot
more but a lot of it is just reworking
the slop that you shipped last week so
uh and then the other side right was
that uh if you're doing green field
little versel dashboard something like
this then it's going to work great. Uh
if you're going to go in a 10-year-old
Java at codebase, maybe not so much. And
this matched my experience personally
and talking to a lot of founders and
great engineers, too much slop uh tech
debt factory. It's just it's not going
to work from our codebase. Like maybe
someday when the models get better, but
that's what context engineering is all
about. How can we get the most out of
today's models? How do we manage our
context window? So we talked about this
in August. Um I have to confess
something. The first time I used cloud
code, I was not impressed. It was like,
okay, this is a little bit better. I get
it. I like the UX. Um, but since then,
we as a team figured something out. Um,
that we were actually able to get, you
know, 2 to 3x more throughput. And we
were shipping so much that we had no
choice but to change the way we
collaborated. We rewired everything
about how we build software. Uh, it was
a team of three. It took eight weeks. It
was really freaking hard. Uh, but now
that we solved it, we're we're never
going back. This is the whole no slop
thing. I think I think we got somewhere
with this went super viral on HackerNews
in September. Uh we have thousands of
folks who have gone on to GitHub and
grabbed our you know research plan
implement prompt system. Um so the goals
here which we kind of backed our way
into we need AI that can work well in
brownfield code bases that can solve
complex problems. No slop, right? No
more slop. Uh and we had to maintain
mental alignment. I'll talk a little bit
more about what that means in a minute.
And of course we want to spend with
everything we want to spend as many
tokens as possible. what we can offload
meaningfully to the AI is really really
important. Um, super high leverage. So,
this is advanced context engineering for
coding agents. Um, I'll start with kind
of like framing this. The most naive way
to use a coding agent is to ask it for
something and then tell it why it's
wrong and resteere it and ask and ask
and ask until you run out of context or
you give up or you cry. Um, we can be a
little bit smarter about this. Most
people discover this pretty early on in
their AI like exploration. uh is that it
might be better if you start a
conversation and you're off track that
uh you just start a new context window.
You say, "Okay, we went down that path.
Let's start again. Same prompt, same
task, but this time we're going to go
down this path and like don't go over
there cuz that doesn't work." So, uh how
do you know when it's time to start
over?
If you see this,
it's probably time to start over, right?
This is what Claude says when you tell
it it's screwing up.
Um, so we can be even smarter about
this. We can do what I call intentional
compaction. Um, and this is basically
whether you're on track or not, you can
take uh your existing context window and
ask the agent to compress it down into a
markdown file. You can review this, you
can tag it, and then when the new agent
starts, it gets straight to work instead
of having to do all that searching and
codebase understanding and getting
caught up. Um, what goes into
compaction? Well, the question is like
what takes up space in your context
window. So, um, it's looking for files,
it's understanding code flow, it's
editing files, it's test and build
output. And if you have one of those
MCPs that's dumping JSON and a bunch of
UU ids into your context window, you
know, God help you. Uh, so what should
we compact? I'll get more specifics
here, but this is a really good
compaction. This is exactly what we're
working on. The exact files and line
numbers that matter to the problem that
we're solving. Um, why are we so
obsessed with context? Because LMS are
actually got roasted on YouTube for this
one. And they're not pure functions cuz
they're nondeterministic, but they are
stateless. And the only way to get
better better performance out of an LLM
is to put better tokens in and then you
get better tokens out. And so every turn
of the loop when Claude is picking the
next tool or any coding agent is picking
the next and there could be hundreds of
right next steps and hundreds of wrong
next steps. But the only thing that
influences what comes out next is what
is in the conversation so far. So we're
going to optimize this context window
for correctness, completeness, size, and
a little bit of trajectory. And the
trajectory one is interesting because a
lot of people say, "Well, I I told the
agent to do something and it did
[clears throat] something wrong. So, I
corrected it and I yelled at it and then
it did something wrong again and then I
yelled at it." And then the LM is
looking at this conversation says,
"Okay, cool. I did something wrong. The
human yelled at me and then I did
something wrong and the human yelled at
me." So, the next most likely conver
token in this conversation is I better
do something wrong so the human can yell
at me again. So, mind be mindful of your
trajectory. If you were going to invert
this, the worst thing you can have is
incorrect information, then missing
information, and then just too much
noise. Um, if you like equations,
there's a dumb equation if you want to
think about it this way. Um, Jeff
Huntley uh did a lot of research on
coding agents. Uh, he put it really
well. Just the more you use the context
window, the worse outcomes you'll get.
This leads to a concept I'm in a very
very academic concept called the dumb
zone. So, you have your context window.
You have 168,000 tokens roughly. Some
are reserved for output and compaction.
This varies by model. Um, but we'll use
cloud code as an example here. Around
the 40% line is where you're going to
start to see some diminishing returns
depending on your task. Um, if you have
too many MCPs in your coding agent, you
are doing all your work in the dumb zone
and you're never going to get good
results. People talked about this. I'm
not going to talk about that one. Your
mileage may vary. 40% is like it depends
on how complex the task is, but this is
kind of a good guideline. Um so back to
compaction or as I will call it from now
on cleverly avoiding the dumb zone. Um
we can do sub agents. Um if you have a
front-end sub aent and a backend sub
aent and a QA sub aent and a data data
scientist sub aent
please stop. Sub aents are not for
anthropomorphizing roles. They are for
controlling context. And so what you can
do is if you want to go find how
something works in a large codebase um
you can steer the coding agent to do
this if it supports sub agents or you
can build your own sub agent system. But
basically you say hey go find how this
works and it can fork out a new context
window that is going to go do all that
reading and searching and finding and
reading entire files and understanding
the codebase and then just return a
really really succinct message back up
to the parent agent of just like hey the
file you want is here. parent agent can
read that one file and get straight to
work. And so this is really powerful. If
you wield these correctly, you can get
good responses like this and then you
can manage your context really, really
well. Um, what works even better than
sub agents or like a layer on top of sub
aents is a workflow I call frequent
intentional compaction. We're going to
talk about research plan implement in a
minute, but like the point is you're
constantly st keeping your context
window small. You're building your
entire workflow around context
management. So comes in three phases.
research, plan, implement. Um, and we're
going to try to stay in the smart zone
the whole time. So, the research is all
about understanding how the system
works, finding the right files, staying
objective. Here's a prompt you can use
to do research. Here's the output of um,
a research prompt. These are all open
source. You can go grab them and play
with them yourself. Um, planning, you're
going to outline the exact steps. You're
going to include file names and line
snippets. You're going to be very
explicit about how we're going to test
things after every change. Here's a good
planning prompt. Here's one of our
plans. It's got actual code snippets in
it. Um, and then we're gonna implement.
And if you read one of these plans, you
can see very easily how the dumbest
model in the world is probably not going
to screw this up. Um, so we just go
through and we run the plan and we keep
the context low. As a planning prompt,
like I said, it's the least exciting
part of the process. Um, I wanted to put
this into practice. So, working for us,
uh, I do a podcast with my buddy uh,
Vibv, who's the CEO of a company called
Boundary ML. Uh, and I said, "Hey, I'm
going to try to oneshot a fix to your
300,000line Rust codebase for a
programming language."
Um, and the whole episode goes in, it's
like an hour and a half. Uh, I'm not
going to talk through it right now, but
we built a bunch of research and then we
threw them out because they were bad.
And then we made a plan and we made a
plan without research and with research
and compared all the results. It's a fun
time. Uh, by that was Monday night. By
Tuesday morning, we were on the show and
the CTO had like seen the PR and like
didn't realize I was doing it as a bit
for a podcast and basically was like,
"Yeah, this looks good. We'll get in the
next release." He I think he was a
little confused. Um, here's the the
plan. But anyways, uh, yeah, confirmed
works in brownfield code bases and no
slop. But I wanted to see if we could
solve complex problems. So, Vib was
still a little skeptical. I sat down, we
sat down for like 7 hours on a Saturday
and we shipped 35,000 lines of code to
BAML. One of the PRs got merged like a
week later. I will say some of this is
codegen. You know, you update your
behavior. All the golden files update
and stuff, but we shipped a lot of code
that day. Um, he estimates it was about
1 to 2 weeks and 7 hours. And uh, so
cool. We can solve complex problems.
There are limits to this. I sat down
with my buddy Blake. We tried to remove
Hadoop dependencies from Parket Java. If
you know what Paret Java is, I'm sorry
uh for whatever happened to you to get
you to this point in your career. Uh it
did not go well. Uh here's the plans,
here's the research. Uh at a certain
point, we threw everything out and we
actually went back to the whiteboard. We
had to actually once we had learned
where were the where all the foot guns
were, we we went back to okay, how is
this actually going to fit together? Um,
and this brings me to a really
interesting point that Jake's going to
talk about later. Uh, do not outsource
the thinking. AI cannot replace
thinking. It can only amplify the
thinking you have done or the lack of
thinking you have done. So people ask,
so Dex, this is specri development,
right? No, specri development is broken.
Not the idea, but the phrase. Um,
it's not well defined. This is Brietta
from Thought Works. Um, and a lot of
people just say spec and they mean a
more detailed prompt. Does anyone
remember this picture? Does anyone know
what this is from? All right, that's a
deep cut. Uh, there will never be a year
of agents because of semantic diffusion.
Martin Fowler said this in 2006. We come
up with a good term with a good
definition and then everybody gets
excited and everybody starts meaning it
to mean a hundred things to a 100
different people and it becomes useless.
We had an agent is a person, an agent is
a micros service. An agent is a chatbot.
An agent is a workflow. And thank you,
Simon. We're back to the beginning. An
agent is just tools in a loop. Um, this
is happening to spec driven dev. I used
to have Sean's uh slide in the beginning
of this talk, but it caused a bunch of
people to focus on the wrong things. His
thing of like, forget the code. It's
like assembly now and you just focus on
the markdown. Very cool idea, but people
say Spectrum Dev is writing a better
prompt, a product requirements document.
Sometimes it's using like verifiable
feedback loops and back pressure. Maybe
it is treating the code like assembly
like Sean taught us. Um, but a lot of
people is just using a bunch of markdown
files while you're coding. Or my
favorite, I just stumbled upon this last
week. Uh, a spec is documentation for an
open source library. So it's gone. It's
as specri dev is overhyped. It's useless
now. It's semantically diffused.
Um, so I want to talk about like four
things that actually work today. The
tactical and practical steps that we
found working internally and with a
bunch of users. Um, we do the research,
we figure out how the system works. Um,
remember Momento? This is the best the
best movie on context engineering, as
Peter says it. Guy wakes up, he has no
memory. He has to like read his own
tattoos to figure out who he is and what
he's up to. If you don't onboard your
agents, they will make stuff up. And so,
if this is your team, this is very
simplified for most of you. Most of you
have much bigger orgs than this. But
let's say you want to do some work over
here. Um, one thing you could do is you
could put onboarding into every repo.
You put a bunch of context. Here's the
repo. Here's how it works. This is a
compression of all the context in the
codebase that the agent can see ahead of
time before actually getting to work.
This is challenging because
sometimes it gets too long. As your
codebase gets really big, you either
have to make this longer or you have to
leave information out. And so as you uh
are reading through this, you're going
to read the context of this big 5
million line monor repo and you're going
to use all the smart zone just to learn
how it works. And you're not going to be
able to do any good tool calling in the
dumb zone. So that's uh you can
you can shard this down the stack. You
can do they're just talking about
progressive disclosure. You could split
this up, right? You could just put a
file in the root of every repo and then
like at every level you have like
additional context based on if you're
working here, this is what you need to
know. Uh we don't document the files
themselves cuz they're the source of
truth. But then as your agent is
working, you know, you pull in the root
context and then you pull in the
subcontext. We won't talk about any
specific like you could use cloudd for
this, you can use hooks for this,
whatever it is. Um, but then you still
have plenty of room in the smart zone
because you're only pulling in what you
need to know. Um, the problem with this
is that it gets out of date. And so
every time you ship a new feature, you
need to kind of like cache and validate
and rebuild large parts of this internal
documentation. And you could use a lot
of AI and make it part of your process
to update this. Um, but I want to ask a
question between the actual code, the
function names, the comments, and the
documentation. Does anyone want to guess
what is on the y-axis of this chart?
slop
>> slop. It's actually the amount of lies
you can find in any one part of your
codebase.
Um, so you could make a part of your
process to update this, but you probably
shouldn't cuz you probably won't. What
we prefer is on demand compressed
context. So if I'm building a feature
that relates to SCM providers and Jira
and Linear, um, I would just give it a
little bit of steering. I would say,
hey, we're going over in like this like
part of the codebase over here. Um, and
a good research uh prompt or or slash
command might take you or skill even uh
launch a bunch of sub aents to take
these vertical slices through the
codebase and then build up a research
document that is just a snapshot of the
actually true based on the code itself
parts of the codebase that matter. We
are compressing truth. Um, planning is
leverage. Planning is about compression
of intent. Um, and in plan we're going
to outline the exact steps. We take our
research and our PRD or our bug ticket
or our whatever it is and we create a
plan and we create a plan file. So we're
compacting again. And I want to pause to
talk about mental alignment. Um does
anyone know what code review is for?
>> Mental alignment. Mental alignment is it
is about finding making sure things are
correct and stuff but the most important
thing is how do we keep everybody on the
team on the same page about how the
codebase is changing and why. And I can
read a thousand lines of Golang every
week. Uh sorry I can't read a thousand.
It's hard. I can do it. I don't want to.
Um, and as our team grows, I all the
code gets reviewed. We don't not read
the code, but I, as you know, a
technical leader in the in on the team,
I can read the plans and I can keep up
to date and I can that's enough. I can
catch some problems early and I maintain
understanding of how the system is
evolving. Um, Mitchell had this really
good post about how he's been putting
his AMP threads on his pull requests so
that you can see not just, hey, here's a
wall of green text in GitHub, but here's
the exact steps, here's the prompts, and
hey, I ran the build at the end and it
passed. This takes the reviewer on a
journey in a way that a GitHub PR just
can't. And as you're shipping more and
more in two to three times as much code,
it's really on you to find ways to keep
your team on the same page and show them
here's the steps I did and here's how we
tested it manually. Um, your goal is
leverage. So you want high confidence
that the model will actually do the
right thing. I can't read this plan and
know what actually is going to happen
and what code changes are going to
happen. So we've over time iterated
towards our plans include actual code
snippets of what's going to change. So
your goal is leverage. You want
compression of intent and you want
reliable execution. Um and so I don't
know I have a physics background. We
like to draw lines through the center of
peaks and curves. Uh as your plans get
longer, reliability goes up, readability
goes down. There's a sweet spot for you
and your team and your codebase. you
should try to find it because when we
review the research and the plans, if
they're good, then we can get mental
alignment. Um, don't outsource the
thinking. I've said this before, this is
not magic. There is no perfect prompt.
You still will not work if you do not
read the plan. So, we built our entire
process around you, the builder, are in
back and forth with the agent reading
the plans as they're created. And then
if you need peer review, you can send it
to someone and say, "Hey, does this plan
look right? Is this the right approach?
Is this the right order to look at these
things?" Um Jake again wrote a really
good blog post about like the thing that
makes research plan implementing
valuable is you the human in the loop
making sure it's correct. So if you take
one thing away from this talk it should
be that a bad line of code is a bad line
of code and a bad part of a plan is
could be a hundred bad lines of code and
a bad line of research like a
misunderstanding of how the system works
and where things are your whole thing is
going to be hosed. You're going to be
telling sending the model off in the
wrong direction. And so when we're
working internally and with users, we're
constantly trying to move human effort
and focus to the highest leverage parts
of this pipeline. Um, don't outsource
the thinking. Watch out for tools that
just spew out a bunch of markdown files
just to make you feel good. I'm not
going to name names here. Uh, sometimes
this is overkill. And the way I like to
think about this is like, yeah, you
don't always need a full research plan
implement. Sometimes you need more,
sometimes you need less. If you're
changing the color of a button, just
talk to the agent and tell it what to
do. Um, if you're doing like a simple
plan and as a small feature, if you're
doing medium features across multiple
repos, then do one research, then build
a plan. Basically, the hardest problem
you can solve, the ceiling goes up the
more of this context engineering
compaction you're willing to do. Um, and
so if you're in the top right corner,
you're probably going to have to do
more. A lot of people ask me, "How do I
know how much context engineering to
use?" It takes reps. You will get it
wrong. You have to get it wrong over and
over and over again. Sometimes you'll go
too big. Sometimes you go too small.
Pick one tool and get some reps. I
recommend against minmaxing across cloud
and codeex and all these different
tools. Um, so I'm not a big acronym guy.
Uh, we said specri dev was broken. Uh,
research plan and implement I don't
think will be the steps. The important
part is compaction and context
engineering and staying in the smart
zone. But people are calling this RPI
and there's nothing I can do about it.
So, uh, just be wary. There is no
perfect prompt. There is no silver
bullet. Um, if you really want a hypy
word, you can call this harness harness
engineering, which is part of context
engineering, and it's how you integrate
with the integration points on codeex,
claude, cursor, whatever. How you
customize your codebase. Um, so what's
next? I think the coding agent stuff is
actually going to be commoditized.
People are going to learn how to do this
and get better at it. And the hard part
is going to be how do you adapt your
team and your workflow and the SDLC to
work in a world where 99% of your code
is shipped by AI. Uh, and if you can't
figure this out, you're hosed because
there's kind of a rift growing where
like staff engineers don't adopt AI
because it doesn't make them that much
faster. And then junior mid-levels
engineers use a lot because it fills in
skill gaps and then it also produces
some slop. And then the senior engineers
hate it more and more every week because
they're cleaning up slop that was
shipped by cursor the week before. Uh,
this is not AI's fault. This is not the
mid-level engineers fault. Like if
cultural change is really hard and it
needs to come from the top if it's going
to work. So if you're a technical leader
at your company, pick one tool and get
some reps. If you want to help, we are
hiring. We're building an Aentic IDE to
help teams of all sizes speedrun the
journey to 99% AI generated code. Uh if
we'd love to we'd love to talk if you
want to work with us. Uh go go hit our
website, send us an email, come find me
in the hallway. Uh thank you all so much
for your energy.
[music]
Heat.
[music]
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.