Ralph Wiggum Showdown w/ @GeoffreyHuntley
FULL TRANSCRIPT
should probably just have this. I think
we're live. What's up, Jeff?
>> What's up, Dex?
>> Uh, I'm really jealous of your of your
DJ setup over there. That's uh pretty
incredible.
>> It's been a while. Thanks, mate. Like, I
remember
>> um when I first caught up with you in
San Fran, probably what June, July,
rocking into a meetup and like go to
Allison, it's like here's some free
alpha. If you run it in a loop, you get
crazy outcomes. And this was with Sonnet
45. And now we're up to Opus 45.
>> No, dude. This was not Sonnet 45. This
was in May. This would have been like
Sonnet 35, I think.
>> Yeah, it was. Anyway, it was cooked back
then. 6 months later, as the model gets
better, uh this the the techniques
um there's been a few attempts to turn
it into products.
>> Um but I I don't think that will work.
Um because I see LLM's amplifier of
operator skill. Um and if you just set
it off and run away, you're not going to
get a great as great of an outcome. um
you really want to actually babysit this
thing and then get really curious why it
did that thing and then try to tune that
behavior in or out and really think
about it and never blame the model and
always be curious about what's going on.
So it's really highly supervised.
>> Highly super. Yeah. You guys were
talking with Matt today was like human
on the loop is better than human in the
loop which is like don't ask me but I'm
going to go poke it and pro it and test
it and I might stop you at certain
points but I'm not being the model's not
deciding when and how.
>> Correct. So it's it's really cute that
Anthropic has made the uh Ralph plugin
which is nice. So it's starting to cross
the chasm but I do have some concerns
that people will just like try the
official plugin and go that's not it.
And like you've you've you've poked in
the internals and I've I've we sat down
and you've done it. You you see the
concepts. It's like some of the ideas
behind human layout.
>> It's um you say that it's not it. So how
is it not it Dex?
>> Okay. So I'm going to talk about what we
actually want to do today which is like
>> I have two GCP VMs
>> and in both of them we have this specs
and they both have a repo checked out.
Um, this one actually doesn't even have
a loop dash yet. This just has the like
slash ralph wiggum create loop or
whatever. I forgot what the exact thing
is. We're going to go set it up today. I
haven't actually turned this on yet,
but I've created these two git repos.
One has a prompt. MD and a loop.sh and
it will eventually create this
implementation plan. This is like
vanilla Ralph from the Jeff recipe,
right? And so I've got in this shell I
have my loop.sh, SH, which is literally
just run claude in yolo mode, cat the
prompt in, and let it go do its thing.
>> Yeah. Uh, bigger, by the way. Triple
size. Let's bigger. Bigger. Bigger.
>> Yeah. Yeah. Yeah. And I'm actually going
to close some of these terminals. Um,
and then each of these have um let's see
if we can pull this down. Yeah. Each of
these have a so there's two directories.
There's two git repos I've made. One to
test the anthropic version.
uh and one to test the uh I'll call it
the Jeff version of Ralph. Um so we've
got the bash one and then we've got the
plugin one.
And so these both have received they're
just empty repos. Um I'm going to add
the loop and the prompt. We'll look at
the prompt in a sec. But then we've got
these just like specs for a project that
I was hacking on called customark which
if you remember Kubernetes and the
customized world it's sort of a uh
customization pipeline for like
incrementally building markdown files
with like patches and stuff. So
>> um so anyways they're both getting the
same set of specs and they're both
basically being instructed to uh
run they both get the same prompt
which is like god and actually I guess
this one will also get implementation
plan right?
>> Yeah
>> assuming we have the same prompt and the
prompt is essentially I'll just push it
and go get it. um
>> while you go get it. Now in that diagram
you have GCP
>> folks we uh we've been at AGI for a very
long time. If you define AGI as
disruptive for software engineers at
least six months now and these models
are just getting better now the GCP
thing I see people go oh what about the
sandbox sandboxing dangerously allow all
think about it dangerously allow all is
literally like put deliberately
injecting humans into the loop you don't
want to inject yourself into the loop
because that's essentially not AGI
you're dumbing it down
>> but
>> interesting it is kind of dangerous to
do things. So the fact that you're
running on a GCP VM
is key, right? You you want to you want
to enable all the tools,
>> but it remember the trifecta, right? Is
like
>> untrusted in
>> is
access to the network
and then access to private data.
So we are giving it access to do
everything which means it can search the
web which means it can accidentally
stumble on untrusted input. We're giving
it access to the network to because it
needs to do things I don't know search
web whatever it is and we're giving it
we're not giving it access to private
data. So there here's why we're safe is
this is running in like a dev cluster in
GCP and I think the only thing on there
is like the default IM key which can
literally like look up information about
the instance. You can look at this as
layers of onion,
layers of the security onion.
>> Uh so like if you run dangerously allow
all from your local laptop, congrats.
They they go nab your Bitcoin wallet if
it's on your computer. They steal your
Slack authentication cookies, GitHub
cookies, and they pivot, right? That
that's that's terrible. But if you
create a custompurpose VM or an
ephemeral instance just for this, you
start restricting its network abil
network connectivity and you do all the
things that you should do as a security
engineer. The next thing you know is
like okay it's not what if it's like if
it gets when it gets popped. I develop
on the basis that it's a when. So the
blast radius is if that GCP is like the
worst thing it is cuz this is not public
IP.
>> Yep.
>> Um there is no really absolutely
terrible thing. Okay.
>> We've given it restrict the only
permissions on this box are my cloud API
keys and deploy keys to push to the two
GitHub repos.
>> Correct. Proper security engineering.
>> It's not if it gets po. It's not if it
gets popped. It's when it gets popped.
And what is the blast radius? So
>> this is however not an invitation to go
pop my GCP VMs. I will not be sharing
the IP addresses.
>> If you want to uh share API keys with
me, Dex, I always need some.
>> Uh you know what, man? I think you have
I heard you got a lot of tokens popping
around over there. If anything, you
should be bringing me some tokens.
>> Facts.
>> Um all right, let's look at this prompt.
Yeah.
>> Yeah, let's look at the concept of the
prompt. Um, look, look at the prompt.
>> So, here's what I'm using. This is my
take on the original Ralph prompt.
Sorry, let me let me my I have T-Mox
inside T-Mux here, so it's getting a
little weird.
>> That's fine.
>> Okay, let's look at zero A, right?
>> Yep.
>> So, this is you got to think like a like
a like a C or C++ engineer and you got
to think of context windows as arrays
cuz they are they're literally arrays.
>> Context windows are arrays.
you the a tool the chat when you chat
with the LLM you allocate to the array
when you get when it executes bash or
another tool it autoallocates the array.
>> Yep.
>> So getting into something like context
engineering I heard there's a guy who
knows a thing or two about that
definition. Hey, I just talked to people
like you who uh who who knew things and
put a name on a thing that hundreds of
people were doing.
>> But yeah, context engineering is all
about designing this array.
>> It's all about the array. So, and
thinking about how LLMs are essentially
a sliding window over the array and the
less that that window needs to slide,
the better. There is no memory server
side.
It's it's literally that an array. The
array is the memory. So you want to
allocate less.
So let's go back to the prompt.
>> Yep.
>> Away. We're deliberately
allocating. This is the key. Deliberate
malicking
uh context about your application.
>> We're going to say we're just going to
have 5,000ish tokens that are dedicated
for like here's what we're building and
we want that in every time.
>> Yeah. This could be uh index h index.mmd
or readme.md which is a whole bunch of
hyperlinks out to different specs enough
to like tease and tickle the latent
space that like there are files there.
>> So you can either go for an index or if
Ralph starts being dumb you can go for
like deliberate injection.
>> So you can at specs right and that will
just kind of behavior is just to list
that out. Correct. You mention a file
name. It the the tool registration for
read file is going to go is there a file
on that? I'm going to read it.
>> Mhm.
>> So you can give a directory path. You
can give it like a direct file. So that
is the key. So if we go back to your
context window diagram.
>> Yeah.
>> Right. Think about this. So it's kind of
like you're allocating the array
deliberately. So the first couple first
first couple allocations uh is about the
actual application.
>> Mhm. And every loop
that allocation is always there. LLM
engineering is kind of tarot scarot
card reading. It's it's not really a
science. But to me on vibes it felt like
it was a little bit more deterministic.
if I allocated the first couple things
deterministically.
>> Yeah.
>> Um now once you've got that, we go on to
essentially our next level line in the
spec. So like the first one is like
deliberate malicking on every loosen
array.
>> Yep.
>> Okay. So now we got like a to-do list
type thing, like an implementation plan.
>> Yeah.
Now, something that's kind of missing in
there is like pick one.
>> Oh, it says implement the single highest
priority feature. Oh,
>> yeah. Okay. Yeah, I see that. Sorry. Um,
>> I'll say
>> that's the idea.
>> Yeah. So, a lot of the people they do
these multi-stage things. Let's go back
to the context window diagram. They do
these multi-stage things. Well, what you
want to do is for each one item,
reset the goal. Bree Malik the
objective.
>> Yes.
>> Cuz when you the objective,
>> imagine you have your somewhere down
here is the line of like where like
performance degrades noticeably.
>> Correct. There is a dumb zone. You
should stay out of it.
>> If the dumb zone is down here and it's
very dependent on where this line is
depending on what you're doing and if
how your trajectory is and how much
you're reading and all of this. Um, but
if you ask it to do too much in the
working context, then some of your
results are going to be dumb. And
especially the important part where it's
like, okay, I've made all the changes.
Let me run the tests. And then the tests
are failing and it's like scrambling and
flailing to try to get everything
working. You kind of want to have this
and then like a little bit of headroom
also for like finalizing like doing the
git commits and the pushes and making
sure that all works. You want to have
that all happening in the smart zone.
>> This is the human and the loop. a human
on the loop not in the loop. So I I I
you we we we set this up we architect
this loop in in this way and you can
either go complete AFK or you can be on
the loop. The what you just draw drew
there is on the loop. I when I'm doing
this I always leave myself a little bit
of space for juicing like I like when
I'm reviewing the work. This is when I
software instead of Lego bricks is now
clay. So this is where I I'll do my like
final wrap-up steering or I just throw
it away and then I get reset hard and I
adjust my technique and let it rip
again.
>> So you're saying you might even in the
early days you might just run one
iteration of this loop and then actually
sit here and check it like have it
basically for input between looping
again. Right.
So, like there's a reason I do the live
the live streams. It's literally I use
it as a cheeky portable monitor on my
phone. I'm doing like housework and
stuff and it's like as like a portable
monitor and I check in, I watch it. You
start to notice patterns like and you
start to anamorphize
certain tendencies. Opus 45 doesn't have
high anxiety with the context window
gets but it does seem to be forgetful
of some objectives. Um but
>> so I want to I want to quickly um
because I know you have a limited amount
of time. I want to quickly go through
the architecture of the anthropic plugin
and how it's different and I really want
to get these things kicked off because I
want people to start seeing how they how
they actually work.
Um, and so in the Ralph Wigum plugin,
rather than like do the very first
thing.
Um, so it's it we're going to use the
exact same prompt for both of them
because we want to like change as much
as little as possible. But what's going
to happen in the
uh anthropic plugin is basically
whenever it get forget where the
performance line is, but whenever it
gets to the end and you have your like
final assistant user message assistant
message, It it uses a promise. So the
user's got to do a promise and it it
relies on the LLM
>> to promise that it's completed.
>> Yeah. So you have your final message and
then basically unless unless this
contains the promise, sorry, let's just
drop this in. If it's no, then we
basically inject like the the hook
injects a new user message that is just
like prompt. MD again, which is then
going to cause this stuff to be
reallocated and like happen again.
>> And then you get things like compaction
and all this stuff. I want compaction is
the devil. Dex.
>> Yeah. At some point you get compacted
here and then instead of having all of
the context, you end up with okay, you
were running some tools and then you get
compacted
>> and then you have the model summary.
>> Yeah.
>> Of like what the model thinks is
important and then you keep going.
until you get your final message and
then this process repeats. And so in
these points it has a very different
behavior.
>> It's a it turns this is why I say
deterministic
because it's essentially
uh one is one model has zero auto
compaction ever. The other one is using
auto compaction. So the model auto
compaction is lossy. It could remove the
specs.
um it can remove the task and goal and
objective and with this with the Ralph
loop the idea is you set one goal one
objective in that context window and so
it knows when it's done if you keep
extending the context window forever the
>> you you lose your deterministic
allocation
>> you lose your deterministic allocation
and more more so let's assume the
garbage collection hasn't run it hasn't
been compacted
that window has to slide over two or
three goals
and some of those goals have already
been been actually completed.
>> Mhm.
>> One context window, one activity, one
goal and that goal can be very fine
grain like do a refactor, add structured
logging, what else have you like, and
you can have multiple of these running.
You can have multiple Ralph loops
running.
>> Mhm. Um, okay. So, I'm on my Ralph
plug-in one. I'm going to run Claude and
I'm going to kick off this loop for the
um for the other one. So, we're going to
do Ralph Wiggum, Ralph loop, read, and
then our what is it? Uh, prom. What is
the name of the flag?
>> Sorry, I'm going to something.
>> Yeah,
>> completion promise.
>> Completion promise. Yeah.
>> And this is going to turn on the hook
and it's going to start working. And
over here, I'm going to kick off our
loop.sa. Oh, I think I might have uh I
think I might need to grab the prompt.
>> Yeah. All right. So another thing to
think about this is this is essentially
the Ralph plugin is
>> um running within claude code and the
the nonplugin like the the keep it
really simple is the idea
of an orchestrator running chord code.
So or running a harness.
>> Yeah. You have the outer harness and
then the inner harness. Right.
>> This is the idea of between the inner
harness and the outer harness. So
remember I said opus is forgetful the
current opus is forgetful for example
when I'm doing loom and building loom I
see that always forgets translations
>> so cool you got this raph loop to do
what it's meant to do you got a
supervisor on top which sees if it did
asks if it did translations and if the
translations don't work you run another
raph loop to nudge it hey did you do
translations so the idea behind Ralph is
an outer layer orchestrator,
not a in a loop.
>> So it doesn't it doesn't just have to be
loop and do it forever. Your loop could
actually have like you know run
the main prompt and then you could have
another one which is like
classify if X was done.
>> Correct. We're entering
>> and then you can jump out to other
prompts like add the test and fix the
tests or like do the translations or
whatever it is. Yeah, we're engineering
into places that don't even have names
for these concepts yet, Dex. [laughter]
>> Yeah, you can front Anthropic on this
one.
>> Yeah. What do you want to I saw I was
thinking uh there was some conversation
on Twitter which was like, okay, if
cloud code is the harness, what is the
name you give for engineering the slash
commands and plugins and cloud code and
prompts and maybe the bash loop that you
wrap around it? Because like you could
say that the Ralph loop script is
becomes part of the harness and you've
created a new harness on the building
block that is cloud code or AMP or open
code or whatever. But someone else
posted is like well if if cloud code is
the harness if the if the coding model
agent CLI tool is the harness then the
things you build to control it are the
reinss. And so now I'm like what about
what is reins engineering? But I I hope
that one doesn't catch on because it
sounds really dumb.
>> No no I have some ideas. I spicy. It's
called software engineering.
>> It's called software engineering.
>> So,
>> I like it.
>> We need the new term because
um there are so many people who just
don't get it right now and in denialism
that this is good. They're in their cope
land and people want a way to
differentiate
>> to they want to different differentiate
their skills. like we had like CIS
admins and devops and s surres. They
created these new titles to diff
differentiate and eventually those
titles got muddied.
Um
cuz people go oh I'm I'm DevOps now
because I know Kubernetes. Oh, I am I'm
an AI engineer now because I know like
like how to malik the array um or how
the inferencing loop works. No, no, no.
These are just fundamental new skills
and if you don't have what we're talking
about in a year, I think it's going to
be really rough in the employment market
for high performance companies. Like I'm
already seeing things at like fangish
companies. Won't go into specifics
because they're live, but like like if
you're a software engineering manager
right now, um axes are coming out like
they want your team, which you have no
control over really
>> because there humans to get good at AI.
>> Um so it's kind of got to be kind of
brutal. It's kind of kind of brutal.
Like everyone wants people to get good
at AI, but really comes down to if
someone's curious or not. Really, did
you make the the right hire originally?
>> Yep.
>> Um, so I think it's software
engineering, Dex.
>> I think it's just literally software
engineering, but what it means to be
software engineer changes.
>> I did realize that um I think we can get
push. I just want to make sure that
we're allowed to commit because I know
you have to do some
>> Yeah. GH login. So the G so I have
deploy keys on both these boxes. Um
[clears throat] let's see if we can I'm
like T-Mox within T-Mox is create. I'm
really lucky I changed my default key
T-Max prefix. So now I but I have to
remember what the default one is on the
new on the new boxes. Um
>> while we're on the tangent folks.
>> Yeah.
>> Um you should be thinking about loop
backs. um any way that you that the LLM
can automatically scrape uh context. So
the LLMs know how to drive t-mucks. So
instead of doing some background clawed
code agent, etc., just tell it to spawn
a T-max session, split the pain and
scrape the pain. It does it really well.
If you got like a web server log and
then a backend API log, create it in two
like in two splits
um and just tell Claude or the model to
go grab the pain and then your automatic
loop back for troubleshooting. And this
you don't need to be in the loop. You're
on the loop and you're programming the
loop and this is all Ralph.
>> Um yes. Uh we actually did on last week
uh a couple weeks ago on AI that works
we did do a we did a session on get work
trees and we figured out that uh we did
some demos of like having one Ralph
running over here and using T-M not
route but having one claude running over
here and using T-Mox to like scrape the
pains of the other ones and then like
merge in the results from the work trees
and resolve the conflicts.
>> Yeah. Well, whilst that kicks off, we're
also on another tangent. This is a
concept that you coined. Damn it.
Because I [snorts] just didn't write it.
You You me.
>> That's why I invite you on my streams. I
want you to come up with fun words and I
I'll just be there while you do it,
which is just recording what happened
anyway.
>> Most test runners are trash. They output
too many tokens. You only want to output
the failing test case.
>> I wrote a blog post on this. Did you see
this?
>> You did. And it's it's golden, Derek.
It's golden. Most test runners are
trash.
>> This is actually based on a bunch of
work that I think the first person to
write this stuff in our codebase was
when um Allison was hacking. Like this
is a version of a script that like
Allison and Claude built a while ago
because it was just like why would you
want to output like a million tokens of
like go spew like JSON test output if
the test is passing? What happens is
normally the test run the output's so
large what it does is it goes tail minus
100 but if the error is at the top the
tail it misses the tail.
Yeah. No, this is the thing that
happened all the time where it's uh
yeah, it's just head-N50
and then yeah, if your tests take 30
seconds, then you're fine. But most
people that we work with are like teams
with 50, hundred, thousands of engineers
and their test suites, if you run them
wrong, they can take hours. And so like
there's some work to be done to like
if it runs the head and then something
fails but it doesn't see it and then it
has to run it again, it's like that's
not wasted tokens. It is wasted tokens
and it is wasted time. But like if in
most cases most people aren't doing this
super hands-off Ralph Wiggum thing. And
so what just happened is I finished my
code and I the human I'm sitting there
waiting for it to run this 5minute test
suite again.
>> That's the key. And I'm like why would I
ever use this tool?
>> That's a that's the key. Like like I'm
not in the loop bashing the the array
and manually allocated and like ste
trying to steer it like most people use
cursor. Instead, I I try to oneshot it
at the top and then I watch it and then
if you watch it enough, you notice
stupid patterns and then you make
discoveries like the testr runner thing
that you just showed and you go, "Ooh,
that's a trick that works."
>> I've also I've also
>> discoveries are found by treating Claude
code as a fireplace.
>> As a fireplace that you just sit there,
watch it.
>> You just sit there and watch it. You
like you're out camping. You're sitting
sitting there watching the fire going. I
actually I had a I had a party on uh
Tuesday, a little like pre-New Year's
event, and I wanted to set this up, but
I just didn't have time. But I really
wanted to have one of the attractions at
the party is a uh laptop hooked up to
the TV and one there's a terminal in a
web app and you can see Ralph working
and then anyone at the party can go up
and edit the specs and like control the
trajectory of the loop. Uh so next time
you come to one of my parties, we'll
have we'll have that.
Mate, I've still got a couple
pre-planned trips, so it's just a matter
of when I come to SF.
>> Okay. When you come to SF, we're doing
we're doing a Cursed Lang hackathon.
We could probably also do a Ralph plus
Cursed Lang hackathon. I think that
would be really really fun. Uh, and
yeah, just like how do you make this
it's it's deeply technical and you can
change the world. you could build
incredibly useful things that actually
make many people's lives better, but
also just like the perspective of like
some of this is just art and like how do
you how do you bridge the gap between
art and and and utility and yeah, it's a
fun time.
>> Yeah, it's a it's a crazy time. So, I'm
down for that. Um,
let me get Loom done because I think
Loom is the encapsulation of some of
these ideas into
uh, essentially what is a remote
ephemeral sandbox coding harness. M
>> so the ability for a self-hosted
platform to actually create its own uh
remote agents weavers and then it's just
like your standard uh like agentic
harness which is 300 lines of code. If
people think claude code's amazing, it's
not. It's literally the model that does
all the work. Go look at my how to build
an a how to build an agent harness. All
right. So you got this harness, you got
this remote like provisioner on
infrastructure.
>> The next step there is really like how
do you like how could you encodify Ralph
and like little these all these nudges
and all these pokes and what happens if
it's the source control? It's also
source control. Like I I've been wanting
to get off GitHub for a long time and
evolve SCM.
>> Did you build your own
>> now?
>> Yeah. the last 3 days like AFK I now
have a remote provisioner I now have
full like Aback uh device login flows
off login flows tailwind UI
>> uh it's got full SCM hosting full SCM
mirroring we've got a harness so I've
got this CLI now that can like spawn
remote infrastructure
kick off an agent and then when it says
that it thinks it's done then then I can
set up like almost like a pix chain
reaction of agent pokes agent. So this
is like do did you do the translation do
all these things and
>> if you control the entire stack
>> from source code you can modify and
change that stack to your needs includes
like source control as like a memory for
agents.
>> I love it. Um, I've realized one other
thing here, which is that I did not put
a push command in my prompt, and so the
agents didn't push their stuff.
>> Yeah. So, that's that's another thing we
haven't covered off yet is the idea of
if you have a a shell script on the
outside or an orchestrator over the
harness.
>> That's true. You could just do the push
in the orchestrator.
>> Correct. which makes it deterministic.
But you can also add deterministic push,
deterministic commit. You could add uh
deterministic
like evaluation, whether it meets your
criteria. Does it do a git reset hard?
Does it run Ralph Ferber on what you've
already got? Does it bake it more? Or
does it just reset and try again?
>> Yeah.
>> But if you run into the harness, you're
just going to get you're just going to
get like steak that's either blue or
it's charred. Okay, so here's what's
interesting is we are back to
non-determinism. So you see this one
over here started running the thing and
it actually emitted the promise because
it read the prompt and it said okay
everything is done with the first thing
like it answer it finished the prompt
and it did the first thing but it's now
not looping
>> and so the kind of
>> like if I tell you not to think about an
elephant what are you thinking about Dex
>> elephants
>> exactly like this is another thing about
prompt engineering like people go it's
important that you do not do XYZ right
next in the context window. I'm going to
think about XYZ. And if it gets the
important not
>> the less that's in that context window,
the better your outcomes.
That includes trying to treat it like a
little kid.
>> Mhm. I want to actually edit this
because I haven't worked with this
plugin much. So, it's like a little bit
of this is my uh Huh.
Uh,
a little bit of this is my um like just
learning the the tricks of these of this
plugin. But it looks like the Ralph loop
is finished. So, I'm going to make
another one.
Let's see.
Or what is it? Completion promise. I'm
just going to try to run it without a
completion promise and see if this will
just run forever.
>> Yeah.
I hope uh people stumb upon this video
and they um they're able to disconnect
the two between like the official
product implementation and go, "Oh, wow.
It's bienthropic."
Verse
uh learning the fundamentals
of like
>> why
>> why it works and how does it work and
like actually watching it like I have
AFKed it for 3 months but I wasn't
paying for tokens. I saw it rewrite the
Alexa and PA like
so many times and I thought the model
was the issue. It wasn't the model.
>> Hey Dex, do you know someone who uh said
that you should spend some time reading
the specs and like more time on the spec
because one bad spec equals
uh like one bad line of code is one bad
line of code. One bad spec is like 10
new product features, 10,000 lines of
like crap. and junk because in the case
of cursed
>> Yeah. in the case of cursed my spec was
wrong. So it was tearing down the Lexa
and the PASA like because I declared the
same keyword for and and or to be the
same keyword
>> because you had a mistake in your in the
list of you couldn't come up
>> and I was saying that the model was bad
and loop it was literally garbage in
garbage out like you got
>> because you didn't know enough you
didn't know enough Gen Z slang to do a
good job. Yeah, I never met Ran out of
Gen Z words. [laughter]
>> I ran out of gen.
>> I'm just going to show this real quick
for people who are not familiar, but
this is a programming language that was
built with Ralph uh three times over in
three different it was C and then Rust
and then Zigg, right?
>> Yeah. Playing with the notion of back
pressure and what like what's in the
training data sets and all that stuff.
>> Yeah, this is cool. Um anyways, I'm I'm
going to leave this running for a while.
I'm probably not going to be sitting
here, but I hope if you're watching, uh,
you had fun and you learned some stuff.
And Jeeoff, I know you got to head into
work in a minute, but
>> I got a head into work.
>> Any any final thoughts? Any last words?
I mean, you kind of said your advice,
which is like don't just jump on the
plugin and the name and the cartoon
character, but like actually it's it's
kind of as much of anything as a
teaching tool and like go learn why it
works and why it was designed the way it
was.
>> Yeah. Think like a think like a C or C++
engineer. Think that you got this array.
There's no memory on the server side.
You It's a sliding window over the
array. You want to set only one goal and
objective in that array. And um you want
to leave some like uh headroom if you're
>> if you're not complete afking, you want
to leave some headroom because sometimes
you got this beautiful context window
that you just fall in love with
>> and then you're like, "Oh, can I squeeze
some bore out? Maybe it's not a new
loop. maybe like like like you get just
you get these golden windows. Um
>> yeah where it's like the trajectory is
perfect and it's running the test
properly and you get into the right
>> you want to save it. You want to save it
like
>> that's something I think that we
an area of research uh in aic harnesses
is like the ability to say this is the
perfect context when I want to go back
to it.
>> Yeah. Deliberate malicking.
>> Yeah. Deliberate malicking. Um, and less
is more.
Holy crap. Um, take [snorts] your claude
code rules and tokenize them.
[laughter]
Go to [snorts] like Tik Tok get tick
token off GitHub. Run it through the
tokenizer or the open AI tokenizer.
>> Read the harness
guides.
Um, read the harness guides. Like
anthropic says it's important to shout
at the LLM. GPT5 says if you shout at
it, it becomes timid.
>> You [laughter] detune the model. Yeah.
It stops being good. But yeah, you can
look at the look at the tokenizer. I
mean, this is easy because it's this,
but like yeah, we talk about this all
the time as if like you should go look
at how the model sees what you say
because when you type JSON into here,
you see like there are so many extra
charact like this is way denser than
just feeding the model words. And so you
should turn the JSON deterministically,
turn it into words or XML or something
more token efficient.
Yeah, I'll leave you with a quip.
>> Yeah, let's go.
>> So,
you could only fit about a
Oh, actually, maybe you'll here's the
quip.
I remember someone coming to me and
wanting to do an analysis on some data
using our labs.
>> Mhm. And I go, "How big is the data
set?" And that that person went, "Oh,
it's small. It's only a terabyte."
So, I had to pull up the chair. had to
pull up the chair and go, "Oh, this is
only a Commodore 64
worth of memory." So, if you want to
know how big like 200k of tokens is,
>> yeah,
>> you've got it's tiny. You've got like a
the model gets about a 16k token
overhead.
>> The harness gets about a 16k overhead.
You only got about 176k usable, not the
full 200 because there's overheads,
>> right? There's the there's the system
messages that come in, right?
>> Yeah. Yeah. Yeah. So, for that person
um downloaded Star Wars Episode 1 movie
script.
>> Mhm.
>> And I tokenized it.
>> Okay.
>> And that that worked out to be about 60K
of tokens or about 136 KB on disk.
You can only fit two movie the max of
one movie or two movies into the context
window.
>> Here's a new measurement.
>> How many movies can you fit into the to
get people thinking about like visually
from when we talk about tokens? It's
it's just this weird concept like you
can only fit about 136 KB and people go
what's 136 KB? It's Star Wars movie
script.
>> Amazing. that that includes the tool
output if you and then if you apply the
domain back that includes your tool
output, your spec allocation, it
includes your initial prompts. It goes
by fast. Goes by fast.
>> Yeah,
>> those both you can do a ton, but it's
also it's it's incredibly small and uh
the engineering and being thoughtful
about how you use this stuff uh can make
a huge impact.
>> Correct. And your best learnings will
come by treating it like a Twitch stream
or sitting by the fireplace and then
asking the all these questions and
trying to figure out why it does certain
behaviors and there's no explainable
reason. But then you notice patterns and
then you you tune things like your
agents MD which should only be about 60
lines of uh code by the way.
>> Yeah. Agents MD should be small.
Everything should be small. You want to
maximize everything should be so small
>> useful working time in the smart zone.
Uh, this was super fun. I decided to do
this on as a as a bit and Jeff texted
me. I was like, I'm gonna come hang out
and talk about Ralph. I was like,
incredible. So, thank you so much for
joining.
>> Anytime, man.
>> Post the video somewhere. Uh, if you
want to do a recap or or a
retrospective, I'm I'm happy to dive
deeper once this thing is like cooked
for a couple hours.
>> Peace until I'm next in San Fran, mate.
>> All right, sir. Enjoy.
>> See you.
>> Okay, that was Jeff Huntley. I am now
going to uh get back to work. Uh and
we're going to let this thing cook and
we will just leave it online on the
stream for a bit. So, I'm going to turn
off the OBS camera. I'm gone now. And uh
yeah, enjoy. We'll check back in in a
little bit. Cheers.
>> So, it appears that the anthropic one is
completely dead.
I'm wondering
what is going on here. Oops. I didn't
mean to cancel that one. Let's get that
one going again. So, just keeps
running the hook over and over again.
Let's see if we can control C this. I'm
The only thing I can assume is
stop says iteration 110. No completion
promise set. It's coming from my phone
like hearing the stream on my phone. Um,
yeah. I wonder what's going on here. Did
it delete the prompt? Prompt is here.
What happened? Stop hook error.
Stop says, "Yeah, what happened here?"
That's my wrong team session. What was
the last thing that happened before that
got stuck in this loop? Oh my god, look
at all these iterations. So, all
milestones complete. 202 test passing.
Complete. Project is feature complete.
Features listed and out of scope are
intentionally deferred or not planned.
Milestones complete. All milestones
complete.
Project is complete. Project complete.
All milestones done. Okay. So, this one
says it's baked.
[clears throat] All done.
Complete. All milestones complete. Done.
Project complete.
This is funny because it's all in one
context window. It's just seeing its own
context and just returning this check
mark. Whereas like Ralph starts fresh
and it's like learning again and then
it's more likely to come up with new
things to do.
Um, but what we're going to do is we're
going to actually pull down this repo
and we'll see how it goes. So,
let's just pop this guy open. Oops. All
right. I'm going to get logged in here
and we're going to explore this and see
how it did. H. [clears throat] So, this
was the plug-in one.
get poll and then tell me about this
project
and teach me how to use it. We'll see
how this goes.
All right, fine. I will get Oh, because
we opened it from not the CLI. All
right, we're going to pull this down
over here and see what Fast Ha coup can
do. I guess we'll check out the other
one, too.
See how far we've gotten. Teach me how
to use it and let's run through some
examples. Let's go fast haiku. this guy
on DSP. Clean up some of these other
ones and check out the code. All right,
we got CLI, we got the core, config,
parser, front matter, parser, get URL
parser. Oh, this is sick we got here.
Okay, tests all failing. Perfect. 100
pass nine failures. We got some good
GitHub URL parsing.
Cannot find package. Ah, so we should
install.
[clears throat] This one appears to have
more tests. Nope. Less tests on the
plug-in one. We'll have to see source
and how it differs slash
adheres to the specs. [clears throat] I
would guess that we're not going to get
the emergent behavior here
that we are used to getting from the
Ralph in a loop which is going to just
keep finding stuff to do if we've done
everything. One thing we could do is we
could extend it so that Oh, that's why.
Okay,
it's parsing that as a comment. Um,
booting anything in the out of scope
slash future work that's now in scope
and then we can go launch it again.
Okay. And we'll do the same one over
here. [clears throat]
I should probably
make sure they match. All right. Next
time it runs, it will pull that out of
scope stuff into scope. [clears throat]
Excellent. The tests were fixed by the
agent.
[clears throat and cough] Really um
really should have told this to not use
background agents. Really wish it would
just run all the agents in the
foreground, but such is life. Okay, so
in our Ralph pluginis
M3 partially implemented
not implemented. The spec has this but
it's not there. I wonder if our
implementation plan was broken. Let's
see this one. M3 remote sources partial
offline update lock files still coming
along. Well, this one didn't actually
say it was done. So, and this says M4
was not started. So, you could allege
that this one actually finished faster,
but at the end of the day, the real
answer is how do they work? So, we're
going to jump into
let's just do this guy. Customer Val
plugin.
Well, that's not good. This was for the
plug-in one. Human on the loop, folks.
Build it. Here we go. Okay. Interesting.
We really really taken the customization
flag. Uh, now what do I do?
[clears throat]
Okay, I built our stuff.
So, we want to do something like slow
down, buddy. We got to understand what
we did here. What I really want to do,
what's it called? I don't remember the
ticket number. Try new things about
find. We got lazy GP going on here.
We're not going to worry too much about
this. All these fun watch hooks
on build, on error, on shell,
committing with get, pushing up. Getting
stuff, folks. I think neither of these
are really fully baked yet, but um we're
let this cook for a little bit longer
and uh see where we get to. This one has
way more tests. Worth noting. Um
actually, what I want is to compare both
implementations. All right, we're going
to kick that off. We're going to let
them keep keep cooking. Um
should probably start a Ralph loop
locally to run through and just explore
and compare these things. But
[clears throat] anyways, we're going to
let this cook for a little bit more.
We'll be back. Okay, so we started at 2.
It's about 4:20 again. This one's been
done for a while, it feels like.
But we added some more stuff to scope
just so it could keep going. Um, I'm
actually going to go check out the
repos. [cough and clears throat]
So, here we'll be able to see here was
the plug-in one. A little bum this
didn't get a read me. Um, we can come
through and like look at the
implementation plan here. Seems like
everything is done. And here's our
progress log. Apparently, we have
parallel flags. Um, I like that this one
made a readme that explains how it works
and how to use it. Um, so I am on my
other workstation uh that is not being
used for streaming and uh I am testing
the actual use case for this which was
to take a bunch of skills from a claude
plugin and patch them with repo specific
rules. So, we'll be back to talk through
how that went. Um, and we'll be trying
it with both versions of the plugin. So,
um I'm going to go put the put the
Ralphs back on and we will uh keep
rocking. Okay. So, we did uh start
implementing this over there. We did
find one issue um which I from my other
workstation have opened
um because we really want it to flatten
the direct to like maintain the
directory structure but instead it like
flattens it um and so it's done some
root causing over there because it had
access to the source code. So we're
going to update our Ralph loop to not
only read from the specs but also um
from the known issues folder
um and see how that works. I've done
this a couple times before. Um so I'm
going to
pop this one open over here.
Um how can we the best way to install
the GitHub CLI on a
workstation? I'm asking there are there
are very easy questions to these
question. There are very easy answers to
these questions or you can install
whatever you want. Guess we'll do this
on both of these. I hope we can do this
unauthor.
Okay. Oh, we made the same
deterministically bad in a
non-deterministically in a
nondeterministic world.
That fun. They both made the exact same
mistake. Let's see if we can fetch
issues without logging in. Amazing. It
worked. This one's still having trouble.
We're going to add this to the original
prompt. We get rid of this. I guess we
better make sure there's an issue on the
other one, too. Okay, we got an issue
there. Now, let's see. Guess we should
make sure we have JQ. Nice. Thank you,
Ubuntu, for including JQ.
Or maybe it came with Claude. Who knows?
Okay, sick. My stop hook hit, so we
should be good here. I'm going to just
cancel the loop and make sure it's able
to do this stuff. This is the other nice
thing is you can just kill these things
and they'll pick back up wherever they
left off.
[clears throat] Oh, and this one is now
going along. Stop this one. We cannot
have two of these working on the same
thing. Yeah, it's fine. We could have
done that as our own separate step in
the loop script, but uh I think that's
fine for now.
Uh so now these things will pull in any
GitHub issues that get open. So go open
GitHub issues and see if you can uh
prompt inject my Ralphs into doing
something weird. Have fun.
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.