AWS Certified Generative AI Developer - Professional: Cost optimisation
FULL TRANSCRIPT
Day 23, cost optimization, batching,
caching, model choice. This is the day
where AWS stops asking whether your
Genai system works and starts asking
whether it can survive finance. Because
in the real world, systems don't get
shut down for being inaccurate. They get
shut down for being too expensive.
Imagine this. A company launches an AI
platform for internal use. It summarizes
reports, answers policy questions, runs
an operations agent. Adoption explodes.
Everyone loves it. Then finance sends
one email. Why did AI cost triple this
month? Nothing is broken. Nothing is
unsafe. Nothing is inaccurate. But the
system is wasteful. That's the day 23
problem. To understand cost, you must
first understand where it actually comes
from. There are five real cost buckets
in Genai systems. Input tokens, your
prompts and retrieved context. Output
tokens, what the model generates, model
choice, big brains versus small ones,
embeddings and vector storage, and
repeated unnecessary calls. If you can
shrink even one of these buckets, cost
drops. The biggest lever by far is model
choice. Here is the exam truth. Using
the smartest model for everything is
almost always wrong. If the task is
shallow, repetitive or predictable,
rewriting text, summarizing,
classifying, formatting, you do not need
a thinking monster. You use cheaper
models. Titan text light, Titan text
express, Claude, Haiku. They are fast,
they are cheap, and they are good
enough. You only use larger models when
reasoning or accuracy actually matters.
Complex rag answers, ambiguous
decisions, policy interpretation. That's
when clawed sonnet or mistrol large
earns its cost. AWS does not reward
overkill. A very common exam trap is
this. Use a larger model to reduce
hallucinations. That is almost never
correct. Hallucinations are fixed by
retrieval quality and grounding, not by
buying a more expensive brain. Once
retrieval is correct, you pick the
smallest model that still works. The
next major cost sync is token waste.
Tokens are money and tokens are wasted
in very predictable ways. Long system
prompts repeated on every request,
retrieving too many chunks, passing
entire documents instead of small
sections, verbose answers when short
ones would do. This is where token
discipline matters. You cut token cost
by being intentional, short reusable
system prompts, lower top K retrieval,
re-rankers instead of more context,
clear instructions to answer concisely,
strict grounding so the model doesn't
ramble. AWS loves the phrase token
budgeting. Now let's talk about one of
the cheapest wins in the entire exam.
Caching. Caching means you don't
recomputee the same thing twice and
Genai systems repeat themselves
constantly. You can cache embeddings
almost always. Rag retrieval results.
Final answers for common questions. Tool
call results in agents. Anywhere the
same input produces the same output.
Caching saves money. Picture this. Users
constantly ask, "What is the expense
policy?" That policy changes maybe twice
a year. Without caching, you pay for
retrieval and generation every time.
With caching, you answer once and reuse
it. That's massive savings. If the exam
mentions frequently asked questions,
repeated queries, static documents, your
brain should immediately say, "Cash it."
Next is batching, which is incredibly
powerful, but only in the right place.
Batching means processing many items in
one call instead of many small calls. It
shines in offline and high volume
scenarios, generating embeddings for
thousands of documents, daily
summarization jobs, bulk document
processing. This is where batching
slashes cost. But batching is not for
everything. You do not batch real-time
chat. You do not batch interactive
agents. You do not batch low latency
workflows. AWS will penalize you if you
suggest batching where users expect
instant responses. Rag systems deserve
special attention here. Most rag cost
problems come from bad retrieval. Top K
too high. No metadata filters.
Re-mbedding unchanged documents.
Rerunning the same searches again and
again. The fixes are retrieval fixes,
not model fixes. Embed once and reuse.
Filter aggressively. Cache retrieval
results. Tune chunking and overlap.
Better retrieval means fewer tokens and
fewer retries. That's real cost control.
Agents are their own cost problem.
Agents are expensive by default because
they take many steps, call many tools
and generate long traces. But the
correct response is not avoid agents.
The correct response is optimize agents.
You control agent cost by capping
maximum steps, caching tool results,
using smaller models for planning, and
escalating to larger models only when
necessary. AWS rewards this thinking. At
the heart of day 23 is one question.
What is the cheapest architecture that
still meets requirements? Not the
cheapest possible, not the most accurate
possible, the right balance. That is
senior engineering. Let's call out the
classic exam traps. Fine-tuning does not
automatically reduce cost. One model for
everything is lazy and expensive.
Caching blindly creates staleness.
Batching real-time chat is wrong. Every
optimization is contextual. AWS wants
thoughtful trade-offs. If you remember
one decision tree, remember this. First,
can I use a smaller model? Second, can I
reduce tokens? Third, can I cache
results? Fourth, can I batch requests?
Fifth, can I reduce agent steps? That
sequence solves most cost questions.
Here is the one sentence to lock this
entire day into memory. Cost is
controlled by model choice, token
discipline, caching, and batching in
that order. Say it once. Day 23 sticks.
Final self- test. An internal rag system
answers the same policy questions
repeatedly and costs keep rising. What's
the best optimization? Cash grounded
responses and reduce repeated retrieval.
That's day 23 mastered.
UNLOCK MORE
Sign up free to access premium features
INTERACTIVE VIEWER
Watch the video with synced subtitles, adjustable overlay, and full playback control.
AI SUMMARY
Get an instant AI-generated summary of the video content, key points, and takeaways.
TRANSLATE
Translate the transcript to 100+ languages with one click. Download in any format.
MIND MAP
Visualize the transcript as an interactive mind map. Understand structure at a glance.
CHAT WITH TRANSCRIPT
Ask questions about the video content. Get answers powered by AI directly from the transcript.
GET MORE FROM YOUR TRANSCRIPTS
Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.