TRANSCRIPTEnglish

AWS Certified Generative AI Developer - Professional: Cost optimisation

6m 14s912 words155 segmentsEnglish

FULL TRANSCRIPT

0:06

Day 23, cost optimization, batching,

0:08

caching, model choice. This is the day

0:11

where AWS stops asking whether your

0:13

Genai system works and starts asking

0:15

whether it can survive finance. Because

0:17

in the real world, systems don't get

0:19

shut down for being inaccurate. They get

0:21

shut down for being too expensive.

0:24

Imagine this. A company launches an AI

0:27

platform for internal use. It summarizes

0:29

reports, answers policy questions, runs

0:32

an operations agent. Adoption explodes.

0:34

Everyone loves it. Then finance sends

0:35

one email. Why did AI cost triple this

0:38

month? Nothing is broken. Nothing is

0:41

unsafe. Nothing is inaccurate. But the

0:43

system is wasteful. That's the day 23

0:46

problem. To understand cost, you must

0:48

first understand where it actually comes

0:50

from. There are five real cost buckets

0:53

in Genai systems. Input tokens, your

0:56

prompts and retrieved context. Output

0:58

tokens, what the model generates, model

1:01

choice, big brains versus small ones,

1:03

embeddings and vector storage, and

1:05

repeated unnecessary calls. If you can

1:08

shrink even one of these buckets, cost

1:10

drops. The biggest lever by far is model

1:13

choice. Here is the exam truth. Using

1:15

the smartest model for everything is

1:17

almost always wrong. If the task is

1:20

shallow, repetitive or predictable,

1:22

rewriting text, summarizing,

1:23

classifying, formatting, you do not need

1:26

a thinking monster. You use cheaper

1:28

models. Titan text light, Titan text

1:31

express, Claude, Haiku. They are fast,

1:33

they are cheap, and they are good

1:35

enough. You only use larger models when

1:37

reasoning or accuracy actually matters.

1:40

Complex rag answers, ambiguous

1:42

decisions, policy interpretation. That's

1:44

when clawed sonnet or mistrol large

1:47

earns its cost. AWS does not reward

1:49

overkill. A very common exam trap is

1:52

this. Use a larger model to reduce

1:54

hallucinations. That is almost never

1:56

correct. Hallucinations are fixed by

1:58

retrieval quality and grounding, not by

2:00

buying a more expensive brain. Once

2:03

retrieval is correct, you pick the

2:04

smallest model that still works. The

2:07

next major cost sync is token waste.

2:09

Tokens are money and tokens are wasted

2:12

in very predictable ways. Long system

2:15

prompts repeated on every request,

2:17

retrieving too many chunks, passing

2:19

entire documents instead of small

2:20

sections, verbose answers when short

2:23

ones would do. This is where token

2:25

discipline matters. You cut token cost

2:28

by being intentional, short reusable

2:30

system prompts, lower top K retrieval,

2:33

re-rankers instead of more context,

2:36

clear instructions to answer concisely,

2:38

strict grounding so the model doesn't

2:40

ramble. AWS loves the phrase token

2:42

budgeting. Now let's talk about one of

2:45

the cheapest wins in the entire exam.

2:47

Caching. Caching means you don't

2:49

recomputee the same thing twice and

2:51

Genai systems repeat themselves

2:53

constantly. You can cache embeddings

2:56

almost always. Rag retrieval results.

2:59

Final answers for common questions. Tool

3:01

call results in agents. Anywhere the

3:03

same input produces the same output.

3:05

Caching saves money. Picture this. Users

3:08

constantly ask, "What is the expense

3:10

policy?" That policy changes maybe twice

3:13

a year. Without caching, you pay for

3:15

retrieval and generation every time.

3:17

With caching, you answer once and reuse

3:19

it. That's massive savings. If the exam

3:22

mentions frequently asked questions,

3:25

repeated queries, static documents, your

3:28

brain should immediately say, "Cash it."

3:30

Next is batching, which is incredibly

3:32

powerful, but only in the right place.

3:35

Batching means processing many items in

3:37

one call instead of many small calls. It

3:40

shines in offline and high volume

3:41

scenarios, generating embeddings for

3:44

thousands of documents, daily

3:45

summarization jobs, bulk document

3:48

processing. This is where batching

3:50

slashes cost. But batching is not for

3:53

everything. You do not batch real-time

3:55

chat. You do not batch interactive

3:57

agents. You do not batch low latency

3:59

workflows. AWS will penalize you if you

4:02

suggest batching where users expect

4:04

instant responses. Rag systems deserve

4:07

special attention here. Most rag cost

4:09

problems come from bad retrieval. Top K

4:12

too high. No metadata filters.

4:14

Re-mbedding unchanged documents.

4:16

Rerunning the same searches again and

4:18

again. The fixes are retrieval fixes,

4:20

not model fixes. Embed once and reuse.

4:23

Filter aggressively. Cache retrieval

4:25

results. Tune chunking and overlap.

4:28

Better retrieval means fewer tokens and

4:30

fewer retries. That's real cost control.

4:33

Agents are their own cost problem.

4:35

Agents are expensive by default because

4:37

they take many steps, call many tools

4:40

and generate long traces. But the

4:42

correct response is not avoid agents.

4:44

The correct response is optimize agents.

4:47

You control agent cost by capping

4:49

maximum steps, caching tool results,

4:52

using smaller models for planning, and

4:54

escalating to larger models only when

4:56

necessary. AWS rewards this thinking. At

5:00

the heart of day 23 is one question.

5:03

What is the cheapest architecture that

5:05

still meets requirements? Not the

5:07

cheapest possible, not the most accurate

5:09

possible, the right balance. That is

5:11

senior engineering. Let's call out the

5:14

classic exam traps. Fine-tuning does not

5:16

automatically reduce cost. One model for

5:19

everything is lazy and expensive.

5:21

Caching blindly creates staleness.

5:23

Batching real-time chat is wrong. Every

5:26

optimization is contextual. AWS wants

5:29

thoughtful trade-offs. If you remember

5:31

one decision tree, remember this. First,

5:33

can I use a smaller model? Second, can I

5:36

reduce tokens? Third, can I cache

5:38

results? Fourth, can I batch requests?

5:41

Fifth, can I reduce agent steps? That

5:44

sequence solves most cost questions.

5:46

Here is the one sentence to lock this

5:48

entire day into memory. Cost is

5:50

controlled by model choice, token

5:52

discipline, caching, and batching in

5:54

that order. Say it once. Day 23 sticks.

5:57

Final self- test. An internal rag system

5:59

answers the same policy questions

6:01

repeatedly and costs keep rising. What's

6:04

the best optimization? Cash grounded

6:06

responses and reduce repeated retrieval.

6:09

That's day 23 mastered.

UNLOCK MORE

Sign up free to access premium features

INTERACTIVE VIEWER

Watch the video with synced subtitles, adjustable overlay, and full playback control.

SIGN UP FREE TO UNLOCK

AI SUMMARY

Get an instant AI-generated summary of the video content, key points, and takeaways.

SIGN UP FREE TO UNLOCK

TRANSLATE

Translate the transcript to 100+ languages with one click. Download in any format.

SIGN UP FREE TO UNLOCK

MIND MAP

Visualize the transcript as an interactive mind map. Understand structure at a glance.

SIGN UP FREE TO UNLOCK

CHAT WITH TRANSCRIPT

Ask questions about the video content. Get answers powered by AI directly from the transcript.

SIGN UP FREE TO UNLOCK

GET MORE FROM YOUR TRANSCRIPTS

Sign up for free and unlock interactive viewer, AI summaries, translations, mind maps, and more. No credit card required.

GET STARTED FREE SIGN IN