Ch 12: Prompt Engineering & In-Context Learning - Advanced¶

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

To run the code interactively, clone the repo and open chapters/chapter-12-prompt-engineering-and-in-context-learning/notebooks/03_prompt_systems.ipynb in Jupyter.

Chapter 12: Prompt Engineering — Notebook 03 (Prompt Systems in Production)¶

This notebook covers systematic evaluation (golden sets, graders, LLM-as-judge), A/B testing with bootstrap CIs, prompt-injection defenses, versioning + registry, and production observability.

What you'll learn¶

Topic	Section
Golden datasets and grader functions (exact / regex / embedding)	§1
LLM-as-judge: when it helps and when it lies	§2
A/B testing prompts with bootstrap confidence intervals	§3
Prompt-injection defenses: filters, sandwich, hierarchy, output validation	§4
Versioned prompt registry with named, dated revisions	§5
Production observability: logging, tracing, fallback chains	§6

Time estimate: 1.5–2 hours

Key concepts¶

Eval harness — Fix a golden set, run candidate prompts, compute metrics with CIs — repeatable.
LLM-as-judge — Cheap and scalable but biased; calibrate against human labels first.
Prompt injection — Treat user input as untrusted; defend with filters, output validation, and privilege isolation.
Prompt registry — Version prompts like code: ID, timestamp, author, eval scores, rollback path.
Observability — Log prompt + response + metadata; alert on drift in latency, cost, or quality.

Run the full notebook for code and outputs.

Generated by Berta AI