Ch 12: Prompt Engineering & In-Context Learning - Advanced¶
Track: Practitioner | Try code in Playground | Back to chapter overview
Read online or run locally
To run the code interactively, clone the repo and open chapters/chapter-12-prompt-engineering-and-in-context-learning/notebooks/03_prompt_systems.ipynb in Jupyter.
Chapter 12: Prompt Engineering — Notebook 03 (Prompt Systems in Production)¶
This notebook covers systematic evaluation (golden sets, graders, LLM-as-judge), A/B testing with bootstrap CIs, prompt-injection defenses, versioning + registry, and production observability.
What you'll learn¶
| Topic | Section |
|---|---|
| Golden datasets and grader functions (exact / regex / embedding) | §1 |
| LLM-as-judge: when it helps and when it lies | §2 |
| A/B testing prompts with bootstrap confidence intervals | §3 |
| Prompt-injection defenses: filters, sandwich, hierarchy, output validation | §4 |
| Versioned prompt registry with named, dated revisions | §5 |
| Production observability: logging, tracing, fallback chains | §6 |
Time estimate: 1.5–2 hours
Key concepts¶
- Eval harness — Fix a golden set, run candidate prompts, compute metrics with CIs — repeatable.
- LLM-as-judge — Cheap and scalable but biased; calibrate against human labels first.
- Prompt injection — Treat user input as untrusted; defend with filters, output validation, and privilege isolation.
- Prompt registry — Version prompts like code: ID, timestamp, author, eval scores, rollback path.
- Observability — Log prompt + response + metadata; alert on drift in latency, cost, or quality.
Run the full notebook for code and outputs.
Generated by Berta AI