Skip to content

Ch 12: Prompt Engineering & In-Context Learning - Advanced

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

To run the code interactively, clone the repo and open chapters/chapter-12-prompt-engineering-and-in-context-learning/notebooks/03_prompt_systems.ipynb in Jupyter.


Chapter 12: Prompt Engineering — Notebook 03 (Prompt Systems in Production)

This notebook covers systematic evaluation (golden sets, graders, LLM-as-judge), A/B testing with bootstrap CIs, prompt-injection defenses, versioning + registry, and production observability.

What you'll learn

Topic Section
Golden datasets and grader functions (exact / regex / embedding) §1
LLM-as-judge: when it helps and when it lies §2
A/B testing prompts with bootstrap confidence intervals §3
Prompt-injection defenses: filters, sandwich, hierarchy, output validation §4
Versioned prompt registry with named, dated revisions §5
Production observability: logging, tracing, fallback chains §6

Time estimate: 1.5–2 hours


Key concepts

  • Eval harness — Fix a golden set, run candidate prompts, compute metrics with CIs — repeatable.
  • LLM-as-judge — Cheap and scalable but biased; calibrate against human labels first.
  • Prompt injection — Treat user input as untrusted; defend with filters, output validation, and privilege isolation.
  • Prompt registry — Version prompts like code: ID, timestamp, author, eval scores, rollback path.
  • Observability — Log prompt + response + metadata; alert on drift in latency, cost, or quality.

Run the full notebook for code and outputs.


Generated by Berta AI