Ch 11: Large Language Models & Transformers - Introduction¶
Track: Practitioner | Try code in Playground | Back to chapter overview
Read online or run locally
You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-11-large-language-models-and-transformers/notebooks/01_transformer_architecture.ipynb in Jupyter.
Chapter 11: LLMs & Transformers — Notebook 01 (Transformer Architecture)¶
This notebook builds the Transformer from first principles: from the limits of RNNs to scaled dot-product attention, multi-head attention, positional encoding, and a full encoder block — all implemented in NumPy.
What you'll learn¶
| Topic | Section |
|---|---|
| Why attention: limits of RNNs and motivation for transformers | §1 |
| Scaled dot-product attention in NumPy | §2 |
| Multi-head attention and shape bookkeeping | §3 |
| Sinusoidal positional encoding | §4 |
| End-to-end encoder block + encoder/decoder/decoder-only families | §5–6 |
Time estimate: 3 hours
Key concepts¶
- Self-attention — Every token attends to every other token via Query/Key/Value projections.
- Scaled dot-product —
softmax(QKᵀ / √dₖ) Vkeeps gradients stable as dimensions grow. - Multi-head attention — Run several attention "heads" in parallel and concatenate to capture different relations.
- Positional encoding — Inject token order via sinusoids since attention itself is permutation-invariant.
- Encoder block — Attention → residual → layer norm → feed-forward → residual → layer norm.
- Model families — Encoder (BERT), decoder (GPT), encoder-decoder (T5) — each suits different tasks.
Run the full notebook in the chapter folder for code and outputs.
Generated by Berta AI