Skip to content

Ch 11: Large Language Models & Transformers - Introduction

Track: Practitioner | Try code in Playground | Back to chapter overview

Read online or run locally

You can read this content here on the web. To run the code interactively, either use the Playground or clone the repo and open chapters/chapter-11-large-language-models-and-transformers/notebooks/01_transformer_architecture.ipynb in Jupyter.


Chapter 11: LLMs & Transformers — Notebook 01 (Transformer Architecture)

This notebook builds the Transformer from first principles: from the limits of RNNs to scaled dot-product attention, multi-head attention, positional encoding, and a full encoder block — all implemented in NumPy.

What you'll learn

Topic Section
Why attention: limits of RNNs and motivation for transformers §1
Scaled dot-product attention in NumPy §2
Multi-head attention and shape bookkeeping §3
Sinusoidal positional encoding §4
End-to-end encoder block + encoder/decoder/decoder-only families §5–6

Time estimate: 3 hours


Key concepts

  • Self-attention — Every token attends to every other token via Query/Key/Value projections.
  • Scaled dot-productsoftmax(QKᵀ / √dₖ) V keeps gradients stable as dimensions grow.
  • Multi-head attention — Run several attention "heads" in parallel and concatenate to capture different relations.
  • Positional encoding — Inject token order via sinusoids since attention itself is permutation-invariant.
  • Encoder block — Attention → residual → layer norm → feed-forward → residual → layer norm.
  • Model families — Encoder (BERT), decoder (GPT), encoder-decoder (T5) — each suits different tasks.

Run the full notebook in the chapter folder for code and outputs.


Generated by Berta AI