L0 — Architecture overview

What an LLM inference engine actually does, end-to-end.

0.1 The seven things every serving engine does

An LLM inference server is a pipeline. Forget any single buzzword you've read; concretely, when a request hits the server, this is what happens:

Tokenize. Bytes → token IDs.
Schedule. Decide which requests run in this forward step (and how many tokens of each).
Allocate KV cache slots. Reserve room in the paged KV pool for the new tokens.
Forward pass. Run the model. Reads & writes the KV pool through a paged attention kernel.
Sample. Convert logits → one new token per active request.
Detokenize incrementally. Token IDs → streamable text fragments.
Emit / stream / repeat. Push fragments to clients; loop step 2.

Everything else (radix cache, speculative decoding, CUDA graphs, EP/TP/PP, hicache, etc.) is an optimization on top of this skeleton.

0.2 Module map for mini_sglang

mini_sglang/
├── config.py            # ModelConfig.from_pretrained
├── weights.py           # safetensors → torch state dict, name remapping
├── model/
│   ├── layers.py        # RMSNorm, SwiGLU, RoPE precompute & apply
│   └── qwen3.py         # Qwen3Attention, DecoderLayer, Model, ForCausalLM
├── cache/
│   ├── kv_pool.py       # KvPool: per-layer [num_blocks, block_size, H_kv, D]
│   ├── block_alloc.py   # BlockAllocator: free-list of block IDs
│   └── request.py       # Request, ForwardMeta, reserve()
├── sampler.py           # L4
├── scheduler.py         # L5
├── tokenizer.py         # L6
└── server.py            # L7  (FastAPI)
scripts/
├── l1_smoke.py … l9_smoke.py

0.3 The data structure that ties it all together

If you remember one thing from this curriculum: the central data structure of a modern engine is the paged KV cache + per-step metadata:

KvPool per layer: [num_blocks, block_size, H_kv, D] (raw HBM, never moved) Request per in-flight request: prompt_ids, output_ids, blocks=[b0, b1, …], slot_indices=[…], cur_len ForwardMeta per forward call: positions, slot_mapping, ← writes cu_seqlens_q, seq_lens_kv, block_table, ← reads block_size

The scheduler's job (L5) is to take a list of Request objects and produce one ForwardMeta for the next call. The model never sees a Request; it only consumes the pool + meta.

0.4 Why "paged" and not "continuous"?

Naïve per-request KV tensors waste memory: you'd have to pre-allocate [max_len, …] per request even when most are 50 tokens long. Paging lets you allocate KV in fixed-size blocks (16 or 32 tokens) on demand. A request whose history is 73 tokens uses ceil(73/16) = 5 blocks, regardless of max_len. The cost is one indirection through a per-request block_table — which the kernel handles natively.

0.5 What we will skip (intentionally)

Distributed (TP/PP/EP/DP). Pure single-GPU.
Quantization (FP8/INT4/AWQ).
Speculative decoding, draft models, MTP heads.
Multi-modality.
Beam search, structured generation, logprobs API.

Each of these is at most a focused extension of the skeleton you'll have at L9.

0.6 Lesson contract

contract Every lesson keeps scripts/lN_smoke.py green. The smoke test is greedy 20 tokens on the prompt "The capital of France is", compared bit-for-bit against transformers.AutoModelForCausalLM.generate(do_sample=False). The expected token IDs are [12095, 13, 576, 6722, 315, 9625, 374, 12095, 13, 576, 6722, 315, 9625, 374, 12095, 13, 576, 6722, 315, 9625] ("Paris. The capital of France is Paris…").

← Home L1 — Model & weights →