L0 — Architecture overview
0.1 The seven things every serving engine does
An LLM inference server is a pipeline. Forget any single buzzword you've read; concretely, when a request hits the server, this is what happens:
- Tokenize. Bytes → token IDs.
- Schedule. Decide which requests run in this forward step (and how many tokens of each).
- Allocate KV cache slots. Reserve room in the paged KV pool for the new tokens.
- Forward pass. Run the model. Reads & writes the KV pool through a paged attention kernel.
- Sample. Convert logits → one new token per active request.
- Detokenize incrementally. Token IDs → streamable text fragments.
- Emit / stream / repeat. Push fragments to clients; loop step 2.
Everything else (radix cache, speculative decoding, CUDA graphs, EP/TP/PP, hicache, etc.) is an optimization on top of this skeleton.
0.2 Module map for mini_sglang
mini_sglang/
├── config.py # ModelConfig.from_pretrained
├── weights.py # safetensors → torch state dict, name remapping
├── model/
│ ├── layers.py # RMSNorm, SwiGLU, RoPE precompute & apply
│ └── qwen3.py # Qwen3Attention, DecoderLayer, Model, ForCausalLM
├── cache/
│ ├── kv_pool.py # KvPool: per-layer [num_blocks, block_size, H_kv, D]
│ ├── block_alloc.py # BlockAllocator: free-list of block IDs
│ └── request.py # Request, ForwardMeta, reserve()
├── sampler.py # L4
├── scheduler.py # L5
├── tokenizer.py # L6
└── server.py # L7 (FastAPI)
scripts/
├── l1_smoke.py … l9_smoke.py
0.3 The data structure that ties it all together
If you remember one thing from this curriculum: the central data structure of a modern engine is the paged KV cache + per-step metadata:
The scheduler's job (L5) is to take a list of Request objects and produce one ForwardMeta for the next call. The model never sees a Request; it only consumes the pool + meta.
0.4 Why "paged" and not "continuous"?
Naïve per-request KV tensors waste memory: you'd have to pre-allocate [max_len, …] per request even when most are 50 tokens long. Paging lets you allocate KV in fixed-size blocks (16 or 32 tokens) on demand. A request whose history is 73 tokens uses ceil(73/16) = 5 blocks, regardless of max_len. The cost is one indirection through a per-request block_table — which the kernel handles natively.
0.5 What we will skip (intentionally)
- Distributed (TP/PP/EP/DP). Pure single-GPU.
- Quantization (FP8/INT4/AWQ).
- Speculative decoding, draft models, MTP heads.
- Multi-modality.
- Beam search, structured generation, logprobs API.
Each of these is at most a focused extension of the skeleton you'll have at L9.
0.6 Lesson contract
contract Every lesson keepsscripts/lN_smoke.pygreen. The smoke test is greedy 20 tokens on the prompt"The capital of France is", compared bit-for-bit againsttransformers.AutoModelForCausalLM.generate(do_sample=False). The expected token IDs are[12095, 13, 576, 6722, 315, 9625, 374, 12095, 13, 576, 6722, 315, 9625, 374, 12095, 13, 576, 6722, 315, 9625]("Paris. The capital of France is Paris…").