Rebuilding sglang from scratch
Each lesson teaches one production-engine concept by having you build it.
The author plays host: explains the idea, hands you a design and reference snippets,
you implement, then we debug together. The contract is that every lesson keeps the
same end-to-end smoke test green: greedy 20-token generation must match Hugging Face
transformers token-for-token on the prompt
"The capital of France is".
Setup
# Blackwell (sm_120, RTX 5090) requires the cu128 PyTorch wheels.
# pyproject.toml pins torch via [tool.uv.sources] -> pytorch-cu128.
curl -LsSf https://astral.sh/uv/install.sh | sh
cd mini_sglang
uv venv
uv sync
source .venv/bin/activate
python -m scripts.l1_smoke # verify L1 still passes
Lessons
Architecture overview
What an inference engine actually does, end-to-end. The seven stages every modern engine has.
done L1Model & weight loading
Build Qwen3-8B in PyTorch from spec, load HF safetensors, match HF token-for-token on greedy decode.
done L2Paged KV cache
Replace the contiguous KV with a paged pool + block allocator. The data structure that makes batching possible.
done L3Paged attention kernel
Block-table-aware attention. The three metadata tensors every paged kernel takes. Pure-PyTorch and flash-attn paths.
done L4Sampler
Greedy, temperature, top-k, top-p, repetition penalty — per-row, in the right order.
done L5Scheduler / continuous batching
Multiple in-flight requests in one forward pass. Where most of the throughput comes from.
done L6Tokenizer + incremental detokenize
Streaming text out without re-decoding the whole sequence each step.
done L7HTTP server
FastAPI + asyncio glue. /generate end-to-end.
Radix prefix cache
Reuse KV across requests with shared prefixes. The defining sglang trick. 79% prefill compute saved on chat-style workloads.
doneCUDA graphs decode (stretch)
Replay the decode step at near-zero CPU overhead.
stretchReference target
Final goal: python -m mini_sglang.server --model /path/to/Qwen3-8B and
curl localhost:8000/generate -d '{"prompt":"...", "max_tokens":64}'
serves correct text with sane throughput.