Rebuilding sglang from scratch

A curriculum that takes you from an empty Python package to an end-to-end LLM inference server serving Qwen3-8B.

Each lesson teaches one production-engine concept by having you build it. The author plays host: explains the idea, hands you a design and reference snippets, you implement, then we debug together. The contract is that every lesson keeps the same end-to-end smoke test green: greedy 20-token generation must match Hugging Face transformers token-for-token on the prompt "The capital of France is".

Setup

# Blackwell (sm_120, RTX 5090) requires the cu128 PyTorch wheels.
# pyproject.toml pins torch via [tool.uv.sources] -> pytorch-cu128.

curl -LsSf https://astral.sh/uv/install.sh | sh
cd mini_sglang
uv venv
uv sync
source .venv/bin/activate

python -m scripts.l1_smoke   # verify L1 still passes

CUDA graphs decode (stretch)

Replay the decode step at near-zero CPU overhead.

stretch

Reference target

Final goal: python -m mini_sglang.server --model /path/to/Qwen3-8B and curl localhost:8000/generate -d '{"prompt":"...", "max_tokens":64}' serves correct text with sane throughput.

Rebuilding sglang from scratch

Setup

Lessons

Architecture overview

Model & weight loading

Paged KV cache

Paged attention kernel

Sampler

Scheduler / continuous batching

Tokenizer + incremental detokenize

HTTP server

Radix prefix cache

CUDA graphs decode (stretch)

Reference target