mini_sglang

Rebuilding sglang from scratch

A curriculum that takes you from an empty Python package to an end-to-end LLM inference server serving Qwen3-8B.

Each lesson teaches one production-engine concept by having you build it. The author plays host: explains the idea, hands you a design and reference snippets, you implement, then we debug together. The contract is that every lesson keeps the same end-to-end smoke test green: greedy 20-token generation must match Hugging Face transformers token-for-token on the prompt "The capital of France is".

Setup

# Blackwell (sm_120, RTX 5090) requires the cu128 PyTorch wheels.
# pyproject.toml pins torch via [tool.uv.sources] -> pytorch-cu128.

curl -LsSf https://astral.sh/uv/install.sh | sh
cd mini_sglang
uv venv
uv sync
source .venv/bin/activate

python -m scripts.l1_smoke   # verify L1 still passes

Lessons

L0

Architecture overview

What an inference engine actually does, end-to-end. The seven stages every modern engine has.

done
L1

Model & weight loading

Build Qwen3-8B in PyTorch from spec, load HF safetensors, match HF token-for-token on greedy decode.

done
L2

Paged KV cache

Replace the contiguous KV with a paged pool + block allocator. The data structure that makes batching possible.

done
L3

Paged attention kernel

Block-table-aware attention. The three metadata tensors every paged kernel takes. Pure-PyTorch and flash-attn paths.

done
L4

Sampler

Greedy, temperature, top-k, top-p, repetition penalty — per-row, in the right order.

done
L5

Scheduler / continuous batching

Multiple in-flight requests in one forward pass. Where most of the throughput comes from.

done
L6

Tokenizer + incremental detokenize

Streaming text out without re-decoding the whole sequence each step.

done
L7

HTTP server

FastAPI + asyncio glue. /generate end-to-end.

done
L8

Radix prefix cache

Reuse KV across requests with shared prefixes. The defining sglang trick. 79% prefill compute saved on chat-style workloads.

done
L9

CUDA graphs decode (stretch)

Replay the decode step at near-zero CPU overhead.

stretch

Reference target

Final goal: python -m mini_sglang.server --model /path/to/Qwen3-8B and curl localhost:8000/generate -d '{"prompt":"...", "max_tokens":64}' serves correct text with sane throughput.