Building a mini_sglang from scratch

Posted at — May 24, 2026

Why #

I have been using sglang and vLLM as black boxes for a while. Paged KV cache, continuous batching, prefill/decode split — I could repeat the words but not actually draw the boxes. So I sat down with a coding agent and rebuilt the smallest possible serving engine that still has all of the real pieces, one lesson at a time.

The result is mini_sglang, a minimal inference engine that loads Qwen3 with a working radix KV cache that provides continuous batching.

tl;dr Nine lessons (L0–L8) that take you from an empty package to a FastAPI server serving Qwen3-8B with paged KV, continuous batching, incremental detokenization, and a radix prefix cache. Lesson site is hosted here. Code on github.com/moomou/mini_sglang.

The lessons #

Each lesson builds one module and keeps the smoke test green.

L0 — Architecture overview. The seven things every serving engine does. Module map. Why “paged” and not “continuous”. What we intentionally skip.
L1 — Model & weight loading. Qwen3-8B from spec: RMSNorm, SwiGLU, RoPE, GQA. safetensors loader. Eager forward pass using F.scaled_dot_product_attention. This is the baseline the rest of the lessons replace piece by piece.
L2 — Paged KV cache. Block allocator, page tables, the ForwardMeta struct. Token-flat vs block-tensor storage and why L3 forces the switch.
L3 — Paged attention kernel. Block-table-aware attention. Three metadata tensors (cu_seqlens_q, cu_seqlens_k, block_table). Optionally flash-attn varlen if you have nvcc and ten minutes to compile.
L4 — Sampler. Greedy, temperature, top-p. The canonical pipeline order (logit processors → temperature → top-k/p → sample) and the disabled-default sentinels worth memorizing.
L5 — Scheduler / continuous batching. Two queues, four scheduling decisions, the Req state machine, per-request sampling params. This is the lesson where the engine stops being a script and starts being a server.
L6 — Tokenizer + incremental detokenize. The window-diff algorithm and the subtle UTF-8 bug that bites every implementation that tries to detokenize one token at a time.
L7 — HTTP server. FastAPI /generate, the Engine class, the SSE streaming protocol. The scheduler runs in its own thread; the HTTP layer just queues requests and drains output.
L8 — Radix prefix cache. A radix tree of token sequences over the paged KV pool, with block refcounting as the central invariant. Match is one _common_prefix_len per descent; eviction is descending-level, LRU-within-level. This is the lesson that turns “shared system prompt” from a 30%-of-prefill tax into a cache hit.

What I actually learned #

A few things that stuck with me and that I would not have internalized from reading sglang source.

Paged KV is mostly an allocator problem. The L3 kernel change is real but small. The bulk of paged KV is bookkeeping: who owns which blocks, when to free them, how to gather a block_table into a flat tensor the kernel can index. Once L2’s allocator exists, L3 feels mechanical.

Continuous batching is a scheduling decision, not a kernel feature. The kernels do not know about batching; they take whatever batch you hand them. The scheduler decides every step whether to admit a new prefill, continue decoding, or evict — and the magic is that those decisions happen every step instead of once per request.

Incremental detokenization is harder than the model. L6 is the shortest lesson and the one I burned the most time on. BPE has no clean one-token-to-one-piece-of-text mapping, so you keep a sliding window of recent tokens, detok the window, and diff against the previous one. Get the window size wrong and you split a UTF-8 codepoint, and the stream starts emitting \ufffd.

The HTTP server is mostly cooperative queueing. Once the engine exists, the “server” is a thread that drains a request queue and an asyncio.Queue per request for streaming output. The interesting design choice is where the event loop lives, not what the routes look like.

Prefix caching is refcounting and not just a tree. The radix tree is the obvious part; the part that took longer to internalize is that it only works because every cached block carries a refcount the scheduler bumps and drops at exactly the right moments (admit, finish, evict). Get the refcounting wrong and you either leak blocks or free one another request is still reading.

On using an agent to learn #

Each lesson started as a chat: ask the agent to explain a concept, probe and ask for a clearer explanation, then ask for a design and snippets. I wrote most of the code myself, with the agent for cross-referencing real sglang and vLLM source when I got stuck. The lesson HTML in docs/ is a cleaned-up writeup of each session, including the Q&A blocks and a “pitfalls” / “debug log” section of bugs we actually hit — those are the parts most worth reading.

This is the same loop as my CT scan post: the agent collapses the discovery loop, but the only way to actually learn the material is to refuse the agent’s offer to just write the thing for you. Build the smallest version yourself; let the agent explain and review.

Where to start #

If you want to read: the lesson site is here.

If you want to build: clone mini_sglang, uv sync, and run python -m scripts.l1_smoke. Then start at L1 and delete code until you can put it back yourself.