I have been using sglang and vLLM as black boxes for a while. Paged KV cache, continuous batching, prefill/decode split — I could repeat the words but not actually draw the boxes. So I sat down with a coding agent and rebuilt the smallest possible serving engine that still has all of the real pieces, one lesson at a time.
The result is mini_sglang, a minimal inference engine that loads Qwen3 with a working radix KV cache that provides continuous batching.
tl;dr Nine lessons (L0–L8) that take you from an empty package to a FastAPI server serving Qwen3-8B with paged KV, continuous batching, incremental detokenization, and a radix prefix cache. Lesson site is hosted here. Code on github.com/moomou/mini_sglang.
Each lesson builds one module and keeps the smoke test green.
F.scaled_dot_product_attention. This is
the baseline the rest of the lessons replace piece by piece.ForwardMeta struct. Token-flat vs
block-tensor storage and why L3 forces the switch.Req state machine,
per-request sampling params. This is the lesson where the engine
stops being a script and starts being a server./generate, the Engine class, the SSE streaming protocol. The
scheduler runs in its own thread; the HTTP layer just queues requests
and drains output._common_prefix_len
per descent; eviction is descending-level, LRU-within-level. This
is the lesson that turns “shared system prompt” from a 30%-of-prefill
tax into a cache hit.A few things that stuck with me and that I would not have internalized from reading sglang source.
Paged KV is mostly an allocator problem. The L3 kernel change is real but small. The bulk of paged KV is bookkeeping: who owns which blocks, when to free them, how to gather a block_table into a flat tensor the kernel can index. Once L2’s allocator exists, L3 feels mechanical.
Continuous batching is a scheduling decision, not a kernel feature. The kernels do not know about batching; they take whatever batch you hand them. The scheduler decides every step whether to admit a new prefill, continue decoding, or evict — and the magic is that those decisions happen every step instead of once per request.
Incremental detokenization is harder than the model. L6 is the
shortest lesson and the one I burned the most time on. BPE has no
clean one-token-to-one-piece-of-text mapping, so you keep a sliding
window of recent tokens, detok the window, and diff against the
previous one. Get the window size wrong and you split a UTF-8
codepoint, and the stream starts emitting \ufffd.
The HTTP server is mostly cooperative queueing. Once the engine
exists, the “server” is a thread that drains a request queue and an
asyncio.Queue per request for streaming output. The interesting
design choice is where the event loop lives, not what the routes
look like.
Prefix caching is refcounting and not just a tree. The radix tree is the obvious part; the part that took longer to internalize is that it only works because every cached block carries a refcount the scheduler bumps and drops at exactly the right moments (admit, finish, evict). Get the refcounting wrong and you either leak blocks or free one another request is still reading.
Each lesson started as a chat: ask the agent to explain a concept,
probe and ask for a clearer explanation, then ask for a design and
snippets. I wrote most of the code myself, with the agent for
cross-referencing real sglang and vLLM source when I got stuck. The
lesson HTML in docs/ is a cleaned-up writeup of each session,
including the Q&A blocks and a “pitfalls” / “debug log” section of
bugs we actually hit — those are the parts most worth reading.
This is the same loop as my CT scan post: the agent collapses the discovery loop, but the only way to actually learn the material is to refuse the agent’s offer to just write the thing for you. Build the smallest version yourself; let the agent explain and review.
If you want to read: the lesson site is here.
If you want to build: clone
mini_sglang, uv sync, and
run python -m scripts.l1_smoke. Then start at L1 and delete code
until you can put it back yourself.