mini_sglang

L6 — Tokenizer + incremental detokenize

Stream text fragments out per token without re-decoding the full sequence each step.

6.1 Why this is its own lesson

L5's loop has tokenizer.decode(outputs[rid]) at the end. Fine for batch jobs, broken for streaming servers. Two naïve approaches both fail:

The right answer is a small window + diff trick, with one subtle correctness gap to handle.

6.2 The window-diff algorithm

K = 4   # tokens ≥ max UTF-8 char in bytes

class IncrementalDetokenizer:
    def __init__(self, tokenizer, prompt_ids):
        self.tokenizer   = tokenizer
        self.tokens      = prompt_ids.tolist()
        self.emitted_len = len(tokenizer.decode(prompt_ids, skip_special_tokens=False))

    def push(self, new_token_id):
        self.tokens.append(new_token_id)
        start = max(0, len(self.tokens) - K)
        window         = self.tokens[start:]
        decoded_window = self.tokenizer.decode(window, skip_special_tokens=False)

        if decoded_window.endswith("\ufffd"):
            return ""                                            # still mid-char

        prefix_text = self.tokenizer.decode(
            self.tokens[start:-1], skip_special_tokens=False,
        )
        if "\ufffd" in prefix_text:
            # the previous step had partial chars that this token resolved.
            # window-diff is unreliable; fall back to the global cursor.
            full = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
            new_text = full[self.emitted_len:]
        else:
            new_text = decoded_window[len(prefix_text):]

        self.emitted_len += len(new_text)
        return new_text

    def flush(self):
        final = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
        if len(final) == self.emitted_len:
            return ""
        out = final[self.emitted_len:]
        self.emitted_len = len(final)
        return out

6.3 The subtle bug that bites every implementation

Consider an emoji that splits into 3 BPE tokens, like 🫨 = [9284, 104, 101] in Qwen3:

push(9284) -> ''       (decoded window ends with \ufffd, wait)
push(104)  -> ''       (still mid-char)
push(101)  -> ???      (emoji is now complete)

Without the fallback at push 3:

decoded_window = "🫨"             (clean; does NOT end with \ufffd)
prefix_text    = "\ufffd\ufffd"  (two replacement chars for 2 partial bytes)
new_text       = decoded_window[len(prefix_text):]
               = "🫨"[ len(context) + 2 :]
               = "🫨"[1:]
               = ""                       ← BUG: emoji never emits

The diff math overshoots because prefix_text (N+2 chars) is longer than decoded_window (N+1 chars). The emoji would only appear at the next flush().

The if "\ufffd" in prefix_text branch detects exactly this case and falls back to full[emitted_len:], which is the ground-truth diff against what we actually emitted.

student question If I see \ufffd, I don't emit and don't update emitted_len, right? Doesn't that handle it?

Right that emitted_len isn't updated on no-emit. But the bug isn't about emitted_lenpush() doesn't use emitted_len on the fast path. It re-derives the "previous state" by decoding window[:-1] from scratch each call. That re-derivation introduces spurious \ufffd chars (the byte-level replacement chars) that didn't exist in any text we actually emitted.

The endswith check correctly catches "the CURRENT window ends mid-char". It misses "the PREVIOUS window had \ufffds that just got resolved" — that's where the diff math breaks. The fallback handles exactly that case.

Verification

$ python /tmp/l6_proof.py
push(9284) -> ''
push(104)  -> ''
push(101)  -> '🫨'     ← now emits at the resolving token
flush()    -> ''       ← nothing left
MATCH:     True

6.4 Edge cases

6.5 Layering

design The scheduler does NOT touch the detokenizer. Engine emits token IDs; the streaming layer (L7 HTTP) formats them. Keeps the scheduler off the hot path for string operations and lets L7 swap streaming protocols cleanly.

6.6 Pitfalls (table of seven)

pitfallsymptom
decode single token at a timeUnicodeDecodeError or silent \ufffd on CJK / emoji
decode full history each stepO(N²) latency visible above ~256 tokens
start = prefix_offset - K (chars as token index)window is empty, every push returns "", all output appears in flush()
window-diff without the \ufffd-in-prefix fallbackmulti-token UTF-8 chars never emit
skip_special_tokens=True mid-streamEOS / system markers silently vanish
clean_up_tokenization_spaces=Truetext retroactively reformats; breaks deltas
put detokenizer inside schedulerscheduler becomes hot path for string ops

6.7 Acceptance

L6 pass criteria
← L5 L7 — HTTP server →