L6 — Tokenizer + incremental detokenize

Stream text fragments out per token without re-decoding the full sequence each step.

6.1 Why this is its own lesson

L5's loop has tokenizer.decode(outputs[rid]) at the end. Fine for batch jobs, broken for streaming servers. Two naïve approaches both fail:

Decode each token alone, concatenate. Breaks on multi-byte UTF-8. The Chinese char 你 is 3 bytes that a BPE tokenizer often splits across multiple tokens. decode([single_byte_token]) → UnicodeDecodeError or silent \ufffd.
Decode the full history every step. Correct but O(N²); visible above ~256 tokens.

The right answer is a small window + diff trick, with one subtle correctness gap to handle.

6.2 The window-diff algorithm

K = 4   # tokens ≥ max UTF-8 char in bytes

class IncrementalDetokenizer:
    def __init__(self, tokenizer, prompt_ids):
        self.tokenizer   = tokenizer
        self.tokens      = prompt_ids.tolist()
        self.emitted_len = len(tokenizer.decode(prompt_ids, skip_special_tokens=False))

    def push(self, new_token_id):
        self.tokens.append(new_token_id)
        start = max(0, len(self.tokens) - K)
        window         = self.tokens[start:]
        decoded_window = self.tokenizer.decode(window, skip_special_tokens=False)

        if decoded_window.endswith("\ufffd"):
            return ""                                            # still mid-char

        prefix_text = self.tokenizer.decode(
            self.tokens[start:-1], skip_special_tokens=False,
        )
        if "\ufffd" in prefix_text:
            # the previous step had partial chars that this token resolved.
            # window-diff is unreliable; fall back to the global cursor.
            full = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
            new_text = full[self.emitted_len:]
        else:
            new_text = decoded_window[len(prefix_text):]

        self.emitted_len += len(new_text)
        return new_text

    def flush(self):
        final = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
        if len(final) == self.emitted_len:
            return ""
        out = final[self.emitted_len:]
        self.emitted_len = len(final)
        return out

Window of K=4 tokens. Enough for any UTF-8 char (max 4 bytes).
Endswith check. If the new window ends with the replacement char, we're mid-char globally — wait for more tokens.
Diff via decode(window[:-1]). Fast path; the new text is the suffix difference.
Fallback when prefix has \ufffd. See §6.3.

6.3 The subtle bug that bites every implementation

Consider an emoji that splits into 3 BPE tokens, like 🫨 = [9284, 104, 101] in Qwen3:

push(9284) -> ''       (decoded window ends with \ufffd, wait)
push(104)  -> ''       (still mid-char)
push(101)  -> ???      (emoji is now complete)

Without the fallback at push 3:

decoded_window = "🫨"             (clean; does NOT end with \ufffd)
prefix_text    = "\ufffd\ufffd"  (two replacement chars for 2 partial bytes)
new_text       = decoded_window[len(prefix_text):]
               = "🫨"[ len(context) + 2 :]
               = "🫨"[1:]
               = ""                       ← BUG: emoji never emits

The diff math overshoots because prefix_text (N+2 chars) is longer than decoded_window (N+1 chars). The emoji would only appear at the next flush().

The if "\ufffd" in prefix_text branch detects exactly this case and falls back to full[emitted_len:], which is the ground-truth diff against what we actually emitted.

student question If I see \ufffd, I don't emit and don't update emitted_len, right? Doesn't that handle it?

Right that emitted_len isn't updated on no-emit. But the bug isn't about emitted_len — push() doesn't use emitted_len on the fast path. It re-derives the "previous state" by decoding window[:-1] from scratch each call. That re-derivation introduces spurious \ufffd chars (the byte-level replacement chars) that didn't exist in any text we actually emitted.

The endswith check correctly catches "the CURRENT window ends mid-char". It misses "the PREVIOUS window had \ufffds that just got resolved" — that's where the diff math breaks. The fallback handles exactly that case.

Verification

$ python /tmp/l6_proof.py
push(9284) -> ''
push(104)  -> ''
push(101)  -> '🫨'     ← now emits at the resolving token
flush()    -> ''       ← nothing left
MATCH:     True

6.4 Edge cases

First call: emitted_len initialised to prompt's decoded length, so we never emit prompt text to the client.
EOS: when tok_id == eos_id, the scheduler marks finished. The detokenizer's push() for EOS typically returns "" (decoded EOS is empty or a special marker). Pass skip_special_tokens=False if you want to see it on the wire.
flush(): matters when generation stops mid-UTF-8-char (model emits a partial char and then EOS). Drains the dangling bytes or emits the replacement.

6.5 Layering

design The scheduler does NOT touch the detokenizer. Engine emits token IDs; the streaming layer (L7 HTTP) formats them. Keeps the scheduler off the hot path for string operations and lets L7 swap streaming protocols cleanly.

6.6 Pitfalls (table of seven)

pitfall	symptom
decode single token at a time	`UnicodeDecodeError` or silent `\ufffd` on CJK / emoji
decode full history each step	O(N²) latency visible above ~256 tokens
`start = prefix_offset - K` (chars as token index)	window is empty, every push returns `""`, all output appears in `flush()`
window-diff without the `\ufffd`-in-prefix fallback	multi-token UTF-8 chars never emit
`skip_special_tokens=True` mid-stream	EOS / system markers silently vanish
`clean_up_tokenization_spaces=True`	text retroactively reformats; breaks deltas
put detokenizer inside scheduler	scheduler becomes hot path for string ops

6.7 Acceptance

L6 pass criteria

IncrementalDetokenizer exposes __init__(tokenizer, prompt_ids), push(tok) -> str, flush() -> str.

Streamed text concatenated equals tokenizer.decode(prompt + outputs)[len(prompt_text):] exactly.

Multi-token UTF-8 chars (e.g. 🫨) emit at the resolving token, not at flush().

No \ufffd in streamed text for prompts containing CJK / emoji.

O(1)-ish per push() in the common case; worst case (fallback) is O(N).

← L5 L7 — HTTP server →