L6 — Tokenizer + incremental detokenize
6.1 Why this is its own lesson
L5's loop has tokenizer.decode(outputs[rid]) at the end. Fine for batch jobs, broken for streaming servers. Two naïve approaches both fail:
- Decode each token alone, concatenate. Breaks on multi-byte UTF-8. The Chinese char 你 is 3 bytes that a BPE tokenizer often splits across multiple tokens.
decode([single_byte_token])→ UnicodeDecodeError or silent\ufffd. - Decode the full history every step. Correct but O(N²); visible above ~256 tokens.
The right answer is a small window + diff trick, with one subtle correctness gap to handle.
6.2 The window-diff algorithm
K = 4 # tokens ≥ max UTF-8 char in bytes
class IncrementalDetokenizer:
def __init__(self, tokenizer, prompt_ids):
self.tokenizer = tokenizer
self.tokens = prompt_ids.tolist()
self.emitted_len = len(tokenizer.decode(prompt_ids, skip_special_tokens=False))
def push(self, new_token_id):
self.tokens.append(new_token_id)
start = max(0, len(self.tokens) - K)
window = self.tokens[start:]
decoded_window = self.tokenizer.decode(window, skip_special_tokens=False)
if decoded_window.endswith("\ufffd"):
return "" # still mid-char
prefix_text = self.tokenizer.decode(
self.tokens[start:-1], skip_special_tokens=False,
)
if "\ufffd" in prefix_text:
# the previous step had partial chars that this token resolved.
# window-diff is unreliable; fall back to the global cursor.
full = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
new_text = full[self.emitted_len:]
else:
new_text = decoded_window[len(prefix_text):]
self.emitted_len += len(new_text)
return new_text
def flush(self):
final = self.tokenizer.decode(self.tokens, skip_special_tokens=False)
if len(final) == self.emitted_len:
return ""
out = final[self.emitted_len:]
self.emitted_len = len(final)
return out
- Window of K=4 tokens. Enough for any UTF-8 char (max 4 bytes).
- Endswith check. If the new window ends with the replacement char, we're mid-char globally — wait for more tokens.
- Diff via
decode(window[:-1]). Fast path; the new text is the suffix difference. - Fallback when prefix has
\ufffd. See §6.3.
6.3 The subtle bug that bites every implementation
Consider an emoji that splits into 3 BPE tokens, like 🫨 = [9284, 104, 101] in Qwen3:
push(9284) -> '' (decoded window ends with \ufffd, wait)
push(104) -> '' (still mid-char)
push(101) -> ??? (emoji is now complete)
Without the fallback at push 3:
decoded_window = "🫨" (clean; does NOT end with \ufffd)
prefix_text = "\ufffd\ufffd" (two replacement chars for 2 partial bytes)
new_text = decoded_window[len(prefix_text):]
= "🫨"[ len(context) + 2 :]
= "🫨"[1:]
= "" ← BUG: emoji never emits
The diff math overshoots because prefix_text (N+2 chars) is longer than decoded_window (N+1 chars). The emoji would only appear at the next flush().
The if "\ufffd" in prefix_text branch detects exactly this case and falls back to full[emitted_len:], which is the ground-truth diff against what we actually emitted.
student question If I see\ufffd, I don't emit and don't updateemitted_len, right? Doesn't that handle it?
Right that emitted_len isn't updated on no-emit. But the bug isn't about emitted_len — push() doesn't use emitted_len on the fast path. It re-derives the "previous state" by decoding window[:-1] from scratch each call. That re-derivation introduces spurious \ufffd chars (the byte-level replacement chars) that didn't exist in any text we actually emitted.
The endswith check correctly catches "the CURRENT window ends mid-char". It misses "the PREVIOUS window had \ufffds that just got resolved" — that's where the diff math breaks. The fallback handles exactly that case.
Verification
$ python /tmp/l6_proof.py
push(9284) -> ''
push(104) -> ''
push(101) -> '🫨' ← now emits at the resolving token
flush() -> '' ← nothing left
MATCH: True
6.4 Edge cases
- First call:
emitted_leninitialised to prompt's decoded length, so we never emit prompt text to the client. - EOS: when
tok_id == eos_id, the scheduler marks finished. The detokenizer'spush()for EOS typically returns""(decoded EOS is empty or a special marker). Passskip_special_tokens=Falseif you want to see it on the wire. flush(): matters when generation stops mid-UTF-8-char (model emits a partial char and then EOS). Drains the dangling bytes or emits the replacement.
6.5 Layering
design The scheduler does NOT touch the detokenizer. Engine emits token IDs; the streaming layer (L7 HTTP) formats them. Keeps the scheduler off the hot path for string operations and lets L7 swap streaming protocols cleanly.
6.6 Pitfalls (table of seven)
| pitfall | symptom |
|---|---|
| decode single token at a time | UnicodeDecodeError or silent \ufffd on CJK / emoji |
| decode full history each step | O(N²) latency visible above ~256 tokens |
start = prefix_offset - K (chars as token index) | window is empty, every push returns "", all output appears in flush() |
window-diff without the \ufffd-in-prefix fallback | multi-token UTF-8 chars never emit |
skip_special_tokens=True mid-stream | EOS / system markers silently vanish |
clean_up_tokenization_spaces=True | text retroactively reformats; breaks deltas |
| put detokenizer inside scheduler | scheduler becomes hot path for string ops |
6.7 Acceptance
L6 pass criteria
IncrementalDetokenizerexposes__init__(tokenizer, prompt_ids),push(tok) -> str,flush() -> str.- Streamed text concatenated equals
tokenizer.decode(prompt + outputs)[len(prompt_text):]exactly.- Multi-token UTF-8 chars (e.g.
🫨) emit at the resolving token, not atflush().- No
\ufffdin streamed text for prompts containing CJK / emoji.- O(1)-ish per
push()in the common case; worst case (fallback) is O(N).