<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tokenization on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/tokenization/</link><description>Recent content in Tokenization on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Sat, 28 Mar 2026 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/tokenization/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Engineering (2): Tokenization Deep Dive</title><link>https://www.chenk.top/en/llm-engineering/02-tokenization/</link><pubDate>Sat, 28 Mar 2026 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/llm-engineering/02-tokenization/</guid><description>&lt;p>Tokenization is the layer everyone skips. It&amp;rsquo;s also the layer where I&amp;rsquo;ve debugged the most production bugs — silent quality regressions, mysterious cost spikes, models refusing to follow instructions because someone formatted the chat template wrong. This chapter is everything I wish I&amp;rsquo;d internalized before shipping a multilingual product.&lt;/p>
&lt;p>&lt;figure class="article-figure">
 &lt;img src="https://blog-pic-ck.oss-cn-beijing.aliyuncs.com/posts/en/llm-engineering/02-tokenization/illustration_1.png" alt="LLM Engineering (2): Tokenization Deep Dive — Chapter overview" loading="lazy" decoding="async" class="content-image">
 
&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="what-a-tokenizer-actually-does" class="heading-anchor">What a tokenizer actually does&lt;a href="#what-a-tokenizer-actually-does" class="heading-link" aria-label="Permalink to this section" title="Copy link to this section">#&lt;/a>
&lt;/h2>&lt;p>A tokenizer maps a string to a list of integer IDs. Reverse maps IDs back to a string. Both directions are deterministic but not bijective in general — round-tripping &lt;code>tokenizer.decode(tokenizer.encode(s))&lt;/code> can lose whitespace, normalize Unicode, or collapse repeated punctuation, depending on the algorithm.&lt;/p></description></item></channel></rss>