<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Text Preprocessing on Chen Kai Blog</title><link>https://www.chenk.top/en/tags/text-preprocessing/</link><description>Recent content in Text Preprocessing on Chen Kai Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Wed, 01 Oct 2025 09:00:00 +0000</lastBuildDate><atom:link href="https://www.chenk.top/en/tags/text-preprocessing/index.xml" rel="self" type="application/rss+xml"/><item><title>NLP (1): Introduction and Text Preprocessing</title><link>https://www.chenk.top/en/nlp/introduction-and-preprocessing/</link><pubDate>Wed, 01 Oct 2025 09:00:00 +0000</pubDate><guid>https://www.chenk.top/en/nlp/introduction-and-preprocessing/</guid><description>&lt;p>Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you&amp;rsquo;re using a stack that took seventy years to build. Natural Language Processing (NLP) is the field that taught machines to read, score, transform, and write human language. Surprisingly, much of the modern NLP stack still relies on preprocessing techniques from decades ago.&lt;/p>
&lt;p>This first article in the series does two things. First, it maps out the field&amp;rsquo;s history, current scope, and the reasons behind the tools we use. Second, it builds the foundational layer — cleaning, tokenization, normalization, and feature extraction — with code you can use directly in a project. By the end, you&amp;rsquo;ll have a reusable preprocessing pipeline and, more importantly, an understanding of when each step is helpful and when it can destroy signal.&lt;/p></description></item></channel></rss>