NLP (1): Introduction and Text Preprocessing

Every time you ask Claude a question, autocomplete a sentence in Gmail, or read a Google Translate page, you’re using a stack that took seventy years to build. Natural Language Processing (NLP) is the field that taught machines to read, score, transform, and write human language. Surprisingly, much of the modern NLP stack still relies on preprocessing techniques from decades ago.

This first article in the series does two things. First, it maps out the field’s history, current scope, and the reasons behind the tools we use. Second, it builds the foundational layer — cleaning, tokenization, normalization, and feature extraction — with code you can use directly in a project. By the end, you’ll have a reusable preprocessing pipeline and, more importantly, an understanding of when each step is helpful and when it can destroy signal.

NLP (1): Introduction and Text Preprocessing — Chapter overview

What You Will Learn#

NLP (1): Introduction and Text Preprocessing — Chapter summary

The four paradigms of NLP and the technical reason each one displaced the previous
A precise vocabulary for tokenization: characters, words, subwords, and why BPE won
How to build a configurable preprocessing pipeline in Python with NLTK, spaCy, and scikit-learn
The math behind Bag-of-Words and TF-IDF, and how to read the resulting matrices
Zipf’s law, n-gram language models, and why one-hot vectors fail
A decision table for when to apply (or skip) each preprocessing step

Prerequisites: Comfortable Python, light familiarity with NumPy and pandas, no prior NLP exposure required.

Four Eras of NLP#

NLP did not advance smoothly. It moved in jumps, each driven by a new representation of language. Knowing the sequence helps you choose the right tool: rule systems still beat neural nets for narrow form-filling, statistical methods still drive search ranking, and embeddings dominate everything else.

Symbolic Era (1950s — late 1980s)#

Early systems treated language as a logic problem. ELIZA (1966) matched user input against hand-crafted regex patterns and rephrased the captured groups; SHRDLU (1970) parsed instructions about a blocks world using a hand-written grammar. These systems were precise within their domain and completely brittle outside it — a synonym or a typo broke them. The lesson, in hindsight, is that language has too many surface forms for any human to enumerate.

Statistical Revolution (1990s)#

P(w_t \mid w_{t-1}) = \frac{\text{count}(w_{t-1}, w_t)}{\text{count}(w_{t-1})}

This single formula powered IBM’s statistical machine translation, the first viable speech recognizers, and probabilistic part-of-speech taggers. Hidden Markov Models extended the same idea to latent state, and probabilistic context-free grammars handled syntax. Features were still hand-engineered, but the rules were learned.

Deep Learning Era (2013 — 2016)#

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}

For the first time, words were no longer atomic identifiers. They lived in a continuous space where similarity was a cosine away. RNNs and LSTMs followed, letting models thread context through a sequence and finally learn from order, not just bag-of-tokens counts.

Transformer Revolution (2017 — present)#

\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

Two practical consequences mattered. First, the model is fully parallel across positions, so training scales with GPU memory rather than sequence length. Second, every token can attend directly to every other token, which finally solved the long-range dependency problem. BERT, GPT, and every modern LLM are direct descendants.

Era	Years	Core idea	What broke it
Symbolic	1950 — 1980s	Hand-written rules and grammars	Cannot enumerate surface forms
Statistical	1990s — 2010s	Estimate probabilities from corpora	Hand-engineered features hit a ceiling
Deep learning	2013 — 2016	Learn dense representations end-to-end	Recurrence is sequential, slow to train
Transformer	2017 — now	Self-attention over the whole sequence	(Still being explored)

Insight: each shift solved the previous era’s bottleneck without throwing away the layer below. Even today, an LLM tokenizer is a statistical artifact, and your retrieval system probably uses TF-IDF as a fallback.

Where NLP Shows Up Today#

Domain	Examples
Text classification	Sentiment, spam, intent routing
Information extraction	Named entities, relations, knowledge graphs
Generation	Translation, summarization, code
Conversational AI	ChatGPT, Claude, voice assistants
Search and analysis	Semantic search, topic modeling, RAG

The figure above arranges these into six clusters. Notice that almost every cluster ultimately consumes a vector — which is exactly what preprocessing produces.

The Preprocessing Pipeline at a Glance#

Before any model, raw text has to become numerical features. The standard pipeline has six stages, and each one is a deliberate choice that trades information for regularity.

1
2
3
4
5
6
7
Raw text
  -> Cleaning      (strip HTML, URLs, emails, junk characters)
  -> Tokenization  (split into words / subwords)
  -> Normalization (lowercase, lemmatize, optionally stem)
  -> Stopword pass (drop "the", "is", "at" if they hurt)
  -> Vectorization (BoW, TF-IDF, or embeddings)
  -> Model

A common mistake is to apply every step by reflex. The right framing is: each stage should remove noise that downstream cannot handle and preserve everything else. We will revisit this trade-off at every step.

Environment Setup#

1
2
pip install nltk spacy scikit-learn matplotlib numpy pandas beautifulsoup4
python -m spacy download en_core_web_sm

1
2
3
import nltk
for pkg in ['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger']:
    nltk.download(pkg, quiet=True)

Step 1 — Text Cleaning#

Web text comes wrapped in HTML, peppered with URLs, and littered with control characters. Cleaning removes the obvious noise without touching meaning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import re

def clean_text(text: str) -> str:
    """Strip HTML, URLs, emails, non-letters; collapse whitespace."""
    text = re.sub(r'<[^>]+>', '', text)              # HTML tags
    text = re.sub(r'http\S+|www\.\S+', '', text)     # URLs
    text = re.sub(r'\S+@\S+', '', text)              # emails
    text = re.sub(r'[^a-zA-Z\s]', '', text)          # keep letters and spaces
    text = re.sub(r'\s+', ' ', text).strip()         # normalize whitespace
    return text

raw = """<p>Check out https://example.com for info!</p>
Contact info@test.com. Price: $29.99"""
print(clean_text(raw))
# Check out for info Contact Price

The aggressive cleaning trade-off. The function above also deletes digits and punctuation. That is fine for topic modeling, where numbers add noise, but it is wrong for:

Sentiment analysis — !!! and ?! carry emotion.
Named entity recognition — “Apple Inc.” needs the period and the capitalization.
Financial NLP — $29.99 is the actual signal you care about.

Always tailor the regex set to the task; do not apply a one-size-fits-all cleaner.

Performance tip. Compile patterns once if you process millions of documents:

1
2
3
HTML_RE = re.compile(r'<[^>]+>')
URL_RE  = re.compile(r'http\S+|www\.\S+')
text = URL_RE.sub('', HTML_RE.sub('', text))

For HTML in the wild (malformed tags, embedded scripts), regex is fragile. Reach for a parser:

1
2
from bs4 import BeautifulSoup
text = BeautifulSoup(html_text, 'html.parser').get_text(' ', strip=True)

Step 2 — Tokenization#

Tokenization splits text into the atomic units a model will see. The boundary you choose — characters, words, subwords — determines vocabulary size, sequence length, and how gracefully the model handles words it has never seen.

Three tokenization strategies for the same input

Word Tokenization#

1
2
3
4
5
6
7
8
# Naive: breaks on contractions and punctuation
"Don't split can't".split()
# ["Don't", 'split', "can't"]

from nltk.tokenize import word_tokenize
tokens = word_tokenize("Dr. Smith earned $150,000 in 2023! Isn't that amazing?")
# ['Dr.', 'Smith', 'earned', '$', '150,000', 'in', '2023', '!',
#  'Is', "n't", 'that', 'amazing', '?']

NLTK keeps Dr. as one token, separates punctuation, and splits the contraction Isn't into Is + n't. Each of those decisions is a hard-coded English convention — which is exactly why word tokenization is brittle across languages.

Sentence Tokenization#

1
2
3
4
from nltk.tokenize import sent_tokenize
text = "Dr. Johnson works at A.I. Corp. He earned his Ph.D. in 2010."
sent_tokenize(text)
# ['Dr. Johnson works at A.I. Corp.', 'He earned his Ph.D. in 2010.']

NLTK’s Punkt model learns from data which periods end sentences and which mark abbreviations.

Subword Tokenization (BPE)#

Modern models — GPT, BERT, Llama, Claude — do not tokenize on words. They use subword tokenization, almost always a variant of Byte-Pair Encoding (BPE):

Start with a vocabulary of individual characters.
Count adjacent character pair frequencies across the corpus.
Merge the most frequent pair into a new symbol.
Repeat until the vocabulary reaches a target size (commonly 30k — 100k).

1
2
3
4
5
6
7
Corpus: "low" x5, "lower" x2, "newest" x6, "widest" x3
Initial:  l o w  /  l o w e r  /  n e w e s t  /  w i d e s t

Merge 1: (e, s) -> es        # frequent in "newest", "widest"
Merge 2: (es, t) -> est
Merge 3: (l, o) -> lo
...

Why BPE matters in practice:

Rare words decompose — unbelievable becomes un + believ + able, all of which have appeared elsewhere.
Vocabulary stays bounded — a 50k subword vocabulary covers any English text and most code.
Cross-lingual transfer — the same tokenizer handles English, French, and Mandarin if trained on a multilingual corpus.

Here is a minimal, runnable implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from collections import defaultdict

def get_stats(vocab):
    """Count frequency of adjacent symbol pairs."""
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += freq
    return pairs

def merge_vocab(pair, vocab):
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    return {w.replace(bigram, replacement): f for w, f in vocab.items()}

vocab = {'l o w </w>': 5, 'l o w e r </w>': 2,
         'n e w e s t </w>': 6, 'w i d e s t </w>': 3}

for step in range(5):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f"merge {step + 1}: {best} -> {''.join(best)}")

For production, use Hugging Face’s tokenizers library — it ships GPT-style BPE, BERT WordPiece, and SentencePiece behind a unified API.

Step 3 — Normalization#

Normalization collapses surface variants of the same word into a single form, which shrinks vocabulary and improves matching. It also throws information away, so apply it deliberately.

Lowercasing#

1
2
"Apple Inc. sells apples in APPLE stores".lower()
# "apple inc. sells apples in apple stores"

Lowercasing helps search and topic modeling. It hurts named-entity recognition (Apple the company collapses with apple the fruit) and any task where capitalization signals emphasis.

Stemming vs Lemmatization#

Stemming chops suffixes with deterministic rules. Fast, crude, sometimes wrong:

1
2
3
4
5
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
for w in ['running', 'easily', 'connection']:
    print(f"{w} -> {stemmer.stem(w)}")
# running -> run, easily -> easili, connection -> connect

easili is not a word — the Porter stemmer optimizes for matching, not for legibility.

Lemmatization uses a dictionary plus part-of-speech information to return the actual lemma:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp("The geese were running and swimming better than the mice")
for token in doc:
    print(f"{token.text:10} -> {token.lemma_:10} ({token.pos_})")
# geese      -> goose      (NOUN)
# were       -> be         (AUX)
# running    -> run        (VERB)
# swimming   -> swim       (VERB)
# better     -> well       (ADV)
# mice       -> mouse      (NOUN)

Aspect	Stemming	Lemmatization
Speed	Microseconds	Milliseconds (POS-tagged)
Output	May not be a real word	Always a dictionary form
Accuracy	Lower	Higher
Best for	Search and IR	NLU and QA

A useful default: use lemmatization unless you are running a high-throughput retrieval system where the latency budget is tight.

Step 4 — Stopwords and Zipf’s Law#

Stopwords are common closed-class words such as the, is, at that carry little task-specific meaning. Removing them shrinks vocabulary by roughly a third and concentrates signal in content words.

f(\text{rank}) \propto \frac{1}{\text{rank}}

Zipf distribution: head dominated by stopwords, long tail of rare words

The top ten words alone often account for 25 — 30% of all tokens. That is the head of the distribution, and it is mostly stopwords. The tail — thousands of words appearing once or twice — is where most semantic content lives, but it is also where models struggle and where subword tokenization earns its keep.

1
2
3
4
5
6
7
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
text = "The quick brown fox jumps over the lazy dog"
filtered = [w for w in word_tokenize(text.lower()) if w not in stop_words]
# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

When to remove stopwords:

Yes — bag-of-words and topic models, search indexing.
No — sentiment (not good is not the same as good), QA (function words carry the question), any deep model that learns to weight tokens itself.

Step 5 — From Tokens to Vectors#

A model needs numbers. Two classical encodings — Bag-of-Words and TF-IDF — still anchor most retrieval systems and are the right baseline for any new task.

One-hot vs Distributed Representations#

Before we get to BoW, it helps to see why naive encodings fail. A one-hot vector assigns each word a unique index, with a 1 in that position and 0 everywhere else. Every pair of words is orthogonal, which means the encoding carries zero similarity information.

One-hot encoding loses semantics; learned embeddings recover them

Distributed representations — which we will build in Part 2 — pack meaning into dense vectors where related words sit near each other. BoW and TF-IDF are a halfway step: each word still gets its own dimension, but the value in that dimension is a frequency, not just a marker.

Bag of Words#

Represent each document as a vector of word counts, ignoring order:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [
    "I love machine learning",
    "Machine learning is amazing",
    "I love deep learning and machine learning",
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(docs)
print(pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()))

1
2
3
4
   amazing  and  deep  is  learning  love  machine
0        0    0     0   0         1     1        1
1        1    0     0   1         1     0        1
2        0    1     1   0         2     1        1

The fatal limitation: dog bites man and man bites dog produce identical vectors. BoW discards order entirely.

TF-IDF#

\text{TF-IDF}(t, d) = \text{TF}(t, d) \cdot \text{IDF}(t)

\text{IDF}(t) = \log\!\frac{1 + N}{1 + \text{df}(t)} + 1

where $$N$$ is the number of documents and $\text{df}(t)$ is the number of documents containing term $$t$$ . The +1 smoothing keeps the IDF defined when a term appears in every document (or in none).

Bag of Words counts versus TF-IDF weights on the same toy corpus

The figure above shows both matrices side by side. Notice how learning — present in every document — gets weighted down by TF-IDF, while a word like vision that is unique to one document gets lifted. That is exactly the ranking behavior you want for search.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

docs = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Natural language processing uses machine learning",
    "Computer vision uses deep learning techniques",
]

tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=tfidf.get_feature_names_out())

for i, doc in enumerate(docs):
    top = df.iloc[i].sort_values(ascending=False).head(3)
    print(f"Doc {i + 1}: {dict(top.round(3))}")

Production-grade TF-IDF. The defaults rarely survive contact with a real corpus. Tune at least these knobs:

1
2
3
4
5
6
7
8
tfidf = TfidfVectorizer(
    max_features=5_000,    # cap vocabulary
    min_df=2,              # drop hapax legomena
    max_df=0.8,            # drop terms in 80%+ of docs (effective stopwords)
    ngram_range=(1, 2),    # unigrams + bigrams capture short phrases
    sublinear_tf=True,     # log-scale TF, dampens repetition
    stop_words='english',
)

Step 6 — N-gram Language Models#

P(w_1, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_{t-n+1}, \ldots, w_{t-1})

A bigram model uses one word of context, a trigram uses two, and so on.

N-gram windows, the bigram formula, and the perplexity vs sparsity trade-off

The trade-off is sharp:

Larger n captures more context, which lowers perplexity (perplexity is roughly the model’s effective branching factor — lower is better).
Larger n also explodes the parameter count and starves on rare contexts. With $$V$$ vocabulary, a trigram model has up to $$V^3$$ parameters, most of which see zero training examples. This is the sparsity problem, the central pain point of statistical NLP.

Smoothing techniques (Laplace, Kneser-Ney) patch the holes by redistributing probability mass to unseen n-grams. Modern neural language models sidestep the issue entirely by sharing parameters across contexts via embeddings — which is the bridge to Part 2 .

A Reusable Preprocessing Class#

Putting the steps together into something you can drop into a project:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import re
import spacy
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

class TextPreprocessor:
    """Configurable English text preprocessing pipeline."""

    def __init__(self, use_lemmatization: bool = True,
                 remove_stopwords: bool = True):
        self.use_lemmatization = use_lemmatization
        self.remove_stopwords = remove_stopwords

        if use_lemmatization:
            self.nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
        else:
            self.stemmer = PorterStemmer()

        if remove_stopwords:
            self.stop_words = set(stopwords.words('english'))

    def clean(self, text: str) -> str:
        text = text.lower()
        text = re.sub(r'<[^>]+>', '', text)
        text = re.sub(r'http\S+|www\.\S+', '', text)
        text = re.sub(r'\S+@\S+', '', text)
        text = re.sub(r'[^a-zA-Z\s]', '', text)
        return re.sub(r'\s+', ' ', text).strip()

    def tokenize_and_normalize(self, text: str) -> list[str]:
        if self.use_lemmatization:
            doc = self.nlp(text)
            tokens = [t.lemma_ for t in doc if not t.is_space]
        else:
            tokens = [self.stemmer.stem(t) for t in word_tokenize(text)]

        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stop_words]
        return tokens

    def preprocess(self, text: str) -> str:
        return ' '.join(self.tokenize_and_normalize(self.clean(text)))

    def preprocess_corpus(self, texts: list[str]) -> list[str]:
        return [self.preprocess(t) for t in texts]

pre = TextPreprocessor(use_lemmatization=True, remove_stopwords=True)
texts = [
    "Natural Language Processing (NLP) is amazing! Visit https://example.com",
    "Machine learning models are trained on large datasets.",
    "Deep learning has revolutionized computer vision and NLP.",
]
for orig, proc in zip(texts, pre.preprocess_corpus(texts)):
    print(f"original:  {orig}")
    print(f"processed: {proc}\n")

End-to-End Example: a Minimal Spam Classifier#

Combining everything into a working classifier. Use the SMS Spam Collection or any Kaggle spam dataset for real experiments; the snippet below is intentionally tiny so it runs anywhere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

texts = [
    "Congratulations! You've won a $1000 gift card. Call now!",
    "Hey, are we still meeting for dinner tonight?",
    "URGENT: Your account will be closed. Click here immediately!",
    "Can you send me the project report by EOD?",
    "Get rich quick! Amazing investment opportunity!",
    "Don't forget to pick up milk on your way home",
    "You have been selected for a free cruise. Reply YES",
    "Meeting moved to 3pm tomorrow in conference room B",
    "Lose 20 pounds in 2 weeks with this miracle pill!",
    "Thanks for your help with the presentation yesterday",
]
labels = np.array([1, 0, 1, 0, 1, 0, 1, 0, 1, 0])  # 1=spam, 0=ham

# Keep stopwords -- "free", "now", "you" are spam signals.
pre = TextPreprocessor(use_lemmatization=True, remove_stopwords=False)
processed = pre.preprocess_corpus(texts)

vectorizer = TfidfVectorizer(max_features=50, ngram_range=(1, 2))
X = vectorizer.fit_transform(processed)

X_train, X_test, y_train, y_test = train_test_split(
    X, labels, test_size=0.3, random_state=42, stratify=labels)

model = LogisticRegression(max_iter=1000).fit(X_train, y_train)

new_msgs = ["Can you review my code?", "FREE MONEY!!! Click now!!!"]
new_vecs = vectorizer.transform(pre.preprocess_corpus(new_msgs))
for msg, pred in zip(new_msgs, model.predict(new_vecs)):
    print(f"[{'SPAM' if pred else 'HAM'}] {msg}")

The point of the example is not the accuracy on ten samples — it is the shape of the pipeline. Swap in 5,000 SMS messages and the same code reaches roughly 97% accuracy with no further engineering. That is the strength of the classical NLP stack: short, transparent, and remarkably hard to beat without a GPU.

Decision Table: Which Steps for Which Task#

Task	Tokenization	Normalization	Stopwords	Features
Search / IR	Word	Stem	Remove	TF-IDF
Sentiment	Word / subword	Lemma	Keep	TF-IDF or embeddings
Topic modeling	Word	Lemma	Remove	BoW or TF-IDF
Machine translation	Subword (BPE)	Minimal	Keep	Embeddings
NER	Word	None	Keep	Embeddings + context
Modern LLMs	Subword (BPE)	None	Keep	Learned embeddings

Rule of thumb. The more model capacity and data you have, the less preprocessing you should do. Deep models learn their own normalization; aggressive preprocessing destroys signal they could have used. Classical ML benefits from careful feature engineering; LLMs prefer raw text.

Summary#

Preprocessing is task-specific. Search wants aggressive normalization; neural models want raw text.
Subword tokenization (BPE, WordPiece, SentencePiece) is the modern default because it bounds vocabulary and handles unseen words.
TF-IDF remains the right baseline. If a TF-IDF + logistic regression baseline beats your fancy model, the fancy model is broken.
Zipf’s law explains why stopword removal helps classical models and why long-tail words are hard.
Less is often more. Over-preprocessing hurts representation learning. Always measure.