从网页抓取的文本通常夹杂着 HTML 标签、URL 链接和各种控制字符。清洗的目标是移除这些显性噪声,同时最大限度保留原始语义。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
importredefclean_text(text:str)->str:"""去除 HTML 标签、URL、邮箱地址和非字母字符,统一空白符。"""text=re.sub(r'<[^>]+>','',text)# 去掉 HTML 标签text=re.sub(r'http\S+|www\.\S+','',text)# 去掉 URLtext=re.sub(r'\S+@\S+','',text)# 去掉邮箱text=re.sub(r'[^a-zA-Z\s]','',text)# 只保留字母和空格text=re.sub(r'\s+',' ',text).strip()# 统一空白符returntextraw="""<p>Check out https://example.com for info!</p>
Contact info@test.com. Price: $29.99"""print(clean_text(raw))# Check out for info Contact Price
fromnltk.tokenizeimportsent_tokenizetext="Dr. Johnson works at A.I. Corp. He earned his Ph.D. in 2010."sent_tokenize(text)# ['Dr. Johnson works at A.I. Corp.', 'He earned his Ph.D. in 2010.']
现代模型如 GPT、BERT、Llama 和 Claude 都不再以单词为单位进行分词,而是采用子词分词,几乎都基于字节对编码(Byte-Pair Encoding,BPE)。其核心逻辑非常直观:
初始词表只包含单个字符。
统计语料库中所有相邻字符对的出现频率。
合并最高频的一对字符,生成一个新的符号。
重复上述过程,直到词表达到目标规模(通常在 3 万到 10 万之间)。
1
2
3
4
5
6
7
语料: "low" x5, "lower" x2, "newest" x6, "widest" x3
初始: l o w / l o w e r / n e w e s t / w i d e s t
合并 1: (e, s) -> es # 在 "newest" 和 "widest" 中频繁出现
合并 2: (es, t) -> est
合并 3: (l, o) -> lo
...
为什么 BPE 在实际应用中如此重要?
罕见词可分解——例如,unbelievable 被拆分为 un + believ + able,每部分都在其他地方出现过。
fromcollectionsimportdefaultdictdefget_stats(vocab):"""统计相邻符号对的频率"""pairs=defaultdict(int)forword,freqinvocab.items():symbols=word.split()foriinrange(len(symbols)-1):pairs[symbols[i],symbols[i+1]]+=freqreturnpairsdefmerge_vocab(pair,vocab):bigram=' '.join(pair)replacement=''.join(pair)return{w.replace(bigram,replacement):fforw,finvocab.items()}vocab={'l o w </w>':5,'l o w e r </w>':2,'n e w e s t </w>':6,'w i d e s t </w>':3}forstepinrange(5):pairs=get_stats(vocab)ifnotpairs:breakbest=max(pairs,key=pairs.get)vocab=merge_vocab(best,vocab)print(f"合并 {step+1}: {best} -> {''.join(best)}")
在生产环境中,直接使用 Hugging Face 的 tokenizers 库即可,它通过统一的 API 支持 GPT 风格的 BPE、BERT 的 WordPiece 和 SentencePiece。
importspacynlp=spacy.load('en_core_web_sm')doc=nlp("The geese were running and swimming better than the mice")fortokenindoc:print(f"{token.text:10} -> {token.lemma_:10} ({token.pos_})")# geese -> goose (NOUN)# were -> be (AUX)# running -> run (VERB)# swimming -> swim (VERB)# better -> well (ADV)# mice -> mouse (NOUN)
fromnltk.corpusimportstopwordsfromnltk.tokenizeimportword_tokenizestop_words=set(stopwords.words('english'))text="The quick brown fox jumps over the lazy dog"filtered=[wforwinword_tokenize(text.lower())ifwnotinstop_words]# ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
那么,什么时候应该去掉停用词呢?
去——适用于词袋模型、主题建模、搜索倒排索引等场景。
不去——情感分析(例如 not good 和 good 意思完全不同)、问答系统(虚词承载提问语气)、以及任何能够自己学习 token 权重的深度学习模型。
fromsklearn.feature_extraction.textimportCountVectorizerimportpandasaspddocs=["I love machine learning","Machine learning is amazing","I love deep learning and machine learning",]vectorizer=CountVectorizer()X=vectorizer.fit_transform(docs)print(pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out()))
1
2
3
4
amazing and deep is learning love machine
0 0 0 0 0 1 1 1
1 1 0 0 1 1 0 1
2 0 1 1 0 2 1 1
致命缺陷:dog bites man 和 man bites dog 的向量一模一样。词袋模型完全丢掉了顺序。
fromsklearn.feature_extraction.textimportTfidfVectorizerimportpandasaspddocs=["Machine learning is a subset of artificial intelligence","Deep learning is a subset of machine learning","Natural language processing uses machine learning","Computer vision uses deep learning techniques",]tfidf=TfidfVectorizer()X=tfidf.fit_transform(docs)df=pd.DataFrame(X.toarray(),columns=tfidf.get_feature_names_out())fori,docinenumerate(docs):top=df.iloc[i].sort_values(ascending=False).head(3)print(f"文档 {i+1}: {dict(top.round(3))}")
fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.linear_modelimportLogisticRegressionfromsklearn.model_selectionimporttrain_test_splitimportnumpyasnptexts=["Congratulations! You've won a $1000 gift card. Call now!","Hey, are we still meeting for dinner tonight?","URGENT: Your account will be closed. Click here immediately!","Can you send me the project report by EOD?","Get rich quick! Amazing investment opportunity!","Don't forget to pick up milk on your way home","You have been selected for a free cruise. Reply YES","Meeting moved to 3pm tomorrow in conference room B","Lose 20 pounds in 2 weeks with this miracle pill!","Thanks for your help with the presentation yesterday",]labels=np.array([1,0,1,0,1,0,1,0,1,0])# 1=垃圾邮件, 0=正常邮件# 保留停用词——像 "free"、"now"、"you" 这样的词往往是垃圾邮件的重要特征pre=TextPreprocessor(use_lemmatization=True,remove_stopwords=False)processed=pre.preprocess_corpus(texts)vectorizer=TfidfVectorizer(max_features=50,ngram_range=(1,2))X=vectorizer.fit_transform(processed)X_train,X_test,y_train,y_test=train_test_split(X,labels,test_size=0.3,random_state=42,stratify=labels)model=LogisticRegression(max_iter=1000).fit(X_train,y_train)new_msgs=["Can you review my code?","FREE MONEY!!! Click now!!!"]new_vecs=vectorizer.transform(pre.preprocess_corpus(new_msgs))formsg,predinzip(new_msgs,model.predict(new_vecs)):print(f"[{'垃圾邮件'ifpredelse'正常邮件'}] {msg}")