文本预处理

文本预处理是自然语言处理的第一步，也是至关重要的一步。原始文本数据通常包含噪声、不一致性和冗余信息，需要经过一系列处理才能用于后续的模型训练和分析。

为什么需要文本预处理

原始文本数据存在以下问题：

噪声数据：HTML 标签、特殊字符、表情符号等可能干扰分析。

格式不一致：大小写、编码、空白符等格式不统一。

冗余信息：停用词、标点符号等对某些任务没有贡献。

语言特性：不同语言有不同的处理需求，如中文需要分词。

经过预处理，可以得到干净、统一、有意义的文本表示，提高后续模型的效果。

分词

分词（Tokenization）是将连续的文本序列分割成有意义的词汇单位的过程。这是 NLP 中最基础的操作。

英文分词

英文单词之间有空格分隔，分词相对简单，但仍需处理缩写、标点等情况。

使用 NLTK

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

text = "Hello, world! I'm learning NLP. It's fascinating!"

# 句子分割
sentences = sent_tokenize(text)
print(sentences)
# ["Hello, world!", "I'm learning NLP.", "It's fascinating!"]

# 词级分词
tokens = word_tokenize(text)
print(tokens)
# ['Hello', ',', 'world', '!', 'I', "'m", 'learning', 'NLP', '.', 'It', "'s", 'fascinating', '!']

NLTK 的分词器能够正确处理缩写（如 "I'm" 分成 "I" 和 "'m"）和标点符号。

使用 spaCy

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Hello, world! I'm learning NLP."

doc = nlp(text)

# 分词
tokens = [token.text for token in doc]
print(tokens)
# ['Hello', ',', 'world', '!', 'I', "'m", 'learning', 'NLP', '.']

# 句子分割
sentences = [sent.text for sent in doc.sents]
print(sentences)
# ['Hello, world!', "I'm learning NLP."]

spaCy 的分词更加智能，能够处理更多边缘情况。

中文分词

中文没有自然的词边界，分词是一个重要且具有挑战性的任务。

使用 jieba

import jieba

text = "自然语言处理是人工智能的重要分支"

# 精确模式（默认）
seg_list = jieba.cut(text, cut_all=False)
print("精确模式:", "/".join(seg_list))
# 精确模式: 自然语言/处理/是/人工智能/的/重要/分支

# 全模式
seg_list = jieba.cut(text, cut_all=True)
print("全模式:", "/".join(seg_list))
# 全模式: 自然/自然语言/语言/处理/是/人工/人工智能/智能/的/重要/分支

# 搜索引擎模式
seg_list = jieba.cut_for_search(text)
print("搜索引擎模式:", "/".join(seg_list))
# 搜索引擎模式: 自然/语言/自然语言/处理/是/人工/智能/人工智能/的/重要/分支

三种模式的区别：

精确模式：将句子最精确地切分，适合文本分析
全模式：把所有可能的词都扫描出来，速度快但有冗余
搜索引擎模式：在精确模式基础上对长词再切分，提高召回率

自定义词典

jieba 支持添加自定义词典来提高分词准确性：

import jieba

# 添加自定义词
jieba.add_word("自然语言处理")
jieba.add_word("人工智能")

text = "自然语言处理是人工智能的重要分支"
seg_list = jieba.cut(text)
print("/".join(seg_list))
# 自然语言处理/是/人工智能/的/重要/分支

# 删除自定义词
jieba.del_word("自然语言处理")

# 从文件加载词典
# jieba.load_userdict("user_dict.txt")

词典文件格式：每行一个词，格式为 词语词频词性（词频和词性可选）。

子词分词（Subword Tokenization）

对于预训练模型，通常使用子词分词方法，如 BPE、WordPiece 等。

from transformers import AutoTokenizer

# 加载 BERT 分词器
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")

text = "自然语言处理很有趣"

# 分词
tokens = tokenizer.tokenize(text)
print(tokens)
# ['自', '然', '语', '言', '处', '理', '很', '有', '趣']

# 转换为 ID
ids = tokenizer.encode(text)
print(ids)
# [101, 5632, 4125, 6421, 6444, 1905, 4415, 2523, 3300, 7305, 102]

# 解码
decoded = tokenizer.decode(ids)
print(decoded)
# [CLS] 自 然 语 言 处 理 很 有 趣 [SEP]

文本清洗

文本清洗是去除文本中的噪声和无关信息的过程。

去除 HTML 标签

import re
from bs4 import BeautifulSoup

html_text = "<p>这是一段<b>HTML</b>文本。</p><br/>"

# 方法1：使用正则表达式
clean_text = re.sub(r'<[^>]+>', '', html_text)
print(clean_text)
# 这是一段HTML文本。

# 方法2：使用 BeautifulSoup
clean_text = BeautifulSoup(html_text, 'html.parser').get_text()
print(clean_text)
# 这是一段HTML文本。

去除特殊字符

import re

text = "Hello!!! 这是测试文本... @#$%^&*()"

# 只保留中文、英文和数字
clean_text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9\s]', '', text)
print(clean_text)
# Hello 这是测试文本 

# 只保留英文和空格
clean_text = re.sub(r'[^a-zA-Z\s]', '', text)
print(clean_text)
# Hello 

去除多余空白

import re

text = "  这   是   多余   空白  的  文本  "

# 去除首尾空白
text = text.strip()

# 合并多个空格为一个
clean_text = re.sub(r'\s+', ' ', text)
print(clean_text)
# 这 是 多余 空白 的 文本

URL 和邮箱处理

import re

text = "访问 https://example.com 或联系 [email protected]"

# 替换 URL
text_with_url_placeholder = re.sub(
    r'https?://[^\s]+', 
    '[URL]', 
    text
)
print(text_with_url_placeholder)
# 访问 [URL] 或联系 [email protected]

# 替换邮箱
text_with_email_placeholder = re.sub(
    r'[\w.-]+@[\w.-]+\.\w+', 
    '[EMAIL]', 
    text
)
print(text_with_email_placeholder)
# 访问 https://example.com 或联系 [EMAIL]

文本标准化

文本标准化是将文本转换为统一格式的过程。

大小写转换

text = "Hello World! NLP is Amazing!"

# 转小写
lower_text = text.lower()
print(lower_text)
# hello world! nlp is amazing!

# 转大写
upper_text = text.upper()
print(upper_text)
# HELLO WORLD! NLP IS AMAZING!

全角半角转换

import unicodedata

def fullwidth_to_halfwidth(text):
    result = []
    for char in text:
        code = ord(char)
        if code == 0x3000:  # 全角空格
            result.append(' ')
        elif 0xFF01 <= code <= 0xFF5E:  # 全角字符
            result.append(chr(code - 0xFEE0))
        else:
            result.append(char)
    return ''.join(result)

text = "Ｈｅｌｌｏ　Ｗｏｒｌｄ！１２３"
converted = fullwidth_to_halfwidth(text)
print(converted)
# Hello Word!123

繁简转换

from opencc import OpenCC

# 繁体转简体
cc = OpenCC('t2s')
traditional = "自然語言處理"
simplified = cc.convert(traditional)
print(simplified)
# 自然语言处理

# 简体转繁体
cc = OpenCC('s2t')
simplified = "自然语言处理"
traditional = cc.convert(simplified)
print(traditional)
# 自然語言處理

安装：pip install OpenCC

停用词处理

停用词（Stop Words）是指在文本中频繁出现但对语义贡献较小的词，如"的"、"是"、"在"等。

英文停用词

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

text = "This is a sample sentence, showing off the stop words filtration."

# 获取英文停用词
stop_words = set(stopwords.words('english'))

# 分词
tokens = word_tokenize(text)

# 去除停用词
filtered_tokens = [w for w in tokens if w.lower() not in stop_words]
print(filtered_tokens)
# ['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']

中文停用词

import jieba

# 中文停用词列表
stop_words = set([
    '的', '了', '在', '是', '我', '有', '和', '就',
    '不', '人', '都', '一', '一个', '上', '也', '很', '到',
    '说', '要', '去', '你', '会', '着', '没有', '看', '好'
])

text = "我是一名自然语言处理工程师，我非常喜欢这个领域"

# 分词
tokens = jieba.cut(text)

# 去除停用词
filtered_tokens = [w for w in tokens if w not in stop_words]
print(filtered_tokens)
# ['自然语言', '处理', '工程师', '非常', '喜欢', '领域']

也可以从文件加载停用词：

def load_stopwords(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        return set([line.strip() for line in f])

stop_words = load_stopwords('stopwords.txt')

词干提取和词形还原

词干提取（Stemming）和词形还原（Lemmatization）是将词汇还原到基本形式的方法，主要用于英文处理。

词干提取

词干提取通过简单的规则去除词缀，结果可能不是真正的词。

from nltk.stem import PorterStemmer, LancasterStemmer

words = ['running', 'ran', 'runs', 'easily', 'fairly']

# Porter Stemmer
porter = PorterStemmer()
porter_stems = [porter.stem(w) for w in words]
print("Porter:", porter_stems)
# Porter: ['run', 'ran', 'run', 'easili', 'fairli']

# Lancaster Stemmer
lancaster = LancasterStemmer()
lancaster_stems = [lancaster.stem(w) for w in words]
print("Lancaster:", lancaster_stems)
# Lancaster: ['run', 'ran', 'run', 'easy', 'fair']

词形还原

词形还原基于词典和语法规则，结果是真正的词。

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

def get_wordnet_pos(word):
    """获取词性标签"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {
        'J': wordnet.ADJ,
        'N': wordnet.NOUN,
        'V': wordnet.VERB,
        'R': wordnet.ADV
    }
    return tag_dict.get(tag, wordnet.NOUN)

lemmatizer = WordNetLemmatizer()

words = ['running', 'ran', 'runs', 'better', 'studies']

for word in words:
    pos = get_wordnet_pos(word)
    lemma = lemmatizer.lemmatize(word, pos=pos)
    print(f"{word} -> {lemma} (POS: {pos})")
# running -> run (POS: v)
# ran -> run (POS: v)
# runs -> run (POS: v)
# better -> good (POS: a)
# studies -> study (POS: n)