分词器详解
分词器(Tokenizer)是 Transformers 中至关重要的组件,负责将原始文本转换为模型能够处理的数字序列。本章将详细介绍分词器的原理、使用方法和最佳实践。
分词器概述
什么是分词?
分词是将文本切分成模型可处理的最小单元(token)的过程:
原始文本: "I love natural language processing!"
分词过程:
┌─────────────────────────────────────────────────────────────┐
│ I → [257] │
│ love → [1492] │
│ natural → [4915] │
│ language → [2653] │
│ processing → [6364] │
│ ! → [999] │
└─────────────────────────────────────────────────────────────┘
最终输入: [101, 257, 1492, 4915, 2653, 6364, 999, 102]
↑ ↑
[CLS] [SEP]
分词器的类型
| 类型 | 示例 | 特点 |
|---|---|---|
| 词级别 | WordPiece, BPE | 按单词切分,词汇表大 |
| 子词级别 | WordPiece (BERT), SentencePiece | 按子词切分,处理 OOV |
| 字符级别 | Character | 按字符切分,处理 OOV 能力强但序列长 |
常见分词算法
WordPiece(BERT 使用)
- 基于频率的贪婪算法
- 将单词拆分为常见子词
- 词汇表通常 30000 左右
Byte Pair Encoding (BPE, GPT 使用)
- 从字符开始,合并最常见的字节对
- 迭代构建词汇表
- 能处理任意 Unicode 字符
SentencePiece
- 将文本视为 Unicode 字符序列
- 不依赖语言特定规则
- 支持 BPE 和 Unigram
基本使用
加载分词器
from transformers import AutoTokenizer
# 使用 AutoTokenizer 自动加载
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 或使用特定的分词器类
from transformers import BertTokenizer, GPT2Tokenizer, T5Tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
基础分词
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 单条文本分词
text = "I love natural language processing"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['i', 'love', 'natural', 'language', 'processing']
# 编码(文本 → token IDs)
encoded = tokenizer.encode(text)
print(encoded)
# [101, 1045, 2293, 3019, 2653, 6364, 102]
# 解码(token IDs → 文本)
decoded = tokenizer.decode(encoded)
print(decoded)
# '[CLS] i love natural language processing [SEP]'
Batch 分词
texts = [
"I love natural language processing",
"Deep learning is amazing",
"Transformers are powerful"
]
# 批量编码
encoded_batch = tokenizer(texts)
print(encoded_batch)
# {'input_ids': [[101, 1045, ...], [101, 2785, ...], ...],
# 'attention_mask': [[1, 1, ...], [1, 1, ...], ...]}
# 批量解码
decoded_batch = tokenizer.batch_decode(encoded_batch['input_ids'])
print(decoded_batch)
详细参数配置
padding(填充)
# 不同填充策略
texts = ["short", "this is a longer text", "medium length"]
# 填充到批量最大长度(默认行为)
padded = tokenizer(texts, padding=True)
print([len(x) for x in padded['input_ids']]) # [5, 7, 6]
# 填充到指定长度
padded = tokenizer(texts, padding='longest')
print([len(x) for x in padded['input_ids']])
# 填充到固定长度
padded = tokenizer(texts, padding='max_length', max_length=10)
print([len(x) for x in padded['input_ids']]) # [10, 10, 10]
# 不填充
padded = tokenizer(texts, padding=False)
truncation(截断)
# 截断策略
long_text = "This is a very long text " * 100
# 截断到最大长度
truncated = tokenizer(long_text, max_length=50, truncation=True)
print(len(truncated['input_ids'])) # 50
# 只截断第一项/第二项(用于句子对任务)
pair1 = "The first sentence"
pair2 = "This is a much longer second sentence that will definitely exceed the maximum length"
result = tokenizer(
pair1, pair2,
max_length=10,
truncation=True,
truncation_strategy='only_second' # 只截断第二个句子
)
return_tensors(返回张量格式)
texts = ["I love NLP", "Deep learning is great"]
# 返回 Python 列表(默认)
result = tokenizer(texts, return_tensors='pt')
# {'input_ids': tensor([[...]]), 'attention_mask': tensor([[...]])}
# 返回 PyTorch 张量
result = tokenizer(texts, return_tensors='pt')
# 返回 TensorFlow 张量
result = tokenizer(texts, return_tensors='tf')
# 返回 JAX 张量
result = tokenizer(texts, return_tensors='jax')
add_special_tokens(特殊标记)
# BERT 特殊标记
# [CLS] - 分类标记,用于序列开头
# [SEP] - 分隔标记,用于句子对分隔
# [PAD] - 填充标记
# [UNK] - 未知词标记
# [MASK] - 掩码标记
text = "I love BERT"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['i', 'love', 'bert']
# 添加特殊标记(默认开启)
encoded = tokenizer.encode(text, add_special_tokens=True)
print(encoded)
# [101, 1045, 2293, 14324, 102] # 101=[CLS], 102=[SEP]
# 不添加特殊标记
encoded_no_special = tokenizer.encode(text, add_special_tokens=False)
print(encoded_no_special)
# [1045, 2293, 14324]
分词器的高级功能
句子对编码
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence1 = "What is natural language processing?"
sentence2 = "NLP is a branch of artificial intelligence"
# 编码句子对
encoded = tokenizer(
sentence1,
sentence2,
padding=True,
truncation=True,
max_length=128,
return_tensors='pt'
)
print(encoded)
# {
# 'input_ids': tensor([[101, 2054, 2003, ...]]),
# 'token_type_ids': tensor([[0, 0, 0, ..., 1, 1, ...]]), # 标记属于哪个句子
# 'attention_mask': tensor([[1, 1, 1, ...]])
# }
# 解码查看
print(tokenizer.decode(encoded['input_ids'][0]))
# [CLS] what is natural language processing ? [SEP] nlp is a branch of artificial intelligence [SEP]
处理特殊字符
# 中文分词示例
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
chinese_text = "你好,世界!"
tokens = tokenizer.tokenize(chinese_text)
print(tokens)
# ['你', '好', ',', '世', '界', '!']
# 编码
encoded = tokenizer.encode(chinese_text)
print(encoded)
# [101, 872, 1962, 8024, 3362, 8013, 106, 102]
# 带标点符号和特殊字符
mixed_text = "Hello 你好 123"
tokens = tokenizer.tokenize(mixed_text)
print(tokens)
# ['hello', '你', '好', '1', '2', '3']
获取 token 对应位置
text = "I love natural language processing"
# 获取 token 对应的字符位置
tokens = tokenizer.tokenize(text)
offsets = tokenizer(text, return_offsets_mapping=True)['offset_mapping']
print("Token 位置映射:")
for token, (start, end) in zip(tokens, offsets):
print(f" '{token}': 字符 {start} - {end}")
# 示例输出:
# 'i': 字符 0 - 1
# 'love': 字符 2 - 6
# ...
Fast vs Slow Tokenizer
from transformers import AutoTokenizer
# Slow Tokenizer(Python 实现)
slow_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
# Fast Tokenizer(Rust 实现,更快)
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)
print(type(slow_tokenizer)) # <class 'transformers.models.bert.tokenization_bert.BertTokenizer'>
print(type(fast_tokenizer)) # <class 'transformers.models.bert.tokenization_bert_fast.BertTokenizerFast'>
# Fast Tokenizer 额外功能
fast_result = fast_tokenizer(text, return_offsets_mapping=True)
print(fast_result.word_ids()) # 获取 token 对应的词 ID
自定义分词器
添加新 token
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 添加新的特殊 token
new_tokens = ["[URL]", "[EMAIL]", "[PHONE]"]
num_added = tokenizer.add_tokens(new_tokens)
print(f"添加了 {num_added} 个新 token")
# 添加后需要调整模型 embedding 层大小
model = AutoModel.from_pretrained("bert-base-uncased")
model.resize_token_embeddings(len(tokenizer))
# 使用新 token
text = "Visit our website [URL] or contact us [EMAIL]"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['visit', 'our', 'website', '[URL]', 'or', 'contact', 'us', '[EMAIL]']
添加特殊标记
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 添加特殊标记(不会被进一步分词)
special_tokens = {"additional_special_tokens": ["[USER]", "[SYSTEM]"]}
tokenizer.add_special_tokens(special_tokens)
text = "[USER] Hello [SYSTEM] How are you?"
tokens = tokenizer.tokenize(text)
print(tokens)
# ['[USER]', 'hello', '[SYSTEM]', 'how', 'are', 'you', '?']
分词器缓存
from transformers import AutoTokenizer
# 首次加载(会下载和缓存)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 后续使用(从缓存加载,速度更快)
# Transformers 自动处理缓存
# 手动指定缓存目录
tokenizer = AutoTokenizer.from_pretrained(
"bert-base-uncased",
cache_dir="./huggingface_cache"
)
常见问题
1. 词汇表外(OOV)问题
# 检查 token 是否在词汇表中
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
word = "Transformer"
token_id = tokenizer.convert_tokens_to_ids(word)
print(f"Token ID: {token_id}") # 如果不在词汇表中,返回 [UNK] (100)
# 使用词片分词
tokens = tokenizer.tokenize(word)
print(f"分词结果: {tokens}") # ['transformer'] 会被分解
2. 中文分词
# 中文应该按字符级别分词(使用中文预训练模型)
tokenizer = AutoTokenizer.from_pretrained("bert-base-chinese")
chinese_text = "深度学习"
tokens = tokenizer.tokenize(chinese_text)
print(tokens) # ['深', '度', '学', '习']
# 英文使用词片
english_text = "deep learning"
tokens = tokenizer.tokenize(english_text)
print(tokens) # ['deep', 'learning']
3. 长文本处理
# 超过模型最大长度的文本需要截断
long_text = "..." # 很长的文本
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 方法1:直接截断
result = tokenizer(long_text, max_length=512, truncation=True)
# 方法2:滑动窗口(用于文档级任务)
result = tokenizer(
long_text,
max_length=512,
truncation=True,
return_overflowing_tokens=True, # 返回截断后的片段
stride=128 # 片段之间重叠的 token 数
)
print(f"总片段数: {len(result['input_ids'])}")
分词器工具函数
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# 将 token 转换为 ID
ids = tokenizer.convert_tokens_to_ids(["hello", "world"])
print(ids) # [7592, 2088]
# 将 ID 转换为 token
tokens = tokenizer.convert_ids_to_tokens([7592, 2088])
print(tokens) # ['hello', 'world']
# 编码单个 token
id = tokenizer.convert_tokens_to_ids("hello")
print(id) # 7592
# 解码单个 ID
token = tokenizer.convert_ids_to_tokens(7592)
print(token) # 'hello'
# 获取词汇表大小
print(tokenizer.vocab_size) # 30522
# 获取词汇表
vocab = tokenizer.get_vocab()
print(f"词汇表大小: {len(vocab)}")
下一步
掌握分词器后,你可以继续学习: