跳到主要内容

中文模型使用指南

中文自然语言处理有其独特的挑战和特点。本章将介绍 Hugging Face 上常用的中文预训练模型,以及如何正确使用它们处理中文文本。

中文 NLP 的特点

中文与英文等西方语言有显著差异,这些差异影响了模型的选择和使用方式:

┌─────────────────────────────────────────────────────────────────┐
│ 中英文 NLP 差异对比 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 英文 中文 │
│ ──── ──── │
│ 词之间有空格分隔 词之间无明确分隔符 │
│ "I love programming" "我喜欢编程" │
│ │
│ 词汇量相对有限 词汇组合灵活多样 │
│ ~50万常用词 组合产生无限词汇 │
│ │
│ 词形变化丰富 无词形变化 │
│ run/running/ran 跑/跑/跑 │
│ │
│ 大小写敏感 无大小写区分 │
│ Apple/apple 苹果/苹果 │
│ │
└─────────────────────────────────────────────────────────────────┘

核心挑战

  1. 分词问题:中文没有天然的词边界,需要分词或字符级处理
  2. 词汇量大:中文词汇组合灵活,词表需要足够大
  3. 字符语义:单个汉字也具有语义,字符级表示很重要
  4. 多义词:同一个词在不同上下文中有不同含义

主流中文模型概览

BERT 系列中文模型

模型名称特点适用任务
bert-base-chineseGoogle 官方中文 BERT,字符级分词通用理解任务
hfl/chinese-roberta-wwm-ext哈工大全词掩码 BERT文本分类、NER
hfl/chinese-roberta-wwm-ext-large大版本全词掩码 BERT高精度理解任务
hfl/chinese-bert-wwm-ext全词掩码 BERT-base通用理解任务

RoBERTa 系列中文模型

模型名称特点适用任务
uer/chinese_roberta_L-12_H-768UER-py 预训练通用理解任务
uer/chinese_roberta_L-24_H-1024大规模版本高精度任务
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2多语言句子嵌入语义相似度

生成式中文模型

模型名称特点适用任务
Qwen/Qwen-7B-Chat阿里通义千问对话、生成
THUDM/chatglm3-6b清华 ChatGLM对话、生成
baichuan-inc/Baichuan2-7B-Chat百川大模型对话、生成
shibing624/text2vec-base-chinese中文文本向量化语义匹配

中文 BERT 使用详解

bert-base-chinese

Google 官方发布的中文 BERT 模型,使用字符级分词:

from transformers import BertTokenizer, BertModel, BertForSequenceClassification
import torch

# 加载模型和分词器
model_name = "bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# 中文文本处理
text = "自然语言处理是人工智能的重要分支"
inputs = tokenizer(text, return_tensors="pt")

print(f"原文: {text}")
print(f"Token IDs: {inputs['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])}")

# 输出示例:
# 原文: 自然语言处理是人工智能的重要分支
# Tokens: ['[CLS]', '自', '然', '语', '言', '处', '理', '是', '人', '工', '智', '能', '的', '重', '要', '分', '支', '[SEP]']

注意事项bert-base-chinese 使用字符级分词,每个汉字独立作为一个 token。这有时会导致词语边界不清晰的问题。

全词掩码 BERT (Whole Word Masking)

哈工大讯飞联合实验室发布的全词掩码模型,在预训练时对整个词进行掩码,效果通常优于字符级模型:

from transformers import BertTokenizer, BertForSequenceClassification

model_name = "hfl/chinese-roberta-wwm-ext"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 文本分类示例
text = "这家餐厅的服务态度非常好,菜品也很美味!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=-1).item()

print(f"文本: {text}")
print(f"预测类别: {'正面' if predicted_class == 1 else '负面'}")

中文文本分类完整示例

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# 自定义数据集
class ChineseTextDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length

def __len__(self):
return len(self.texts)

def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]

encoding = self.tokenizer(
text,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)

return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'labels': torch.tensor(label, dtype=torch.long)
}

# 示例数据
texts = [
"这个产品质量很好,非常满意",
"服务态度太差了,不会再来了",
"物流很快,包装也很完整",
"完全是假货,严重差评",
"价格实惠,性价比很高",
"收到的东西和描述完全不一样",
# ... 更多数据
]
labels = [1, 0, 1, 0, 1, 0] # 1: 正面, 0: 负面

# 划分数据集
train_texts, val_texts, train_labels, val_labels = train_test_split(
texts, labels, test_size=0.2, random_state=42
)

# 加载模型
model_name = "hfl/chinese-roberta-wwm-ext"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 创建数据集和数据加载器
train_dataset = ChineseTextDataset(train_texts, train_labels, tokenizer)
val_dataset = ChineseTextDataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=4)

# 训练配置
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

# 训练循环
def train_epoch(model, dataloader, optimizer, device):
model.train()
total_loss = 0

for batch in dataloader:
optimizer.zero_grad()

input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)

outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss

loss.backward()
optimizer.step()

total_loss += loss.item()

return total_loss / len(dataloader)

# 训练模型
num_epochs = 3
for epoch in range(num_epochs):
train_loss = train_epoch(model, train_loader, optimizer, device)
print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {train_loss:.4f}")

# 评估
def evaluate(model, dataloader, device):
model.eval()
predictions = []
true_labels = []

with torch.no_grad():
for batch in dataloader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels']

outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = torch.argmax(outputs.logits, dim=-1).cpu().numpy()

predictions.extend(preds)
true_labels.extend(labels.numpy())

return accuracy_score(true_labels, predictions)

accuracy = evaluate(model, val_loader, device)
print(f"验证集准确率: {accuracy:.4f}")

中文命名实体识别

from transformers import BertTokenizerFast, BertForTokenClassification
import torch

model_name = "uer/chinese_roberta_L-12_H-768"
tokenizer = BertTokenizerFast.from_pretrained(model_name)

# NER 标签(示例)
label_list = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

model = BertForTokenClassification.from_pretrained(
model_name,
num_labels=len(label_list),
id2label=id2label,
label2id=label2id
)

# 实体识别
text = "张三在北京清华大学工作"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
predictions = predictions[0].numpy()

for token, pred in zip(tokens, predictions):
if token not in ["[CLS]", "[SEP]"]:
label = id2label[pred]
if label != "O":
print(f"{token}: {label}")

中文问答系统

from transformers import BertTokenizer, BertForQuestionAnswering
import torch

model_name = "uer/chinese_roberta_L-12_H-768"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

def answer_question(question, context):
# 编码问题和上下文
inputs = tokenizer(
question,
context,
return_tensors="pt",
truncation=True,
max_length=512
)

with torch.no_grad():
outputs = model(**inputs)

# 获取答案的起始和结束位置
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits)

# 解码答案
answer_tokens = inputs["input_ids"][0][answer_start:answer_end + 1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

return answer

# 示例
question = "北京有什么著名景点?"
context = "北京是中国的首都,拥有众多著名景点,如故宫、长城、天安门广场等。故宫是世界上现存规模最大的宫殿建筑群。"

answer = answer_question(question, context)
print(f"问题: {question}")
print(f"答案: {answer}")

中文大语言模型

Qwen (通义千问)

阿里开源的中英双语大语言模型,支持多种任务:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen-7B-Chat"

# 加载模型(使用 4-bit 量化减少显存)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)

# 对话生成
response, history = model.chat(
tokenizer,
"请介绍一下北京的历史文化",
history=None
)
print(response)

# 多轮对话
response, history = model.chat(
tokenizer,
"那北京有哪些著名景点?",
history=history
)
print(response)

ChatGLM

清华开源的中英双语对话模型:

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "THUDM/chatglm3-6b"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)

# 对话
response, history = model.chat(
tokenizer,
"你好,请用中文介绍一下自己",
history=[]
)
print(response)

中文文本向量化

用于语义相似度、文本检索等任务:

from transformers import BertTokenizer, BertModel
import torch
import torch.nn.functional as F

model_name = "shibing624/text2vec-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

def get_embedding(text):
"""获取文本的向量表示"""
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

with torch.no_grad():
outputs = model(**inputs)

# 使用平均池化
attention_mask = inputs["attention_mask"]
last_hidden_state = outputs.last_hidden_state

# 计算加权平均
input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
embedding = sum_embeddings / sum_mask

# L2 归一化
embedding = F.normalize(embedding, p=2, dim=1)

return embedding

# 计算语义相似度
text1 = "今天天气很好"
text2 = "今天阳光明媚"
text3 = "今天下雨了"

emb1 = get_embedding(text1)
emb2 = get_embedding(text2)
emb3 = get_embedding(text3)

# 余弦相似度
sim_12 = F.cosine_similarity(emb1, emb2)
sim_13 = F.cosine_similarity(emb1, emb3)

print(f"'{text1}' 与 '{text2}' 的相似度: {sim_12.item():.4f}")
print(f"'{text1}' 与 '{text3}' 的相似度: {sim_13.item():.4f}")

中文模型选择指南

按任务选择

任务类型推荐模型说明
文本分类hfl/chinese-roberta-wwm-ext全词掩码,效果稳定
命名实体识别hfl/chinese-roberta-wwm-ext或微调后的专用模型
情感分析uer/chinese_roberta_L-12_H-768通用性好
语义相似度shibing624/text2vec-base-chinese专为向量匹配优化
问答系统hfl/chinese-roberta-wwm-ext-large需要高精度理解
文本生成Qwen/Qwen-7B-Chat中英双语,对话能力强
对话系统THUDM/chatglm3-6b轻量级,部署友好

按规模选择

模型规模参数量推荐场景
Small~10M快速原型、资源受限环境
Base~100M通用任务、平衡性能与速度
Large~300M+高精度要求、充足资源
LLM7B+复杂生成任务、对话系统

中文处理注意事项

1. 编码问题

确保文本使用 UTF-8 编码:

# 读取文件时指定编码
with open("chinese_text.txt", "r", encoding="utf-8") as f:
text = f.read()

# 处理异常编码
def safe_decode(text):
try:
return text.encode('utf-8').decode('utf-8')
except UnicodeError:
return text.encode('gbk', errors='ignore').decode('gbk')

2. 文本清洗

中文文本通常需要预处理:

import re

def clean_chinese_text(text):
"""清理中文文本"""
# 去除多余空格
text = re.sub(r'\s+', '', text)

# 去除特殊字符(保留中文、英文、数字)
text = re.sub(r'[^\u4e00-\u9fa5a-zA-Z0-9]', '', text)

# 繁体转简体(需要 opencc 库)
# import opencc
# converter = opencc.OpenCC('t2s')
# text = converter.convert(text)

return text

3. 序列长度

中文分词后序列通常更长,需要注意:

# 对于字符级模型,中文序列长度约为英文的 2-3 倍
# 建议使用更长的 max_length

tokenizer(text, max_length=256) # 中文建议 256 或更长

4. 停用词处理

# 常用中文停用词
stopwords = set([
"的", "了", "在", "是", "我", "有", "和", "就", "不", "人", "都", "一", "一个", "上", "也", "很", "到", "说", "要", "去", "你", "会", "着", "没有", "看", "好", "自己", "这"
])

def remove_stopwords(text):
# 需要先分词
import jieba
words = jieba.cut(text)
return " ".join([w for w in words if w not in stopwords])

性能优化建议

1. 使用 Fast Tokenizer

# Fast Tokenizer 基于 Rust 实现,速度更快
tokenizer = BertTokenizerFast.from_pretrained(model_name)

2. 模型量化

from transformers import AutoModelForSequenceClassification

# 8-bit 量化
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)

# 4-bit 量化(需要 bitsandbytes)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
load_in_4bit=True,
device_map="auto"
)

3. 批处理推理

def batch_inference(texts, model, tokenizer, batch_size=32):
results = []

for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
inputs = tokenizer(batch, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

results.extend(outputs.logits.argmax(-1).tolist())

return results

参考资源