问答系统
问答系统(Question Answering,QA)是自然语言处理中最具实用价值的应用之一。它的目标是根据用户提出的问题,从知识库或文档中找到准确的答案。与搜索引擎返回相关文档不同,问答系统直接返回具体的答案,大大提高了信息获取的效率。
什么是问答系统
问答系统是一种能够理解自然语言问题并给出准确答案的系统。根据答案的来源和生成方式,问答系统可以分为以下几种类型:
抽取式问答(Extractive QA):从给定的上下文文档中直接抽取答案片段。答案必须是原文中的连续片段。这是目前最成熟的问答技术,代表性数据集是 SQuAD。
生成式问答(Abstractive QA):根据问题理解,生成一个完整的答案。答案不一定是原文的片段,可以是模型综合理解后的表达。这种方式更接近人类的回答方式,但技术难度更大。
知识库问答(Knowledge Base QA):从结构化知识库(如 Freebase、Wikidata)中查询答案。需要将自然语言问题转换为结构化查询语句。
开放域问答(Open-Domain QA):没有限定上下文文档,需要从大规模文档库中检索相关文档,然后从中找到答案。这是最具挑战性的问答任务。
问答系统的架构
一个完整的问答系统通常包含以下组件:
问题 → 问题理解 → 文档检索 → 答案抽取/生成 → 答案排序 → 最终答案
问题理解负责分析问题的意图和关键信息。文档检索从知识库中找到可能包含答案的文档。答案抽取或生成从相关文档中找到具体答案。答案排序对多个候选答案进行排序,选择最合适的答案。
抽取式问答
抽取式问答是最常见的问答形式,给定一个问题和一段上下文,模型从上下文中抽取答案。
基本原理
抽取式问答模型通常预测答案的起始位置和结束位置。假设上下文有 个词,模型会输出两个概率分布:起始位置分布 和结束位置分布 ,每个分布都是 维的。答案就是起始位置到结束位置之间的文本片段。
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
class ExtractiveQAModel(nn.Module):
"""抽取式问答模型"""
def __init__(self, model_name="bert-base-chinese"):
super().__init__()
self.bert = BertModel.from_pretrained(model_name)
self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2)
def forward(self, input_ids, attention_mask, token_type_ids=None, start_positions=None, end_positions=None):
# BERT 编码
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids
)
sequence_output = outputs.last_hidden_state # (batch_size, seq_len, hidden_size)
# 预测起始和结束位置
logits = self.qa_outputs(sequence_output) # (batch_size, seq_len, 2)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1) # (batch_size, seq_len)
end_logits = end_logits.squeeze(-1) # (batch_size, seq_len)
# 计算损失
if start_positions is not None and end_positions is not None:
loss_fct = nn.CrossEntropyLoss()
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
return {"loss": total_loss, "start_logits": start_logits, "end_logits": end_logits}
return {"start_logits": start_logits, "end_logits": end_logits}
# 创建模型
model = ExtractiveQAModel()
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")
# 模拟输入
question = "北京是中国的什么?"
context = "北京是中国的首都,也是政治、文化和国际交流中心。"
# 分词
inputs = tokenizer(
question,
context,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
# 预测
with torch.no_grad():
outputs = model(**inputs)
start_logits = outputs["start_logits"]
end_logits = outputs["end_logits"]
# 获取预测位置
start_index = torch.argmax(start_logits, dim=1).item()
end_index = torch.argmax(end_logits, dim=1).item()
# 解码答案
answer_tokens = inputs["input_ids"][0][start_index:end_index+1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
print(f"问题: {question}")
print(f"上下文: {context}")
print(f"答案: {answer}")
使用 Hugging Face Pipeline
Hugging Face 提供了简便的 Pipeline 接口用于问答:
from transformers import pipeline
# 创建问答管道
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")
# 问答
question = "What is the capital of China?"
context = "Beijing is the capital of China. It is located in the northern part of the country."
result = qa_pipeline(question=question, context=context)
print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['score']:.4f}")
print(f"答案位置: {result['start']}-{result['end']}")
中文问答模型
from transformers import pipeline
# 使用中文问答模型
qa_pipeline = pipeline(
"question-answering",
model="uer/roberta-base-chinese-extractive-qa"
)
# 中文问答示例
context = """
自然语言处理(NLP)是人工智能领域的重要分支。
它致力于让计算机理解和处理人类语言。
NLP 技术广泛应用于机器翻译、情感分析、问答系统等领域。
深度学习的发展极大地推动了 NLP 的进步。
"""
questions = [
"自然语言处理是什么?",
"NLP 技术有哪些应用?",
"什么推动了 NLP 的进步?"
]
for question in questions:
result = qa_pipeline(question=question, context=context)
print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['score']:.4f}")
print()
生成式问答
生成式问答不局限于从上下文中抽取答案,而是可以根据理解生成答案。
使用 Seq2Seq 模型
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# 加载生成式问答模型
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def generate_answer(question, context, max_length=100):
"""生成式问答"""
# 构造输入
input_text = f"question: {question} context: {context}"
# 分词
inputs = tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True
)
# 生成答案
outputs = model.generate(
inputs["input_ids"],
max_length=max_length,
num_beams=4,
early_stopping=True
)
# 解码
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
return answer
# 示例
question = "What is machine learning?"
context = "Machine learning is a subset of artificial intelligence that enables systems to learn from data."
answer = generate_answer(question, context)
print(f"问题: {question}")
print(f"答案: {answer}")
使用大语言模型
大语言模型(如 GPT)可以作为强大的问答系统:
from transformers import AutoModelForCausalLM, AutoTokenizer
# 加载对话模型
model_name = "Qwen/Qwen2-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
def chat_qa(question, context=None, max_new_tokens=256):
"""使用大语言模型进行问答"""
# 构造提示
if context:
prompt = f"根据以下信息回答问题。\n\n信息:{context}\n\n问题:{question}\n\n答案:"
else:
prompt = f"问题:{question}\n\n答案:"
# 分词
inputs = tokenizer(prompt, return_tensors="pt")
# 生成
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True
)
# 解码
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# 提取答案部分
if "答案:" in response:
answer = response.split("答案:")[-1].strip()
else:
answer = response[len(prompt):].strip()
return answer
# 示例
question = "什么是深度学习?"
context = "深度学习是机器学习的一个子领域,使用多层神经网络来学习数据的表示。"
answer = chat_qa(question, context)
print(f"答案: {answer}")
微调问答模型
当预训练模型不能很好地处理特定领域问题时,可以进行微调。
准备数据
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer, DefaultDataCollator
# 加载 SQuAD 数据集
dataset = load_dataset("squad")
# 查看数据结构
print(dataset["train"][0])
# 加载分词器和模型
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
# 预处理函数
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)
offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []
for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = start_char + len(answer["text"][0])
sequence_ids = inputs.sequence_ids(i)
# 找到上下文的起始和结束位置
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while idx < len(sequence_ids) and sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1
# 如果答案不在上下文中,标记为 (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# 否则找到答案的起始和结束 token 位置
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)
idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)
inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs
# 预处理数据集
tokenized_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset["train"].column_names
)
# 数据整理器
data_collator = DefaultDataCollator()
训练模型
# 训练参数
training_args = TrainingArguments(
output_dir="./qa_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=False,
)
# 创建训练器
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"].select(range(1000)), # 使用部分数据
eval_dataset=tokenized_dataset["validation"].select(range(100)),
tokenizer=tokenizer,
data_collator=data_collator,
)
# 开始训练
trainer.train()
# 保存模型
trainer.save_model("./qa_model_final")
开放域问答
开放域问答需要在没有给定上下文的情况下,从大规模文档库中检索相关文档并回答问题。
检索器-阅读器架构
开放域问答通常采用检索器-阅读器(Retriever-Reader)架构:
- 检索器:从文档库中检索相关文档
- 阅读器:从检索到的文档中抽取答案
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
from transformers import DPRReader, DPRReaderTokenizer
import torch
class OpenDomainQA:
"""开放域问答系统"""
def __init__(self):
# 加载检索器
self.question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
self.question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
self.context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
self.context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
# 加载阅读器
self.reader_tokenizer = DPRReaderTokenizer.from_pretrained("facebook/dpr-reader-single-nq-base")
self.reader = DPRReader.from_pretrained("facebook/dpr-reader-single-nq-base")
# 文档库(示例)
self.documents = []
self.document_embeddings = []
def index_documents(self, documents):
"""索引文档库"""
self.documents = documents
# 编码所有文档
for doc in documents:
inputs = self.context_tokenizer(doc, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
embedding = self.context_encoder(**inputs).pooler_output
self.document_embeddings.append(embedding)
self.document_embeddings = torch.cat(self.document_embeddings, dim=0)
def retrieve(self, question, top_k=5):
"""检索相关文档"""
# 编码问题
inputs = self.question_tokenizer(question, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
question_embedding = self.question_encoder(**inputs).pooler_output
# 计算相似度
scores = torch.matmul(question_embedding, self.document_embeddings.T).squeeze()
# 获取 top-k 文档
top_indices = torch.topk(scores, min(top_k, len(self.documents))).indices.tolist()
return [self.documents[i] for i in top_indices], [scores[i].item() for i in top_indices]
def answer(self, question, top_k=5):
"""回答问题"""
# 检索相关文档
retrieved_docs, scores = self.retrieve(question, top_k)
# 使用阅读器从每个文档中抽取答案
best_answer = None
best_score = -1
for doc in retrieved_docs:
inputs = self.reader_tokenizer(
questions=question,
titles=[""],
texts=doc,
return_tensors="pt",
max_length=512,
truncation=True
)
with torch.no_grad():
outputs = self.reader(**inputs)
# 获取答案
start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()
# 计算答案得分
score = (outputs.start_logits[0, start_idx] + outputs.end_logits[0, end_idx]).item()
if score > best_score:
best_score = score
# 解码答案
tokens = inputs["input_ids"][0][start_idx:end_idx+1]
best_answer = self.reader_tokenizer.decode(tokens, skip_special_tokens=True)
return best_answer, retrieved_docs
# 使用示例
qa_system = OpenDomainQA()
# 索引文档
documents = [
"北京是中国的首都,位于中国北部。",
"上海是中国最大的城市,是经济中心。",
"深圳是中国改革开放的前沿城市,科技产业发达。",
"杭州是阿里巴巴的总部所在地,电子商务产业发达。",
"成都是四川省的省会,以美食和熊猫闻名。"
]
qa_system.index_documents(documents)
# 提问
question = "中国的首都是哪里?"
answer, sources = qa_system.answer(question)
print(f"问题: {question}")
print(f"答案: {answer}")
print(f"来源文档: {sources[0]}")
使用 BM25 检索
除了神经网络检索器,传统的 BM25 也是常用的检索方法:
from rank_bm25 import BM25Okapi
import jieba
class BM25Retriever:
"""BM25 检索器"""
def __init__(self, documents):
# 分词
self.documents = documents
self.tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_docs)
def retrieve(self, query, top_k=5):
"""检索相关文档"""
tokenized_query = list(jieba.cut(query))
scores = self.bm25.get_scores(tokenized_query)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [self.documents[i] for i in top_indices], [scores[i] for i in top_indices]
# 使用示例
documents = [
"Python 是一种高级编程语言,由 Guido van Rossum 创建。",
"Java 是一种面向对象的编程语言,由 Sun Microsystems 开发。",
"机器学习是人工智能的一个分支,使用算法从数据中学习。",
"深度学习使用多层神经网络进行特征学习。",
"自然语言处理让计算机理解和生成人类语言。"
]
retriever = BM25Retriever(documents)
query = "什么是 Python?"
results, scores = retriever.retrieve(query, top_k=3)
print(f"查询: {query}")
for i, (doc, score) in enumerate(zip(results, scores)):
print(f"{i+1}. [分数: {score:.4f}] {doc}")
问答系统评估
评估指标
from collections import Counter
def compute_exact_match(prediction, ground_truth):
"""计算精确匹配"""
return int(prediction.strip().lower() == ground_truth.strip().lower())
def compute_f1(prediction, ground_truth):
"""计算 F1 分数"""
pred_tokens = prediction.lower().split()
gt_tokens = ground_truth.lower().split()
common = Counter(pred_tokens) & Counter(gt_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = num_same / len(pred_tokens)
recall = num_same / len(gt_tokens)
f1 = 2 * precision * recall / (precision + recall)
return f1
# 使用 Hugging Face 评估库
import evaluate
# 加载 SQuAD 评估指标
squad_metric = evaluate.load("squad")
predictions = [
{"id": "1", "prediction_text": "北京"},
{"id": "2", "prediction_text": "上海"}
]
references = [
{"id": "1", "answers": {"text": ["北京"], "answer_start": [0]}},
{"id": "2", "answers": {"text": ["上海", "上海市"], "answer_start": [0, 0]}}
]
results = squad_metric.compute(predictions=predictions, references=references)
print(f"Exact Match: {results['exact_match']:.2f}")
print(f"F1 Score: {results['f1']:.2f}")
实际应用案例
知识库问答系统
from transformers import pipeline
import json
class KnowledgeBaseQA:
"""知识库问答系统"""
def __init__(self, knowledge_file):
# 加载知识库
with open(knowledge_file, 'r', encoding='utf-8') as f:
self.knowledge = json.load(f)
# 加载问答模型
self.qa_model = pipeline("question-answering", model="uer/roberta-base-chinese-extractive-qa")
# 构建文档库
self.documents = []
self.doc_metadata = []
for item in self.knowledge:
self.documents.append(item["content"])
self.doc_metadata.append(item)
def search_relevant_docs(self, question, top_k=3):
"""搜索相关文档(简单关键词匹配)"""
# 实际应用中可使用更复杂的检索方法
scores = []
question_keywords = set(question)
for doc in self.documents:
score = sum(1 for kw in question_keywords if kw in doc)
scores.append(score)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [self.documents[i] for i in top_indices]
def answer(self, question):
"""回答问题"""
# 检索相关文档
relevant_docs = self.search_relevant_docs(question)
# 从每个文档中找答案
best_answer = None
best_score = 0
best_source = None
for doc in relevant_docs:
result = self.qa_model(question=question, context=doc)
if result["score"] > best_score:
best_score = result["score"]
best_answer = result["answer"]
best_source = doc
return {
"answer": best_answer,
"confidence": best_score,
"source": best_source
}
# 示例知识库(实际应用中从文件加载)
sample_knowledge = [
{
"title": "公司简介",
"content": "ABC公司成立于2010年,是一家专注于人工智能技术研发的高科技企业。公司总部位于北京,在上海和深圳设有分公司。"
},
{
"title": "产品介绍",
"content": "公司主要产品包括智能客服系统、智能问答平台和知识图谱构建工具。智能客服系统已服务超过1000家企业客户。"
},
{
"title": "联系方式",
"content": "公司客服电话:400-123-4567,邮箱:[email protected],地址:北京市海淀区中关村大街1号。"
}
]
# 使用示例(需要保存知识库到文件)
# qa_system = KnowledgeBaseQA("knowledge.json")
# result = qa_system.answer("公司的客服电话是多少?")
# print(f"答案: {result['answer']}")
多轮问答系统
from transformers import AutoModelForCausalLM, AutoTokenizer
class ConversationalQA:
"""多轮问答系统"""
def __init__(self, model_name="Qwen/Qwen2-1.5B-Instruct"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.conversation_history = []
def chat(self, question, context=None):
"""多轮问答"""
# 构建对话历史
if context:
prompt = f"根据以下信息回答问题:\n{context}\n\n"
else:
prompt = ""
for turn in self.conversation_history:
prompt += f"用户:{turn['question']}\n助手:{turn['answer']}\n"
prompt += f"用户:{question}\n助手:"
# 生成回答
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
inputs["input_ids"],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = response[len(prompt):].strip()
# 更新对话历史
self.conversation_history.append({"question": question, "answer": answer})
return answer
def reset(self):
"""重置对话"""
self.conversation_history = []
# 使用示例
# qa = ConversationalQA()
# print(qa.chat("什么是机器学习?"))
# print(qa.chat("它有哪些应用?")) # 会记住上一轮的上下文
问答系统的挑战与解决方案
答案不在上下文中
当问题无法从给定上下文中回答时,模型应该能够识别并返回"无法回答"。
def answer_with_unanswerable(question, context, model, tokenizer, threshold=0.5):
"""处理无法回答的情况"""
inputs = tokenizer(question, context, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")
with torch.no_grad():
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# 计算 [CLS] token 的得分(通常用于表示"无法回答")
cls_score = start_logits[0, 0] + end_logits[0, 0]
# 计算最佳答案的得分
start_idx = start_logits[0, 1:].argmax() + 1
end_idx = end_logits[0, 1:].argmax() + 1
answer_score = start_logits[0, start_idx] + end_logits[0, end_idx]
# 如果 [CLS] 得分更高,则认为无法回答
if cls_score > answer_score or answer_score < threshold:
return "根据提供的信息无法回答该问题。"
# 否则返回答案
answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]
return tokenizer.decode(answer_tokens, skip_special_tokens=True)
处理长文档
对于超过模型最大长度的文档,需要分段处理:
def answer_from_long_document(question, document, model, tokenizer, max_length=512, stride=128):
"""从长文档中回答问题"""
# 分割文档
chunks = []
for i in range(0, len(document), max_length - stride):
chunk = document[i:i + max_length]
if chunk:
chunks.append(chunk)
# 从每个片段中找答案
best_answer = None
best_score = -1
for chunk in chunks:
inputs = tokenizer(question, chunk, return_tensors="pt", truncation=True, max_length=max_length)
with torch.no_grad():
outputs = model(**inputs)
start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()
# 计算答案得分
score = (outputs.start_logits[0, start_idx] + outputs.end_logits[0, end_idx]).item()
if score > best_score:
best_score = score
answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]
best_answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)
return best_answer
总结
问答系统是 NLP 最具实用价值的应用之一,本章介绍了:
- 问答类型:抽取式、生成式、知识库问答、开放域问答
- 抽取式问答:预测答案的起始和结束位置
- 生成式问答:使用 Seq2Seq 或大语言模型生成答案
- 开放域问答:检索器-阅读器架构
- 模型微调:在特定领域数据上优化
- 评估方法:精确匹配和 F1 分数
- 实际应用:知识库问答、多轮问答
问答系统技术正在快速发展,结合大语言模型和检索增强生成(RAG),可以构建更加强大和可靠的问答应用。