跳到主要内容

问答系统

问答系统(Question Answering,QA)是自然语言处理中最具实用价值的应用之一。它的目标是根据用户提出的问题,从知识库或文档中找到准确的答案。与搜索引擎返回相关文档不同,问答系统直接返回具体的答案,大大提高了信息获取的效率。

什么是问答系统

问答系统是一种能够理解自然语言问题并给出准确答案的系统。根据答案的来源和生成方式,问答系统可以分为以下几种类型:

抽取式问答(Extractive QA):从给定的上下文文档中直接抽取答案片段。答案必须是原文中的连续片段。这是目前最成熟的问答技术,代表性数据集是 SQuAD。

生成式问答(Abstractive QA):根据问题理解,生成一个完整的答案。答案不一定是原文的片段,可以是模型综合理解后的表达。这种方式更接近人类的回答方式,但技术难度更大。

知识库问答(Knowledge Base QA):从结构化知识库(如 Freebase、Wikidata)中查询答案。需要将自然语言问题转换为结构化查询语句。

开放域问答(Open-Domain QA):没有限定上下文文档,需要从大规模文档库中检索相关文档,然后从中找到答案。这是最具挑战性的问答任务。

问答系统的架构

一个完整的问答系统通常包含以下组件:

问题 → 问题理解 → 文档检索 → 答案抽取/生成 → 答案排序 → 最终答案

问题理解负责分析问题的意图和关键信息。文档检索从知识库中找到可能包含答案的文档。答案抽取或生成从相关文档中找到具体答案。答案排序对多个候选答案进行排序,选择最合适的答案。

抽取式问答

抽取式问答是最常见的问答形式,给定一个问题和一段上下文,模型从上下文中抽取答案。

基本原理

抽取式问答模型通常预测答案的起始位置和结束位置。假设上下文有 nn 个词,模型会输出两个概率分布:起始位置分布 PstartP_{start} 和结束位置分布 PendP_{end},每个分布都是 nn 维的。答案就是起始位置到结束位置之间的文本片段。

import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class ExtractiveQAModel(nn.Module):
"""抽取式问答模型"""
def __init__(self, model_name="bert-base-chinese"):
super().__init__()
self.bert = BertModel.from_pretrained(model_name)
self.qa_outputs = nn.Linear(self.bert.config.hidden_size, 2)

def forward(self, input_ids, attention_mask, token_type_ids=None, start_positions=None, end_positions=None):
# BERT 编码
outputs = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
token_type_ids=token_type_ids
)

sequence_output = outputs.last_hidden_state # (batch_size, seq_len, hidden_size)

# 预测起始和结束位置
logits = self.qa_outputs(sequence_output) # (batch_size, seq_len, 2)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1) # (batch_size, seq_len)
end_logits = end_logits.squeeze(-1) # (batch_size, seq_len)

# 计算损失
if start_positions is not None and end_positions is not None:
loss_fct = nn.CrossEntropyLoss()
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
return {"loss": total_loss, "start_logits": start_logits, "end_logits": end_logits}

return {"start_logits": start_logits, "end_logits": end_logits}

# 创建模型
model = ExtractiveQAModel()
tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

# 模拟输入
question = "北京是中国的什么?"
context = "北京是中国的首都,也是政治、文化和国际交流中心。"

# 分词
inputs = tokenizer(
question,
context,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)

# 预测
with torch.no_grad():
outputs = model(**inputs)
start_logits = outputs["start_logits"]
end_logits = outputs["end_logits"]

# 获取预测位置
start_index = torch.argmax(start_logits, dim=1).item()
end_index = torch.argmax(end_logits, dim=1).item()

# 解码答案
answer_tokens = inputs["input_ids"][0][start_index:end_index+1]
answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

print(f"问题: {question}")
print(f"上下文: {context}")
print(f"答案: {answer}")

使用 Hugging Face Pipeline

Hugging Face 提供了简便的 Pipeline 接口用于问答:

from transformers import pipeline

# 创建问答管道
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

# 问答
question = "What is the capital of China?"
context = "Beijing is the capital of China. It is located in the northern part of the country."

result = qa_pipeline(question=question, context=context)

print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['score']:.4f}")
print(f"答案位置: {result['start']}-{result['end']}")

中文问答模型

from transformers import pipeline

# 使用中文问答模型
qa_pipeline = pipeline(
"question-answering",
model="uer/roberta-base-chinese-extractive-qa"
)

# 中文问答示例
context = """
自然语言处理(NLP)是人工智能领域的重要分支。
它致力于让计算机理解和处理人类语言。
NLP 技术广泛应用于机器翻译、情感分析、问答系统等领域。
深度学习的发展极大地推动了 NLP 的进步。
"""

questions = [
"自然语言处理是什么?",
"NLP 技术有哪些应用?",
"什么推动了 NLP 的进步?"
]

for question in questions:
result = qa_pipeline(question=question, context=context)
print(f"问题: {question}")
print(f"答案: {result['answer']}")
print(f"置信度: {result['score']:.4f}")
print()

生成式问答

生成式问答不局限于从上下文中抽取答案,而是可以根据理解生成答案。

使用 Seq2Seq 模型

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# 加载生成式问答模型
model_name = "google/flan-t5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def generate_answer(question, context, max_length=100):
"""生成式问答"""
# 构造输入
input_text = f"question: {question} context: {context}"

# 分词
inputs = tokenizer(
input_text,
return_tensors="pt",
max_length=512,
truncation=True
)

# 生成答案
outputs = model.generate(
inputs["input_ids"],
max_length=max_length,
num_beams=4,
early_stopping=True
)

# 解码
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
return answer

# 示例
question = "What is machine learning?"
context = "Machine learning is a subset of artificial intelligence that enables systems to learn from data."

answer = generate_answer(question, context)
print(f"问题: {question}")
print(f"答案: {answer}")

使用大语言模型

大语言模型(如 GPT)可以作为强大的问答系统:

from transformers import AutoModelForCausalLM, AutoTokenizer

# 加载对话模型
model_name = "Qwen/Qwen2-1.5B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

def chat_qa(question, context=None, max_new_tokens=256):
"""使用大语言模型进行问答"""
# 构造提示
if context:
prompt = f"根据以下信息回答问题。\n\n信息:{context}\n\n问题:{question}\n\n答案:"
else:
prompt = f"问题:{question}\n\n答案:"

# 分词
inputs = tokenizer(prompt, return_tensors="pt")

# 生成
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9,
do_sample=True
)

# 解码
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# 提取答案部分
if "答案:" in response:
answer = response.split("答案:")[-1].strip()
else:
answer = response[len(prompt):].strip()

return answer

# 示例
question = "什么是深度学习?"
context = "深度学习是机器学习的一个子领域,使用多层神经网络来学习数据的表示。"

answer = chat_qa(question, context)
print(f"答案: {answer}")

微调问答模型

当预训练模型不能很好地处理特定领域问题时,可以进行微调。

准备数据

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import TrainingArguments, Trainer, DefaultDataCollator

# 加载 SQuAD 数据集
dataset = load_dataset("squad")

# 查看数据结构
print(dataset["train"][0])

# 加载分词器和模型
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

# 预处理函数
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]

inputs = tokenizer(
questions,
examples["context"],
max_length=384,
truncation="only_second",
return_offsets_mapping=True,
padding="max_length",
)

offset_mapping = inputs.pop("offset_mapping")
answers = examples["answers"]
start_positions = []
end_positions = []

for i, offset in enumerate(offset_mapping):
answer = answers[i]
start_char = answer["answer_start"][0]
end_char = start_char + len(answer["text"][0])

sequence_ids = inputs.sequence_ids(i)

# 找到上下文的起始和结束位置
idx = 0
while sequence_ids[idx] != 1:
idx += 1
context_start = idx
while idx < len(sequence_ids) and sequence_ids[idx] == 1:
idx += 1
context_end = idx - 1

# 如果答案不在上下文中,标记为 (0, 0)
if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
start_positions.append(0)
end_positions.append(0)
else:
# 否则找到答案的起始和结束 token 位置
idx = context_start
while idx <= context_end and offset[idx][0] <= start_char:
idx += 1
start_positions.append(idx - 1)

idx = context_end
while idx >= context_start and offset[idx][1] >= end_char:
idx -= 1
end_positions.append(idx + 1)

inputs["start_positions"] = start_positions
inputs["end_positions"] = end_positions
return inputs

# 预处理数据集
tokenized_dataset = dataset.map(
preprocess_function,
batched=True,
remove_columns=dataset["train"].column_names
)

# 数据整理器
data_collator = DefaultDataCollator()

训练模型

# 训练参数
training_args = TrainingArguments(
output_dir="./qa_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
push_to_hub=False,
)

# 创建训练器
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"].select(range(1000)), # 使用部分数据
eval_dataset=tokenized_dataset["validation"].select(range(100)),
tokenizer=tokenizer,
data_collator=data_collator,
)

# 开始训练
trainer.train()

# 保存模型
trainer.save_model("./qa_model_final")

开放域问答

开放域问答需要在没有给定上下文的情况下,从大规模文档库中检索相关文档并回答问题。

检索器-阅读器架构

开放域问答通常采用检索器-阅读器(Retriever-Reader)架构:

  1. 检索器:从文档库中检索相关文档
  2. 阅读器:从检索到的文档中抽取答案
from transformers import DPRQuestionEncoder, DPRContextEncoder, DPRQuestionEncoderTokenizer, DPRContextEncoderTokenizer
from transformers import DPRReader, DPRReaderTokenizer
import torch

class OpenDomainQA:
"""开放域问答系统"""

def __init__(self):
# 加载检索器
self.question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
self.question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
self.context_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
self.context_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

# 加载阅读器
self.reader_tokenizer = DPRReaderTokenizer.from_pretrained("facebook/dpr-reader-single-nq-base")
self.reader = DPRReader.from_pretrained("facebook/dpr-reader-single-nq-base")

# 文档库(示例)
self.documents = []
self.document_embeddings = []

def index_documents(self, documents):
"""索引文档库"""
self.documents = documents

# 编码所有文档
for doc in documents:
inputs = self.context_tokenizer(doc, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
embedding = self.context_encoder(**inputs).pooler_output
self.document_embeddings.append(embedding)

self.document_embeddings = torch.cat(self.document_embeddings, dim=0)

def retrieve(self, question, top_k=5):
"""检索相关文档"""
# 编码问题
inputs = self.question_tokenizer(question, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
question_embedding = self.question_encoder(**inputs).pooler_output

# 计算相似度
scores = torch.matmul(question_embedding, self.document_embeddings.T).squeeze()

# 获取 top-k 文档
top_indices = torch.topk(scores, min(top_k, len(self.documents))).indices.tolist()

return [self.documents[i] for i in top_indices], [scores[i].item() for i in top_indices]

def answer(self, question, top_k=5):
"""回答问题"""
# 检索相关文档
retrieved_docs, scores = self.retrieve(question, top_k)

# 使用阅读器从每个文档中抽取答案
best_answer = None
best_score = -1

for doc in retrieved_docs:
inputs = self.reader_tokenizer(
questions=question,
titles=[""],
texts=doc,
return_tensors="pt",
max_length=512,
truncation=True
)

with torch.no_grad():
outputs = self.reader(**inputs)

# 获取答案
start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()

# 计算答案得分
score = (outputs.start_logits[0, start_idx] + outputs.end_logits[0, end_idx]).item()

if score > best_score:
best_score = score
# 解码答案
tokens = inputs["input_ids"][0][start_idx:end_idx+1]
best_answer = self.reader_tokenizer.decode(tokens, skip_special_tokens=True)

return best_answer, retrieved_docs

# 使用示例
qa_system = OpenDomainQA()

# 索引文档
documents = [
"北京是中国的首都,位于中国北部。",
"上海是中国最大的城市,是经济中心。",
"深圳是中国改革开放的前沿城市,科技产业发达。",
"杭州是阿里巴巴的总部所在地,电子商务产业发达。",
"成都是四川省的省会,以美食和熊猫闻名。"
]

qa_system.index_documents(documents)

# 提问
question = "中国的首都是哪里?"
answer, sources = qa_system.answer(question)

print(f"问题: {question}")
print(f"答案: {answer}")
print(f"来源文档: {sources[0]}")

使用 BM25 检索

除了神经网络检索器,传统的 BM25 也是常用的检索方法:

from rank_bm25 import BM25Okapi
import jieba

class BM25Retriever:
"""BM25 检索器"""

def __init__(self, documents):
# 分词
self.documents = documents
self.tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
self.bm25 = BM25Okapi(self.tokenized_docs)

def retrieve(self, query, top_k=5):
"""检索相关文档"""
tokenized_query = list(jieba.cut(query))
scores = self.bm25.get_scores(tokenized_query)
top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [self.documents[i] for i in top_indices], [scores[i] for i in top_indices]

# 使用示例
documents = [
"Python 是一种高级编程语言,由 Guido van Rossum 创建。",
"Java 是一种面向对象的编程语言,由 Sun Microsystems 开发。",
"机器学习是人工智能的一个分支,使用算法从数据中学习。",
"深度学习使用多层神经网络进行特征学习。",
"自然语言处理让计算机理解和生成人类语言。"
]

retriever = BM25Retriever(documents)

query = "什么是 Python?"
results, scores = retriever.retrieve(query, top_k=3)

print(f"查询: {query}")
for i, (doc, score) in enumerate(zip(results, scores)):
print(f"{i+1}. [分数: {score:.4f}] {doc}")

问答系统评估

评估指标

from collections import Counter

def compute_exact_match(prediction, ground_truth):
"""计算精确匹配"""
return int(prediction.strip().lower() == ground_truth.strip().lower())

def compute_f1(prediction, ground_truth):
"""计算 F1 分数"""
pred_tokens = prediction.lower().split()
gt_tokens = ground_truth.lower().split()

common = Counter(pred_tokens) & Counter(gt_tokens)
num_same = sum(common.values())

if num_same == 0:
return 0

precision = num_same / len(pred_tokens)
recall = num_same / len(gt_tokens)
f1 = 2 * precision * recall / (precision + recall)

return f1

# 使用 Hugging Face 评估库
import evaluate

# 加载 SQuAD 评估指标
squad_metric = evaluate.load("squad")

predictions = [
{"id": "1", "prediction_text": "北京"},
{"id": "2", "prediction_text": "上海"}
]

references = [
{"id": "1", "answers": {"text": ["北京"], "answer_start": [0]}},
{"id": "2", "answers": {"text": ["上海", "上海市"], "answer_start": [0, 0]}}
]

results = squad_metric.compute(predictions=predictions, references=references)
print(f"Exact Match: {results['exact_match']:.2f}")
print(f"F1 Score: {results['f1']:.2f}")

实际应用案例

知识库问答系统

from transformers import pipeline
import json

class KnowledgeBaseQA:
"""知识库问答系统"""

def __init__(self, knowledge_file):
# 加载知识库
with open(knowledge_file, 'r', encoding='utf-8') as f:
self.knowledge = json.load(f)

# 加载问答模型
self.qa_model = pipeline("question-answering", model="uer/roberta-base-chinese-extractive-qa")

# 构建文档库
self.documents = []
self.doc_metadata = []
for item in self.knowledge:
self.documents.append(item["content"])
self.doc_metadata.append(item)

def search_relevant_docs(self, question, top_k=3):
"""搜索相关文档(简单关键词匹配)"""
# 实际应用中可使用更复杂的检索方法
scores = []
question_keywords = set(question)

for doc in self.documents:
score = sum(1 for kw in question_keywords if kw in doc)
scores.append(score)

top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
return [self.documents[i] for i in top_indices]

def answer(self, question):
"""回答问题"""
# 检索相关文档
relevant_docs = self.search_relevant_docs(question)

# 从每个文档中找答案
best_answer = None
best_score = 0
best_source = None

for doc in relevant_docs:
result = self.qa_model(question=question, context=doc)
if result["score"] > best_score:
best_score = result["score"]
best_answer = result["answer"]
best_source = doc

return {
"answer": best_answer,
"confidence": best_score,
"source": best_source
}

# 示例知识库(实际应用中从文件加载)
sample_knowledge = [
{
"title": "公司简介",
"content": "ABC公司成立于2010年,是一家专注于人工智能技术研发的高科技企业。公司总部位于北京,在上海和深圳设有分公司。"
},
{
"title": "产品介绍",
"content": "公司主要产品包括智能客服系统、智能问答平台和知识图谱构建工具。智能客服系统已服务超过1000家企业客户。"
},
{
"title": "联系方式",
"content": "公司客服电话:400-123-4567,邮箱:[email protected],地址:北京市海淀区中关村大街1号。"
}
]

# 使用示例(需要保存知识库到文件)
# qa_system = KnowledgeBaseQA("knowledge.json")
# result = qa_system.answer("公司的客服电话是多少?")
# print(f"答案: {result['answer']}")

多轮问答系统

from transformers import AutoModelForCausalLM, AutoTokenizer

class ConversationalQA:
"""多轮问答系统"""

def __init__(self, model_name="Qwen/Qwen2-1.5B-Instruct"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.conversation_history = []

def chat(self, question, context=None):
"""多轮问答"""
# 构建对话历史
if context:
prompt = f"根据以下信息回答问题:\n{context}\n\n"
else:
prompt = ""

for turn in self.conversation_history:
prompt += f"用户:{turn['question']}\n助手:{turn['answer']}\n"

prompt += f"用户:{question}\n助手:"

# 生成回答
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model.generate(
inputs["input_ids"],
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)

response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
answer = response[len(prompt):].strip()

# 更新对话历史
self.conversation_history.append({"question": question, "answer": answer})

return answer

def reset(self):
"""重置对话"""
self.conversation_history = []

# 使用示例
# qa = ConversationalQA()
# print(qa.chat("什么是机器学习?"))
# print(qa.chat("它有哪些应用?")) # 会记住上一轮的上下文

问答系统的挑战与解决方案

答案不在上下文中

当问题无法从给定上下文中回答时,模型应该能够识别并返回"无法回答"。

def answer_with_unanswerable(question, context, model, tokenizer, threshold=0.5):
"""处理无法回答的情况"""
inputs = tokenizer(question, context, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")

with torch.no_grad():
outputs = model(**inputs)

start_logits = outputs.start_logits
end_logits = outputs.end_logits

# 计算 [CLS] token 的得分(通常用于表示"无法回答")
cls_score = start_logits[0, 0] + end_logits[0, 0]

# 计算最佳答案的得分
start_idx = start_logits[0, 1:].argmax() + 1
end_idx = end_logits[0, 1:].argmax() + 1
answer_score = start_logits[0, start_idx] + end_logits[0, end_idx]

# 如果 [CLS] 得分更高,则认为无法回答
if cls_score > answer_score or answer_score < threshold:
return "根据提供的信息无法回答该问题。"

# 否则返回答案
answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]
return tokenizer.decode(answer_tokens, skip_special_tokens=True)

处理长文档

对于超过模型最大长度的文档,需要分段处理:

def answer_from_long_document(question, document, model, tokenizer, max_length=512, stride=128):
"""从长文档中回答问题"""
# 分割文档
chunks = []
for i in range(0, len(document), max_length - stride):
chunk = document[i:i + max_length]
if chunk:
chunks.append(chunk)

# 从每个片段中找答案
best_answer = None
best_score = -1

for chunk in chunks:
inputs = tokenizer(question, chunk, return_tensors="pt", truncation=True, max_length=max_length)

with torch.no_grad():
outputs = model(**inputs)

start_idx = outputs.start_logits.argmax().item()
end_idx = outputs.end_logits.argmax().item()

# 计算答案得分
score = (outputs.start_logits[0, start_idx] + outputs.end_logits[0, end_idx]).item()

if score > best_score:
best_score = score
answer_tokens = inputs["input_ids"][0][start_idx:end_idx+1]
best_answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

return best_answer

总结

问答系统是 NLP 最具实用价值的应用之一,本章介绍了:

  • 问答类型:抽取式、生成式、知识库问答、开放域问答
  • 抽取式问答:预测答案的起始和结束位置
  • 生成式问答:使用 Seq2Seq 或大语言模型生成答案
  • 开放域问答:检索器-阅读器架构
  • 模型微调:在特定领域数据上优化
  • 评估方法:精确匹配和 F1 分数
  • 实际应用:知识库问答、多轮问答

问答系统技术正在快速发展,结合大语言模型和检索增强生成(RAG),可以构建更加强大和可靠的问答应用。