中文分词

中文分词是中文搜索的关键。本章介绍如何在 Elasticsearch 中配置和使用 IK 中文分词器。

分词器概述

为什么需要中文分词？

英文分词："The quick brown fox" -> ["the", "quick", "brown", "fox"]
（按空格自然分词）

中文分词："搜索技术很强大" -> ?
- 按字分词：["搜", "索", "技", "术", "很", "强", "大"]
- 智能分词：["搜索", "技术", "很", "强大"]
（需要理解语义才能正确分词）

IK 分词器

IK 分词器是最常用的 Elasticsearch 中文分词插件，提供两种分词模式：

ik_smart：智能分词，粗粒度，适合搜索时使用
ik_max_word：最大化分词，细粒度，适合索引时使用

安装 IK 分词器

在线安装

# 进入 Elasticsearch 插件目录
cd /usr/share/elasticsearch

# 安装插件（版本需要与 ES 版本匹配）
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip

# 重启 Elasticsearch
systemctl restart elasticsearch

Docker 安装

# 创建 Dockerfile
FROM docker.elastic.co/elasticsearch/elasticsearch:8.12.0
RUN bin/elasticsearch-plugin install --batch https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v8.12.0/elasticsearch-analysis-ik-8.12.0.zip

# 构建镜像
docker build -t elasticsearch-ik:8.12.0 .

验证安装

# 测试分词效果
GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国国歌"
}

# 响应
{
  "tokens": [
    { "token": "中华人民共和国", "start_offset": 0, "end_offset": 7, "type": "CN_WORD" },
    { "token": "中华人民", "start_offset": 0, "end_offset": 4, "type": "CN_WORD" },
    { "token": "中华", "start_offset": 0, "end_offset": 2, "type": "CN_WORD" },
    { "token": "华人", "start_offset": 1, "end_offset": 3, "type": "CN_WORD" },
    { "token": "人民共和国", "start_offset": 2, "end_offset": 7, "type": "CN_WORD" },
    { "token": "人民", "start_offset": 2, "end_offset": 4, "type": "CN_WORD" },
    { "token": "共和国", "start_offset": 4, "end_offset": 7, "type": "CN_WORD" },
    { "token": "共和", "start_offset": 4, "end_offset": 6, "type": "CN_WORD" },
    { "token": "国歌", "start_offset": 7, "end_offset": 9, "type": "CN_WORD" }
  ]
}

两种分词模式对比

ik_max_word（索引时使用）

最大化分词，尽可能多地切分词语：

GET /_analyze
{
  "analyzer": "ik_max_word",
  "text": "中华人民共和国"
}

# 结果：["中华人民共和国", "中华人民", "中华", "华人", "人民共和国", "人民", "共和国", "共和", "国"]

ik_smart（搜索时使用）

智能分词，保留最合理的分词结果：

GET /_analyze
{
  "analyzer": "ik_smart",
  "text": "中华人民共和国"
}

# 结果：["中华人民共和国"]

最佳实践

# 创建索引时，索引使用 ik_max_word，搜索使用 ik_smart
PUT /articles
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      },
      "content": {
        "type": "text",
        "analyzer": "ik_max_word",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

解释：

索引时使用 ik_max_word，尽可能多地切分词语，提高召回率
搜索时使用 ik_smart，保留最合理的分词，提高准确率

自定义词典

IK 分词器支持自定义词典，可以添加新词。

词典文件位置

config/analysis-ik/
├── IKAnalyzer.cfg.xml    # 配置文件
├── main.dic              # 主词典
├── quantifier.dic        # 量词词典
├── stopword.dic          # 停用词词典
├── extra_main.dic        # 扩展词典
└── extra_stopword.dic    # 扩展停用词词典

配置自定义词典

编辑 IKAnalyzer.cfg.xml：

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <entry key="ext_dict">custom/my_dict.dic</entry>
    <entry key="ext_stopwords">custom/my_stopwords.dic</entry>
</properties>

创建自定义词典

# 创建词典文件
mkdir -p config/analysis-ik/custom
echo " Elasticsearch" >> config/analysis-ik/custom/my_dict.dic
echo "Kibana" >> config/analysis-ik/custom/my_dict.dic
echo "微服务" >> config/analysis-ik/custom/my_dict.dic
echo "分布式系统" >> config/analysis-ik/custom/my_dict.dic

远程词典

支持从远程 HTTP 服务器加载词典：

<properties>
    <entry key="remote_ext_dict">http://example.com/dict.txt</entry>
    <entry key="remote_ext_stopwords">http://example.com/stopwords.txt</entry>
</properties>

远程词典会定期刷新（默认 60 秒）。

停用词配置

停用词是在搜索时被忽略的词，如"的"、"是"、"在"等。

# 查看默认停用词
cat config/analysis-ik/stopword.dic

# 添加自定义停用词
echo "啊" >> config/analysis-ik/extra_stopword.dic
echo "嗯" >> config/analysis-ik/extra_stopword.dic

同义词配置

创建同义词文件

# 创建同义词文件
echo "手机,移动电话,智能机" > config/analysis-ik/synonyms.txt
echo "电脑,计算机,PC" >> config/analysis-ik/synonyms.txt

配置同义词分析器

PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms_path": "analysis-ik/synonyms.txt"
        }
      },
      "analyzer": {
        "ik_synonym": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["my_synonyms"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_synonym"
      }
    }
  }
}

测试同义词

GET /articles/_analyze
{
  "analyzer": "ik_synonym",
  "text": "我买了一部手机"
}

# 结果会同时索引"手机"和"移动电话"、"智能机"

拼音分词

安装拼音插件

bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v8.12.0/elasticsearch-analysis-pinyin-8.12.0.zip

配置拼音分析器

PUT /articles
{
  "settings": {
    "analysis": {
      "analyzer": {
        "ik_pinyin": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["pinyin_filter"]
        }
      },
      "filter": {
        "pinyin_filter": {
          "type": "pinyin",
          "keep_full_pinyin": true,
          "keep_original": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_pinyin"
      }
    }
  }
}

搜索示例

# 搜索"北京"可以用拼音"beijing"搜索
GET /articles/_search
{
  "query": {
    "match": {
      "title": "beijing"
    }
  }
}

实战示例

文章搜索系统

# 创建索引
PUT /articles
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["Python,Python语言", "Java,Java语言"]
        }
      },
      "analyzer": {
        "ik_index": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["my_synonyms"]
        },
        "ik_search": {
          "type": "custom",
          "tokenizer": "ik_smart"
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "ik_index",
        "search_analyzer": "ik_search"
      },
      "content": {
        "type": "text",
        "analyzer": "ik_index",
        "search_analyzer": "ik_search"
      },
      "author": {
        "type": "keyword"
      },
      "tags": {
        "type": "keyword"
      },
      "created_at": {
        "type": "date"
      }
    }
  }
}

# 索引文档
POST /articles/_doc
{
  "title": "Python 机器学习入门教程",
  "content": "本文介绍如何使用 Python 进行机器学习开发，包括数据预处理、模型训练等内容",
  "author": "张三",
  "tags": ["Python", "机器学习", "教程"],
  "created_at": "2024-01-15"
}

# 搜索
GET /articles/_search
{
  "query": {
    "match": {
      "title": "Python机器学习"
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

小结

本章我们学习了：

中文分词的必要性
IK 分词器的安装和配置
ik_max_word 和 ik_smart 两种分词模式
自定义词典和停用词配置
同义词和拼音分词
实战示例：文章搜索系统

练习

安装 IK 分词器并测试分词效果
配置自定义词典，添加专业术语
实现一个支持同义词搜索的文章索引
配置拼音分词，实现拼音搜索功能

分词器概述​

为什么需要中文分词？​

IK 分词器​

安装 IK 分词器​

在线安装​

Docker 安装​

验证安装​

两种分词模式对比​

ik_max_word（索引时使用）​

ik_smart（搜索时使用）​

最佳实践​

自定义词典​

词典文件位置​

配置自定义词典​

创建自定义词典​

远程词典​

停用词配置​

同义词配置​

创建同义词文件​

配置同义词分析器​

测试同义词​

拼音分词​

安装拼音插件​

配置拼音分析器​

搜索示例​

实战示例​

文章搜索系统​

小结​

练习​

参考资料​