跳到主要内容

聚合分析

聚合(Aggregations)是 Elasticsearch 强大的数据分析功能,能够对数据进行统计、分组、计算等操作。与搜索查询不同,聚合关注的是数据的整体特征和趋势,而不是单个文档。本章将详细介绍聚合分析的各种类型和使用方法。

聚合概述

聚合类型

Elasticsearch 提供三种主要类型的聚合:

Bucket 聚合(桶聚合):将文档分组到不同的桶中,每个桶对应一个或多个文档。类似于 SQL 的 GROUP BY 操作。

Metric 聚合(指标聚合):计算数值指标,如平均值、最大值、最小值等。类似于 SQL 的聚合函数(AVG、MAX、MIN、SUM)。

Pipeline 聚合(管道聚合):基于其他聚合的结果进行计算,用于实现更复杂的分析场景。

基本语法

GET /articles/_search
{
"size": 0, # 不返回文档,只返回聚合结果
"aggs": {
"聚合名称": {
"聚合类型": {
"field": "字段名"
}
}
}
}

size: 0 表示不返回文档内容,只返回聚合结果。这在只需要统计数据时非常有用,可以减少网络传输。

Bucket 聚合

Bucket 聚合将文档分组到不同的桶中,每个桶都有一个唯一的键和一组文档。

terms 聚合

terms 聚合按字段值分组统计,是最常用的桶聚合:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": {
"field": "category.keyword",
"size": 10, # 返回前 10 个桶
"order": {
"_count": "desc" # 按文档数量排序
}
}
}
}
}

响应示例:

{
"aggregations": {
"by_category": {
"buckets": [
{ "key": "Python", "doc_count": 150 },
{ "key": "Java", "doc_count": 120 },
{ "key": "Go", "doc_count": 80 }
]
}
}
}

terms 聚合的工作原理:

每个分片独立统计词项频率,返回前 size 个结果到协调节点。协调节点合并所有分片的结果,返回最终的桶列表。这意味着结果可能不是完全精确的,特别是对于分布不均匀的数据。

提高精确度:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": {
"field": "category.keyword",
"size": 10,
"shard_size": 50 # 每个分片返回 50 个桶
}
}
}
}

增加 shard_size 可以提高精确度,但会增加网络传输和内存开销。

range 聚合

range 聚合按数值范围分组:

GET /articles/_search
{
"size": 0,
"aggs": {
"view_ranges": {
"range": {
"field": "views",
"ranges": [
{ "to": 100, "key": "低" },
{ "from": 100, "to": 1000, "key": "中" },
{ "from": 1000, "key": "高" }
]
}
}
}
}

响应示例:

{
"aggregations": {
"view_ranges": {
"buckets": [
{ "key": "低", "to": 100, "doc_count": 50 },
{ "key": "中", "from": 100, "to": 1000, "doc_count": 200 },
{ "key": "高", "from": 1000, "doc_count": 30 }
]
}
}
}

范围说明:

  • to:不包含(小于)
  • from:包含(大于等于)
  • key:自定义桶名称

date_histogram 聚合

date_histogram 聚合按时间间隔分组,非常适合时间序列分析:

GET /articles/_search
{
"size": 0,
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "month", # 按月分组
"format": "yyyy-MM",
"min_doc_count": 0, # 没有数据的月份也返回
"extended_bounds": {
"min": "2024-01",
"max": "2024-12"
}
}
}
}
}

时间间隔选项:

参数说明
calendar_interval日历间隔:minute、hour、day、week、month、quarter、year
fixed_interval固定间隔:如 30d12h5m

日历间隔 vs 固定间隔:

  • 日历间隔考虑日历规则,如一个月可能是 28、29、30 或 31 天
  • 固定间隔使用固定的时间长度,如 30d 总是 30 天

histogram 聚合

histogram 聚合按数值间隔分组:

GET /articles/_search
{
"size": 0,
"aggs": {
"view_histogram": {
"histogram": {
"field": "views",
"interval": 500,
"min_doc_count": 1
}
}
}
}

响应示例:

{
"buckets": [
{ "key": 0, "doc_count": 100 },
{ "key": 500, "doc_count": 80 },
{ "key": 1000, "doc_count": 50 }
]
}

filter 聚合

filter 聚合先过滤文档再聚合:

GET /articles/_search
{
"size": 0,
"aggs": {
"published_articles": {
"filter": {
"term": { "status": "published" }
},
"aggs": {
"avg_views": {
"avg": { "field": "views" }
}
}
}
}
}

这个查询先筛选出所有已发布的文章,然后计算它们的平均浏览量。

filters 聚合

filters 聚合可以定义多个过滤器,每个过滤器对应一个桶:

GET /articles/_search
{
"size": 0,
"aggs": {
"articles_by_status": {
"filters": {
"filters": {
"published": { "term": { "status": "published" } },
"draft": { "term": { "status": "draft" } },
"archived": { "term": { "status": "archived" } }
}
}
}
}
}

missing 聚合

missing 聚合统计字段缺失的文档:

GET /articles/_search
{
"size": 0,
"aggs": {
"missing_author": {
"missing": { "field": "author" }
}
}
}

Metric 聚合

Metric 聚合计算数值指标,分为单值聚合和多值聚合。

基本统计

GET /articles/_search
{
"size": 0,
"aggs": {
"avg_views": { "avg": { "field": "views" } },
"max_views": { "max": { "field": "views" } },
"min_views": { "min": { "field": "views" } },
"sum_views": { "sum": { "field": "views" } },
"count_views": { "value_count": { "field": "views" } }
}
}

各聚合说明:

聚合类型说明
avg平均值
max最大值
min最小值
sum总和
value_count非空值数量

stats 聚合

stats 聚合一次性返回多个统计值:

GET /articles/_search
{
"size": 0,
"aggs": {
"views_stats": {
"stats": { "field": "views" }
}
}
}

响应示例:

{
"aggregations": {
"views_stats": {
"count": 1000,
"min": 0,
"max": 10000,
"avg": 1500.5,
"sum": 1500500
}
}
}

extended_stats 聚合

extended_stats 聚合提供更详细的统计信息:

GET /articles/_search
{
"size": 0,
"aggs": {
"views_stats": {
"extended_stats": { "field": "views" }
}
}
}

响应包含:count、min、max、avg、sum、sum_of_squares、variance(方差)、std_deviation(标准差)等。

cardinality 聚合

cardinality 聚合统计唯一值数量(近似去重):

GET /articles/_search
{
"size": 0,
"aggs": {
"unique_authors": {
"cardinality": {
"field": "author.keyword",
"precision_threshold": 1000
}
}
}
}

precision_threshold 参数:

  • 控制精确度和内存消耗的权衡
  • 值越大,结果越精确,但内存消耗越高
  • 建议值在 100-40000 之间

cardinality 使用 HyperLogLog++ 算法,是一种概率算法,结果可能有误差,但对于大数据量的去重统计非常高效。

percentile 聚合

percentile 聚合计算百分位数,常用于分析响应时间、延迟等指标:

GET /articles/_search
{
"size": 0,
"aggs": {
"views_percentiles": {
"percentiles": {
"field": "views",
"percents": [50, 75, 90, 95, 99]
}
}
}
}

响应示例:

{
"aggregations": {
"views_percentiles": {
"values": {
"50.0": 1200,
"75.0": 2500,
"90.0": 4500,
"95.0": 6800,
"99.0": 9200
}
}
}
}

这意味着:50% 的文章浏览量低于 1200,90% 的文章浏览量低于 4500,以此类推。

top_hits 聚合

top_hits 聚合获取每个桶中的文档:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": { "field": "category.keyword" },
"aggs": {
"top_articles": {
"top_hits": {
"size": 3,
"sort": [{ "views": "desc" }],
"_source": ["title", "author", "views"]
}
}
}
}
}
}

这个查询按分类分组,并返回每个分类下浏览量最高的 3 篇文章。

嵌套聚合

多层分组

聚合可以嵌套,实现多层分组分析:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": { "field": "category.keyword" },
"aggs": {
"by_author": {
"terms": { "field": "author.keyword", "size": 5 },
"aggs": {
"avg_views": {
"avg": { "field": "views" }
}
}
}
}
}
}
}

这个查询的执行逻辑:

  1. 按分类分组
  2. 在每个分类桶内,按作者分组
  3. 在每个作者桶内,计算平均浏览量

桶排序

可以按子聚合的结果对桶排序:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": {
"field": "category.keyword",
"order": {
"avg_views": "desc" # 按子聚合结果排序
}
},
"aggs": {
"avg_views": {
"avg": { "field": "views" }
}
}
}
}
}

桶过滤

bucket_selector 可以过滤聚合结果:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": { "field": "category.keyword" },
"aggs": {
"avg_views": {
"avg": { "field": "views" }
},
"views_filter": {
"bucket_selector": {
"buckets_path": {
"avgViews": "avg_views"
},
"script": "params.avgViews > 1000"
}
}
}
}
}
}

这个查询只保留平均浏览量大于 1000 的分类。

Pipeline 聚合

Pipeline 聚合基于其他聚合的结果进行计算。

derivative 聚合(导数)

计算相邻桶的变化率:

GET /articles/_search
{
"size": 0,
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day"
},
"aggs": {
"daily_count": {
"value_count": { "field": "_id" }
},
"derivative": {
"derivative": {
"buckets_path": "daily_count"
}
}
}
}
}
}

导数表示每天文章数量的变化量,正值表示增加,负值表示减少。

cumulative_sum 聚合(累计和)

计算累计总和:

GET /articles/_search
{
"size": 0,
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day"
},
"aggs": {
"daily_count": {
"value_count": { "field": "_id" }
},
"cumulative": {
"cumulative_sum": {
"buckets_path": "daily_count"
}
}
}
}
}
}

moving_fn 聚合(移动函数)

计算移动统计值:

GET /articles/_search
{
"size": 0,
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "day"
},
"aggs": {
"daily_views": {
"sum": { "field": "views" }
},
"moving_avg": {
"moving_fn": {
"buckets_path": "daily_views",
"window": 7,
"script": "MovingFunctions.avg(values)"
}
}
}
}
}
}

可用的移动函数:

  • MovingFunctions.avg(values):移动平均
  • MovingFunctions.sum(values):移动求和
  • MovingFunctions.min(values):移动最小值
  • MovingFunctions.max(values):移动最大值
  • MovingFunctions.stdDev(values):移动标准差

bucket_sort 聚合

对桶进行排序和分页:

GET /articles/_search
{
"size": 0,
"aggs": {
"by_category": {
"terms": { "field": "category.keyword", "size": 100 },
"aggs": {
"total_views": { "sum": { "field": "views" } },
"sort": {
"bucket_sort": {
"sort": [{ "total_views": "desc" }],
"from": 0,
"size": 10
}
}
}
}
}
}

综合示例

文章统计分析

GET /articles/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "term": { "status": "published" } }
]
}
},
"aggs": {
"by_category": {
"terms": { "field": "category.keyword" },
"aggs": {
"total_views": { "sum": { "field": "views" } },
"avg_views": { "avg": { "field": "views" } },
"top_articles": {
"top_hits": {
"size": 3,
"sort": [{ "views": "desc" }],
"_source": ["title", "views"]
}
}
}
},
"views_over_time": {
"date_histogram": {
"field": "created_at",
"calendar_interval": "month"
},
"aggs": {
"monthly_views": { "sum": { "field": "views" } }
}
},
"popular_tags": {
"terms": {
"field": "tags.keyword",
"size": 10
}
},
"overall_stats": {
"stats": { "field": "views" }
}
}
}

销售数据分析

GET /sales/_search
{
"size": 0,
"query": {
"range": {
"date": {
"gte": "now-30d/d"
}
}
},
"aggs": {
"by_product": {
"terms": { "field": "product_id" },
"aggs": {
"total_revenue": {
"sum": {
"script": "doc['price'].value * doc['quantity'].value"
}
},
"avg_order_value": {
"avg": { "field": "price" }
},
"daily_sales": {
"date_histogram": {
"field": "date",
"calendar_interval": "day"
},
"aggs": {
"revenue": {
"sum": { "field": "price" }
}
}
}
}
},
"revenue_percentiles": {
"percentiles": {
"field": "price",
"percents": [25, 50, 75, 90, 95, 99]
}
}
}
}

小结

本章我们深入学习了 Elasticsearch 聚合分析的核心知识:

  1. 聚合概述:理解 Bucket、Metric、Pipeline 三种聚合类型
  2. Bucket 聚合:terms、range、date_histogram、histogram、filter、filters
  3. Metric 聚合:avg、sum、max、min、stats、cardinality、percentiles、top_hits
  4. 嵌套聚合:多层分组、桶排序、桶过滤
  5. Pipeline 聚合:derivative、cumulative_sum、moving_fn、bucket_sort

练习

  1. 统计每个分类的文章数量、总浏览量和平均浏览量
  2. 按时间统计每月新增文章数量,并计算月度增长率
  3. 找出每个分类下浏览量最高的 5 篇文章
  4. 计算浏览量的百分位数分布,分析数据的分布特征
  5. 实现一个销售仪表盘:按产品、地区、时间维度进行多维度分析

参考资料