窗口函数详解
窗口函数是 SQL 中最强大的分析工具之一,也是面试中的高频考点。它允许你在不减少行数的情况下进行聚合计算,解决许多传统聚合函数无法处理的问题。MySQL 8.0 正式引入了窗口函数,掌握它是从"会写 SQL"到"精通 SQL"的关键一步。
什么是窗口函数?
窗口函数(Window Function)是一种特殊的函数,它对一组查询行(称为"窗口")执行计算,为每一行返回一个结果。
与普通聚合函数的区别
理解窗口函数的关键是理解它与普通聚合函数的区别:
-- 普通聚合函数:多行合并为一行
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;
-- 结果:每个部门一行
-- 窗口函数:每行都返回结果,不减少行数
SELECT name, department, salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
-- 结果:每个员工一行,但包含部门平均工资
核心区别:
| 特性 | 普通聚合函数 | 窗口函数 |
|---|---|---|
| 结果行数 | 减少为一行/组 | 保持原行数不变 |
| 计算范围 | 整个组 | 可定义的窗口范围 |
| 能否访问其他行 | 不能 | 能(如 LAG、LEAD) |
| 使用场景 | 汇总统计 | 排名、累计、对比分析 |
基本语法
窗口函数的基本语法结构如下:
函数名(参数) OVER (
PARTITION BY 分组表达式 -- 可选:定义分区
ORDER BY 排序表达式 -- 可选:定义排序
frame_clause -- 可选:定义窗口帧
)
语法拆解
1. PARTITION BY 子句
将数据分成多个分区,每个分区独立计算窗口函数。类似于 GROUP BY,但不会减少行数。
-- 计算每个员工的工资在其部门内的排名
SELECT name, department, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
如果不指定 PARTITION BY,整个结果集被视为一个分区。
2. ORDER BY 子句
在分区内定义排序顺序。对于排名函数(RANK、DENSE_RANK、ROW_NUMBER)和偏移函数(LAG、LEAD),这是必须的。
-- 按日期排序计算累计销售额
SELECT date, amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales;
3. frame_clause 窗口帧
定义窗口的具体范围,用于聚合窗口函数(SUM、AVG 等)。
-- 计算每行及其前两行的移动平均
SELECT date, amount,
AVG(amount) OVER (
ORDER BY date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg
FROM daily_sales;
窗口函数分类
MySQL 提供的窗口函数可以分为以下几类:
1. 排名函数
| 函数 | 说明 |
|---|---|
ROW_NUMBER() | 为每行分配唯一的连续编号 |
RANK() | 分配排名,相同值并列,后续排名跳过 |
DENSE_RANK() | 分配排名,相同值并列,后续排名不跳过 |
NTILE(n) | 将分区内的行分成 n 个桶 |
2. 偏移函数
| 函数 | 说明 |
|---|---|
LAG(expr, n, default) | 获取当前行前 n 行的值 |
LEAD(expr, n, default) | 获取当前行后 n 行的值 |
FIRST_VALUE(expr) | 获取窗口第一行的值 |
LAST_VALUE(expr) | 获取窗口最后一行的值 |
NTH_VALUE(expr, n) | 获取窗口第 n 行的值 |
3. 分布函数
| 函数 | 说明 |
|---|---|
PERCENT_RANK() | 百分比排名(0 到 1) |
CUME_DIST() | 累积分布值 |
4. 聚合窗口函数
普通的聚合函数(SUM、AVG、COUNT、MAX、MIN)也可以作为窗口函数使用。
排名函数详解
排名函数是面试中最常考的窗口函数。三种排名函数的核心区别在于处理"并列"的方式。
ROW_NUMBER()
为每行分配唯一的连续编号,不考虑并列情况。
-- 示例数据:salary = [100, 100, 90, 80]
SELECT name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num
FROM employees;
-- 结果:
-- | name | salary | row_num |
-- | A | 100 | 1 |
-- | B | 100 | 2 | <- 相同薪水,但编号不同
-- | C | 90 | 3 |
-- | D | 80 | 4 |
使用场景:
- 需要唯一编号时(如分页查询)
- 随机抽取排名(当有并列时)
RANK()
分配排名,相同值获得相同排名,但后续排名会跳过。
-- 示例数据:salary = [100, 100, 90, 80]
SELECT name, salary,
RANK() OVER (ORDER BY salary DESC) AS rank_num
FROM employees;
-- 结果:
-- | name | salary | rank_num |
-- | A | 100 | 1 |
-- | B | 100 | 1 | <- 相同薪水,相同排名
-- | C | 90 | 3 | <- 跳过了 2
-- | D | 80 | 4 |
使用场景:
- 竞赛排名(有并列时保留名次的"空缺")
- 需要反映真实排名情况
DENSE_RANK()
分配排名,相同值获得相同排名,后续排名不跳过。
-- 示例数据:salary = [100, 100, 90, 80]
SELECT name, salary,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank
FROM employees;
-- 结果:
-- | name | salary | dense_rank |
-- | A | 100 | 1 |
-- | B | 100 | 1 | <- 相同薪水,相同排名
-- | C | 90 | 2 | <- 不跳过,继续是 2
-- | D | 80 | 3 |
使用场景:
- 需要连续排名(如"薪水等级")
- 查询"前 N 高"(如部门工资前三高)
排名函数对比表
| 场景 | 函数选择 |
|---|---|
| 分页查询,需要唯一编号 | ROW_NUMBER() |
| 竞赛排名,有空缺名次 | RANK() |
| 连续排名,无空缺 | DENSE_RANK() |
| 部门工资前三高 | DENSE_RANK() |
偏移函数详解
偏移函数允许你访问"其他行"的数据,这是窗口函数独有的能力。
LAG 函数
获取当前行之前第 n 行的值。
-- 语法
LAG(expr, n, default) OVER (...)
-- expr: 要获取的列或表达式
-- n: 向前偏移的行数(默认 1)
-- default: 当没有前 n 行时的默认值(默认 NULL)
示例:计算环比增长
SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS growth,
ROUND((revenue - LAG(revenue) OVER (ORDER BY date))
/ LAG(revenue) OVER (ORDER BY date) * 100, 2) AS growth_rate
FROM daily_revenue;
-- 结果:
-- | date | revenue | prev_revenue | growth | growth_rate |
-- | 2024-01-01 | 1000 | NULL | NULL | NULL |
-- | 2024-01-02 | 1200 | 1000 | 200 | 20.00 |
-- | 2024-01-03 | 1150 | 1200 | -50 | -4.17 |
示例:判断连续登录
-- 找出连续 3 天登录的用户
SELECT DISTINCT user_id
FROM (
SELECT user_id, login_date,
LAG(login_date, 1) OVER (PARTITION BY user_id ORDER BY login_date) AS prev1,
LAG(login_date, 2) OVER (PARTITION BY user_id ORDER BY login_date) AS prev2
FROM login_log
) t
WHERE DATEDIFF(login_date, prev1) = 1
AND DATEDIFF(prev1, prev2) = 1;
LEAD 函数
获取当前行之后第 n 行的值。语法与 LAG 相同。
-- 示例:查看下一天的销售额
SELECT date, revenue,
LEAD(revenue, 1) OVER (ORDER BY date) AS next_revenue
FROM daily_revenue;
FIRST_VALUE 和 LAST_VALUE
获取窗口内第一行和最后一行的值。
-- 计算每个部门员工工资与最高工资的差距
SELECT name, department, salary,
FIRST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY salary DESC
) AS max_salary,
FIRST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY salary DESC
) - salary AS diff
FROM employees;
注意:LAST_VALUE 的默认窗口范围是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,这可能导致意外结果。如果需要整个分区的最后一个值,应该显式指定窗口范围:
SELECT name, department, salary,
LAST_VALUE(salary) OVER (
PARTITION BY department
ORDER BY salary
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS max_salary
FROM employees;
聚合窗口函数
普通聚合函数配合 OVER 子句可以成为窗口函数,实现滚动计算。
累计求和
-- 计算累计销售额
SELECT date, amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM sales;
-- 等价于
SELECT date, amount,
SUM(amount) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM sales;
分组内累计
-- 计算每个部门的累计销售额
SELECT department, month, revenue,
SUM(revenue) OVER (
PARTITION BY department
ORDER BY month
) AS dept_running_total
FROM monthly_revenue;
移动平均
-- 计算 3 天移动平均
SELECT date, amount,
AVG(amount) OVER (
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS moving_avg_3
FROM daily_sales;
-- 计算 7 天移动平均(当前 + 前 6 天)
SELECT date, amount,
AVG(amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7
FROM daily_sales;
分组内统计
-- 计算每个员工工资占部门总工资的比例
SELECT name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
ROUND(salary / SUM(salary) OVER (PARTITION BY department) * 100, 2) AS pct
FROM employees;
窗口帧详解
窗口帧(Frame)定义了窗口函数的计算范围,是高级应用的关键。
帧语法
{ROWS | RANGE} BETWEEN frame_start AND frame_end
帧边界
| 边界 | 说明 |
|---|---|
UNBOUNDED PRECEDING | 分区的第一行 |
n PRECEDING | 当前行之前的第 n 行 |
CURRENT ROW | 当前行 |
n FOLLOWING | 当前行之后的第 n 行 |
UNBOUNDED FOLLOWING | 分区的最后一行 |
ROWS vs RANGE
ROWS:基于物理行号RANGE:基于逻辑值(排序键的值)
-- ROWS:精确的行数
ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING -- 前2行 + 当前行 + 后2行 = 5行
-- RANGE:值的范围(对于相同值的行会一起包含)
RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW -- 过去7天的所有行
常用帧模式
-- 从开头到当前行(累计)
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 当前行到末尾
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
-- 整个分区
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
-- 滑动窗口(前后 n 行)
ROWS BETWEEN n PRECEDING AND n FOLLOWING
实战案例
案例一:部门工资前三高
-- 使用 DENSE_RANK 找出每个部门工资前三高的员工
WITH ranked AS (
SELECT name, department_id, salary,
DENSE_RANK() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) AS rnk
FROM employees
)
SELECT r.name AS Employee, d.name AS Department, r.salary AS Salary
FROM ranked r
JOIN departments d ON r.department_id = d.id
WHERE r.rnk <= 3;
为什么用 DENSE_RANK 而不是 RANK?
如果部门有两个并列第一的工资,用 DENSE_RANK,下一个是第二高;用 RANK,下一个是第三高(跳过了第二)。题目要求的是"前三高",所以用 DENSE_RANK。
案例二:连续出现三次的数字
-- 方法一:使用 LAG
SELECT DISTINCT num AS ConsecutiveNums
FROM (
SELECT num,
LAG(num, 1) OVER (ORDER BY id) AS prev1,
LAG(num, 2) OVER (ORDER BY id) AS prev2
FROM Logs
) t
WHERE num = prev1 AND num = prev2;
-- 方法二:使用自连接
SELECT DISTINCT l1.num AS ConsecutiveNums
FROM Logs l1
JOIN Logs l2 ON l1.id = l2.id - 1
JOIN Logs l3 ON l1.id = l3.id - 2
WHERE l1.num = l2.num AND l2.num = l3.num;
案例三:滚动累计直到达到阈值
-- 找出累计重量不超过 1000 的最后一个人
SELECT person_name
FROM (
SELECT person_name, weight, turn,
SUM(weight) OVER (ORDER BY turn) AS total_weight
FROM Queue
) t
WHERE total_weight <= 1000
ORDER BY total_weight DESC
LIMIT 1;
案例四:同比和环比计算
-- 年同比:与去年同期比较
-- 月环比:与上月比较
SELECT
year, month, revenue,
-- 月环比
LAG(revenue, 1) OVER (ORDER BY year, month) AS prev_month,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY year, month))
/ LAG(revenue, 1) OVER (ORDER BY year, month) * 100, 2) AS month_growth,
-- 年同比
LAG(revenue, 12) OVER (ORDER BY year, month) AS prev_year,
ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY year, month))
/ LAG(revenue, 12) OVER (ORDER BY year, month) * 100, 2) AS year_growth
FROM monthly_revenue;
性能考虑
窗口函数虽然强大,但也有性能上的考量:
1. 执行时机
窗口函数在 WHERE、GROUP BY、HAVING 之后执行,不能在这些子句中直接使用窗口函数的结果。
-- 错误:不能在 WHERE 中使用窗口函数
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
WHERE rnk <= 3; -- 报错!
-- 正确:使用子查询或 CTE
WITH ranked AS (
SELECT name, salary, RANK() OVER (ORDER BY salary DESC) AS rnk
FROM employees
)
SELECT * FROM ranked WHERE rnk <= 3;
2. 索引使用
ORDER BY子句中的列如果有索引,可以提高性能- 多个窗口函数使用相同的
PARTITION BY和ORDER BY,可以共享排序结果
3. 窗口帧范围
- 大的窗口帧范围(如
UNBOUNDED PRECEDING)需要更多内存 - 尽量使用具体的行数限制(如
ROWS 100 PRECEDING)而非无界范围
4. 与子查询的比较
在许多情况下,窗口函数比相关子查询性能更好:
-- 子查询方式:对每行执行一次子查询
SELECT name, salary,
(SELECT AVG(salary) FROM employees e2 WHERE e2.dept_id = e1.dept_id) AS avg_salary
FROM employees e1;
-- 窗口函数方式:只排序一次
SELECT name, salary,
AVG(salary) OVER (PARTITION BY dept_id) AS avg_salary
FROM employees;
知识点总结
语法速查
-- 基本语法
函数名() OVER (
PARTITION BY 分组列
ORDER BY 排序列
ROWS/RANGE BETWEEN ... AND ...
)
-- 排名
ROW_NUMBER() OVER (ORDER BY col)
RANK() OVER (ORDER BY col)
DENSE_RANK() OVER (ORDER BY col)
-- 偏移
LAG(col, n, default) OVER (ORDER BY col)
LEAD(col, n, default) OVER (ORDER BY col)
-- 聚合窗口
SUM(col) OVER (ORDER BY col)
AVG(col) OVER (PARTITION BY grp ORDER BY col ROWS BETWEEN ... AND ...)
函数选择指南
| 需求 | 推荐函数 |
|---|---|
| 唯一编号 | ROW_NUMBER() |
| 竞赛排名 | RANK() |
| 连续排名 | DENSE_RANK() |
| 前一行值 | LAG(col, 1) |
| 下一行值 | LEAD(col, 1) |
| 累计求和 | SUM(col) OVER (ORDER BY ...) |
| 移动平均 | AVG(col) OVER (ORDER BY ... ROWS ...) |
| 分组排名 | DENSE_RANK() OVER (PARTITION BY ... ORDER BY ...) |
注意事项
- 窗口函数在
WHERE/GROUP BY/HAVING之后执行 RANK会跳号,DENSE_RANK不跳号LAST_VALUE默认窗口范围可能不是整个分区- 使用子查询或 CTE 过滤窗口函数结果
- 相同的
PARTITION BY和ORDER BY可以共享排序