时间序列

时间序列数据是数据分析中的重要类型，Pandas 提供了强大的时间序列处理功能，包括日期时间创建、时间索引、重采样、移动窗口等操作。

时间数据类型

Pandas 支持四种时间相关概念：

概念	标量类型	数组类型	说明
时间戳	`Timestamp`	`DatetimeIndex`	特定的日期时间点
时间差	`Timedelta`	`TimedeltaIndex`	时间持续时间
时间段	`Period`	`PeriodIndex`	时间跨度（如 2024年1月）
日期偏移	`DateOffset`	-	相对时间长度（如工作日）

创建时间数据

Timestamp 时间戳

import pandas as pd
import numpy as np

# 创建时间戳
ts = pd.Timestamp('2024-01-15')
print(ts)  # 2024-01-15 00:00:00

# 带时间
ts = pd.Timestamp('2024-01-15 14:30:00')
ts = pd.Timestamp('2024/01/15 2:30 PM')
ts = pd.Timestamp(year=2024, month=1, day=15, hour=14, minute=30)

# 从时间戳创建
ts = pd.Timestamp(1705315800, unit='s')  # Unix 时间戳

# 时间戳属性
print(ts.year)        # 2024
print(ts.month)       # 1
print(ts.day)         # 15
print(ts.hour)        # 14
print(ts.minute)      # 30
print(ts.dayofweek)   # 0 (周一=0)
print(ts.day_name())  # 'Monday'
print(ts.month_name())  # 'January'

to_datetime 转换

# 从字符串转换
dates = pd.to_datetime(['2024-01-01', '2024-02-15', '2024-03-20'])
print(dates)
# DatetimeIndex(['2024-01-01', '2024-02-15', '2024-03-20'], dtype='datetime64[ns]', freq=None)

# 指定格式
dates = pd.to_datetime(['01/15/2024', '02/20/2024'], format='%m/%d/%Y')

# 处理格式不一致的数据
dates = pd.to_datetime(['2024-01-15', '15/02/2024', 'March 20, 2024'], format='mixed')

# 错误处理
dates = pd.to_datetime(['2024-01-15', 'invalid'], errors='coerce')  # 无效变为 NaT
# DatetimeIndex(['2024-01-15', 'NaT'], dtype='datetime64[ns]', freq=None)

# 从 DataFrame 列创建
df = pd.DataFrame({
    'year': [2024, 2024, 2024],
    'month': [1, 2, 3],
    'day': [15, 20, 25]
})
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])

date_range 生成序列

# 生成日期范围
dates = pd.date_range('2024-01-01', '2024-01-10')
print(dates)
# DatetimeIndex(['2024-01-01', '2024-01-02', ..., '2024-01-10'], freq='D')

# 指定数量
dates = pd.date_range('2024-01-01', periods=5)

# 指定频率
dates = pd.date_range('2024-01-01', periods=5, freq='H')   # 每小时
dates = pd.date_range('2024-01-01', periods=5, freq='W')   # 每周
dates = pd.date_range('2024-01-01', periods=5, freq='ME')  # 每月末
dates = pd.date_range('2024-01-01', periods=5, freq='MS')  # 每月初
dates = pd.date_range('2024-01-01', periods=5, freq='B')   # 工作日
dates = pd.date_range('2024-01-01', periods=5, freq='Q')   # 季度末

# 自定义频率
dates = pd.date_range('2024-01-01', periods=5, freq='2D')   # 每2天
dates = pd.date_range('2024-01-01', periods=5, freq='3H')   # 每3小时
dates = pd.date_range('2024-01-01 09:00', periods=5, freq='30min')  # 每30分钟

常用频率字符串

别名	说明
`D`	日历日
`B`	工作日
`W`	周
`ME`	月末
`MS`	月初
`QE`	季度末
`QS`	季度初
`YE`	年末
`YS`	年初
`H`	小时
`T` / `min`	分钟
`S`	秒

时间差 Timedelta

# 创建时间差
td = pd.Timedelta(days=5)
td = pd.Timedelta('5 days')
td = pd.Timedelta('5 days 3 hours 30 minutes')
td = pd.Timedelta(weeks=2, days=3, hours=4)

# 时间戳运算
ts = pd.Timestamp('2024-01-15')
print(ts + pd.Timedelta(days=7))   # 加7天
print(ts - pd.Timedelta(hours=24))  # 减24小时

# 时间差序列
tds = pd.to_timedelta(['1 day', '2 days', '3 hours'])
tds = pd.to_timedelta([1, 2, 3], unit='day')

# 时间差运算
df = pd.DataFrame({
    'start': pd.to_datetime(['2024-01-01', '2024-01-15']),
    'end': pd.to_datetime(['2024-01-10', '2024-02-01'])
})
df['duration'] = df['end'] - df['start']
print(df['duration'])
# 0   9 days
# 1  17 days

# 提取天数
df['days'] = df['duration'].dt.days

时间段 Period

# 创建时间段
p = pd.Period('2024-01', freq='M')  # 2024年1月
print(p)  # 2024-01

# 时间段运算
print(p + 1)  # 2024-02 (下个月)
print(p - 1)  # 2023-12 (上个月)

# 时间段范围
periods = pd.period_range('2024-01', '2024-12', freq='M')
print(periods)
# PeriodIndex(['2024-01', '2024-02', ..., '2024-12'], dtype='period[M]')

# 转换
p = pd.Period('2024-01-15', freq='D')
print(p.to_timestamp())  # 转为时间戳

ts = pd.Timestamp('2024-01-15')
print(ts.to_period('M'))  # 转为月时间段

时间索引

DatetimeIndex

# 创建时间序列数据
dates = pd.date_range('2024-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100), index=dates)

# 按日期选择
print(ts['2024-01-15'])           # 单日
print(ts['2024-01'])              # 整个月
print(ts['2024-01-01':'2024-01-10'])  # 日期范围
print(ts.loc['2024-01-15'])       # 使用 loc

# 部分字符串选择
print(ts['2024-01-15':'2024-01-20'])
print(ts['2024'])                 # 整年

dt 访问器

# 创建包含日期列的 DataFrame
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5),
    'value': [1, 2, 3, 4, 5]
})

# 提取日期组件
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['minute'] = df['date'].dt.minute
df['second'] = df['date'].dt.second
df['dayofweek'] = df['date'].dt.dayofweek
df['dayofyear'] = df['date'].dt.dayofyear
df['weekofyear'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter
df['is_month_start'] = df['date'].dt.is_month_start
df['is_month_end'] = df['date'].dt.is_month_end
df['is_year_start'] = df['date'].dt.is_year_start
df['is_year_end'] = df['date'].dt.is_year_end
df['is_weekend'] = df['date'].dt.dayofweek >= 5

print(df)
#         date  value  year  month  day  ...  is_month_end  is_year_start  is_year_end  is_weekend
# 0 2024-01-01      1  2024      1    1  ...         False           True        False       False
# 1 2024-01-02      2  2024      1    2  ...         False          False        False        True

时区处理

# 创建时间序列
ts = pd.date_range('2024-01-01', periods=5, freq='H')

# 本地化（添加时区）
ts_utc = ts.tz_localize('UTC')
print(ts_utc)
# DatetimeIndex(['2024-01-01 00:00:00+00:00', ...], dtype='datetime64[ns, UTC]', freq='H')

# 转换时区
ts_shanghai = ts_utc.tz_convert('Asia/Shanghai')
print(ts_shanghai)
# DatetimeIndex(['2024-01-01 08:00:00+08:00', ...], dtype='datetime64[ns, Asia/Shanghai]', freq='H')

ts_newyork = ts_utc.tz_convert('America/New_York')
print(ts_newyork)

# 创建时直接指定时区
ts = pd.date_range('2024-01-01', periods=5, freq='H', tz='Asia/Shanghai')

# 常用时区
# 'UTC' - 协调世界时
# 'Asia/Shanghai' - 上海时间（北京时间）
# 'America/New_York' - 纽约时间
# 'Europe/London' - 伦敦时间
# 'Asia/Tokyo' - 东京时间

重采样 Resample

重采样是时间序列分析的核心操作，用于改变时间序列的频率。

降采样（高频 → 低频）

# 创建高频数据
dates = pd.date_range('2024-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)

# 月度均值
monthly = ts.resample('ME').mean()
print(monthly.head())
# 2024-01-31    0.523
# 2024-02-29    1.234
# ...

# 多种聚合方式
monthly_sum = ts.resample('ME').sum()       # 月度总和
monthly_ohlc = ts.resample('ME').ohlc()     # 开盘、最高、最低、收盘
monthly_first = ts.resample('ME').first()   # 月初值
monthly_last = ts.resample('ME').last()     # 月末值

# 自定义聚合
monthly_stats = ts.resample('ME').agg(['mean', 'std', 'min', 'max'])

# 周度数据
weekly = ts.resample('W').mean()

# 季度数据
quarterly = ts.resample('QE').mean()

升采样（低频 → 高频）

# 创建低频数据
monthly = pd.Series([1, 2, 3], index=pd.date_range('2024-01-01', periods=3, freq='ME'))

# 升采样到日频
daily = monthly.resample('D').asfreq()  # 非月末日为 NaN
daily_ffill = monthly.resample('D').ffill()  # 前向填充
daily_bfill = monthly.resample('D').bfill()  # 后向填充
daily_interp = monthly.resample('D').interpolate()  # 插值

print(daily_ffill.head(10))
# 2024-01-31    1
# 2024-02-01    1
# 2024-02-02    1
# ...
# 2024-02-29    2

重采样参数

# label: 标签位置（开始或结束）
ts.resample('ME', label='right').mean()   # 使用月末作为标签（默认）
ts.resample('ME', label='left').mean()    # 使用月初作为标签

# closed: 区间闭合方式
ts.resample('ME', closed='right').mean()  # 右闭区间（默认）
ts.resample('ME', closed='left').mean()   # 左闭区间

# origin: 起始点
ts.resample('W', origin='start').mean()   # 从数据开始
ts.resample('W', origin='end').mean()     # 从数据结束
ts.resample('W', origin='epoch').mean()   # 从 Unix 纪元

移动窗口

rolling 滚动窗口

# 创建时间序列
ts = pd.Series(np.random.randn(100).cumsum(), index=pd.date_range('2024-01-01', periods=100))

# 7天滚动平均
ts_rolling = ts.rolling(window=7).mean()

# 多种统计
rolling_stats = pd.DataFrame({
    'original': ts,
    'rolling_mean': ts.rolling(7).mean(),
    'rolling_std': ts.rolling(7).std(),
    'rolling_min': ts.rolling(7).min(),
    'rolling_max': ts.rolling(7).max(),
    'rolling_sum': ts.rolling(7).sum()
})

# 最小周期数
ts.rolling(window=7, min_periods=3).mean()  # 至少3个值

# 中心窗口
ts.rolling(window=7, center=True).mean()  # 以当前点为中心

# 自定义聚合
ts.rolling(7).apply(lambda x: x.max() - x.min())

# 多个窗口统计
result = ts.rolling(7).agg(['mean', 'std', 'median'])

expanding 扩展窗口

# 扩展窗口：从开始到当前位置
ts_expanding = ts.expanding().mean()  # 累积平均

# 其他统计
ts.expanding().sum()     # 累积和
ts.expanding().max()     # 累积最大值
ts.expanding().std()     # 累积标准差

# 最小周期
ts.expanding(min_periods=10).mean()

ewm 指数加权移动

# 指数加权移动平均（更重视近期数据）
ts_ewm = ts.ewm(span=7).mean()  # 跨度为7

# 其他参数
ts.ewm(alpha=0.3).mean()      # 指定平滑因子
ts.ewm(com=0.5).mean()        # 指定质心
ts.ewm(halflife=3).mean()     # 指定半衰期

# 其他统计
ts.ewm(span=7).std()
ts.ewm(span=7).var()

移动窗口 vs 重采样

# 重采样：改变频率（输出点数减少）
monthly = ts.resample('ME').mean()

# 滚动窗口：在原频率上计算（输出点数相同）
rolling = ts.rolling(7).mean()

# 组合使用
monthly_rolling = ts.resample('D').mean().rolling(7).mean()

时间移动

shift 位移

# 创建数据
df = pd.DataFrame({
    'value': [1, 2, 3, 4, 5]
}, index=pd.date_range('2024-01-01', periods=5))

# 向后移动（索引不变，值移动）
df['shifted'] = df['value'].shift(1)  # 向后移动1期
print(df)
#            value  shifted
# 2024-01-01      1      NaN
# 2024-01-02      2      1.0
# 2024-01-03      3      2.0
# 2024-01-04      4      3.0
# 2024-01-05      5      4.0

# 向前移动
df['shifted_back'] = df['value'].shift(-1)

# 计算变化
df['change'] = df['value'] - df['value'].shift(1)
df['pct_change'] = df['value'].pct_change()  # 百分比变化

tshift 索引移动（已弃用）

# ⚠️ tshift 已弃用，使用 shift 加 freq 参数
df.shift(1, freq='D')  # 索引移动1天

时间序列实战

示例 1：股票数据分析

# 模拟股票数据
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='B')  # 工作日
prices = 100 + np.random.randn(len(dates)).cumsum()
volume = np.random.randint(100000, 500000, len(dates))

stock = pd.DataFrame({
    'close': prices,
    'volume': volume
}, index=dates)

# 计算技术指标
stock['ma_5'] = stock['close'].rolling(5).mean()       # 5日均线
stock['ma_20'] = stock['close'].rolling(20).mean()     # 20日均线
stock['ewm_12'] = stock['close'].ewm(span=12).mean()   # 12日指数均线
stock['returns'] = stock['close'].pct_change()          # 日收益率
stock['volatility'] = stock['returns'].rolling(20).std() # 20日波动率

# 月度统计
monthly_stats = stock.resample('ME').agg({
    'close': ['first', 'last', 'max', 'min'],
    'volume': 'sum'
})
monthly_stats.columns = ['open', 'close', 'high', 'low', 'volume']

# 计算月度收益率
monthly_stats['monthly_return'] = monthly_stats['close'] / monthly_stats['open'] - 1

print(monthly_stats.head())

示例 2：销售数据时间分析

# 创建销售数据
np.random.seed(42)
dates = pd.date_range('2023-01-01', '2024-12-31', freq='D')
sales = pd.DataFrame({
    'date': dates,
    'sales': np.random.randint(100, 500, len(dates)) + 
             50 * np.sin(np.arange(len(dates)) * 2 * np.pi / 365)  # 添加季节性
})
sales = sales.set_index('date')

# 添加时间特征
sales['year'] = sales.index.year
sales['month'] = sales.index.month
sales['weekday'] = sales.index.dayofweek
sales['is_weekend'] = sales['weekday'] >= 5

# 工作日 vs 周末
weekend_analysis = sales.groupby('is_weekend')['sales'].mean()

# 月度趋势
monthly_trend = sales.resample('ME').sum()

# 同比分析
sales['sales_ly'] = sales['sales'].shift(365)  # 去年同期
sales['yoy_growth'] = (sales['sales'] - sales['sales_ly']) / sales['sales_ly']

# 7天移动平均平滑
sales['sales_ma7'] = sales['sales'].rolling(7).mean()

示例 3：时间序列缺失值处理

# 创建有缺失值的时间序列
dates = pd.date_range('2024-01-01', '2024-01-10', freq='D')
values = [1, np.nan, 3, np.nan, np.nan, 6, 7, np.nan, 9, 10]
ts = pd.Series(values, index=dates)

# 前向填充
ts_ffill = ts.ffill()

# 后向填充
ts_bfill = ts.bfill()

# 线性插值
ts_interp = ts.interpolate(method='linear')

# 时间插值（考虑时间间隔）
ts_time_interp = ts.interpolate(method='time')

# 多项式插值
ts_poly_interp = ts.interpolate(method='polynomial', order=2)

# 对比结果
result = pd.DataFrame({
    'original': ts,
    'ffill': ts_ffill,
    'bfill': ts_bfill,
    'linear': ts_interp,
    'time': ts_time_interp
})
print(result)

性能优化

使用时间索引加速查询

# 大量数据时，使用时间索引
dates = pd.date_range('2020-01-01', periods=1000000, freq='T')  # 每分钟
ts = pd.Series(np.random.randn(1000000), index=dates)

# 时间范围查询（利用索引优化）
result = ts['2024-01']  # 快速查询整月

# 使用切片
result = ts.loc['2024-01-01':'2024-01-31']  # 高效

避免逐行操作

# ❌ 慢：逐行迭代
for idx in df.index:
    df.loc[idx, 'weekday'] = df.loc[idx, 'date'].weekday()

# ✅ 快：向量化操作
df['weekday'] = df['date'].dt.weekday

小结

时间数据类型：

Timestamp / DatetimeIndex：时间点
Timedelta / TimedeltaIndex：时间差
Period / PeriodIndex：时间段

常用操作：

创建：to_datetime(), date_range()
提取：.dt 访问器
重采样：resample()
移动窗口：rolling(), ewm(), expanding()
移动：shift()

最佳实践：

时间列设为索引以加速查询
使用向量化操作代替循环
合理选择重采样和窗口大小

练习

创建一个包含一整年数据的时间序列，计算每月的平均值和总和
使用滚动窗口计算7天移动平均和标准差
处理时间序列中的缺失值，比较不同插值方法的效果
将时间序列从一个时区转换到另一个时区
计算时间序列的同比和环比增长率

下一步

掌握了时间序列后，让我们学习高级索引！

时间数据类型​

创建时间数据​

Timestamp 时间戳​

to_datetime 转换​

date_range 生成序列​

常用频率字符串​

时间差 Timedelta​

时间段 Period​

时间索引​

DatetimeIndex​

dt 访问器​

时区处理​

重采样 Resample​

降采样（高频 → 低频）​

升采样（低频 → 高频）​

重采样参数​

移动窗口​

rolling 滚动窗口​

expanding 扩展窗口​

ewm 指数加权移动​

移动窗口 vs 重采样​

时间移动​

shift 位移​

tshift 索引移动（已弃用）​

时间序列实战​

示例 1：股票数据分析​

示例 2：销售数据时间分析​

示例 3：时间序列缺失值处理​

性能优化​

使用时间索引加速查询​

避免逐行操作​

小结​

练习​

下一步​

时间数据类型

创建时间数据

Timestamp 时间戳

to_datetime 转换

date_range 生成序列

常用频率字符串

时间差 Timedelta

时间段 Period

时间索引

DatetimeIndex

dt 访问器

时区处理

重采样 Resample

降采样（高频 → 低频）

升采样（低频 → 高频）

重采样参数

移动窗口

rolling 滚动窗口

expanding 扩展窗口

ewm 指数加权移动

移动窗口 vs 重采样

时间移动

shift 位移

tshift 索引移动（已弃用）

时间序列实战

示例 1：股票数据分析

示例 2：销售数据时间分析

示例 3：时间序列缺失值处理

性能优化

使用时间索引加速查询

避免逐行操作

小结

练习

下一步