数据转换

数据转换是数据分析的核心环节，Pandas 提供了丰富的工具来转换、映射和重塑数据。本章介绍如何使用 apply、map、transform 等方法高效地进行数据转换。

函数应用概述

Pandas 提供了多个层次的数据转换方法：

方法	应用范围	返回值	适用场景
`map()`	Series 元素级	Series	单列值映射
`apply()`	Series 或 DataFrame 行/列	Series/DataFrame	复杂行/列操作
`applymap()`	DataFrame 元素级	DataFrame	所有元素转换
`transform()`	Series 或 DataFrame	Series/DataFrame	与 groupby 配合使用

map 方法

map() 是 Series 专用的方法，用于将 Series 中的每个值映射为新值。它接受函数、字典或 Series 作为映射规则。

使用函数映射

import pandas as pd
import numpy as np

# 创建示例数据
s = pd.Series(['apple', 'banana', 'cherry', 'date'])

# 使用函数转换
result = s.map(str.upper)
print(result)
# 0     APPLE
# 1    BANANA
# 2    CHERRY
# 3      DATE

# 使用 lambda 函数
result = s.map(lambda x: x[:3].upper())
print(result)
# 0    APP
# 1    BAN
# 2    CHA
# 3    DAT

使用字典映射

字典映射是 map() 最常见的用法，特别适合类别转换：

# 创建员工数据
df = pd.DataFrame({
    'name': ['张三', '李四', '王五', '赵六'],
    'department': ['技术', '销售', '技术', '人事'],
    'level': ['A', 'B', 'A', 'C']
})

# 使用字典映射部门代码
dept_code = {'技术': 'TECH', '销售': 'SALES', '人事': 'HR'}
df['dept_code'] = df['department'].map(dept_code)
print(df)
#   name department level dept_code
# 0   张三       技术     A      TECH
# 1   李四       销售     B     SALES
# 2   王五       技术     A      TECH
# 3   赵六       人事     C        HR

# 级别对应薪资
level_salary = {'A': 20000, 'B': 15000, 'C': 10000}
df['base_salary'] = df['level'].map(level_salary)

字典中不存在的键会被映射为 NaN，如果需要保留原值，可以结合 fillna()：

# 不完整的映射字典
partial_map = {'技术': 'TECH'}
result = df['department'].map(partial_map)
print(result)
# 0    TECH
# 1     NaN
# 2    TECH
# 3     NaN

# 保留原值
result = df['department'].map(partial_map).fillna(df['department'])
print(result)
# 0    TECH
# 1      销售
# 2    TECH
# 3      人事

使用 Series 映射

Series 也可以作为映射规则，其索引作为键：

# 创建映射 Series
mapping = pd.Series({'技术': 'Technology', '销售': 'Sales', '人事': 'HR'})
df['dept_en'] = df['department'].map(mapping)

apply 方法

apply() 是最灵活的转换方法，可以对 Series 或 DataFrame 应用自定义函数。

Series 的 apply

对 Series 使用 apply() 时，函数会应用到每个元素：

# 示例数据
s = pd.Series([1, 2, 3, 4, 5])

# 简单计算
result = s.apply(lambda x: x ** 2)
print(result)
# 0     1
# 1     4
# 2     9
# 3    16
# 4    25

# 条件转换
def categorize(x):
    if x < 3:
        return '低'
    elif x < 4:
        return '中'
    else:
        return '高'

result = s.apply(categorize)
print(result)
# 0    低
# 1    低
# 2    中
# 3    高
# 4    高

DataFrame 的 apply

对 DataFrame 使用时，可以指定轴向（axis）：

# 创建示例数据
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# 默认 axis=0，对每列应用函数
col_sum = df.apply(lambda x: x.sum())
print(col_sum)
# A    6
# B    15
# C    24

# axis=1，对每行应用函数
row_sum = df.apply(lambda x: x.sum(), axis=1)
print(row_sum)
# 0    12
# 1    15
# 2    18

# 返回多个值的聚合
def describe_column(col):
    return pd.Series({
        'mean': col.mean(),
        'std': col.std(),
        'min': col.min(),
        'max': col.max()
    })

stats = df.apply(describe_column)
print(stats)
#          A    B    C
# mean   2.0  5.0  8.0
# std    1.0  1.0  1.0
# min    1.0  4.0  7.0
# max    3.0  6.0  9.0

复杂行操作

apply 特别适合需要同时访问多列的复杂操作：

# 创建员工数据
df = pd.DataFrame({
    'name': ['张三', '李四', '王五'],
    'base_salary': [10000, 15000, 20000],
    'bonus_rate': [0.1, 0.15, 0.2]
})

# 计算总薪资（需要多列参与）
def calculate_total(row):
    base = row['base_salary']
    bonus = base * row['bonus_rate']
    return base + bonus

df['total_salary'] = df.apply(calculate_total, axis=1)
print(df)
#   name  base_salary  bonus_rate  total_salary
# 0   张三        10000        0.10       11000.0
# 1   李四        15000        0.15       17250.0
# 2   王五        20000        0.20       24000.0

性能优化提示

apply 虽然灵活，但性能不如向量化操作：

# ❌ 慢：使用 apply
df['total'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

# ✅ 快：使用向量化操作
df['total'] = df['A'] + df['B']

# ❌ 慢：逐元素 apply
df['squared'] = df['A'].apply(lambda x: x ** 2)

# ✅ 快：向量化运算
df['squared'] = df['A'] ** 2

applymap 方法

applymap() 对 DataFrame 的每个元素应用函数（Pandas 2.1+ 推荐使用 map() 替代）：

# 创建示例数据
df = pd.DataFrame({
    'A': [1.123, 2.456, 3.789],
    'B': [4.111, 5.222, 6.333]
})

# 格式化所有元素
formatted = df.applymap(lambda x: f'{x:.2f}')
print(formatted)
#       A     B
# 0  1.12  4.11
# 1  2.46  5.22
# 2  3.79  6.33

# Pandas 2.1+ 可以使用 df.map()
formatted = df.map(lambda x: f'{x:.2f}')

transform 方法

transform() 返回与输入相同形状的结果，常与 groupby 配合使用。

基本用法

# 示例数据
df = pd.DataFrame({
    'group': ['A', 'A', 'B', 'B', 'C'],
    'value': [1, 2, 3, 4, 5]
})

# 与 groupby 配合：标准化每组的值
df['normalized'] = df.groupby('group')['value'].transform(
    lambda x: (x - x.mean()) / x.std()
)
print(df)
#   group  value  normalized
# 0     A      1        -1.0
# 1     A      2         1.0
# 2     B      3        -1.0
# 3     B      4         1.0
# 4     C      5         NaN  # 单个值无法计算标准差

# 计算每组的排名
df['rank_in_group'] = df.groupby('group')['value'].transform('rank')

多函数转换

# 同时应用多个转换
transformed = df.groupby('group')['value'].transform(['mean', 'std', 'count'])
print(transformed)
#    mean  std  count
# 0   1.5  0.5      2
# 1   1.5  0.5      2
# 2   3.5  0.5      2
# 3   3.5  0.5      2
# 4   5.0  NaN      1

# 添加前缀
df['group_mean'] = df.groupby('group')['value'].transform('mean')
df['group_max'] = df.groupby('group')['value'].transform('max')

transform vs apply

# transform：返回与原数据相同形状
df['group_mean'] = df.groupby('group')['value'].transform('mean')
# 结果形状与 df['value'] 相同

# apply：返回聚合后的形状
group_means = df.groupby('group')['value'].apply('mean')
# 结果形状为每组一行

数据类型转换

astype 方法

# 创建示例数据
df = pd.DataFrame({
    'int_col': [1, 2, 3],
    'float_col': [1.1, 2.2, 3.3],
    'str_col': ['1', '2', '3'],
    'bool_col': [True, False, True]
})

# 转换单列
df['int_col'] = df['int_col'].astype(float)

# 转换多列
df = df.astype({
    'int_col': 'float64',
    'str_col': 'int64'
})

# 转换为字符串
df['int_col'] = df['int_col'].astype(str)

# 转换为分类类型（节省内存）
df['category_col'] = df['str_col'].astype('category')

智能类型推断

# to_numeric：智能转换为数值
s = pd.Series(['1', '2.5', 'three', '4'])

# 默认遇到错误会抛出异常
# pd.to_numeric(s)  # ValueError

# 忽略错误
result = pd.to_numeric(s, errors='ignore')
print(result)
# 0      1
# 1    2.5
# 2  three
# 3      4

# 强制转换（错误变为 NaN）
result = pd.to_numeric(s, errors='coerce')
print(result)
# 0    1.0
# 1    2.5
# 2    NaN
# 3    4.0

# to_datetime：转换为日期时间
dates = pd.Series(['2024-01-01', '2024-02-15', '2024-03-20'])
df['date'] = pd.to_datetime(dates)

# 指定格式
df['date'] = pd.to_datetime(dates, format='%Y-%m-%d')

字符串操作

Pandas 提供了 .str 访问器来处理字符串列：

# 创建示例数据
df = pd.DataFrame({
    'name': ['  张三  ', '李四', '王 五  '],
    'email': ['[email protected]', '[email protected]', '[email protected]'],
    'phone': ['138-1234-5678', '139 5678 9012', '13612345678']
})

# 去除空白
df['name_clean'] = df['name'].str.strip()

# 大小写转换
df['email_lower'] = df['email'].str.lower()
df['email_upper'] = df['email'].str.upper()
df['email_title'] = df['email'].str.title()

# 字符串替换
df['phone_clean'] = df['phone'].str.replace(r'[-\s]', '', regex=True)

# 字符串分割
df['email_parts'] = df['email'].str.split('@')
df['email_domain'] = df['email'].str.split('@').str[1]

# 字符串包含
mask = df['email'].str.contains('example', case=False)

# 正则表达式提取
df['area_code'] = df['phone'].str.extract(r'(\d{3})')

# 字符串长度
df['name_len'] = df['name'].str.len()

# 判断是否以某字符开头/结尾
mask_start = df['email'].str.startswith('ZHANG')
mask_end = df['email'].str.endswith('.com')

条件转换

np.where

import numpy as np

df = pd.DataFrame({
    'score': [85, 62, 91, 45, 78]
})

# 简单条件
df['pass'] = np.where(df['score'] >= 60, '及格', '不及格')

# 多条件
df['grade'] = np.where(df['score'] >= 90, 'A',
                 np.where(df['score'] >= 80, 'B',
                 np.where(df['score'] >= 60, 'C', 'D')))

pd.cut 分箱

# 连续变量分箱
df['score_level'] = pd.cut(
    df['score'],
    bins=[0, 60, 70, 80, 90, 100],
    labels=['不及格', '及格', '中等', '良好', '优秀']
)

# 等频分箱
df['score_quartile'] = pd.qcut(df['score'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])

自定义函数最佳实践

向量化优先

# ❌ 慢：逐元素操作
df['total'] = df['price'].apply(lambda x: x * 1.1)

# ✅ 快：向量化操作
df['total'] = df['price'] * 1.1

使用 numpy 函数

# ❌ 慢：Python 内置函数
df['log_value'] = df['value'].apply(lambda x: np.log(x))

# ✅ 快：numpy 向量化
df['log_value'] = np.log(df['value'])

缓存中间结果

# ❌ 慢：重复计算
df['result'] = df.apply(
    lambda row: complex_calc(row['A']) + complex_calc(row['A']) * 2,
    axis=1
)

# ✅ 快：缓存中间结果
cache = {}
def cached_calc(x):
    if x not in cache:
        cache[x] = complex_calc(x)
    return cache[x]

df['result'] = df['A'].apply(cached_calc) * 3

实战示例

示例 1：数据清洗

# 原始数据
df = pd.DataFrame({
    'name': ['  张三  ', '李四 ', '王 五'],
    'phone': ['13812345678', '139-5678-9012', '136 1234 5678'],
    'email': ['[email protected]', '[email protected]', '[email protected]  ']
})

# 清洗流程
def clean_data(df):
    # 去除空白
    df['name'] = df['name'].str.strip()
    df['email'] = df['email'].str.strip()
    
    # 标准化电话号码
    df['phone'] = df['phone'].str.replace(r'[-\s]', '', regex=True)
    
    # 标准化邮箱
    df['email'] = df['email'].str.lower()
    
    return df

df_clean = clean_data(df)

示例 2：特征工程

# 原始数据
df = pd.DataFrame({
    'transaction_time': pd.to_datetime([
        '2024-01-15 09:30:00',
        '2024-01-15 14:20:00',
        '2024-01-16 20:15:00'
    ]),
    'amount': [100, 5000, 200]
})

# 提取时间特征
df['year'] = df['transaction_time'].dt.year
df['month'] = df['transaction_time'].dt.month
df['day'] = df['transaction_time'].dt.day
df['hour'] = df['transaction_time'].dt.hour
df['day_of_week'] = df['transaction_time'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6])

# 时间段分类
df['time_period'] = pd.cut(
    df['hour'],
    bins=[0, 6, 12, 18, 24],
    labels=['凌晨', '上午', '下午', '晚上']
)

# 金额分类
df['amount_level'] = pd.cut(
    df['amount'],
    bins=[0, 100, 1000, 10000],
    labels=['小额', '中额', '大额']
)

示例 3：复杂转换

# 计算移动平均和标准化
df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=10),
    'value': [10, 12, 15, 14, 18, 20, 19, 22, 25, 23]
})

# 移动平均
df['ma_3'] = df['value'].rolling(window=3).mean()

# 指数移动平均
df['ema'] = df['value'].ewm(span=3).mean()

# Z-score 标准化
df['z_score'] = (df['value'] - df['value'].mean()) / df['value'].std()

# 最大最小归一化
df['normalized'] = (df['value'] - df['value'].min()) / (df['value'].max() - df['value'].min())

小结

方法选择指南：

需求	推荐方法
单列值映射	`map()`
行/列级复杂操作	`apply()`
元素级操作	`applymap()` 或 `map()`
与分组配合的转换	`transform()`
类型转换	`astype()` 或 `to_numeric()`
字符串处理	`.str` 访问器

性能建议：

优先使用向量化操作
避免 apply 中的循环
使用 numpy 函数代替 Python 内置函数
对于大数据集考虑分块处理

练习

使用 map() 将员工级别转换为薪资等级
使用 apply() 计算每行的加权平均
使用 transform() 计算每个分组内的排名
创建一个字符串清洗函数，处理空白、大小写和特殊字符
使用 pd.cut() 将连续年龄变量转换为年龄段

下一步

掌握了数据转换后，让我们学习数据合并！

函数应用概述​

map 方法​

使用函数映射​

使用字典映射​

使用 Series 映射​

apply 方法​

Series 的 apply​

DataFrame 的 apply​

复杂行操作​

性能优化提示​

applymap 方法​

transform 方法​

基本用法​

多函数转换​

transform vs apply​

数据类型转换​

astype 方法​

智能类型推断​

字符串操作​

条件转换​

np.where​

pd.cut 分箱​

自定义函数最佳实践​

向量化优先​

使用 numpy 函数​

缓存中间结果​

实战示例​

示例 1：数据清洗​

示例 2：特征工程​

示例 3：复杂转换​

小结​

练习​

下一步​