跳到主要内容

Scikit-learn 数据预处理

数据预处理是机器学习的关键步骤,scikit-learn 提供了丰富的预处理工具。

为什么需要预处理?

原始数据通常存在以下问题:

  • 数值尺度不一致
  • 存在缺失值
  • 类别特征需要编码
  • 存在异常值

特征缩放

StandardScaler

标准化:将特征转换为均值为0,方差为1的分布

from sklearn.preprocessing import StandardScaler
import numpy as np

# 示例数据
X = np.array([[1, 2], [3, 4], [5, 6]])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(X_scaled)
# 输出:
# [[-1.22474387 -1.22474387]
# [ 0. 0. ]
# [ 1.22474387 1.22474387]]

MinMaxScaler

归一化:将特征缩放到指定范围(默认0-1)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)

print(X_normalized)
# 输出:
# [[0. 0. ]
# [0.5 0.5]
# [1. 1. ]]

MaxAbsScaler

缩放到 [-1, 1] 范围,适用于稀疏数据

from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)

编码类别特征

LabelEncoder

将类别标签转换为数字

from sklearn.preprocessing import LabelEncoder

y = ["cat", "dog", "cat", "bird"]
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

print(y_encoded) # [0 1 0 2]
print(encoder.classes_) # ['bird' 'cat' 'dog']

OneHotEncoder

独热编码,适用于无序类别

from sklearn.preprocessing import OneHotEncoder

categories = [["cat"], ["dog"], ["bird"], ["cat"]]
encoder = OneHotEncoder(sparse_output=False)
y_onehot = encoder.fit_transform(categories)

print(y_onehot)
# 输出:
# [[1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]
# [1. 0. 0.]]

处理缺失值

SimpleImputer

from sklearn.impute import SimpleImputer
import numpy as np

# 包含缺失值的数据
X = [[1, 2], [np.nan, 3], [7, np.nan]]

# 使用均值填充
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

print(X_imputed)
# 输出:
# [[1. 2. ]
# [4. 3. ]
# [7. 2.5]]

其他策略:medianmost_frequentconstant

特征选择

VarianceThreshold

移除方差低于阈值的特征

from sklearn.feature_selection import VarianceThreshold

X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 0, 0]]

selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)

SelectKBest

选择最相关的 K 个特征

from sklearn.feature_selection import SelectKBest, f_classif

# 选择最好的 2 个特征
selector = SelectKBest(f_classif, k=2)
X_selected = selector.fit_transform(X, y)

管道 Pipeline

将预处理和模型串联:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 创建管道
pipeline = Pipeline([
('scaler', StandardScaler()), # 第一步:标准化
('classifier', SVC()) # 第二步:分类器
])

# 训练
pipeline.fit(X_train, y_train)

# 预测
y_pred = pipeline.predict(X_test)
print(f"准确率: {accuracy_score(y_test, y_pred):.2f}")

ColumnTransformer

对不同列应用不同的预处理:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# 假设有数值列和类别列
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), [0, 1]), # 数值列标准化
('cat', OneHotEncoder(), [2]) # 类别列编码
]
)

下一步