Scikit-learn 数据预处理
数据预处理是机器学习的关键步骤,scikit-learn 提供了丰富的预处理工具。
为什么需要预处理?
原始数据通常存在以下问题:
- 数值尺度不一致
- 存在缺失值
- 类别特征需要编码
- 存在异常值
特征缩放
StandardScaler
标准化:将特征转换为均值为0,方差为1的分布
from sklearn.preprocessing import StandardScaler
import numpy as np
# 示例数据
X = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
# 输出:
# [[-1.22474387 -1.22474387]
# [ 0. 0. ]
# [ 1.22474387 1.22474387]]
MinMaxScaler
归一化:将特征缩放到指定范围(默认0-1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print(X_normalized)
# 输出:
# [[0. 0. ]
# [0.5 0.5]
# [1. 1. ]]
MaxAbsScaler
缩放到 [-1, 1] 范围,适用于稀疏数据
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()
X_scaled = scaler.fit_transform(X)
编码类别特征
LabelEncoder
将类别标签转换为数字
from sklearn.preprocessing import LabelEncoder
y = ["cat", "dog", "cat", "bird"]
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
print(y_encoded) # [0 1 0 2]
print(encoder.classes_) # ['bird' 'cat' 'dog']
OneHotEncoder
独热编码,适用于无序类别
from sklearn.preprocessing import OneHotEncoder
categories = [["cat"], ["dog"], ["bird"], ["cat"]]
encoder = OneHotEncoder(sparse_output=False)
y_onehot = encoder.fit_transform(categories)
print(y_onehot)
# 输出:
# [[1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]
# [1. 0. 0.]]
处理缺失值
SimpleImputer
from sklearn.impute import SimpleImputer
import numpy as np
# 包含缺失值的数据
X = [[1, 2], [np.nan, 3], [7, np.nan]]
# 使用均值填充
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
print(X_imputed)
# 输出:
# [[1. 2. ]
# [4. 3. ]
# [7. 2.5]]
其他策略:median、most_frequent、constant
特征选择
VarianceThreshold
移除方差低于阈值的特征
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 0, 0]]
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
SelectKBest
选择最相关的 K 个特征
from sklearn.feature_selection import SelectKBest, f_classif
# 选择最好的 2 个特征
selector = SelectKBest(f_classif, k=2)
X_selected = selector.fit_transform(X, y)
管道 Pipeline
将预处理和模型串联:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
# 加载数据
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 创建管道
pipeline = Pipeline([
('scaler', StandardScaler()), # 第一步:标准化
('classifier', SVC()) # 第二步:分类器
])
# 训练
pipeline.fit(X_train, y_train)
# 预测
y_pred = pipeline.predict(X_test)
print(f"准确率: {accuracy_score(y_test, y_pred):.2f}")
ColumnTransformer
对不同列应用不同的预处理:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# 假设有数值列和类别列
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), [0, 1]), # 数值列标准化
('cat', OneHotEncoder(), [2]) # 类别列编码
]
)