Scikit-learn 分类算法
分类是监督学习的重要任务,scikit-learn 提供了丰富的分类算法。
分类问题概述
分类的目标是预测离散的类别标签:
- 二分类:正类/负类(如:垃圾邮件检测)
- 多分类:多个类别(如:手写数字识别)
常用分类算法
1. 逻辑回归
适用于二分类问题,输出概率值
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# 加载数据(使用二分类)
iris = load_iris()
X = iris.data
y = (iris.target != 0).astype(int) # 二分类:0 vs 1,2
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 创建模型
clf = LogisticRegression(random_state=42)
# 训练
clf.fit(X_train, y_train)
# 预测
y_pred = clf.predict(X_test)
# 评估
print(f"准确率: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))
2. 决策树
可解释性强,支持多分类
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
3. 随机森林
集成多棵决策树,泛化能力强
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
n_estimators=100, # 树的数量
max_depth=10, # 最大深度
random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
4. 支持向量机 (SVM)
适用于中小规模数据集
from sklearn.svm import SVC
clf = SVC(
kernel='rbf', # 核函数
C=1.0, # 正则化参数
gamma='scale', # 核系数
random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
5. K 最近邻 (KNN)
基于距离的简单算法
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
完整示例:鸢尾花分类
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
# 1. 加载数据
iris = load_iris()
X = iris.data
y = iris.target
# 2. 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 3. 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 4. 比较多个模型
models = {
"逻辑回归": LogisticRegression(max_iter=200),
"随机森林": RandomForestClassifier(n_estimators=100),
"SVM": SVC(kernel='rbf')
}
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
results.append({"模型": name, "准确率": acc})
# 5. 显示结果
df = pd.DataFrame(results)
print(df)
模型评估指标
准确率
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
混淆矩阵
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
分类报告
from sklearn.metrics import classification_report
report = classification_report(y_test, y_pred)
print(report)
输出示例:
precision recall f1-score support
0 1.00 1.00 1.00 10
1 0.90 1.00 0.95 10
2 1.00 0.80 0.89 10
accuracy 0.93 30
macro avg 0.97 0.93 0.95 30
weighted avg 0.97 0.93 0.95 30
超参数调优
GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5,
scoring='accuracy'
)
grid_search.fit(X_train_scaled, y_train)
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳分数: {grid_search.best_score_:.2f}")
# 使用最佳模型预测
y_pred = grid_search.best_estimator_.predict(X_test_scaled)
交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
# 5 折交叉验证
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f"各折准确率: {scores}")
print(f"平均准确率: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")
选择算法的建议
| 数据规模 | 推荐算法 |
|---|---|
| 小数据集 | KNN、SVM |
| 中等规模 | 随机森林、梯度提升 |
| 大规模 | 逻辑回归、线性SVM |
| 需要可解释 | 决策树、逻辑回归 |
| 追求精度 | 随机森林、XGBoost |