跳到主要内容

Scikit-learn 分类算法

分类是监督学习的重要任务,scikit-learn 提供了丰富的分类算法。

分类问题概述

分类的目标是预测离散的类别标签:

  • 二分类:正类/负类(如:垃圾邮件检测)
  • 多分类:多个类别(如:手写数字识别)

常用分类算法

1. 逻辑回归

适用于二分类问题,输出概率值

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# 加载数据(使用二分类)
iris = load_iris()
X = iris.data
y = (iris.target != 0).astype(int) # 二分类:0 vs 1,2

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# 创建模型
clf = LogisticRegression(random_state=42)

# 训练
clf.fit(X_train, y_train)

# 预测
y_pred = clf.predict(X_test)

# 评估
print(f"准确率: {accuracy_score(y_test, y_pred):.2f}")
print(classification_report(y_test, y_pred))

2. 决策树

可解释性强,支持多分类

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

3. 随机森林

集成多棵决策树,泛化能力强

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
n_estimators=100, # 树的数量
max_depth=10, # 最大深度
random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

4. 支持向量机 (SVM)

适用于中小规模数据集

from sklearn.svm import SVC

clf = SVC(
kernel='rbf', # 核函数
C=1.0, # 正则化参数
gamma='scale', # 核系数
random_state=42
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

5. K 最近邻 (KNN)

基于距离的简单算法

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

完整示例:鸢尾花分类

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

# 1. 加载数据
iris = load_iris()
X = iris.data
y = iris.target

# 2. 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# 3. 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 4. 比较多个模型
models = {
"逻辑回归": LogisticRegression(max_iter=200),
"随机森林": RandomForestClassifier(n_estimators=100),
"SVM": SVC(kernel='rbf')
}

results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
acc = accuracy_score(y_test, y_pred)
results.append({"模型": name, "准确率": acc})

# 5. 显示结果
df = pd.DataFrame(results)
print(df)

模型评估指标

准确率

from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

混淆矩阵

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

分类报告

from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred)
print(report)

输出示例:

              precision    recall  f1-score   support

0 1.00 1.00 1.00 10
1 0.90 1.00 0.95 10
2 1.00 0.80 0.89 10

accuracy 0.93 30
macro avg 0.97 0.93 0.95 30
weighted avg 0.97 0.93 0.95 30

超参数调优

GridSearchCV

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': ['scale', 'auto']
}

grid_search = GridSearchCV(
SVC(),
param_grid,
cv=5,
scoring='accuracy'
)

grid_search.fit(X_train_scaled, y_train)

print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳分数: {grid_search.best_score_:.2f}")

# 使用最佳模型预测
y_pred = grid_search.best_estimator_.predict(X_test_scaled)

交叉验证

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 5 折交叉验证
scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

print(f"各折准确率: {scores}")
print(f"平均准确率: {scores.mean():.2f} (+/- {scores.std()*2:.2f})")

选择算法的建议

数据规模推荐算法
小数据集KNN、SVM
中等规模随机森林、梯度提升
大规模逻辑回归、线性SVM
需要可解释决策树、逻辑回归
追求精度随机森林、XGBoost