跳到主要内容

Scikit-learn 聚类算法

聚类是无监督学习的重要任务,用于将数据分组。

聚类问题概述

聚类的目标是将相似的样本归为一类,不需要预先定义标签:

  • K-Means:基于质心的聚类
  • 层次聚类:构建层次结构
  • DBSCAN:基于密度的聚类

K-Means 聚类

基本原理

  1. 随机选择 K 个质心
  2. 将每个点分配给最近的质心
  3. 更新质心为所有点的均值
  4. 重复直到收敛

代码示例

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 生成数据
X, y_true = make_blobs(
n_samples=300,
centers=3,
cluster_std=0.8,
random_state=42
)

# K-Means 聚类
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)

# 可视化
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
c='red', marker='x', s=200)
plt.show()

选择最佳 K 值

使用肘部法则:

from sklearn.metrics import silhouette_score

# 测试不同 K 值
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X, kmeans.labels_))

# 绘制
fig, axes = plt.subplots(1, 2)
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('K')
axes[0].set_ylabel('Inertia')

axes[1].plot(K_range, silhouettes, 'ro-')
axes[1].set_xlabel('K')
axes[1].set_ylabel('Silhouette Score')
plt.show()

DBSCAN 聚类

特点

  • 不需要指定聚类数量
  • 可以发现任意形状的簇
  • 能够识别噪声点

代码示例

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

# 生成月牙形数据
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)

# DBSCAN 聚类
dbscan = DBSCAN(eps=0.3, min_samples=5)
y_pred = dbscan.fit_predict(X)

print(f"聚类数量: {len(set(y_pred)) - (1 if -1 in y_pred else 0)}")
print(f"噪声点数量: {sum(y_pred == -1)}")

层次聚类

AgglomerativeClustering

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

# 层次聚类
hc = AgglomerativeClustering(n_clusters=3)
y_pred = hc.fit_predict(X)

可视化树状图

from scipy.cluster.hierarchy import linkage, dendrogram
import numpy as np

# 计算层次结构
Z = linkage(X, method='ward')

# 绘制树状图
dendrogram(Z)
plt.title('层次聚类树状图')
plt.show()

聚类评估指标

轮廓系数

from sklearn.metrics import silhouette_score

score = silhouette_score(X, y_pred)
print(f"轮廓系数: {score:.2f}")

调整兰德指数

from sklearn.metrics import adjusted_rand_score

ari = adjusted_rand_score(y_true, y_pred)
print(f"调整兰德指数: {ari:.2f}")

完整示例

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import numpy as np

# 生成测试数据
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.8, random_state=42)

# 比较不同算法
algorithms = {
"K-Means": KMeans(n_clusters=4, random_state=42),
"DBSCAN": DBSCAN(eps=0.8, min_samples=5),
"层次聚类": AgglomerativeClustering(n_clusters=4)
}

for name, algo in algorithms.items():
y_pred = algo.fit_predict(X)

# 排除 DBSCAN 的噪声点
if name == "DBSCAN":
mask = y_pred != -1
if mask.sum() > 0:
score = silhouette_score(X[mask], y_pred[mask])
else:
score = 0
else:
score = silhouette_score(X, y_pred)

print(f"{name}: 轮廓系数 = {score:.2f}")

算法选择建议

数据特点推荐算法
球形簇K-Means
任意形状DBSCAN
需要层次结构层次聚类
大数据集MiniBatch K-Means
已知簇数量K-Means、层次聚类
未知簇数量DBSCAN