Scikit-learn 聚类算法
聚类是无监督学习的重要任务,用于将数据分组。
聚类问题概述
聚类的目标是将相似的样本归为一类,不需要预先定义标签:
- K-Means:基于质心的聚类
- 层次聚类:构建层次结构
- DBSCAN:基于密度的聚类
K-Means 聚类
基本原理
- 随机选择 K 个质心
- 将每个点分配给最近的质心
- 更新质心为所有点的均值
- 重复直到收敛
代码示例
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
# 生成数据
X, y_true = make_blobs(
n_samples=300,
centers=3,
cluster_std=0.8,
random_state=42
)
# K-Means 聚类
kmeans = KMeans(n_clusters=3, random_state=42)
y_pred = kmeans.fit_predict(X)
# 可视化
plt.scatter(X[:, 0], X[:, 1], c=y_pred)
plt.scatter(kmeans.cluster_centers_[:, 0],
kmeans.cluster_centers_[:, 1],
c='red', marker='x', s=200)
plt.show()
选择最佳 K 值
使用肘部法则:
from sklearn.metrics import silhouette_score
# 测试不同 K 值
inertias = []
silhouettes = []
K_range = range(2, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X, kmeans.labels_))
# 绘制
fig, axes = plt.subplots(1, 2)
axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('K')
axes[0].set_ylabel('Inertia')
axes[1].plot(K_range, silhouettes, 'ro-')
axes[1].set_xlabel('K')
axes[1].set_ylabel('Silhouette Score')
plt.show()
DBSCAN 聚类
特点
- 不需要指定聚类数量
- 可以发现任意形状的簇
- 能够识别噪声点
代码示例
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
# 生成月牙形数据
X, y_true = make_moons(n_samples=200, noise=0.05, random_state=42)
# DBSCAN 聚类
dbscan = DBSCAN(eps=0.3, min_samples=5)
y_pred = dbscan.fit_predict(X)
print(f"聚类数量: {len(set(y_pred)) - (1 if -1 in y_pred else 0)}")
print(f"噪声点数量: {sum(y_pred == -1)}")
层次聚类
AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram
# 层次聚类
hc = AgglomerativeClustering(n_clusters=3)
y_pred = hc.fit_predict(X)
可视化树状图
from scipy.cluster.hierarchy import linkage, dendrogram
import numpy as np
# 计算层次结构
Z = linkage(X, method='ward')
# 绘制树状图
dendrogram(Z)
plt.title('层次聚类树状图')
plt.show()
聚类评估指标
轮廓系数
from sklearn.metrics import silhouette_score
score = silhouette_score(X, y_pred)
print(f"轮廓系数: {score:.2f}")
调整兰德指数
from sklearn.metrics import adjusted_rand_score
ari = adjusted_rand_score(y_true, y_pred)
print(f"调整兰德指数: {ari:.2f}")
完整示例
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score
import numpy as np
# 生成测试数据
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.8, random_state=42)
# 比较不同算法
algorithms = {
"K-Means": KMeans(n_clusters=4, random_state=42),
"DBSCAN": DBSCAN(eps=0.8, min_samples=5),
"层次聚类": AgglomerativeClustering(n_clusters=4)
}
for name, algo in algorithms.items():
y_pred = algo.fit_predict(X)
# 排除 DBSCAN 的噪声点
if name == "DBSCAN":
mask = y_pred != -1
if mask.sum() > 0:
score = silhouette_score(X[mask], y_pred[mask])
else:
score = 0
else:
score = silhouette_score(X, y_pred)
print(f"{name}: 轮廓系数 = {score:.2f}")
算法选择建议
| 数据特点 | 推荐算法 |
|---|---|
| 球形簇 | K-Means |
| 任意形状 | DBSCAN |
| 需要层次结构 | 层次聚类 |
| 大数据集 | MiniBatch K-Means |
| 已知簇数量 | K-Means、层次聚类 |
| 未知簇数量 | DBSCAN |