Evaluation for Clustering Model
Evaluation for Clustering Model
1. Evaluation for Clustering Tendency
Before conducting clustering, we usually would evaluate whether there exists natural clustering tendency in the dataset. The most common measure is the Hopkins Statistics.
Let S denote a sample sapce
Randomly pick a subsample p with k points from the sample space S
2. Evaluation for Clustering Goodness
The evaluation for Clustering Goodness can be categorized as internal evaluation and external evaluation. In external evaluation, we suppose we do have groud truth(true label) in the training process, but we still will not have that label in real application when the model is online. In such scenarios, we can simply use loss in classification and metrics in classification evaluation, such as entropy, impurity, accuracy and F-1 score.
However, the more common scenario is that we do not have a label to evaluate the quality of the clustering results even in training. In this case, we would need internal evaluation. Usually, we would consider a clustering results to be better if the points are compact inside a cluster while the clusters are sparse in the whole space. Thus, we often construct measure based on the internal distance of a cluster
Distortion
The overall distortion is defined as:
The smaller distortion is, the better the clustering results are. We can replace the distance calculating method to
Cohesion, Separation and Silhouette Coefficient
These three metrics are all on a sample point level.
The Cohension is the averge distance between a sample and all other samples in the same cluster:
The Separation is the averge distance between a sample and all other samples in the nearest neighbor cluster:
Good clustering performance should yield small a(i) and great b(i). Thus, we define the silhouette coefficient of a data point as:
We can then calculate the overall s(i) by average:
Calinski-Harabasz Index
The Calinski-Harabasz Index is the ratio of the within-cluster- distance and between-cluster-distance. The within-distance is defined samely as distortion:
3. Evaluation for Clustering Stability
A clustering should be stable that it can remain same on clustering results(low variance) when the sample points vary. Suppose
- Generate
perturbed versions of the original dataset , denote as - For each version, conduct clustering plan
and obtain and clustering results - For each pair of
in m, compute the distance - The overall instability metric would be
These procedures brings the following question:
Generating perturbed versions
Solutions includes:
Draw a random subsample from
with or without replacementAdd random noise to the original
Create artificial samples through oversampling methods
Distance Measure
The clustering results of
Data points | ||
---|---|---|
1 | 2 | |
3 | 3 | |
... | ... | ... |
k | k |
These can be seen as two collections
- Minimal Matching Distance
- Jaccard Distance
- Hamming Distance
Details of these metrics can be found in this article