A Method for Comparing Two Hierarchical Clusterings: A Critical Analysis

Hierarchical clustering is a popular unsupervised learning technique used to group similar data points into clusters. Evaluating the similarity between two hierarchical clusterings, often represented as dendrograms, is crucial for tasks like method selection or cluster validation. While cophenetic correlation is a common approach, this article delves into the complexities and potential pitfalls of comparing dendrograms for selecting the “right” method or distance measure in hierarchical clustering.

Key Considerations When Comparing Hierarchical Clusterings

Several critical factors influence the interpretation and comparison of hierarchical clustering results:

Impact of Agglomeration Methods on Dendrogram Appearance

Different agglomeration methods (e.g., single, complete, Ward’s) inherently produce distinct dendrogram structures, even with identical or random data. Visually comparing dendrograms generated by different methods can be misleading and should not be the sole basis for method selection. A more appropriate comparison involves using the same method on different datasets to observe variations in clustering patterns.

Ward’s Method and Cluster Number Selection

Ward’s method, while effective, presents a unique challenge when determining the optimal number of clusters from its dendrogram. The method’s reliance on summative coefficients can create a misleading impression of improved clustering with larger clusters. Standardizing the dendrogram by dividing the coefficient growth at each step by the combined cluster size can mitigate this issue, though graphical implementation might be challenging. Relying on formal internal clustering criteria, rather than solely visual inspection, offers a more robust approach to cluster number determination.

Conscious Selection of Distance Measures and Methods

Blindly experimenting with distance measures and agglomeration methods is discouraged. Instead, a conscious selection based on the data characteristics and research objectives is recommended. The chosen distance should accurately reflect the relevant aspects of dissimilarity, while the method should align with the desired cluster archetype (e.g., type, circle, chain).

Data Type, Distance Measure, and Method Compatibility

Certain methods have specific data and distance requirements. For instance, Ward’s and centroid methods, relying on centroid calculations in Euclidean space, are best suited for continuous data and (squared) Euclidean distance. Applying these methods to binary data or non-Euclidean distances can lead to incongruous results. Careful consideration of data/distance/method compatibility is paramount.

Data Preprocessing and Visual Inspection

Data preprocessing, including centering, scaling, and transformations, significantly impacts clustering outcomes. Thoughtful preprocessing, guided by interpretational considerations, is essential. Visual inspection of the data before applying clustering algorithms can reveal underlying patterns and inform preprocessing decisions.

Hierarchical Classification vs. Clustering History

Not all agglomerative clustering methods yield a true hierarchical classification. While some methods, like the centroid method, define clusters based on emergent features, others, like complete linkage, focus on individual object distances, creating a historical record of merging rather than a taxonomic structure.

Limitations of Greedy Algorithms in Hierarchical Clustering

Hierarchical clustering employs a greedy algorithm, making locally optimal choices at each step. This approach can lead to suboptimal solutions, particularly with larger datasets and increasing numbers of steps. For large samples (thousands of objects), alternative clustering techniques may be more suitable.

Cophenetic Correlation and Beyond

While cophenetic correlation, based on comparing pairwise colligation coefficients or their ranks between dendrograms, offers a quantitative measure of similarity, it’s crucial to acknowledge the aforementioned complexities. Further research into comparing dendrograms and hierarchical classifications is encouraged for a more comprehensive understanding. Exploring alternative metrics and considering the inherent limitations of hierarchical clustering are vital for robust and meaningful comparisons.

Conclusion

Comparing hierarchical clusterings requires careful consideration of various factors beyond simple visual comparison of dendrograms. Understanding the influence of agglomeration methods, distance measures, data types, and preprocessing techniques is crucial for accurate interpretation and method selection. While cophenetic correlation offers a quantitative approach, a holistic evaluation encompassing these factors is essential for robust and meaningful comparison of hierarchical clustering results.