A Comparative Approach to Clustering Algorithms: Performance Evaluation for Data Analysis

Introduction

The increasing automation of data collection across diverse systems has led to an unprecedented surge in available information [1–8]. This data deluge necessitates robust methodologies for organization and modeling [9], driving advancements in fields like diagnostics [10], education [11], forecasting [12], and numerous other domains [13]. These methodologies are central to the field of machine learning [14], a critical area within computer science and statistics due to its pivotal role in modern technology and data-driven decision-making.

Machine learning encompasses a broad spectrum of techniques, including regression analysis [15], feature selection [16], and classification [14]. Classification, the focus of this study, involves assigning objects within a dataset to predefined categories. Within classification, three primary paradigms exist: supervised, semi-supervised, and unsupervised. Supervised classification relies on prior knowledge of class labels for a subset of objects (the training set) to train algorithms for classification. Semi-supervised methods utilize both labeled and unlabeled data, often employed when manual labeling is expensive. Unsupervised classification, or clustering, is distinct in its approach. It aims to discover inherent groupings or classes directly from the data itself, without any pre-existing knowledge of class labels. Clustering algorithms identify clusters of objects that exhibit greater similarity to each other than to objects in other clusters. This approach is fundamental to data modeling, simplifying complex datasets into intuitive representations of relevant patterns. While generally more computationally intensive than supervised methods, clustering offers deeper insights into complex data structures and is the primary focus of this comparative study.

Clustering algorithm performance is significantly influenced by factors such as parameter settings, high dimensionality, and the presence of noise, incompleteness, or sampling biases in data. This variability has spurred the development of numerous clustering approaches (e.g., [17–19]). For practitioners, selecting the most appropriate clustering method for a given dataset or problem is a challenging task. Therefore, a Comparative Approach to evaluating different clustering methods is invaluable. Previous research has explored the comparison of clustering algorithms [20–29]. This study contributes to this body of work by employing a diversified and extensive set of artificial, normally distributed datasets. These datasets are designed with tunable properties including the number of classes, features, objects, and class separation, as well as variations in group structure, such as predefined correlation distributions between features. The use of artificial data allows for generating unlimited samples and systematically manipulating dataset properties. This controlled environment facilitates a comprehensive and rigorous evaluation of clustering algorithms across a wide range of conditions. It also allows us to quantify performance sensitivity to subtle data variations. It is crucial to note that the performance results presented here are specific to normally distributed data; different data types with alternative statistical behaviors may yield different outcomes. Performance in this study is defined as the similarity between known object labels and those identified by the clustering algorithm. Various metrics quantify this similarity [30], and we compare the Jaccard index [31], Adjusted Rand index [32], Fowlkes-Mallows index [33], and Normalized mutual information [34]. We utilize a modified version of a procedure developed by [35] to generate 400 distinct datasets for evaluating clustering algorithm performance. The data generation procedure and parameter settings are detailed herein. Related methodologies can be found in [36].

Parameter configuration is critical for clustering algorithm performance. A persistent challenge in machine learning is determining optimal parameter settings [37]. While optimization procedures (e.g., simulated annealing [38] or genetic algorithms [39]) could, in principle, identify parameter configurations that maximize algorithm performance, this approach faces significant hurdles. First, parameter optimization tailored to a specific dataset can lead to overfitting [40], where optimal parameters for the training data may result in reduced performance on new, unseen data. Second, parameter optimization can be computationally prohibitive, given the complexity of many algorithms and their often numerous parameters. Consequently, many researchers resort to using default parameter settings provided by software implementations. Therefore, evaluating and comparing clustering algorithm performance in both optimized and default parameter scenarios is essential. This study examines representative algorithms commonly used in the literature [37, 41].

Clustering algorithms are implemented in diverse programming languages and software packages. Ongoing development and optimization often lead to variations in algorithm implementations across different packages. This study focuses on a comparative analysis of several clustering algorithms available in popular R programming language packages [42]. R is a widely adopted language in data mining, offering well-established clustering packages, making it a relevant platform for this analysis. This research aims to guide researchers proficient in R but with limited experience in data clustering.

We evaluate algorithm performance in three scenarios: using default parameters, varying single parameters while keeping others at default values, and simultaneously varying all parameters through random sampling. Comparing results from these scenarios to default parameter performance allows us to assess potential performance improvements achievable through parameter modification.

The algorithms are tested on 400 artificial, normally distributed datasets generated using a robust methodology previously described in [36]. This methodology allows systematic manipulation of the number of features, classes, objects per class, and average inter-class distance across datasets.

This paper is structured as follows: We begin with a review of key approaches to clustering algorithm comparison. Next, we describe the clustering methods selected for analysis and the R packages implementing them. We then detail the data generation method and performance metrics employed for algorithm comparison. Finally, we present and discuss performance results obtained for default parameters, single parameter variations, and random parameter sampling.

Related Works

Prior research comparing clustering algorithm performance can be categorized by the type of datasets used. Some studies utilize real-world data, others artificial data, and some employ both to provide a more comprehensive comparative approach.

Comparative analyses using real-world datasets are documented in numerous studies [20, 21, 24, 25, 43, 44]. Several of these are briefly reviewed below. In [43], authors propose an evaluation framework based on multiple criteria decision-making in financial risk analysis, using three real-world credit and bankruptcy risk datasets. Clustering algorithms are evaluated using a combination of clustering measurements, including both external and internal validity indices. Their findings indicate that no single algorithm consistently achieves optimal performance across all measurements or datasets. This underscores the necessity of using multiple performance measures when evaluating clustering algorithms, highlighting the need for a comparative approach across different metrics.

In [21], a comparative analysis of clustering methods was conducted for text-independent speaker verification, using three document datasets. Two approaches were considered: distance-based objective function minimization clustering algorithms and Gaussian model-based approaches. The algorithms compared were k-means, random swap, expectation-maximization, hierarchical clustering, self-organizing maps (SOM), and fuzzy c-means. The study concluded that model order, representing the number of centroids or Gaussian components, was the most critical factor for algorithm success. Distance-based clustering algorithms showed similar recognition accuracy overall. SOM and hierarchical methods exhibited lower accuracy than other methods when the number of clusters was small. Computational efficiency analysis revealed split hierarchical clustering as the fastest method for the datasets examined.

The study in [25] evaluated five clustering methods: k-means, multivariate Gaussian mixture, hierarchical clustering, spectral, and nearest neighbor methods. Four proximity measures were used: Pearson and Spearman correlation coefficients, cosine similarity, and Euclidean distance. The algorithms were tested on 35 gene expression datasets from Affymetrix or cDNA chip platforms, using the adjusted Rand index for performance evaluation. The multivariate Gaussian mixture method demonstrated the best performance in recovering the actual number of clusters, with k-means showing similar performance. Hierarchical clustering exhibited limited performance, while spectral methods were highly sensitive to the chosen proximity measure. This emphasizes the importance of considering proximity measures in a comparative approach to clustering algorithm selection.

In [24], a comparative study examined five clustering algorithms: CLICK, self-organized mapping-based method (SOM), k-means, hierarchical, and dynamical clustering. Gene expression time series datasets from Saccharomyces cerevisiae yeast were used. A k-fold cross-validation procedure was employed for algorithm comparison. K-means, dynamical clustering, and SOM tended to achieve high accuracy across experiments. Hierarchical clustering, however, showed limited performance with larger datasets, exhibiting low accuracy in some instances.

Comparative analyses using artificial data are presented in [45–47]. In [47], two subspace clustering methods, MAFIA (Adaptive Grids for Clustering Massive Data Sets) [48] and FINDIT (A Fast and Intelligent Subspace Clustering Algorithm Using Dimension Voting) [49], were compared. Artificial data, modeled with normal distribution, allowed for controlled manipulation of dimensions and instances. Methods were evaluated for scalability and accuracy. Scalability was assessed by comparing running times for varying numbers of instances and features. The ability of methods to identify appropriate subspaces for each cluster was also evaluated. MAFIA identified all relevant clusters but often omitted one significant dimension. FINDIT performed better at identifying relevant dimensions. Both algorithms scaled linearly with the number of instances, but MAFIA generally outperformed FINDIT.

Another common comparative approach involves using a combination of real-world and artificial datasets (e.g., [23, 26–28, 50]). In [28], k-means, single linkage, and simulated annealing (SA) performance was evaluated using different partitions obtained by validation indices. Two real-world datasets from [51] and three artificial datasets (two dimensions, 10 clusters) were used. A new validation index, the I index, was proposed. This index measures separation based on maximum inter-cluster distance and compactness based on the sum of distances between objects and their centroids. The I index was found to be the most reliable among indices tested, reaching its maximum value when the number of clusters was correctly chosen.

A systematic quantitative evaluation of four graph-based clustering methods was performed in [27]. The methods compared were Markov clustering (MCL), restricted neighborhood search clustering (RNSC), super paramagnetic clustering (SPC), and molecular complex detection (MCODE). Six protein interaction datasets from Saccharomyces cerevisiae and 84 random graphs were used for comparison. Algorithm robustness was measured by quantifying performance variation in response to changes in (i) method parameters and (ii) dataset properties (connection additions/removals to simulate relationship uncertainties). Restricted neighborhood search clustering proved particularly robust to parameter variations, while other algorithms were more robust to dataset alterations.

In [52], authors briefly compared clustering algorithms using the Fundamental Clustering Problem Suite (FPC). The FPC contains artificial and real datasets designed to test clustering algorithms across specific challenges. For example, in Hepta and LSum datasets, clusters are linearly separable but have different densities and variances. ChainLink and Atom datasets are not linearly separable. The Target dataset contains outliers. Single linkage clustering exhibited lower performance on Tetra, EngyTime, Twodiamonds, and Wingnut datasets. While versatile, the FPC dataset does not allow for controlled evaluation of how characteristics like dimensions or features affect clustering accuracy, highlighting a limitation in some comparative approaches.

Clustering Methods

A wide variety of clustering methods exist in the literature [53–56]. While diverse, some methods are more frequently used [57]. Many common methods share underlying assumptions about data (e.g., k-means and k-medoids) or employ similar mathematical concepts (e.g., similarity matrices in spectral or graph clustering), leading to potentially similar performance in typical scenarios. Therefore, we have selected a range of clustering algorithms representing different method families for this comparative analysis. Several taxonomies categorize clustering algorithms [29, 58]. Some categorize algorithms by objective function [58], others by desired cluster structures (e.g., hierarchical) [29]. We consider the algorithms listed in Table 1 as representative examples of the categories indicated. These algorithms represent major method types in the field. Note that while some algorithms belong to the same family, they may have significant differences in application (e.g., clara for very large datasets). A brief description of the parameters for each algorithm is provided in S1 File of the supplementary material.

[Table 1. Clustering methods considered in our analysis and the respective libraries and functions in R employing the methods.

Algorithm	Category	R Function	R Library
K-means	Partitional	kmeans	stats
Clara	Partitional	clara	cluster
Optics	Density-based	optics	dbscan
Agnes	Hierarchical	agnes	cluster
Hcmodel	Model-based	hc	mclust
EM	Model-based	Mclust	mclust
Spectral	Spectral	specc	kernlab
Subspace (HDDC)	Subspace	hddc	HDclassif
Dbscan	Density-based	dbscan	dbscan

The first column shows the name of the algorithms used throughout the text. The second column indicates the category of the algorithms. The third and fourth columns contain, respectively, the function name and R library of each algorithm.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t001)

Partitional Approaches:

The k-means algorithm [68] is a widely used partitional clustering method [57]. It requires the number of clusters (k) and a distance metric as input. Initially, data points are assigned to one of the k clusters based on their distance to cluster centroids (cluster centers). Figure 1(a) illustrates this, with black points representing centroids and other points colored according to their closest centroid. New centroids are then calculated, and data point classification is repeated for the new centroids, as shown in Figure 1(b), where gray points indicate previous centroid positions. This iterative process continues until centroid positions stabilize, as depicted in Figures 1(c) and 1(d).

[Fig 1. Illustration of the k-means clustering method.

Each plot shows the partition obtained after specific iterations of the algorithm. The centroids of the clusters are shown as a black marker. Points are colored according to their assigned clusters. Gray markers indicate the position of the centroids in the previous iteration. The dataset contains 2 clusters, but k = 4 seeds were used in the algorithm.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g001)

A key limitation of k-means is the a priori requirement to specify the number of clusters. The final classification can be highly sensitive to this choice [68]. K-means is also less suitable for clusters that are non-convex or have significantly different sizes [59, 60]. Furthermore, k-means is sensitive to initial seed selection [41]. Despite these limitations, its low computational cost and effectiveness in many practical applications, such as anomaly detection [66] and data segmentation [67], make it valuable. The R implementation of k-means used here is the kmeans function from the stats package, which includes algorithms proposed by Macqueen [68] and Hartigan and Wong [69]. The Hartigan and Wong algorithm is used by default in the stats package, while Macqueen’s algorithm is used in other cases.

Clustering for large applications (clara) [70] is another important partitional method designed for large datasets. Clara takes multiple fixed samples of the dataset to minimize sampling bias and selects the best medoids from these samples. A medoid is an object i that minimizes the average dissimilarity to other objects in its cluster. Clara is efficient for large datasets because it does not examine the entire data neighborhood [71], although result quality can be sensitive to the number of objects in the sample data [62]. The clara algorithm used in this analysis is the clara function in the cluster package, implementing the method developed by Kaufman and Rousseeuw [70].

Density-based Approaches:

Ordering Points To Identify the Clustering Structure (OPTICS) [72, 73] is a density-based clustering algorithm based on the concept of maximal density-reachability [72]. OPTICS starts with a data point and expands its neighborhood similarly to the DBSCAN algorithm [74], but expands to points with low core-distance first. The core distance of an object p is the m-th smallest distance between p and objects in its ϵ-neighborhood (objects within distance ϵ of p), where m is a parameter representing the minimum cluster size. OPTICS can detect clusters with varying densities and irregular shapes. The R implementation of OPTICS is the optics function in the dbscan package, which implements the original algorithm by Ankerst et al. [72]. Hierarchical cluster structures can be extracted from OPTICS output using the extractXi function from the dbscan package. The extractDBSCAN function in the same package provides clustering similar to DBSCAN from an OPTICS ordering.

Hierarchical Approaches:

Hierarchical clustering methods, which consider data point linkages, are broadly divided into agglomerative and divisive approaches [59]. Agglomerative hierarchical clustering starts with each object in its own cluster and iteratively merges clusters until a stopping condition is met. Divisive hierarchical clustering begins with all objects in one cluster and recursively splits clusters. The R language provides hierarchical clustering routines in the stats and cluster packages. Here, we use the agnes routine from the cluster package, which implements the algorithm by Kaufman and Rousseeuw [70]. agnes offers four linkage criteria: single linkage, complete linkage, Ward’s method, and weighted average linkage [75].

Model-based Approaches:

Model-based methods provide a framework for estimating the maximum likelihood of underlying distribution parameters for a given dataset. The expectation-maximization (EM) algorithm is a prominent example. Often, data from each class is modeled by multivariate normal distributions, and the overall data distribution is seen as a mixture of these normal distributions. A maximum likelihood approach is used to find the most probable parameters for each class’s normal distribution. EM clustering is particularly useful for incomplete datasets [76, 77]. However, cluster results can be sensitive to initial conditions [54], and the algorithm may struggle to identify very small clusters [29, 78]. In R, the mclust package [79, 80] provides iterative EM methods for maximum likelihood estimation using parameterized Gaussian mixture models. The estep and mstep functions implement individual EM iteration steps. A related algorithm analyzed here is hcmodel, found in the hc function of the mclust package. Hcmodel, also based on Gaussian mixtures, was proposed by Fraley [81]. It includes additional steps compared to traditional EM methods, such as agglomerative procedures and model parameter adjustment using Bayes factor selection with BIC approximation [82].

Spectral Clustering:

Spectral clustering methods emerged as an alternative to traditional approaches unable to define nonlinear discriminative hypersurfaces [83]. Their key advantage is defining an adjacency structure from the dataset, avoiding pre-imposed cluster shapes [84]. The first step is constructing an affinity matrix , where each entry represents the similarity between two points. This matrix represents the data as a weighted graph. Eigenvalues and eigenvectors of this matrix are then used to partition data based on a chosen criterion. Various similarity matrices can be used, with the Laplacian matrix being a popular choice [85]. Spectral methods can be computationally intensive due to eigenvector calculation [54]. The specc function from the kernlab R package implements the algorithm by Jordan and Weiss [86], using a kernel function to compute the affinity matrix as *Aij = exp(−||xi − xj||2/2σ2), where xi and x*j are points to be clustered into k subsets, and σ2 controls the decay rate of affinity with distance. Recent discussion on the relationship between spectral and k-means algorithms can be found in [87]. Weighted kernel k-means is a generalization of k-means, used to locally optimize graph partitioning objectives into k disjoint partitions or clusters. Spectral algorithms using eigenvectors are typically used for this, but can be computationally expensive for large datasets. Weighted kernel k-means can aid in optimizing various graph partitioning objectives.

Subspace Clustering:

Efficient handling of high-dimensional data is increasingly important, making this a desirable feature in clustering methods. Subspace clustering [49] addresses high dimensionality by considering object similarity within different attribute subsets [88]. This is motivated by the idea that different attribute subsets may reveal distinct data separations. The algorithm can identify clusters existing in multiple, potentially overlapping, subspaces [49]. Subspace algorithms are categorized into lattice, statistical, approximation, and hybrid families [89]. The hddc function from the HDclassif package implements Bouveyron’s subspace clustering method [90] in R. This statistical model-based algorithm assumes all attributes may be relevant for clustering [91]. Parameters like the number of clusters and model type are estimated using an EM procedure.

While we have focused on static data clustering, analyzing dynamic data is also crucial. Dynamic data, unlike static data, changes over time. Data streams, such as network packets and transaction streams, are transient. Time series data also represents dynamic data, with values changing over time [92]. Dynamic data typically involves numerous features and potentially unbounded objects [59]. This necessitates novel approaches for rapid processing of continuous data streams [93], detection of new cluster formations, and outlier identification [94]. This comparative approach focuses on static data clustering, but dynamic data clustering is an important area for future research.

Materials and Methods

Artificial Datasets

Robust comparison of clustering algorithms requires a method for generating diverse artificial datasets. We use a methodology based on the work of Hirschberger et al. [35]. This procedure generates normally distributed samples with F features, divided into C classes. It allows control over variance and correlation distributions among features within each class. The number of objects per class (*N*e) and the expected separation (α) between classes are also tunable.

Generating datasets with these properties hinges on defining a valid covariance matrix R for features. A valid covariance matrix must be positive semi-definite [95], which is not always straightforward to ensure. However, for any given matrix G, the matrix R = GG**T is guaranteed to be positive semi-definite [95]. Thus, any random matrix G can define a valid covariance matrix. By imposing constraints on G, we can generate datasets with desired properties. Hirschberger et al. [35] developed a method to generate such a matrix based on the first two statistical moments of the covariance distribution of F artificial features. The resulting covariance matrix contains variances and co-variances drawn from this distribution. Here, we use a normal distribution to represent elements of R**.

For each class i, a covariance matrix **R*i of size F × F is created and used to generate Ne objects for that class. This allows for distinct feature correlations for each class. Generated class values are divided by α and translated by si, where si is a uniform random variable in the interval [−1, 1]. The parameter α* relates to the expected inter-class distances. These distances can have varying impacts on clustering depending on the number of objects and features. Features in the generated data follow a multivariate normal distribution, and the covariance among features also follows a normal distribution. This data generation procedure was previously used in [36].

Figure 2 shows examples of artificially generated data, all with F = 2 features for visualization. Parameter settings for each case are described in the figure caption. This methodology can generate diverse dataset configurations, including variations in feature correlation across classes.

[Fig 2. Examples of artificial datasets generated by the methodology.

The parameters used for each case are (a) C = 2, Ne = 100 and α = 3.3. (b) C = 2, Ne = 100 and α = 2.3. (c) C = 10, Ne = 50 and α = 4.3. (d) C = 10, Ne = 50 and α = 6.3. Note that each class can present highly distinct properties due to differences in correlation between their features.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g002)

In this study, we used the following parameter values for artificial datasets:

Number of classes (C): C = {2, 10, 50}
Number of features (F): F = {2, 5, 10, 50, 200}
Number of objects per class (*Ne): Ne* = {5, 50, 100, 500, 5000} (number of instances per class is constant within a dataset).
Mixing parameter (α): Tuned for each dataset to avoid algorithms achieving 0% or 100% accuracy, as α’s effect is non-trivially dependent on the number of classes and features.

Datasets with 2, 10, 50, and 200 features are referred to as DB2F, DB10F, DB50F, and DB200F, respectively. These datasets include all considered numbers of classes (C = {2, 10, 50}) and 50 objects per class (*N*e = 50). Specific datasets are sometimes denoted, e.g., DB2C10F refers to a dataset with 2 classes, 10 features, and 50 objects per class.

For each parameter combination, 10 dataset realizations were generated, resulting in a total of 400 datasets.

Evaluating the Performance of Clustering Algorithms

Evaluating partition quality is crucial in cluster analysis [30]. Quality indices are categorized as internal or external. Internal validation indices assess clustering structure goodness using only data-intrinsic information, without external knowledge. They evaluate cluster tightness and separation and are used to select optimal algorithms for specific datasets when correct partitions are unavailable [96]. External validation indices measure similarity between algorithm output and the known correct dataset partitioning. Jaccard, Fowlkes-Mallows, and adjusted Rand indices are pair-counting category indices and are closely related. Differences include potential biases with respect to the number of clusters or class size distributions. Normalization helps mitigate these biases. In [97], biases affecting pair-counting external cluster validity indices are discussed. Across 26 such indices, bias from cluster number was examined. Fowlkes-Mallows and Jaccard indices monotonically decrease as cluster number increases, favoring fewer clusters. The Adjusted Rand Index is generally indifferent to cluster number.

Here, we use traditional external indices: Jaccard Index (J) [31], Adjusted Rand Index (ARI) [32], Fowlkes Mallows Index (FM) [33], and Normalized Mutual Information (NMI) [34]. To define these, let U = {u1, u2…*uR} represent the original dataset partition, with ui being the subset of objects in cluster i. Let V = {v1, v2…v*C} be the partition from a clustering algorithm. Let a be the number of object pairs in the same group in both U and V. Mathematically, a is:

(1)

where *nij is the number of objects in both subset ui and v*j.

Let b be the number of object pairs in the same group in U but different groups in V:

(2)

where *ni. = ∑j nij. Let c be the number of object pairs in different groups in U but the same group in V*:

(3)

where n.j = ∑i *n*ij.

The Jaccard Index (J), Adjusted Rand Index (ARI), and Fowlkes Mallows (FM) index are defined as:

(4)

(5)

(6)

We also use Normalized Mutual Information (NMI) because it quantifies mutual dependence between random variables based on information theory [98]. NMI is defined as [99]:

(7)

where C is the random variable for cluster assignments, and T is the random variable for true class labels. I(C, T) = H(C) − H(C|T) is the mutual information between C and T. H(C) is the Shannon entropy of C, and H(C|T) is the conditional entropy of C given T.

When two sets of labels perfectly correspond, all quality measures equal unity.

Previous studies indicate no single internal cluster validation index consistently outperforms others [100, 101]. In [101], comparing internal indices across scenarios showed the Silhouette index performing best most often.

The Dunn’s validation index quantifies both cluster compactness and inter-cluster separation. It aims to maximize inter-cluster distance and minimize intra-cluster distance, where *c*i is the i-th cluster. The index is:

(8)

where dist(*ci, cj) is the minimum distance between clusters ci and cj*, and . High Dunn index values indicate compact, well-separated clusters.

The Silhouette index (SI) computes a width for each point based on its cluster membership:

(9)

where n is the total points, *ai is the average distance between point i and other points in its cluster, and bi is the minimum average dissimilarity between i and points in other clusters. The SI range is −1 ≤ SI* ≤ 1, with higher SI indicating better partitions.

Results and Discussion

The accuracy of each clustering algorithm was evaluated using three methods: default algorithm parameters, single parameter variation, and random parameter variation. Default parameter evaluation simulates a user applying algorithms without parameter tuning, a common scenario for non-experts. Single parameter variation quantifies parameter influence on accuracy and tests simple optimization strategies. Random parameter variation samples algorithm performance across its parameter space, allowing quantification of maximum accuracy and sensitivity to parameter changes.

Performance when using default parameters

Algorithm performance was evaluated on all datasets described in Section Artificial datasets. All unsupervised algorithms used default parameter configurations. Results are presented by dataset feature number (F). For each F, datasets with C = {2, 10, 50} classes and *N*e = {5, 50, 100} objects per class were used. Performance for each F is averaged across different class numbers and objects per class. The subspace algorithm could not be applied to 2-feature datasets.

Figure 3 shows performance metric values. All metrics show similar trends. Hierarchical methods appear strongly affected by feature number, with lower accuracy at 50 and 200 features. K-means, spectral, optics, and DBSCAN benefit from increased features. HCmodel performs best with 10 features, suggesting optimal performance around this dimensionality. For 2 features, algorithms perform similarly, except optics and DBSCAN. More features induce larger performance differences. Spectral algorithms show best performance at 200 features.

[Fig 3. Average performance of the seven considered clustering algorithms according to the number of features in the dataset.

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 2, 10 and 50 features. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g003)

The Kruskal-Wallis test [102], a non-parametric test, was used to assess statistical performance differences across feature numbers. For 2 features, the test yielded a p-value of p = 6.48 × 10−7, with a chi-squared distance χ2 = 41.50, indicating significant performance differences. For 10 features, p = 1.53 × 10−8 (χ2 = 52.20). For 50 features, p = 1.56 × 10−6 (χ2 = 41.67). For 200 features, p = 2.49 × 10−6 (χ2 = 40.58). In all cases, the null hypothesis was rejected, confirming statistically significant performance differences across algorithms for 2, 10, 50, and 200 features, consistent with Figure 3.

To examine the influence of object number, average accuracy was calculated for datasets grouped by *Ne. Figure 4 shows these results. The impact of Ne varies by algorithm. Hierarchical, k-means, and clara show lower accuracy with more data, suggesting reduced robustness to increased cluster overlap from more objects. HCmodel, optics, and DBSCAN performance improves with larger N*e, consistent with [90].

[Fig 4. Average performance of the seven considered clustering algorithms according to the number of objects per class in the dataset.

All artificial datasets were used for evaluation. The averages were calculated separately for datasets containing 5, 50 and 100 objects per class. The considered performance indexes are (a) adjusted Rand, (b) Jaccard, (c) normalized mutual information and (d) Fowlkes Mallows.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g004)

Data size affects clustering quality for most algorithms. To quantify this, we tested a scenario with high instance numbers using DB10C5F datasets (F = 5, C = 10, *N*e = {5, 50, 500, 5000}). Figure 5 shows subspace and spectral methods improving with instance number. K-means, clara, hcmodel, and EM algorithm accuracy is largely unaffected. Spectral, hierarchical, and hcmodel accuracy could not be calculated at 5000 instances per class due to memory limitations. For spectral methods, kernel matrix computation and storage become computationally demanding at large dataset sizes. Subspace algorithms show lower accuracy at very small dataset sizes.

[Fig 5. Performance of the algorithms when the number of elements by class correspond to Ne = 5, 50, 500, 5000.

The plots correspond to the ARI, Jaccard and FM indexes averaged for all datasets containing 10 classes and 5 features (DB10C5F).](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g005)

Algorithm performance was also assessed with varying expected cluster numbers (K), a parameter typically unknown in real datasets. We varied K for each algorithm and observed accuracy changes. Optics and DBSCAN, which do not have a cluster number parameter, were excluded from this analysis. DB10C10F datasets (10 features, 10 classes) were used for simplicity. Figure 6 shows results. Setting K < 10 leads to worse performance than K > 10, suggesting slight overestimation of cluster number has less performance impact. Overestimating K slightly may be a good strategy. Hierarchical clustering shows accuracy improvement with increasing expected cluster number. This is due to the default method parameter, set to “average” (UPGMA). UPGMA averages dissimilarities between points in clusters for agglomeration. Its moderate performance, even with high subgroup differentiation, may be due to UPGMA tending to produce unbalanced clusters, with most objects in few clusters and many clusters containing only one or two objects.

[Fig 6. Performance of the algorithms when changing the expected number of clusters K in the dataset.

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 10 features (DB10C10F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g006)

External validation indices show most algorithms correctly identify 10 clusters in DB10C10F. However, true cluster number is usually unknown in real analysis. Therefore, internal validation indices, providing feedback on partition quality, are also considered. Silhouette and Dunn indices were applied to DB10C10F and DB10C2F datasets while varying K. Figures 6 and 7 show results. In Figure 6, results are similar across algorithms. Silhouette index shows high accuracy around k = 10. Dunn index is slightly less accurate, misestimating cluster number for hierarchical algorithm. In Figure 7, Silhouette and Dunn show similar behavior.

[Fig 7. Performance of the algorithms when changing the expected number of clusters K in the dataset.

The upper plots correspond to the ARI and Jaccard indices averaged for all datasets containing 10 classes and 2 features (DB10C2F). The lower plots correspond to the Silhouette and Dunn indices for the same dataset. The red line indicates the actual number of clusters in the dataset.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g007)

Default parameter results are summarized in Table 2. Each table section corresponds to a performance metric. Entry (i, j) is the average performance difference between method i and method j. The last column shows average performance for each algorithm across all datasets.

[Table 2. Average difference of accuracies obtained when clustering algorithms are used with their default configuration of parameters.

In general, the spectral algorithm provides the highest accuracy rate among all evaluated methods.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t002)

Table 2 indicates spectral algorithms generally outperform others by at least 10%. Hierarchical methods show lower performance in most cases. K-means and clara show equivalent performance across datasets. Spectral methods appear preferable when parameter optimization is not performed.

One-dimensional analysis

One-dimensional analysis assesses algorithm accuracy sensitivity to single parameter variations and tests simple optimization effectiveness. DB2C2F (α = 2.5), DB10C2F (α = 4.3), DB2C10F (α = 1.16), DB10C10F (α = 1.75), DB2C200F (α = 0.87), and DB10C200F (α = 1.09) datasets were used. For each parameter, values were varied while others remained at default. Performance change from parameter P variation was quantified by comparing accuracy Γ(x) at value x to default accuracy Γdef. Performance improvement was measured by average (〈S〉) and maximum (max S) values:

(10)

(11)

where *n*P is the number of tested values for parameter P. Sensitivity to P variation is measured by standard deviation ΔS:

(12)

Average maximum accuracy 〈max Acc〉 across datasets, achieved by varying each parameter, was also measured. Table 3 shows 〈S〉, max S, ΔS, and 〈max Acc〉 for 2-feature datasets. For two-class DB2C2F, significant improvement (〈S〉 = 10.75% and 〈S〉 = 13.35%) was observed by varying modelName, minPts, and kpar for EM, optics, and spectral methods, respectively. Other cases showed minor average gain. For 10-class datasets, hierarchical algorithm method parameter variation can cause substantial accuracy loss (16.15% average). However, average performance variation was generally small.

[Table 3. One-parameter analysis performed in DB2C2F and DB10C2F.

This analysis is based on the performance (measured through the ARI index) obtained when varying a single parameter of the clustering algorithm, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t003)

Table 4 shows values for 10-feature datasets. For two-class clustering, moderate improvement can be seen for k-means, hierarchical, and optics algorithms by varying nstart, method, and minPts, respectively. EM method modelName variation shows large accuracy increase, averaging 18.8%. Similar behavior is seen for 10 classes. For 10 classes, hierarchical method variation improves average performance by 6.72%. EM modelName variation also shows high improvement, averaging 13.63%.

[Table 4. One-parameter analysis performed in DB2C10F and DB10C10F.

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter, where the average is calculated over all considered datasets.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t004)

Variation of some parameters has minor impact on algorithm discriminative power. Examples include spectral clustering kernel and iter, and k-means iter.max. In some cases, unidimensional parameter variation reduces performance. For example, subspace algorithm min.individuals and models variation causes average accuracy loss of ~20%, depending on dataset. Similar behavior is seen for DBSCAN, where minPts variation averages 20.32% accuracy loss. Clara algorithm metric and rngR parameters also lead to marked performance decreases.

Table 5 shows 〈S〉, max S, ΔS, and 〈max Acc〉 for 200-feature datasets. For two-class problems, significant performance improvement is seen by varying k-means nstart, hierarchical method, hcmodel modelName, and EM modelName. However, varying clara metric, subspace min.individuals, and hcmodel use causes average accuracy losses > 10%. Largest accuracy loss is with DBSCAN minPts (49.47%). For 10-class problems, similar results are seen, except clara parameter changes result in large accuracy losses.

[Table 5. One-parameter analysis performed in DB2C200F and DB10C200F.

This analysis is based on the performance obtained when varying a single parameter, while maintaining the others in their default configuration. 〈S〉, max S, ΔS are associated with the average, standard deviation and maximum difference between the performance obtained when varying a single parameter and the performance obtained for the default parameter values. We also measure 〈max Acc〉, the average of best ARI values obtained when varying each parameter.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t005)

Multi-dimensional analysis

Comprehensive algorithm performance analysis requires simultaneous variation of all parameters. However, this is practically challenging due to the vast parameter combination space. Here, we use random parameter variation to sample algorithm performance across its multi-dimensional parameter space.

The method is as follows: based on one-dimensional parameter variation results, we identify parameter bounds [Pmin, Pmax], where performance either stabilizes or degrades significantly compared to default values. These bounds define the random sampling interval. For an algorithm with parameters P(1), P(2), …, P(n), each parameter P(i) is randomly sampled from a uniform distribution within . For each algorithm, 500 parameter sets were generated.

Algorithm performance for different parameter sets was evaluated as follows: Figure 8 shows the ARI value histogram for k-means random parameter sampling. The red dashed line shows default parameter ARI. The light blue shaded region indicates parameter configurations with improved performance. From this, we calculated four measures: p-value (blue region area / total histogram area 100%), representing the percentage of parameter configurations improving performance; average improvement 〈R〉; improvement standard deviation ΔR; and maximum improvement max R*, all calculated for cases with performance improvement (blue region in Figure 8). Relative performance is the difference between performance with random parameters and default parameters. Average maximum accuracy 〈max ARI〉 across datasets was also measured for random parameter selection. S2 File shows ARI distribution histograms for random parameter sampling across all algorithms.

[Fig 8. Distribution of ARI values obtained for the random sampling of the k-means parameters.

The algorithm was applied to dataset DB10C10F, and 500 sets of parameters were drawn.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.g008)

Table 6 shows algorithm performance (ARI) for DB2C2F with random parameter selection. Optics and EM are the only algorithms with p-value > 50%. EM (22.1%) and hierarchical (30.6%) methods show high average performance gain. HCmodel, k-means, spectral, optics, and DBSCAN show moderate improvement.

[Table 6. Multi-parameter analysis performed in dataset DB2C2F.

The p-value represents the probability that the classifier set with a random configuration of parameters outperform the same classifier set with its default parameters. 〈R〉, ΔR and max R represent the average, standard deviation and maximum value of the improvement obtained when random parameters are considered. Column 〈max ARI〉 indicates the average of the best accuracies obtained for each dataset.](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t006)

Table 7 shows performance for DB10C2F. High p-values are seen for optics (96.6%), EM (76.5%), and k-means (77.7%). Average performance improvement is relatively low for most, except optics, which averages 15.9% improvement.

[Table 7. Multi-parameter analysis performed in dataset DB10C2F.

More marked performance variation is seen for DB2C10F (Table 8). EM, k-means, hierarchical, and optics algorithms have p-value > 50%. Average performance gain when improved was 30.1%, 18.0%, 25.9%, and 15.5%, respectively. Random parameter variation may improve these algorithms. Except for clara and DBSCAN, all methods show significant average improvement for this dataset. Maximum accuracy of 100% can be achieved for EM and subspace algorithms.

[Table 8. Multi-parameter analysis performed in dataset DB2C10F.

Table 9 shows performance for DB10C10F. EM, clara, k-means, and optics p-values indicate random parameter selection generally improves performance. Hierarchical algorithms can be significantly improved due to the default method parameter being unsuitable for this dataset.

[Table 9. Multi-parameter analysis performed in dataset DB10C10F.

Performance for DB2C200F is in Table 10. High p-values are seen for EM (65.1%) and k-means (65.6%). Average performance gain is 39.1% and 35.4%, respectively. Spectral and Subspace methods show improved ARI in ~16% of cases. Random parameter variation leads to large average performance improvements across all algorithms.

[Table 10. Multi-parameter analysis performed in dataset DB2C200F.

Table 11 shows performance for DB10C200F. High p-values are seen for all methods. Average accuracy improvement is generally lower than for DB2C200F.

[Table 11. Multi-parameter analysis performed in dataset DB10C200F.

Conclusions

Clustering data is complex, involving choices among methods, parameters, and performance metrics, with real-world implications [63, 103–108]. Analyzing clustering algorithm advantages and disadvantages is also complex and has received considerable attention. This study used a comprehensive methodology for generating diverse heterogeneous datasets with controlled properties like inter-class distances and feature correlations. Using R packages, we compared the performance of nine popular clustering methods across 400 artificial datasets, considering default parameters, single parameter variation, and random parameter variation. It is important to remember that these results are specific to normally distributed data and algorithm implementations; different results may occur in other situations. Beyond practical guidance for non-expert users, interesting findings were obtained about the clustering methods.

For default parameters, performance differences were not significant for low-dimensional datasets. Kruskal-Wallis test p-values were p = 6.48 × 10−7 (χ2 = 41.50) for 2 features, p = 1.53 × 10−8 (χ2 = 52.20) for 10 features, p = 1.56 × 10−6 (χ2 = 41.67) for 50 features, and p = 2.49 × 10−6 (χ2 = 40.58) for 200 features.

Spectral methods showed best default parameter performance, with an ARI of 68.16% (Table 2). Hierarchical methods had lower performance (ARI 21.34%). Underestimating cluster number led to worse performance than overestimation, consistent with [44].

Single parameter variations showed significant performance variation for hierarchical, optics, and EM methods with 2-feature datasets. For ≥10 features, most methods could be improved by parameter adjustments.

Multi-dimensional analysis for 10-class, 2-feature datasets showed similar performance to default parameters, suggesting low parameter sensitivity. For 2-class, 10-feature datasets, EM, hcmodel, subspace, and hierarchical algorithms showed significant performance gains. EM algorithm had high p-value (70.8%), indicating many parameter values outperform defaults. For 10-class, 10-feature datasets, improvement was lower, except for hierarchical clustering. With 200 features, large performance gains were seen for all methods.

Tables 12, 13, and 14 summarize best accuracies achieved in different scenarios (default, one-dimensional, and multi-dimensional parameter adjustment) for DB2C2F, DB10C2F, DB2C10F, DB10C10F, DB2C200F, and DB10C200F datasets. For 2-feature datasets, algorithms show similar performance, especially with increased class number. For ≥10 features, spectral algorithms generally perform best, although EM, hierarchical, k-means, and subspace can achieve similar performance with tuning. Algorithms like optics and DBSCAN are designed for other data distributions (e.g., elongated, S-shaped) [72, 74], so results may differ for non-normally distributed data.

[Table 12. Summary table for the performance of clustering algorithms in datasets DB2C2F and DB10C2F.

*ARIdef represents the average accuracy obtained when considering the default parameters of the algorithms. represents the average of the best accuracies obtained when varying a single parameter. represents the average of the best accuracies obtained when parameters are randomly selected.*](article/figure/image?size=large&id=10.1371/journal.pone.0210236.t012)

[Table 13. Summary table for the performance of clustering algorithms in datasets DB2C10F and DB10C10F.

[Table 14. Summary table for the performance of clustering algorithms in datasets DB2C200F and DB10C200F.

Future work could compare more algorithms and explore other statistical data distributions. Applying a similar comparative approach to semi-supervised classification is another important direction.

Supporting information

S1 File. Description of the clustering algorithms’ parameters.

We provide a brief description about the parameters of the clustering algorithms considered in the main text.

S2 File. Clustering performance obtained for the random selection of parameters.

The file contains figures showing the histograms of ARI values obtained for identifying the clusters of, respectively, datasets DB10C10F and DB2C10F using a random selection of parameters. Each plot corresponds to a clustering method considered in the main text.