A Comparative Study for Outlier Detection Techniques in Data Mining

In the realm of data mining, a comparative study for outlier detection techniques is crucial for identifying anomalies and unusual patterns that deviate significantly from the norm. compare.edu.vn offers a comprehensive platform to explore and contrast various outlier detection algorithms. By examining the relative strengths of different anomaly detection methods, users can gain valuable insights, identify data defects, and improve predictive accuracy. Explore outlier analysis, anomaly scoring, and diverse detection approaches for enhanced data understanding.

1. Introduction to Outlier Detection

Outlier detection, also known as anomaly detection, is the process of identifying data points that deviate significantly from the majority of the dataset. These outliers can represent errors, novelties, or critical events, making their detection essential in various fields. A comparative study for outlier detection techniques enables data scientists and analysts to choose the most appropriate method for their specific needs. This involves understanding the strengths and weaknesses of different algorithms under various conditions.

2. Importance of Outlier Detection in Data Mining

Outlier detection plays a vital role in data mining by uncovering hidden patterns and providing actionable insights. These techniques are essential for various applications, including fraud detection, network intrusion detection, and predictive maintenance. A comparative study for outlier detection techniques in data mining is crucial because no single method is universally optimal; the best choice depends on the characteristics of the dataset and the specific goals of the analysis.

3. Key Concepts in Outlier Detection

Before delving into a comparative study for outlier detection techniques, it’s important to define some key concepts:

Outliers: Data points that differ significantly from other observations.
Anomaly Score: A measure of how likely a data point is to be an outlier.
Univariate Outliers: Outliers in a single dimension or feature.
Multivariate Outliers: Outliers in multiple dimensions or features.

4. Intentions Behind User Searches

Understanding user intent is crucial for creating relevant and valuable content. Here are five potential user intentions when searching for “A Comparative Study For Outlier Detection Techniques In Data Mining”:

Seeking a comprehensive overview: Users want a general understanding of different outlier detection methods.
Comparing specific algorithms: Users are interested in a detailed comparison of specific outlier detection techniques.
Finding the best method for a specific dataset: Users need guidance on selecting the most appropriate method for their data.
Understanding the performance of algorithms: Users want to see how different methods perform in different situations.
Learning about the applications of outlier detection: Users are interested in real-world applications of outlier detection.

5. Types of Outlier Detection Techniques

A comparative study for outlier detection techniques typically categorizes methods into several broad types:

5.1. Statistical Methods

Statistical methods assume that normal data points follow a certain distribution. Outliers are then defined as points that deviate significantly from this distribution.

Z-score: Measures how many standard deviations a data point is from the mean.
Grubbs’ Test: Detects a single outlier in a univariate dataset.

5.2. Machine Learning Methods

Machine learning-based outlier detection techniques learn from the data and identify anomalies based on learned patterns.

Isolation Forest: Isolates outliers by randomly partitioning the data space.
One-Class SVM: Trains a model on normal data and identifies points outside this region as outliers.
Local Outlier Factor (LOF): Measures the local density deviation of a data point compared to its neighbors.

5.3. Clustering-Based Methods

Clustering algorithms group similar data points together, and outliers are identified as points that do not belong to any cluster or belong to very small clusters.

K-Means Clustering: Identifies outliers as points far from any cluster centroid.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Marks points in low-density regions as outliers.

5.4. Proximity-Based Methods

Proximity-based methods calculate the distance or similarity between data points and identify outliers based on their isolation from other points.

K-Nearest Neighbors (KNN): Identifies outliers as points with large distances to their k-nearest neighbors.

6. Comparative Analysis of Outlier Detection Techniques

A comparative study for outlier detection techniques in data mining involves evaluating different methods based on several criteria, including accuracy, computational complexity, and suitability for different types of data.

6.1. Accuracy

Accuracy refers to the ability of an outlier detection technique to correctly identify outliers and normal data points. Evaluation metrics include precision, recall, F1-score, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC).

6.2. Computational Complexity

Computational complexity measures the resources (time and memory) required by an algorithm to process data. This is especially important for large datasets.

6.3. Data Type Suitability

Different outlier detection techniques are better suited for different types of data. Some methods work well with numerical data, while others are designed for categorical or mixed data.

7. Evaluation Metrics for Outlier Detection

To perform a thorough comparative study for outlier detection techniques, it is crucial to understand the various evaluation metrics used to assess their performance.

7.1. Precision

Precision measures the proportion of correctly identified outliers out of all data points flagged as outliers. A high precision indicates that the technique is good at avoiding false positives.

7.2. Recall

Recall measures the proportion of actual outliers that are correctly identified by the technique. A high recall indicates that the technique is good at avoiding false negatives.

7.3. F1-Score

The F1-score is the harmonic mean of precision and recall, providing a balanced measure of the technique’s accuracy.

7.4. AUC-ROC

The Area Under the Receiver Operating Characteristic Curve (AUC-ROC) measures the ability of the technique to discriminate between outliers and normal data points. A higher AUC-ROC indicates better performance.

8. Experimental Evaluation and Results

In this section, we present an experimental evaluation and report the results obtained from a comparative study for outlier detection techniques in data mining. This evaluation answers the question: “Are clustering-based anomaly detection methods competitive in accuracy with those of the non-clustering-based state-of-the-art?”

8.1. Accuracy

This section intends to answer the question Q1: “Are clustering-based anomaly detection methods competitive in accuracy with those of the non-clustering-based state-of-the-art?”. To this end, we first performed the Friedman’s test to reject the null hypothesis that the methods provide statistically equivalent results (Demšar 2006). For the posthoc analysis (Benavoli et al. 2016), we abandoned the average rank comparison in favor of a pairwise statistical comparison: the Wilcoxon signed-rank test with Holm’s alpha correction (5%). Finally, we employed a critical difference diagram (Demšar 2006) to depict the results of these statistical tests projected onto the average rank axis, with a thick horizontal line showing a clique of methods that are not significantly different in terms of AUROC, Average Precision, R-Precision, or Max-F1.

Figure 2 presents the critical difference diagrams regarding the best values per evaluation measure obtained by every method on each dataset considering all possible parameter configurations. We evaluate the results separately for each group of datasets, i.e., those used in the literature, the semantically meaningful ones, and the synthetic datasets. Regarding the datasets used in the literature, the algorithms KNNOutlier, KMeans–, OPTICS-OF, and GLOSH are generally at the top of the average rank considering the four measures. On the other hand, the algorithms SilhouetteOD and OutRank S1/H are the worst performers. We also note that the gap (amplitude) between the first and the last average rank in the diagrams is almost 8, which could be a significant difference in the performance. Nevertheless, the lines connecting all the methods shows that there is no statistically significant difference between the results they obtained. A similar scenario is seen in the semantically meaningful datasets, except for the appearance of the OutRank S1/D at the top of the ranking of R-Precision (Fig. 2g), separated from the second place by a small difference in the average rank. On the other hand, the results are considerably different regarding the synthetic datasets. We note that the amplitude in the average rank is almost 10 for these datasets, and there is a clear superiority of the method in the third place over the one in the fourth-place regarding the measures AUROC and Average-Precision. The KNNOutlier always gets the first place with the DBSCANOD and the EMOutlier following it in the successive positions. The KMeans– and the OPTICS-OF show the worst performances. It is worth noting that both the DBSCANOD and the EMOutlier improved their performance compared to the other datasets. We believe it is because the synthetic data contain clusters following Gaussian distributions and a few outliers. Methods tend to perform better when their assumptions match the type of patterns present in the data.

Outlier detection is mainly an unsupervised task. Hence, the results of the previous paragraph may not represent well-enough a real-world scenario. Rather, they represent an ideal scenario where we know a priori the correct parameterization to obtain the best possible result from each method and dataset. Figure 3 presents the critical difference plots regarding the average values per evaluation measure obtained by every method on each dataset considering all possible parameter configurations. We believe they assess a more plausible real-world scenario by simulating the use of values selected at random for each parameter from a reasonable range of values. In all three groups of datasets, the amplitude in the average rank is between 8 and 10, with the latter value observed in the synthetic data. For the first two groups, the algorithms LOF, KNNOutlier, KMeans–, and KMeansOD are generally the best performers. The good performance of the non-clustering-based methods KNNOutlier and LOF aligns with previous findings in literature, and shows that we chose a reasonable baseline. The OutRank S1/H, the SilhouetteOD, and the KMeans– show the worst performances. For the synthetic datasets, the algorithms EMOutlier and KMeans– are always at the top of the ranking. From a statistical point of view, we deduce that none of the methods presents significantly better performance than the others considering the datasets used in the literature and the semantically meaningful ones. Distinctly, for the synthetic datasets, the algorithms EMOutlier and KMeans– perform considerably better than the other methods regarding the four measures.

Overall, we conclude that there is no significant superiority of the non-clustering-based algorithms over the clustering-based ones. Both approaches exhibit similar performance on the real data. On the other hand, a clear advantage of the clustering-based methods is seen on synthetic data, which we believe is determined partly by the setup of the datasets studied that involves the generation of clusters of normal data. Finally, it is worth noting that the continuous scores of the KMeans–* lead to significant improvements compared to the binary labels of the original method KMeans–, and it usually performs better than the standard variant KMeansOD, too.

8.2. Resilience to Data Variation

This section intends to answer the question Q2: “How resilient to data variation are the evaluated methods?”. We leverage boxplots to assess each method’s resilience considering various datasets. Each plot depicts the locality, skewness, and dispersion groups of a set of values obtained for one evaluation measure through their quartiles. The higher the dispersion in the set of values, the lower the resilience of the method. We first investigate the real datasets (Sect. 5.2.1) and then move on to the synthetic ones (Sect. 5.2.2).

8.2.1. Real data

Figure 4 shows boxplots of the best values obtained by each method per evaluation measure and dataset. Every boxplot represents a set of values, each of which being the best result obtained by one method from one dataset regarding one measure. Here we discuss the results obtained from real data, that is, those depicted at the top two lines of plots shown in Fig. 4; the remaining results are discussed later in the paper. We observe that the AUROC values of the methods in all the real datasets generally present little dispersion, and they also present medians larger than 0.6. Thus, under the best parameter configuration, every method can detect outliers better than a randomized detector, although, for many applications, 0.6 will be much too low to be useful. On the other hand, the best values of Average Precision, R-Precision, and Max-F1 (all of them adjusted for chance) tend to be considerably lower, with medians often smaller than ≈0.4. We believe that there exist three reasons for these undesired results. Firstly, the datasets have a large variety of domains. Secondly, the number of outliers and inliers is not balanced, with outliers accounting for approximately 5% of the data, which is common in anomaly detection scenarios. Thirdly, the datasets were usually labeled for a classification task, not for outlier detection, and may contain anomalous objects that nevertheless belong to the majority class and are labeled as “normal”. Furthermore, the relatively large spread of values in these three measures shows that, except for SilhoutteOD, all methods have a low resilience concerning the type of data under analysis even when using their best possible parameter configurations. Overall, both the clustering-based and the non-clustering-based approaches present comparable resilience to the data variation assuming that the optimal setup for each method and dataset is known a priori.

The boxplots corresponding to the average values obtained by each method per evaluation measure and dataset are shown in Fig. 5. We note that the AUROC scores for the datasets used in the literature have a moderate dispersion of nearly 0.5 between the minimum and the maximum values of the plots, and the medians range from 0.5 to 0.9. For the other evaluation measures, the algorithms KMeansOD, KMeans–, and KNNOutlier have large spread values with low resilience to data variation. The SilhouetteOD and the SilhouetteOD* exhibit much better resilience, though. Unfortunately, it is not a positive result because their median values are very low. In the semantically meaningful datasets, the dispersion of AUROC values is approximately 0.4, thus being smaller than that of the other group of real datasets, and the medians range from 0.5 to 0.8. Additionally, the other evaluation measures exhibit reduced dispersion of values, such as for the algorithms KMeansOD, KMeans–, and KNNOutlier, which indicates moderate resilience to data variation regarding all methods and this group of datasets. Overall, when considering the results in a randomized configuration scenario, the clustering-based methods are slightly superior to the non-clustering-based ones in terms of their resilience to variation in the data analyzed.

8.2.2. Synthetic data

In general terms, most methods have similar resilience to data variation when evaluated considering our real datasets. Nevertheless, we have little knowledge about the characteristics of these datasets, such as the underlying data distributions, the types of the outliers, if there are irrelevant dimensions, and a number of other unknown features that may impact the results of the methods. For this reason, we also evaluate the methods using synthetic data in a controlled experimental environment. We intend to glimpse if there are superior algorithms regarding the resilience to data variation under certain conditions.

Figure 4 shows that many methods are highly resilient to data variation when considering their best results obtained from the synthetic datasets; see the last line of boxplots at the bottom part of the figure. Most boxes in the plots depict very concentrated values and large medians of at least 0.8 or so for the four evaluation measures, which indicates high resilience and good performance. It is important to emphasize that even an algorithm using random scoring could also obtain a high resilience with very low dispersion of values. Nevertheless, its performance would be poor, such as having a median of around 0.5 regarding AUROC. For instance, the KMeans– presents a lower-than-average performance, although it is one of the most resilient methods. On the other hand, the OPTICS-OF and the LOF present low resilience and medium-to-low performance. The OutRank S1/D remains exhibiting an average resilience with a median above 0.8 in all evaluation measures. Overall, under the best parameterization conditions, most methods of both the clustering-based and the non clustering-based approaches demonstrate higher resilience on the synthetic datasets than that seen in the real datasets.

Figure 5 presents considerably distinct results for the group of synthetic datasets compared with those of the other two groups. As described before, this figure depicts the resilience of the methods regarding the average values obtained per evaluation measure in each dataset. Note that the medians of AUROC values are well above 0.6 for the synthetic data and generally better than those of the real data. The medians of the other measures are also considerably better for the synthetic data compared with the real data. The algorithms EMOutlier, KMeansOD and KMeans– stand out as the most resilient ones also having nearly-perfect medians close to 1.0 for the four measures. The KMeans– presents high resilience too, but its median values are all considerably lower. On the other hand, the remaining methods exhibit considerably worse resilience by being much more susceptible to variation in their effectiveness in the synthetic data than in the real data. Provided that the medians of these methods are considerably better for the synthetic data, there is also more room for variation in the scores obtained, with some particular configurations performing much better or worse than others. Overall, in a randomized parameterization scenario, the non clustering-based methods are less resilient to synthetic data variation than the clustering-based ones, especially when considering the algorithms EMOutlier, KMeansOD and KMeans–.

Figure 6 presents results regarding the resilience of the methods to specific variations in our synthetic data. Each row of plots regards a subset of our synthetic datasets in which all characteristics are equal to those of our standard dataset except for one single characteristic. For example, the first row of plots regards datasets having distinct numbers of clusters while the remaining characteristics are the same as in the standard dataset. Once again, we consider the resilience of the methods regarding the average of the values obtained per evaluation measure in each dataset, because they assess a more plausible real-world scenario than the evaluation considering the best values obtained per dataset. Considering the datasets with different numbers of clusters, i.e., 2, 5, and 10 clusters, the AUROC medians are all above 0.6. The EMOutlier stands out not only for having a high resilience, but also for presenting a median of almost 1.0 (unsurprisingly, as it matches the data generation process very well). Similar results are seen for the algorithms KMeansOD and KMeans–*. The DBSCANOD and the KMeans– also show very similar results in terms of high resilience, but their median values are less than ≈0.7. The remaining methods demonstrate moderate resilience with dispersion around 0.2. In the other three evaluation measures, we see that the KMeans–, the SilhouetteOD, and the DBSCANOD are highly resilient to data variation, but they present low effectiveness with medians below 0.4. The EMOutlier is highly resilient too; distinctly, this method is highly effective presenting medians above 0.9. The other methods demonstrated moderate or low resilience with the KNNOutlier being the one with the lowest resilience. Overall, when considering the variation in the number of clusters, it is evident that most clustering-based methods are more resilient and perform better than the non-clustering-based methods.

We now move on to the datasets with varying cardinalities, i.e., 1k, 5k, and 10k. In the four evaluation measures, the most resilient methods are the EMOutlier, the KMeansOD, the KMeans–, the KMeans–, and the SilhouetteOD. Note, however, that the last two methods have low effectiveness with medians much smaller than those of the other methods. Regarding AUROC, the methods OPTICS-OF, SilhouetteOD, KNNOutlier, and LOF presented the lowest resilience with dispersion of values close to 0.5. The methods DBSCANOD, OPTICS-OF, KNNOutlier, and LOF are the least resilient ones regarding the remaining three measures, with dispersion of values close to 0.9. Overall, considering the variation in the data cardinality, we evidenced a higher impact on the effectiveness of the non-clustering-based methods compared to the clustering-based ones. The latter are again more resilient to data variation, especially considering the methods EMOutlier, KMeansOD, and KMeans–*.

Regarding the datasets with different numbers of relevant attributes, i.e., 2, 5, and 10 relevant attributes, all methods had medians of AUROC values above 0.6. The methods EMOutlier, KMeansOD, KMeans– and KMeans– are the most resilient ones in all evaluation measures, but the KMeans– is considerably less effective than the others. The DBSCANOD stands out as the least resilient method considering all evaluation measures, with values of dispersion being often around 0.6. Overall, when varying the number of relevant attributes, both clustering- and non clustering-based approaches present a similar degree of resilience. For datasets with varying numbers of irrelevant attributes, i.e., 2, 5, and 10 irrelevant attributes, we see that the DBSCANOD continues to be the least resilient method regarding AUROC. Nevertheless, the remaining methods generally presented higher resilience compared to the previous data variation. For the other three measures, the DBSCANOD continued to be the least resilient method, followed by the SilhouetteOD whose medians are around 0.4. The algorithms EMOutlier, KMeans–, and OutRank S1/D are the most resilient ones, but only the first method also presents high effectiveness. Overall, the non clustering-based methods are more resilient than the clustering-based ones to variation in the number of irrelevant attributes.

For the datasets with varying cluster positioning, i.e., grid-, sine-, and uniform-based positioning, most methods presented high resilience with dispersion values often bellow 0.1 regarding the four evaluation measures. The exceptions are the methods DBSCANOD, OutRank S1/D, KNNOutlier, and LOF because they presented moderate resilience to the cluster positioning pattern. A similar result can be seen in the next data variation, where the clusters are generated by different distributions, i.e., Gaussian and uniform distributions. The main exception is that the SilhouetteOD* appeared as the least resilient method in this case, followed by the KNNOutlier, the OutRank S1/H and the LOF. On the other hand, the DBSCANOD greatly improved its resilience and obtained values of dispersion as low as 0.1, although its effectiveness was very much low. Overall, considering the variation of the clusters’ positioning and of the distribution of their instances, the clustering-based methods showed slightly better resilience than the non clustering-based ones.

For the last two scenarios, we consider datasets with distinct percentages of outliers, i.e., 1%, 5%, and 10% percent, as well as with different outlier types, including local, global, and collective outliers. In the first scenario, based on AUROC values, we see that the methods EMOutlier, KMeansOD, KMeans–, OutRank S1/D, SilhouetteOD, iForest, and KNNOutlier exhibit high resilience with dispersion as small as 0.1. Except for the SilhouetteOD, all of these methods were also effective with medians above 0.8. The remaining methods demonstrated lower resilience with dispersion between nearly 0.2 and 0.5, and moderate effectiveness. Considering the other three evaluation measures, the EMOutlier, the KMeansOD, and the KMeans– stand out as the most resilient methods with dispersion close to 0.1. These methods were also the most effective ones with medians above 0.9. The least resilient methods are the LOF and the OPTICS-OF with dispersion as large as 0.8. Additionally, it is worth noting that the OutRank S1/D was much less resilient in these three measures than in the AUROC measure. A reason could be that the subspace clustering algorithm DiSH used by OutRank S1/D might be identifying tiny clusters of anomalous points, especially when the percentage of outliers is large. The larger the number of outliers the larger the probability they form clusters. Such potential clusters of outliers might have small cardinalities and exist only in subspaces, but they would still be detected by DiSH and considered as less anomalous by OutRank S1/D. It might put inliers ahead of very few outliers in the rankings of points, thus only marginally impacting the AUROC values because a small number of mistakes often have little impact in this metric; distinctly, their impact is more noticeable in the other three measures.

We now move on to the last scenario. Note that most methods struggle in both resilience and effectiveness when the type of the outliers varies. When considering the AUROC measure, the EMOutlier, the KMeans– and the OutRank S1/D remain as the most resilient methods with dispersion around 0.1. Distinctly, the methods GLOSH, SilhouetteOD, SilhouetteOD, and LOF stand out as the least resilient ones with dispersion as large as 0.6. For the other evaluation measures, the methods were in general even less resilient; the least resilient ones are the GLOSH, the SilhouetteOD, and the LOF. Unfortunately, the most resilient methods were the ones that constantly presented low effectiveness, including the SilhouetteOD, and the OPTICS-OF. Ultimately, in these last two scenarios, no approach stood out in resilience; however, when considering the medians of the values in all evaluation measures, it is plausible to state that the clustering-based methods demonstrated better performance.

In conclusion, we highlight that both varying the percentage and the type of the outliers harmed the resilience of the methods generally. On the other hand, varying the number of instances, the clusters’ positions or the distribution of their instances had minor impact on the method’s resilience. Overall, when considering synthetic data, the clustering-based methods were once more slightly superior to the non-clustering-based ones in terms of their resilience to variation in the data analyzed.

8.3. Resilience to Parameter Configuration

Here we refer to question Q3: “How resilient to parameter configuration are the evaluated methods?”. We first consider one dataset at a time and later present an overall view of the results considering all the datasets studied.

8.3.1. Analysis per dataset

Figure 7 shows the dispersion of the evaluation measures when varying the parameter configuration of each method. That is, each boxplot illustrates the dispersion of the evaluation measures obtained for each method when using the many parameter configurations shown in Table 3. For this purpose, we consider three datasets (i.e., Lymphography, Glass and Wilt) selected to represent distinct levels of difficulty according to a score introduced by Campos et al. (2016). This score is defined as the average of the (binned) ranks of all outliers reported by a given set of outlier detectors for one dataset. If a dataset has a low score, it contains outliers that are relatively easy to detect using most methods. A high score means that all or most methods have high difficulty in detecting the outliers. The score values range from 1 to almost 10, representing a perfect and a random ranking respectively. Importantly, note that the scores of difficulty were generated by Campos et al. (2016) using a different set of methods than ours; however, we consider it plausible to use these scores in our work due to the large number and variety of approaches employed.

Lymphography is the easiest dataset; it has a low difficulty score of nearly 1. We confirm this fact by observing that the AUROC values obtained by most methods have medians above 0.7, although no method manages to obtain a median larger than 0.9. Under this evaluation measure, the most resilient methods are the EMOutlier, SilhouetteOD, and KNNOutlier. The least resilient ones are DBSCANOD, GLOSH, OPTICS-OF, and OutRank S1/H. Regarding the measure Average Precision, the OPTICS-OF is the most resilient method, but it always presented low quality results. Distinctly, the EMOutlier and KMeansOD presented much better effectiveness with slightly lower resilience to parameter configuration. The methods KMeans–, KMeans–*, and LOF are the least resilient ones. Similar results are seen for the other two measures.

Glass has a difficulty score of about 4.5 and can, therefore, be considered a medium-difficulty dataset. Looking at the AUROC values reported, we can confirm that this dataset is more challenging than the Lymphography because four methods (and not only 2) obtained medians below 0.6. In this case, most methods (i.e., EMOutlier, GLOSH, KMeansOD, KMeans–, KMeans–, iForest, KNNOutlier, and LOF) showed high resilience concerning parameter variation. Reviewing the results of the remaining three evaluation measures, we notice that not only the effectiveness of all methods is low, with Average Precision and R-Precision smaller than ≈0.2 and Max-F1 bellow ≈0.4, but also the dispersion of values was reduced, thus leading to a phenomenon that we shall call “forced high resilience*”. The higher the difficulty score of a dataset, the higher the resilience of the methods that analyze it. This phenomenon may be explained by a direct relationship between the difficulty of a dataset and the resilience of a method that analyzes it. The more difficult it is to detect the outliers in a dataset, the more compact is the dispersion of values of an evaluation measure; and, the measurements are usually concentrated in values representing low performance.

Wilt is a very difficult dataset with a difficulty score of ≈6.5. According to the reported AUROC values, all methods had medians below 0.5, except for the EMOutlier and the OutRank S1/D whose medians are slightly larger than 0.6. Furthermore, we notice that the dispersion of values is small (≈0.1), once again demonstrating the phenomenon of forced high resilience. The few exceptions are the OPTICS-OF and the OutRank S1/D with dispersion of ≈0.4. In the other three evaluation measures, we again confirm the aforementioned phenomenon. Since this dataset has a high degree of difficulty, all methods have very high resilience and poor effectiveness.

The results obtained from the remaining datasets are reported in Figs. 12 and 13 of Appendix A. Note that the datasets with known difficulty scores as reported by Campos et al. (2016) are presented in ascending order of difficulty. The datasets with unknown scores are shown separately. Let us now summarize these results while also contrasting them with the degree of difficulty of each dataset. It is worth noting that 18 out of the 23 datasets studied by Campos et al. (2016) were characterized as being difficult; 6 of them come from the group of datasets commonly used in the literature, and the remaining 12 are from the semantically meaningful group. Starting with the group of datasets typically used in the literature, we confirm that the datasets ALOI and Waveform are difficult with all methods struggling to detect the outliers. Also, WPBC is the most challenging dataset of the group. All methods experienced forced high resilience with very low effectiveness regardless of parameter configuration. Distinctly, the datasets WBC and WDBC are much easier. It was confirmed because most methods achieved relatively low resilience and high values in all four evaluation measures. As a contribution of our survey, we can say that Ionosphere is a moderately easy dataset with most methods performing well, except for some very resilient methods that presented low effectiveness. For KDDCup99, PenDigits and Shuttle, all methods obtained resilience and effectiveness similar to those observed in Glass. Therefore, we characterize them as datasets of medium difficulty. Overall, clustering-based and non-clustering-based methods reported similar resilience to parameter configuration without considerable distinction between the approaches.

Finally, in the group of semantically meaningful datasets, the results we obtained also corroborate the levels of difficulty reported by Campos et al. (2016). The clustering-based methods presented low resilience and high effectiveness in the datasets with a low difficulty score (lessapprox 4). We also observed forced high resilience in the datasets of medium-to-high difficulty (gtrapprox 6). Importantly, the non-clustering-based methods demonstrated better performance than the clustering-based ones for the datasets with low-to-medium difficulty. That is, they obtained better quality of the results in terms of the evaluation measures and higher resilience to parameter configuration.

8.3.2. Overall view

Motivated by the previously presented results, we elaborated Fig. 8 as an “overall view” to depict how resilient the methods are when varying their parameter configuration. We consider the average and the standard deviation of the evaluation measurements. The horizontal axis of each quadrant represents the average of the values of one measurement obtained from a method when evaluating one dataset; the vertical axis presents their corresponding standard deviation. We divided the 18 datasets characterized by Campos et al. (2016) into three groups according to their score of difficulty: Lymphography, Parkinson, WBC, and WDBC are considered to have a low degree of difficulty; PageBlocks, Stamps, HearDisease, Hepatitis, Arrhythmia, InternetAds, Glass, and Cardiotocography have a medium difficulty, and; SpamBase, Pima, ALOI, Annthyroid, Wilt and Waveform have a high degree of difficulty. Each dataset is represented by a unique marker, while each method has a unique color. Consequently, a marker appears in 12 different positions, each time drawn with a different color according to the method being considered.

Under this setting, the markers are positioned in such a way that the lower the standard deviation in the vertical axis of each plot, the higher the dispersion of the average results shown in the horizontal axis. It is possible to note that this dispersion is reduced even further as the level of difficulty of the datasets increases. We also notice that the results get more concentrated in values that represent low performance as the degree of difficulty becomes higher. For example, the results regarding the datasets of medium and high difficulty are often close to 0.5 in the AUROC measure and near 0 in the Average Precision, R-Precision, and Max-F1 measures.

In the low-difficulty datasets, we notice that the AUROC averages are mostly above 0.4, with some values getting close to 1.0. The methods EMOutlier, KMeansOD, KMeans–, KNNOutlier, and LOF stand out by reporting averages around 0.9 and standard deviations smaller than 0.1, thus presenting both high effectiveness and high resilience to parameter configuration. The methods SilhouetteOD and KMeans– also present high resilience, but with lower quality of results. The least resilient methods are DBSCANOD, GLOSH, and OPTICS-OF, presenting standard deviations close to 0.2 despite having relatively good effectiveness with average AUROC values around 0.8. The SilhouetteOD is the worst-performing method with standard deviations close to 0.2 and averages between 0.2 and 0.6. Once again, the Average Precision, R-Precision and Max-F1 measurements show similar patterns, except for the fact that the OPTICS-OF and the OutRank S1/D reported considerably large standard deviations between 0.3 and 0.4, i.e., low resilience. Overall, these datasets do not represent tough challenges; most methods demonstrated high performance both in terms of resilience and in the quality of the results, and a few others report low resilience and accuracy. Considering this, we conclude that there is no remarkable superiority between clustering and non-clustering-based methods regarding resilience to parameter configuration for these datasets.

Concerning the datasets of medium difficulty, GLOSH had the largest variations in the AUROC measurements. It presented large standard deviation values of around 0.2 and average values ranging from