IVOA KDD-IG: A user guide for Data Mining in Astronomy

5: Which algorithm to use?

Unfortunately, there is no simple answer, because the differing characteristics of different algorithms render them more or less suitable for different problems. While within in the data mining community there is much literature, for example, comparing different algorithms on a specific dataset, or looking at their theoretical properties in idealized (read: unrealistic) situations, there is much less available to help make a practical choice with real data. We attempt to remedy that here by comparing and contrasting the characteristics of some commonly used algorithms.

Given the number of factors influencing the decision, we do not recommend a particular algorithm. However, if it is truly unknown where to start, the decision tree is a good bet, because, although it may not provide the best results, it is generally robust to issues encountered in real-world data.

An interesting point to note is that, aside from their relative popularity, none of the attributes of the algorithms mentioned are specific to astronomy, or, indeed, any given problem. This emphasizes their generic utility for a variety of analyses, and hence, science goals.

Material for this table is based on section 4 of this guide, Table 1 of Ball & Brunner (2010), and Table 10.1 of Hastie et al. (2001).


Algorithm Advantages DisadvantagesSorted ascending
constrained k-means, guided by training data Affected by outliers
Decompositions (SVD, PCA, ICA) PCA is extensively used in astronomy Applicable to numerical data only
Basic algorithm requires no overlap between classes
Many extensions, e.g., Biased towards finding spherical clusters
Artificial Neural Network Good approximation of nonlinear functions Black-box model
Good computational scalability Building the tree can be slow (data sorting)
Mixture Models & Expectation Maximization Gives number of clusters in the data Can be biased toward Gaussians
Can be slow to converge
Can give class labels for semi-supervised learning Can be slow to converge
Decision Tree Popular real-world data mining algorithm Can generate large trees that require pruning
Robust to irrelevant or redundant attributes Can overfit
Interpretable model, easily converted to rules Can overfit
Good scalability to high-dimensional data Can overfit
k-Nearest Neighbor Uses all available information Computationally intensive (but can be mitigated, e.g., k-d tree)
Conceptually simple, and easy to code
Copes with missing data Doesn't work well with a large number of components
Can input and output numerical or categorical variables Generally poorer predictive power than ANN, SVM or kNN
Good at handling missing values and outliers
Good predictive power
Support Vector Machine Copes with noise Harder to classify > 2 classes
probabilistic cluster membership Has difficulties with different densities in the clusters
Suitable for clusters of differerent density, size and shape Local minima
Good predictive power Long training time
Robust to outliers, noisy or redundant attributes, missing values Many adjustable parameters
No missing values
Gives expected error rate No model is created
Does not require training No model is created
Good predictive power Non-trivial architecture: many adjustable parameters
Easily parallelized Non-trivial choice of fitness function and solution representation
Not guaranteed to find local minimum
Genetic algorithm Able to avoid getting trapped in local minima Not guaranteed to find the local minimum
Often requires a large number of iterations through the data
Few or no adjustable parameters Performs poorly with high-dimensional data
Can approximate nonlinear functions Poor at handling missing or irrelevant attributes
Popular algorithm in astronomy Poor interpretability
Requires normalized data
Requires numerical data, not categorical
Requires numerical inputs if using Euclidean distance
Can be extended to the nonlinear case and noisy data Requires whitening of the data as preprocessing step
Extensively used in astronomy Sensitive to noise
Sensitive to outliers if maximum likelihood is used
Gives unique solution (no local minima) Some adjustable parameters
Easily parallelized Susceptible to local minima
k-means clustering Well-known, simple, popular algorithm Susceptible to noise
Kohonen Self-Organizing maps Well-known, popular algorithm Susceptible to noise
Easily parallelized Susceptible to noise and irrelevant attributes
Susceptible to outliers
Training can be slow
Uses numerical data

References

  1. Ball N.M. & Brunner R.J., "Data Mining and Machine Learning in Astronomy", International Journal of Modern Physics D 19 (7) 1049-1106 (2010); arXiv/0906.2173
  2. Hastie T., Tibshirani R. & Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, 1st edn. (Springer, New York, 2001).


-- NickBall - 05 Sep 2010
-- NickBall - 23 Sep 2011


Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r2 - 2011-09-24 - NickBall
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback