IVOA Web>IvoaKDD>IvoaKDDguide>IvoaKDDguideComparison (2011-09-24, NickBall)

IVOA KDD-IG: A user guide for Data Mining in Astronomy

5: Which algorithm to use?

Unfortunately, there is no simple answer, because the differing characteristics of different algorithms render them more or less suitable for different problems. While within in the data mining community there is much literature, for example, comparing different algorithms on a specific dataset, or looking at their theoretical properties in idealized (read: unrealistic) situations, there is much less available to help make a practical choice with real data. We attempt to remedy that here by comparing and contrasting the characteristics of some commonly used algorithms.

Given the number of factors influencing the decision, we do not recommend a particular algorithm. However, if it is truly unknown where to start, the decision tree is a good bet, because, although it may not provide the best results, it is generally robust to issues encountered in real-world data.

An interesting point to note is that, aside from their relative popularity, none of the attributes of the algorithms mentioned are specific to astronomy, or, indeed, any given problem. This emphasizes their generic utility for a variety of analyses, and hence, science goals.

Material for this table is based on section 4 of this guide, Table 1 of Ball & Brunner (2010), and Table 10.1 of Hastie et al. (2001).

Algorithm	Advantages	Disadvantages
Artificial Neural Network	Good approximation of nonlinear functions	Black-box model
	Easily parallelized	Susceptible to local minima
	Good predictive power	Non-trivial architecture: many adjustable parameters
	Extensively used in astronomy	Sensitive to noise
	Robust to irrelevant or redundant attributes	Can overfit
		Training can be slow
		No missing values
Decision Tree	Popular real-world data mining algorithm	Can generate large trees that require pruning
	Can input and output numerical or categorical variables	Generally poorer predictive power than ANN, SVM or kNN
	Interpretable model, easily converted to rules	Can overfit
	Robust to outliers, noisy or redundant attributes, missing values	Many adjustable parameters
	Good computational scalability	Building the tree can be slow (data sorting)
Genetic algorithm	Able to avoid getting trapped in local minima	Not guaranteed to find the local minimum
	Easily parallelized	Non-trivial choice of fitness function and solution representation
		Can be slow to converge
Support Vector Machine	Copes with noise	Harder to classify > 2 classes
	Gives expected error rate	No model is created
	Good predictive power	Long training time
	Popular algorithm in astronomy	Poor interpretability
	Can approximate nonlinear functions	Poor at handling missing or irrelevant attributes
	Good scalability to high-dimensional data	Can overfit
	Gives unique solution (no local minima)	Some adjustable parameters
k-Nearest Neighbor	Uses all available information	Computationally intensive (but can be mitigated, e.g., k-d tree)
	Does not require training	No model is created
	Easily parallelized	Susceptible to noise and irrelevant attributes
	Few or no adjustable parameters	Performs poorly with high-dimensional data
	Good predictive power
	Uses numerical data
	Conceptually simple, and easy to code
	Good at handling missing values and outliers
k-means clustering	Well-known, simple, popular algorithm	Susceptible to noise
	Many extensions, e.g.,	Biased towards finding spherical clusters
	probabilistic cluster membership	Has difficulties with different densities in the clusters
	constrained k-means, guided by training data	Affected by outliers
		Requires numerical inputs if using Euclidean distance
		Not guaranteed to find local minimum
		Basic algorithm requires no overlap between classes
Mixture Models & Expectation Maximization	Gives number of clusters in the data	Can be biased toward Gaussians
	Suitable for clusters of differerent density, size and shape	Local minima
	Copes with missing data	Doesn't work well with a large number of components
	Can give class labels for semi-supervised learning	Can be slow to converge
	Requires numerical data, not categorical
Kohonen Self-Organizing maps	Well-known, popular algorithm	Susceptible to noise
		Requires normalized data
		Susceptible to outliers
		Often requires a large number of iterations through the data
Decompositions (SVD, PCA, ICA)	PCA is extensively used in astronomy	Applicable to numerical data only
	Can be extended to the nonlinear case and noisy data	Requires whitening of the data as preprocessing step
		Sensitive to outliers if maximum likelihood is used

References

Ball N.M. & Brunner R.J., "Data Mining and Machine Learning in Astronomy", International Journal of Modern Physics D 19 (7) 1049-1106 (2010); arXiv/0906.2173
Hastie T., Tibshirani R. & Friedman J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, 1st edn. (Springer, New York, 2001).

-- NickBall - 05 Sep 2010
-- NickBall - 23 Sep 2011

Topic revision: r2 - 2011-09-24 - NickBall

IVOA

Log in or Register

IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics

Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki

TWiki intro
TWiki tutorial
User registration
Notify me

Working Groups

Interest Groups

Time Domain

Committees

Stds&Procs

www.ivoa.net
Documents
Events
Members
XML Schema