# IVOA KDD-IG: A user guide for Data Mining in Astronomy

## 7: Algorithms and techniques astronomers could benefit from but seldom use

There are a large number of algorithms and approaches that are well-known to workers in other fields, but are seldom used in astronomy. There is much potential for novelty in collaboration with these subject areas. In particular, significant collaboration with statisticians has created a field related to astroinformatics, astrostatistics.

Subject areas include:

• Applied mathematics
• Artificial intelligence
• Computer science
• Signal processing
• Statistics

We now describe some of the techniques that hold potential.

• Bootstrap aggregating (bagging): Random subsamples with replacement (bootstrap samples) are taken, and the algorithm, for example, a decision tree, is trained on each. The created algorithms vote on the testing set. Bagging often works well on real-world data. When bagging is used with the random selection of a small subset of features for splitting at each node of a decision tree, the procedure is known as a Random Forest.

• Mixture of experts: This is a general approach that often works well on real-world data. The outputs from several instances of an algorithm (also called a committee), or several different algorithms, are combined, using some method such as voting, or the output of one algorithm forming the input of another. The final performance is often better than any of the individual algorithms.

• Boosting: This uses the fact that several weaker instances of an algorithm can be combined to produce a stronger instance, often better than any single instance. Boosted decision trees have gained popularity in recent years in several fields. The weak learners are iteratively added, and misclassified objects receive a higher weight. Thus, differences between classes are emphasized, i.e., boosted. Boosting is different to bagging or mixture of experts, because the data themselves are altered.

• Simulated annealing: This is an addition to a search for a minimum in a parameter space, in which a point is perturbed. This allows it not to get trapped in a local minimum, and hence be more likely to find the global minimum.

• Semi-supervised learning: Combines the advantages of supervised and unsupervised learning. Existing classes are incorporated as training data, but new classes can be discovered in an automated way from the data. Could be used for, e.g., large surveys, where spectra are available, but also applied to the much larger photometric regime.

• Radial basis function networks: Similar to the usual artificial neural networks used in astronomy, these fit smooth functions to a training set and can thus interpolate. This allows one to overcome the problem of high dimensional data being sparse (the 'curse of dimensionality'). They use a hyperelliptical kernel, and are a route to partial unsupervised learning.

• Adaptive resonance theory: Patterns are fed sequentially to an ANN, and new examples can add to existing clusters, or form new ones.

• Fuzzy c-means clustering: Extension of k-means clustering that allows the probability of cluster membership to be assigned to each object, removing the requirement that clusters be completely separated. Useful, because different classes of astronomical object often overlap each other in the parameter space of observations.

• Locally linear embedding: Part of manifold learning, this is another method of nonlinear dimension reduction that can project high dimensional data onto something more manageable.

• Probabilistic principal surfaces: A generalization of PCA that performs unsupervised clustering using dimension reduction

• Negative entropy clustering: Clustering performed hierarchically based on the information difference between a Gaussian and a generic multivariate distribution in a space of the same number of dimensions

• Wavelets: A wavelet is a wave that starts and ends at amplitude zero. A given shape of wavelet can be convolved with data to extract signals of the same shape from the data. These are already used in astronomy, for example in analyzing the cosmic microwave background, but there remains much potential to be exploited, including other shapes, such as needlets, curvelets, platelets, and so on, and other generalizations, such as to the low counts regime.

• Hidden Markov models: part of a broader class of Bayesian networks, which are themselves graphical models, these have been used in astronomy for real-time classification of time domain data, where the existing base of knowledge that would form the training set to a more traditional supervised method is sparse (e.g., few objects, or missing data), thus prior knowledge is useful. Markov logic networks allow domain knowledge to be expressed as a quantified value within the network.

• Online learning: a machine learning algorithm learns the data from a training set as it is streamed, as opposed to all at once. This is useful for utilizing training sets that are too large to be loaded into machine memory.

• Diffusion maps: Part of spectral connectivity analysis, these are a nonlinear method for transforming observed data onto a simpler, more natural coordinate system. Results can be superior to those gained from principal component analysis, and the method is robust to outliers.

• False discovery rate: This is a method for establishing a significant discovery from a set of data smaller than the usual N sigma hypothesis test. It allows one to a priori control the number of false rejections made when testing a model against a null hypothesis, and reduces the number of false rejections compared to correct detections. It also works with correlated data.

• Anderson-Darling test: A statistical test that is more sensitive than the Kolmogorov-Smirnov test (widely used in astronomy) to test for departures of a sample from a normal distribution.

• Mark correlation function: This generalizes the usual 2- or N-point correlation function, so that the properties (or marks) of an object are taken into account. An example is galaxies that are clustered in space, where the clustering is a function of several properties. Related to this, the mark connection function gives the relative frequencies of mark pairs as a function of pair separation. Both of these are part of the statistics of marked point processes.

• Multiscale methods: These are capable of adapting to spatially varying smoothness in images, enabling deconvolution with error estimates.

-- NickBall - 19 Mar 2011
-- NickBall - 23 Sep 2011

Edit | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | More topic actions
Topic revision: r4 - 2011-09-24 - NickBall

Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki

TWiki intro
TWiki tutorial
User registration
Notify me

Working Groups

Interest Groups

Committees

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback