IVOA Web>IvoaKDD>IvoaKDDguide>IvoaKDDguideAlgorithms (revision 7)~~EditAttach~~

IVOA KDD-IG: A user guide for Data Mining in Astronomy

4: The main data mining algorithms

One of the aims of the KDD-IG is to build up an inventory of data mining algorithms that are of use to astronomy. We don't attempt to duplicate that here, but instead provide descriptions of some of the most well-known data mining algorithms, many of which have been fairly extensively used in astronomy.

Artificial neural network
Decision trees
Genetic algorithms
k Nearest Neighbor
k-means clustering
Kernel density estimation
Kohonen self-organizing map
Independent component analysis
Mixture models and EM algorithm
Support vector machine

Artificial Neural Networks

Artificial Neural Networks (ANNs) are one of the oldest data-mining algorithms, and one of the first to be applied in astronomy. Modelled after the mammalian Brain, ANNs consist of a large number of processing units that are interconnected with each other. The interconnections are represented by weights (numerical values in the range 0 to 1 or -1 to 1), and learning of the model occurs by adjusting the weights during the training phase. ANNs can be used for both supervised (predictive) and unsupervised (descriptive) data mining.

ANNs come in a large variety of flavors, more so with respect to the architecture in which the processing units, so-called perceptrons , are connected, but also with respect to the learning algorithm used. One typical architecture is that of a feedforward network, in which a distinct input layer is connected to a distinct output layer via one or more hidden layers. Connections in this particular network point in the forward direction only. Each node in the input layer represents one attribute for each sample, while each node in the output layer typically represents a class (with the exception of a single output node for a two-class problem). ANNs require numerical values as input, which should be normalized.

For a feedforward network, learning typically occurs through the Backpropagation algorithm. For this algorithm, input values are presented to the input layer and passed through the hidden layers to produce values at each node in the output layer. The produced output is compared to the desired output (in the form of the correct class); the resulting error is then backpropagated in the reverse direction and used to adjust the weight so as to minimize the error.

Main characteristics:

convergence of the weights is typically slow
the model (the numerical values associated with the weights) is hard to interpret
prone to settle into a local minimum because of the complexity of the error surface
sensitive to noise
choice of the architecture is non-trivial
able to approximate any function, given a lack of certain restrictions with respect to the number of hidden layers and number of nodes in hidden layers
easy to parallize

Decision Trees

Decision Trees build tree-like structures from the data in which leaf nodes represent classes, while internal nodes represent tests to be performed on the attribute values. The training phase (building of the tree) proceeds as follows. Initially, each datum is associated with the root. Using measures such as entropy, the Gini index, or the misclassification error, the data is separated into disjoint subsets based on splits of the attribute values. The splits are applied recursively to each of the nodes at the current level until a purity condition is satisfied and the node turns into a leaf. Typically, splits are binary and are performed based on single attribute values; however, one variant are oblique decision trees which make decisions based on combinations of attribute values. Other variants are regression trees for prediction of numerical values, and Random Forests which introduce a random factor when choosing the splitting attribute during the training phase.
When the model is applied after construction of the tree, a new datum to be classified is presented to the root. Based on the outcome of the attribute test, the datum is then associated with one of the root's children. Tests are then performed recursively until the datum reaches one of the leaves, at which point the datum is assigned to the class represented by the leaf (which is usually the majority of the training samples assigned to the leaf during the training phase).

Main characteristics:

are not restricted to numerical attributes like ANNs
the model (the sequence of tests on the attribute values) is easy to understand and can readily be converted to a rule set
building of the tree requires sorting of the data, which can be slow for large, numerical datasets
robust to noise because the tree is typically pruned after the building phase

Genetic Algorithms

Genetic algorithms are evolutionary algorithms that represent the members of the population as chromosomes. Typically, the chromosomes consist of a series of 0s and 1s, although other representations are also possible.The population is evolved over time over many generations by allowing those members with the highest evaluation through a problem-dependent fitness function to reproduce. Typical reproduction mechanisms are crossover and mutation. For crossover, a random location in the chromosomes is chosen to split two parents. Two offspring are then created by combining two of the resulting pieces, one from each parent. During mutation, a random position in the chromosome is chosen and flipped (if the bit representation is used) to produce a single offspring.

One key point is that the members of the population represent solutions (the function to be learned), rather than the data itself.

Main characteristics:

able to explore the solution space without getting trapped in local minima
not guaranteed to find the optimal solution
choice of fitness function and representation of solutions is problem dependent and often non-trivial
can be slow to converge
lend themselves to easy parallelization

k Nearest Neighbour

Algorithms such as ANNs and Decision Trees introduced above are eager learners that perform the work during a typically slow training phase with the benefits of fast classification of new data during the deployment phase. In contrast, k Nearest Neighbour (kNN) approaches are lazy learners, which restrict the model-building phase to storing of the training data, while all the work is performed when new data is to be classified. This results in a slow deployment phase, which is one of the main disadvantage of this algorithm.

In order to classify a new input sample, the algorithm determines the k-closest points from the training set to the datum to be classified (typically using the Euclidean distance, although other measures are possible). The class of the new datum is then determined as the majority of the classes of the k-closest points.

Main characteristics:

does not work well with high-dimensional data
susceptible to noise
very slow for large datasets, even though this can be improved upon through the use of data structures such as k-d trees
uses numerical data
easy to code and parallelize

k-means Clustering

k-means Clustering is an unsupervised, partition-based approach that assigns each of the input samples to one of k cluster centers (centroids), usually based on the Euclidean distance.
In Step 1 of the approach, the cluster centers are selected, typically on a random basis. In that case, there is no guarantee that the algorithm finds the optimal location of the cluster centers and the algorithm should be run multiple times. Other variants try to select more suitable initial cluster centers, for example by choosing one representative of each cluster present in the data, or by choosing a larger number of cluster centers followed by subsequent merging of clusters. Step 2 of the algorithm assigns each input datum to the closest cluster center, while Step 3 adjusts the cluster centers so that they better represent the assigned data. Steps 2 and 3 are then repeated until a termination criterion is satisfied(for example, when the number of samples changing clusters or the SSE falls below a threshold).

Main characteristics:

susceptible to noise
biased towards finding spherical clusters
has difficulties with different densities in the clusters, and outliers int he data
requires numerical inputs if using Euclidean distance
not guaranteed to find optimal solution since choice of initial cluster centers influences quality of result

Kohonen Self-Organizing Map

Under construction by group members

-- NickBall - 05 Sep 2010
-- SabineMcConnell - 16 Jan 2011
-- SabineMcConnell - 21 Jan 2011

Topic revision: r7 - 2011-01-21 - SabineMcConnell

IVOA

Log in or Register

IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics

Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki

TWiki intro
TWiki tutorial
User registration
Notify me

Working Groups

Interest Groups

Time Domain

Committees

Stds&Procs

www.ivoa.net
Documents
Events
Members
XML Schema