4: The main data mining algorithms
One of the aims of the KDD-IG is to build up an
inventory of data mining algorithms that are of use to astronomy. We don't attempt to duplicate that here, but instead provide descriptions of some of the most well-known data mining algorithms, many of which have been fairly extensively used in astronomy.
- Artificial neural network
- Decision trees
- Genetic algorithms
- k nearest neighbor
- k-means clustering
- Kernel density estimation
- Kohonen self-organizing map
- Independent component analysis
- Mixture models and EM algorithm
- Support vector machine
Artificial Neural Networks
Artificial Neural Networks (ANNs) are one of the oldest data-mining algorithms, and one of the first to be applied in astronomy. Modelled after the mammalian Brain, ANNs consist of a large number of processing units that are interconnected with each other. The interconnections are represented by weights (numerical values in the range 0 to 1 or -1 to 1), and learning of the model occurs by adjusting the weights during the training phase. ANNs can be used for both supervised (predictive) and unsupervised (descriptive) data mining.
ANNs come in a large variety of flavors, more so with respect to the architecture in which the processing units, so-called
perceptrons , are connected, but also with respect to the learning algorithm used. One typical architecture is that of a
feedforward network, in which a distinct
input layer is connected to a distinct
output layer via one or more
hidden layers. Connections in this particular network point in the forward direction only. Each node in the input layer represents one attribute for each sample, while each node in the output layer typically represents a class (with the exception of a single output node for a two-class problem). ANNs require numerical values as input, which should be normalized.
For a feedforward network, learning typically occurs through the
Backpropagation algorithm. For this algorithm, input values are presented to the input layer and passed through the hidden layers to produce values at each node in the output layer. The produced output is compared to the desired output (in the form of the correct class); the resulting error is then backpropagated in the reverse direction and used to adjust the weight so as to minimize the error.
Main characteristics:
- convergence of the weights is slow
- the model (the numerical values associated with the weights) is hard to interpret
- prone to settle into a local minimum because of the complexity of the error surface
- sensitive to noise
- choice of the architecture is non-trivial
- able to approximate any function, given a lack of certain restrictions with respect to the number of hidden layers and number of nodes in hidden layers
- easy to parallize
Decision Trees
Decision Trees build tree-like structures from the data in which leaf nodes represent classes, while internal nodes represent tests to be performed on the attribute values. The training phase (building of the tree) proceeds as follows. Initially, each datum is associated with the root. Using measures such as
entropy, the
Gini index, or the
misclassification error, the data is separated into disjoint subsets based on splits of the attribute values. The splits are applied recursively to each of the nodes at the current level until a purity condition is satisfied and the node turns into a leaf. Typically, splits are binary and are performed based on single attribute values; however, one variant are
oblique decision trees which make decisions based on combinations of attribute values. Other variants are
regression trees for prediction of numerical values, and
Random Forests which introduce a random factor when choosing the splitting attribute during the training phase.
When the model is applied after construction of the tree, a new datum to be classified is presented to the root. Based on the outcome of the attribute test, the datum is then associated with one of the root's children. Tests are then performed recursively until the datum reaches one of the leaves, at which point the datum is assigned to the class represented by the leaf (which is usually the majority of the training samples assigned to the leaf during the training phase).
Main characteristics:
- are not restricted to numerical attributes like ANNs
- the model (the sequence of tests on the attribute values) is easy to understand and can readily be converted to a rule set
- building of the tree requires sorting of the data, which can be slow for large, numerical datasets
- robust to noise because the tree is typically pruned after the building phase
Genetic Algorithms
Genetic algorithms
Main characteristics:
Under construction by group members
--
NickBall - 05 Sep 2010
--
SabineMcConnell - 16 Jan 2011