One of the aims of the KDD-IG is to build up an inventory of data mining algorithms that are of use to astronomy. We don't attempt to duplicate that here, but instead provide descriptions of some of the most well-known data mining algorithms, many of which have been fairly extensively used in astronomy.

- Artificial Neural Networks
- Decision Trees
- Genetic Algorithms
- Support Vector Machines
- k-Nearest Neighbor algorithm
- k-means clustering
- Mixture models and EM algorithm
- Kernel density estimation
- Kohonen Self-Organizing maps
- Decompositions (SVD, PCA, ICA)

**Artificial Neural Networks**

Artificial Neural Networks (ANNs) are one of the oldest data-mining algorithms, and one of the first to be applied in astronomy. Modelled after the mammalian Brain, ANNs consist of a large number of processing units that are interconnected with each other. The interconnections are represented by weights (numerical values in the range 0 to 1 or -1 to 1), and learning of the model occurs by adjusting the weights during the training phase. ANNs can be used for both supervised (predictive) and unsupervised (descriptive) data mining.

ANNs come in a large variety of flavors, more so with respect to the architecture in which the processing units, so-called * perceptrons *, are connected, but also with respect to the learning algorithm used. One typical architecture is that of a *feedforward* network, in which a distinct *input layer* is connected to a distinct *output layer* via one or more *hidden layers*. Connections in this particular network point in the forward direction only. Each node in the input layer represents one attribute for each sample, while each node in the output layer typically represents a class (with the exception of a single output node for a two-class problem). ANNs require numerical values as input, which should be normalized.

For a feedforward network, learning typically occurs through the *Backpropagation* algorithm. For this algorithm, input values are presented to the input layer and passed through the hidden layers to produce values at each node in the output layer. The produced output is compared to the desired output (in the form of the correct class); the resulting error is then backpropagated in the reverse direction and used to adjust the weight so as to minimize the error.

__Main characteristics:__

- convergence of the weights is typically slow
- the model (the numerical values associated with the weights) is hard to interpret
- prone to settle into a local minimum because of the complexity of the error surface
- sensitive to noise
- choice of the architecture is non-trivial
- able to approximate any function, given a lack of certain restrictions with respect to the number of hidden layers and number of nodes in hidden layers
- easy to parallelize

**Decision Trees**

Decision Trees build tree-like structures from the data in which leaf nodes represent classes, while internal nodes represent tests to be performed on the attribute values. The training phase (building of the tree) proceeds as follows. Initially, each datum is associated with the root. Using measures such as *entropy*, the *Gini index*, or the *misclassification error*, the data is separated into disjoint subsets based on splits of the attribute values. The splits are applied recursively to each of the nodes at the current level until a purity condition is satisfied and the node turns into a leaf. Typically, splits are binary and are performed based on single attribute values; however, one variant are *oblique* decision trees which make decisions based on combinations of attribute values. Other variants are *regression trees * for prediction of numerical values, and *Random Forests* which introduce a random factor when choosing the splitting attribute during the training phase.

When the model is applied after construction of the tree, a new datum to be classified is presented to the root. Based on the outcome of the attribute test, the datum is then associated with one of the root's children. Tests are then performed recursively until the datum reaches one of the leaves, at which point the datum is assigned to the class represented by the leaf (which is usually the majority of the training samples assigned to the leaf during the training phase).

__Main characteristics:__

- are not restricted to numerical attributes like ANNs
- the model (the sequence of tests on the attribute values) is easy to understand and can readily be converted to a rule set
- building of the tree requires sorting of the data, which can be slow for large, numerical datasets
- robust to noise because the tree is typically pruned after the building phase

Genetic algorithms are

One key point is that the members of the population represent solutions (the function to be learned), rather than the data itself.

- able to explore the solution space without getting trapped in local minima
- not guaranteed to find the optimal solution
- choice of fitness function and representation of solutions is problem dependent and often non-trivial
*can be slow to converge*- lend themselves to easy parallelization

Support Vector Machines are supervised techniques that separate the data by finding the hyperplane that maximizes the

- suitable for high-dimensional data
- the use of a convex optimization approach leads to a global minimum

Algorithms such as ANNs and Decision Trees introduced above are

In order to classify a new input sample, the algorithm determines the k-closest points from the training set to the datum to be classified (typically using the Euclidean distance, although other measures are possible). The class of the new datum is then determined as the majority of the classes of the k-closest points.

- does not work well with high-dimensional data
- susceptible to noise
- very slow for large datasets, even though this can be improved upon through the use of data structures such as k-d trees
- uses numerical data
- easy to code and parallelize

k-means Clustering is an unsupervised, partition-based approach that assigns each of the input samples to one of k cluster centers (centroids), usually based on the Euclidean distance.

In Step 1 of the approach, the cluster centers are selected, typically on a random basis. In that case, there is no guarantee that the algorithm finds the optimal location of the cluster centers and the algorithm should be run multiple times. Other variants try to select more suitable initial cluster centers, for example by choosing one representative of each cluster present in the data, or by choosing a larger number of cluster centers followed by subsequent merging of clusters. Step 2 of the algorithm assigns each input datum to the closest cluster center, while Step 3 adjusts the cluster centers so that they better represent the assigned data. Steps 2 and 3 are then repeated until a termination criterion is satisfied (for example, when the number of samples changing clusters or the SSE falls below a threshold).

- susceptible to noise
- biased towards finding spherical clusters
- has difficulties with different densities in the clusters, and outliers in the data
- requires numerical inputs if using Euclidean distance
- not guaranteed to find optimal solution since choice of initial cluster centers influences quality of result

Mixture models describe the data as a linear combination of a number of components that are described via parameters, for example the mean and standard deviation for a Gaussian distribution. The exact number of components, as well as the parameters describing them are estimated from the data. The Expectation Maximization (EM) algorithm is one approach to estimate these parameters iteratively over two steps. Initially, the parameter values for each of the mixture components are guessed. In Step one, the algorithm determines the probability of belonging to each of the components for each datum. Using the probabilities determined in Step 1, Step 2 then calculates the estimated parameters for each of the components. The two steps are repeated until a stopping criterion (for example, when the change in parameter estimates falls below a threshold) is satisfied. In this context, the application of EM is typically viewed as a clustering approach, in which the maximum likelihood is used to allocate the data to the clusters once the algorithm terminates. Assuming spherical Gaussian distributions with identical covariance matrices, the EM algorithm reduces to the k-means algorithm.

- The EM algorithm is not guaranteed to find global maxima for the likelihood, multiple runs with different initial parameter estimates are therefore required.
- applicable to numerical data only
- best for small numbers of components
- can be slow to converge
- suitable for data that contains clusters which differ in density, size and shape

Self-Organizing Maps (SOMs) belong to the class of Artificial Neural Networks, and were originally proposed by Kohonen. The map consists of a typically two-dimensional arrangement of neurons, for example in the form of a hexagon or a rectangle. Each neuron is represented by a weight, and all neurons are repeatedly presented with the input samples. The neuron closest (in terms of Euclidean distance) to a given input datum is deemed the

Updates to the weights of the winners and their neighbourhood can either be performed sequentially after each input is represented or in batch mode where updates are performed after a complete iteration through the input data only. In addition, a learning rate is used in determining the weight changes; this learning rate also decreases over time.

- susceptible to noise
- requires normalized data
- susceptible to outliers
- often requires a large number of iterations through the data

Main decompositions used for unsupervised data mining include Principal Component Analysis (PCA), Singular Value Decomposition (SVD), and Independent Component Analysis (ICA). The latter is an extension of PCA which models the data as a linear mixture of unknown but independent non-gaussian variables and is often applied to the problem of blind source separation. One of the methods used to estimate the mixing matrix of the variables is Maximum likelihood estimation.

- applicable to numerical data only
- requires whitening of the data as preprocessing step
- can be extended to the nonlinear case and noisy data
- sensitive to outliers if Maximum likelihood is used

-- NickBall - 05 Sep 2010

-- SabineMcConnell - 16 Jan 2011

-- SabineMcConnell - 21 Jan 2011

-- SabineMcConnell - 29 Jan 2011

**IVOA.net**

Wiki Home

WebChanges

WebTopicList

WebStatistics

**Twiki Meta & Help**

IVOA

Know

Main

Sandbox

TWiki

TWiki intro

TWiki tutorial

User registration

Notify me

**Working Groups**

Data Access

Data Model

GWS

Query Language

Registry

Stds&Procs

Semantics

VOTable

Applications

**Interest Groups**

Data Curation

Theory

Education

Operations

Time Domain

Knowledge Discovery in Database

Solar System

Copyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.

Ideas, requests, problems regarding TWiki? Send feedback

Ideas, requests, problems regarding TWiki? Send feedback