IVOA KDD-IG: A user guide for Data Mining in Astronomy

3: Overview of the data mining process

The data-mining process can be broken down into the following steps, each of which are repeatedly executed and refined.

  • Data collection
  • Data pre-processing
  • Model Building
  • Model Validation
  • Model Deployment

The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process. In addition, a large variety of algorithms are available, each with their own characteristics. Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the data-mining process.

Step 1: Data Collection Usually this is outside the control of a data miner. The data-mining process typically starts with a data set that is collected as a by-product obtained from some other process. Therefore, the most important point for this step is to understand the bias in the data, and the restriction this imposes on your choices for the later steps.

Step 2: Data Pre-processing This step is the most time-consuming step in the overall approach, is different from typical pre-processing steps applied in astronomy, and applied afterwards. Not only do issues with the data such as missing values, multiple measurements, noise, etc. have to be addressed, the data also has to be transformed into a format suitable for the algorithm that is to be applied. Since the algorithm to be applied is likely to change during any course of a data-mining exercise (at least in the initial iterations through this process), this also means that the pre-processing step has to be visited over and over again, with varying requirements.

Typical pre-processing steps include:

  • normalization of data
  • removal of attributes/attribute selection
  • transformation of attributes (categorical to numerical or vice versa)
  • binning of attribute values
  • replacement of missing values
  • removal of noise
  • sampling
Here are some examples of preprocessing requirements for particular algorithms: Neural networks like data to be normalized. Decision trees do not care about normalization, but work better with discrete attributes that have a small number of possible values, so the data might have to be binned. Redundant attributes can lead to large decision trees, while they affect Neural networks much less. In the former case, redundant attributes should be removed during pre-processing. Neural networks do not like noisy data (so therefore it should be removed, for example through PCA), especially for small datasets, while decision trees do not care about noise much - but only because they are typically pruned in the model-building stage. Nearest-neighbour approaches can handle noise if the number of neighbors taken into account is adjusted. However, distance-based approaches do not work well if the attributes are not equally weighted (requiring normalization), and typically work with numerical data only (requiring transformation of categorical attributes). Expectation Maximization approaches can deal with missing data, but k-means techniques require substitution of missing data.

Step 3: Model Building This step is not necessarily deterministic, even for a given dataset and algorithm, and may therefore have to be applied multiple times. For example, the initial choice of cluster centers for the k-means algorithm may influence the final clustering; therefore, this algorithm is typically run multiple times and the final result is the best of the set of results, according to some evaluation criteria.
Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multi-way vs binary splits etc) or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single data-mining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage.

Step 4: Model Validation This is perhaps the most important part of the data-mining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10-fold cross-validation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds.
Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model.
In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria.
The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets.

Step 5: Deployment This is the final step, reached only after a number of iterations through the previous steps and careful validation of the model. Even if the model is sufficient given the data available at the time of model building, the exercise should be repeated whenever there is a change in the input data.


-- NickBall - 05 Sep 2010
-- SabineMcConnell - 30 Jan 2011
-- SabineMcConnell - 31 Jan 2011
-- SabineMcConnell - 11 Feb 2011


Topic revision: r5 - 2011-09-24 - NickBall
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback