IVOA KDD-IG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process
<--Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.-->The data-mining process can be broken down into the following steps, each of which are repeatedly executed and refined.
The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process.
In addition, a large variety of algorithms are available, each with their own characteristics.
Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the data-mining process.
Step 3: Model Building | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | This step is not necessarily deterministic, even for a given dataset and algorithm, and may therefore have to be applied multiple times. For example, the initial choice of cluster centers for the k-means algorithm may influence the final clustering; therefore, this algorithm is typically run multiple times and the final result is the best of the set of results, according to some evaluation criteria. | |||||||
Added: | ||||||||
> > | Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multi-way vs binary splits etc)or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single data-mining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage. | |||||||
Step 4: Model Validation This is perhaps the most important part of the data-mining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10-fold cross-validation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds. Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model. In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria. The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets. Step 5: Deployment | ||||||||
Added: | ||||||||
> > | This is the final step, reached only after a number of iterations through the previous steps and careful validation of the model. Even if the model is sufficient given the data available at the time of model building, the exercise should be repeated whenever there is a change in the input data. | |||||||
Added: | ||||||||
> > | <-- Under construction by group members --> | |||||||
Deleted: | ||||||||
< < | Under construction by group members | |||||||
-- NickBall - 05 Sep 2010
| ||||||||
Changed: | ||||||||
< < | -- SabineMcConnell - 30 Jan 2011 | |||||||
> > | -- SabineMcConnell - 30 Jan 2011 | |||||||
Added: | ||||||||
> > | ||||||||
-- SabineMcConnell - 31 Jan 2011 | ||||||||
Added: | ||||||||
> > | -- SabineMcConnell - 11 Feb 2011 | |||||||
<--
|