IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process
<Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.>The datamining process can be broken down into the following steps, each of which are repeatedly executed and refined.
The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process.
In addition, a large variety of algorithms are available, each with their own characteristics.
Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the datamining process.
Step 3: Model Building This step is not necessarily deterministic, even for a given dataset and algorithm, and may therefore have to be applied multiple times. For example, the initial choice of cluster centers for the kmeans algorithm may influence the final clustering; therefore, this algorithm is typically run multiple times and the final result is the best of the set of results, according to some evaluation criteria. Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multiway vs binary splits etc) or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single datamining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage. Step 4: Model Validation This is perhaps the most important part of the datamining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10fold crossvalidation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds. Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model. In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria. The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets. Step 5: Deployment This is the final step, reached only after a number of iterations through the previous steps and careful validation of the model. Even if the model is sufficient given the data available at the time of model building, the exercise should be repeated whenever there is a change in the input data.
 NickBall  05 Sep 2010

IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process
<Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.>The datamining process can be broken down into the following steps, each of which are repeatedly executed and refined.
The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process.
In addition, a large variety of algorithms are available, each with their own characteristics.
Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the datamining process.
 
Changed:  
< <  
> >  
Step 3: Model Building This step is not necessarily deterministic, even for a given dataset and algorithm, and may therefore have to be applied multiple times. For example, the initial choice of cluster centers for the kmeans algorithm may influence the final clustering; therefore, this algorithm is typically run multiple times and the final result is the best of the set of results, according to some evaluation criteria.  
Changed:  
< <  Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multiway vs binary splits etc)or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single datamining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage.  
> >  Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multiway vs binary splits etc) or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single datamining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage.  
Step 4: Model Validation This is perhaps the most important part of the datamining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10fold crossvalidation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds. Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model. In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria. The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets. Step 5: Deployment This is the final step, reached only after a number of iterations through the previous steps and careful validation of the model. Even if the model is sufficient given the data available at the time of model building, the exercise should be repeated whenever there is a change in the input data.  
Deleted:  
< < 
< Under construction by group members >  
 NickBall  05 Sep 2010
<

IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process
<Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.>The datamining process can be broken down into the following steps, each of which are repeatedly executed and refined.
The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process.
In addition, a large variety of algorithms are available, each with their own characteristics.
Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the datamining process.
Step 3: Model Building  
Changed:  
< <  
> >  This step is not necessarily deterministic, even for a given dataset and algorithm, and may therefore have to be applied multiple times. For example, the initial choice of cluster centers for the kmeans algorithm may influence the final clustering; therefore, this algorithm is typically run multiple times and the final result is the best of the set of results, according to some evaluation criteria.  
Added:  
> >  Typically, data mining algorithms also require decisions (for example, whether a decision tree should be pruned and if so, when and how, which evaluation criteria should be used for the determination of the splitting attributes, multiway vs binary splits etc)or parameter values (for example, the number of clusters to look for, or the number of hidden nodes in a neural network) at this stage. No single datamining algorithm is the best choice for all datasets, for a variety of reasons, and it is usually appropriate (and desirable) to select multiple algorithms at this stage.  
Step 4: Model Validation This is perhaps the most important part of the datamining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10fold crossvalidation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds. Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model. In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria. The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets. Step 5: Deployment  
Added:  
> >  This is the final step, reached only after a number of iterations through the previous steps and careful validation of the model. Even if the model is sufficient given the data available at the time of model building, the exercise should be repeated whenever there is a change in the input data.  
Added:  
> >  < Under construction by group members >  
Deleted:  
< <  Under construction by group members  
 NickBall  05 Sep 2010
 
Changed:  
< <   SabineMcConnell  30 Jan 2011  
> >   SabineMcConnell  30 Jan 2011  
Added:  
> >  
 SabineMcConnell  31 Jan 2011  
Added:  
> >   SabineMcConnell  11 Feb 2011  
<

IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process
<Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.>The datamining process can be broken down into the following steps, each of which are repeatedly executed and refined.
 
Changed:  
< < 
 
> > 
 
 
Changed:  
< <  The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for different purpose and often as a byproduct of some other process.  
> >  The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for a different purpose and often as a byproduct of some other process.  
In addition, a large variety of algorithms are available, each with their own characteristics.  
Changed:  
< <  Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the main steps in data mining.  
> >  Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the above steps in the datamining process.  
Step 1: Data Collection  
Changed:  
< <  Usually this is outside the control of a data miner. The data mining process typically starts with a data set that is collected as a byproduct obtained from some other process. Therefore, the most important point for this step is to understand the bias in the data, and the restriction this imposes on your choices for the later steps.  
> >  Usually this is outside the control of a data miner. The datamining process typically starts with a data set that is collected as a byproduct obtained from some other process. Therefore, the most important point for this step is to understand the bias in the data, and the restriction this imposes on your choices for the later steps.  
Changed:  
< <  Step 2: Data Preprocessing This step is the most timeconsuming step in the overall approach. Not only do issues with the data such as missing values, multiple measurements, noise, etc. have to be addressed, the data also has to be transformed into a format suitable for the algorithm that is to be applied. Since the algorithm to be applied is likely to change during any course of a datamining exercise (at least in the initial iterations through this process), this also means that the preprocessing step has to be visited over and over again, with varying requirements.  
> >  Step 2: Data Preprocessing This step is the most timeconsuming step in the overall approach, is different from typical preprocessing steps applied in astronomy, and applied afterwards. Not only do issues with the data such as missing values, multiple measurements, noise, etc. have to be addressed, the data also has to be transformed into a format suitable for the algorithm that is to be applied. Since the algorithm to be applied is likely to change during any course of a datamining exercise (at least in the initial iterations through this process), this also means that the preprocessing step has to be visited over and over again, with varying requirements.  
Added:  
> >  Typical preprocessing steps include:
 
Step 3: Model Building  
Added:  
> >  
Step 4: Model Validation  
Added:  
> >  This is perhaps the most important part of the datamining process. In the case of predictive approaches, the existing class attribute can be used to evaluate the outcome. 10fold crossvalidation is the appropriate method in the majority of cases: initially, the data is split into 10 folds. One of the folds is set aside for evaluation of the model, while the remaining 9 folds are used for model building. The process is repeated for each of the 10 folds, and the overall error is the sum of the errors achieved on the individual folds.
Evaluation on the training set will be too optimistic, while evaluation on a single split into training data (for example 2/3 of the data for training, and 1/3 for evaluation) is usually not indicative of the generalization capabilities of the model. In addition, some techniques require a third dataset for validation of the model. In the case of unsupervised approaches, evaluation becomes more difficult. For clustering approaches, measures such as such as SSE, cohesion, separation and the Silhouette coefficient can be used, as well as the proximity matrix of the input data. In addition, care has to be taken to determine the correct number of clusters, as this will have an influence on the evaluation criteria. The above measures are for criteria for single models only. Another level of comparison is added by the comparison of pairs of models built from different datasets.  
Step 5: Deployment
Under construction by group members
 NickBall  05 Sep 2010
 
Added:  
> >   SabineMcConnell  31 Jan 2011  
<

IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining process  
Changed:  
< <  Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.  
> >  <Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.>  
Added:  
> >  The datamining process can be broken down into the following steps, each of which are repeatedly executed and refined.  
 
Changed:  
< < 
 
> > 
 
Deleted:  
< < 
 
Changed:  
< <  
> >  The extraction of information through data mining is an iterative process that is impossible to automate. Each dataset is different in a multitude of ways, collected for different purpose and often as a byproduct of some other process.  
Added:  
> >  In addition, a large variety of algorithms are available, each with their own characteristics.
Therefore, one of the most important rules is to be cautious, and not trust any results without careful validation of the approach. The following section aims at pointing out some of the main points for each of the main steps in data mining.
Step 1: Data Collection Usually this is outside the control of a data miner. The data mining process typically starts with a data set that is collected as a byproduct obtained from some other process. Therefore, the most important point for this step is to understand the bias in the data, and the restriction this imposes on your choices for the later steps. Step 2: Data Preprocessing This step is the most timeconsuming step in the overall approach. Not only do issues with the data such as missing values, multiple measurements, noise, etc. have to be addressed, the data also has to be transformed into a format suitable for the algorithm that is to be applied. Since the algorithm to be applied is likely to change during any course of a datamining exercise (at least in the initial iterations through this process), this also means that the preprocessing step has to be visited over and over again, with varying requirements. Step 3: Model Building Step 4: Model Validation Step 5: Deployment  
Added:  
> >  
Under construction by group members
 NickBall  05 Sep 2010  
Added:  
> >   SabineMcConnell  30 Jan 2011  
<

IVOA KDDIG: A user guide for Data Mining in Astronomy
3: Overview of the data mining processHere we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.
Under construction by group members
 NickBall  05 Sep 2010
<
