IVOA KDD-IG: A user guide for Data Mining in Astronomy1: What is 'data mining' and why is it important in astronomy? | ||||||||
Changed: | ||||||||
< < | We begin by attempting to summarize what exactly is meant by 'data mining', and why it will be important to a significant fraction of the astronomical community. | |||||||
> > | We begin by attempting to summarize what exactly is meant by 'data mining', and why it will be important to a significant fraction of the astronomical community. This section is deliberately kept quite short. | |||||||
A Description of the SKA ProjectFrom the talk of Ray Norris [1] at Astroinformatics 2010, June 16-19 in Pasadena, California:
| ||||||||
Changed: | ||||||||
< < | Sounds rather challenging, and it is. | |||||||
> > | Currently, astronomers and their software have problems even handling gigabyte-sized files, and this project will produce gigabytes of data per second. | |||||||
Changed: | ||||||||
< < | Unfortunately, the above is not describing the SKA project. It is describing the ASKAP pathfinder. The final project will be 100-1000 times larger than this. | |||||||
> > | Unfortunately, the above is not describing the SKA project. It is describing the ASKAP pathfinder. The final project and datasets will be 100-1000 times larger. In other words, 100-1000 times larger than a dataset that astronomers already cannot handle. Clearly, there is work to be done. | |||||||
Deleted: | ||||||||
< < | Clearly, there is work to do here. | |||||||
Deleted: | ||||||||
< < | ||||||||
What is 'Data Mining'? | ||||||||
Changed: | ||||||||
< < | The exact meaning of 'data mining' is somewhat nebulous and subject to debate. However, in its broadest sense, it is the process of extracting useful information from a set of data. From this Interest Group's front page: | |||||||
> > | The exact meaning of 'data mining' is somewhat nebulous and subject to debate. However, in its broadest sense, it is the process of extracting useful information from a set of data. From the Interest Group's front page: | |||||||
"...Data mining, or KDD, is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. In other words, traditional data analysis is assumption driven as a hypothesis is formed and validated against the data. Data mining, in contrast, is discovery driven as the patterns are automatically extracted from data....”
This is not meant to imply that data mining is incompatible with the traditional scientific method, i.e., form a hypothesis, then test that hypothesis by reference to experiment and observation, and modify it accordingly, but that there are differences compared to what has gone before.
The approach is different from the more traditional interplay between theory and observation, because the amount of useful information present in already-available datasets potentially far exceeds that which has presently been extracted. It is thus a logical and justifiable extension of the traditional hypothesis-driven analysis of data to not only search the data for patterns with a hypothesis in mind, but to also look for unknown patterns as an intrinsically useful exercise in itself. These new patterns suggest new hypotheses, which can lead to new discoveries.
Due to the size and complexity of modern data (sheer numbers of data points, intrinsic dimensionality, and so on), simply finding and describing or visualizing the patterns can be extremely non-trivial. Data mining encompasses the methods used to make the finding of useful information possible.
Why is it Important for Astronomy?Data mining is important for astronomy because:
The Fourth ParadigmData mining has been described as the Fourth Paradigm of science [2]. The First Paradigm was empirical observation, via description or experiment. This was followed by theory, the formulation of models to explain the observations. The observations test the theories, which in turn are modified. The Third Paradigm, which has become prevalent in the last 60 years or so, is the use of computers to construct models (i.e., simulations) that are too complex to be calculated or described more simply. The Fourth Paradigm that is now emerging is somewhat analogous to the Third, except that, instead of complex models, the approach is in dealing with an exponentially increasing and complex set of available data. These data contain much useful information but are impossible to analyse in anything other than an automated way. The complexity arises not simply from the amount of data, but also from its often extremely high intrinsic dimensionality, which renders it impossible to easily describe or visualize in any conventional way. Astronomy is one of many fields that are subject to this exponential data increase.Knowledge Discovery in DatabasesGenerally, contemporary astronomical data are stored in the form of machine-readable files. Hence, the data mining process described in this guide will be the process of Knowledge Discovery in Databases (KDD), from which the IVOA interest group takes it name. In fact, most astronomical data, while online, are not in proper databases, but simply available in some form. Nevertheless, the obtained product, computerized data describing astronomical phenomena, is ultimately the same in principle, so in practice the term KDD is correct. One can think of the 'database' in KDD as being in its broadest sense.Computational-X and X-informaticsThe KDD approach has become prevalent in many scientific fields in the past decade or two, and astronomy is part of this broader context. Both the third and fourth paradigms are represented, in the form of computational-X and X-informatics, where X is a scientific discipline. While astronomy has traditionally been advanced in many technological areas, in database technology it lags well behind the commercial sector, and in informatics, bio- and geoinformatics are considerably more advanced. It is the aim of the KDD-IG and this guide to begin to redress this imbalance.AstroinformaticsFollowing the informatics idea, the term Astroinformatics was coined in 2009 to concisely describe the emerging use of informatics techniques within the subject. The idea and the word are thus quite new (for example, the word 'astroinformatics' does not appear in the recently completed Astro2010 decadal report from the US community).What Data Mining is Not | ||||||||
Changed: | ||||||||
< < | Finally, a word on what data mining is not. This guide is not aiming to uncritically champion the approach, but to point out the advantages and the possibilities that will enable better science. We thus summarize some of the limitations of data mining: | |||||||
> > | We also include some comments on what data mining is not. It is not the purpose of this guide to uncritically champion the approach, but to point out the advantages and the possibilities that will enable better science. | |||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
References
| ||||||||
Changed: | ||||||||
< < | -- NickBall - 05 Sep 2010 | |||||||
> > | -- NickBall - 07 Jan 2011 | |||||||
<--
|
IVOA KDD-IG: A user guide for Data Mining in Astronomy1: What is 'data mining' and why is it important in astronomy?We begin by attempting to summarize what exactly is meant by 'data mining', and why it will be important to a significant fraction of the astronomical community.A Description of the SKA ProjectFrom the talk of Ray Norris [1] at Astroinformatics 2010, June 16-19 in Pasadena, California:
What is 'Data Mining'?The exact meaning of 'data mining' is somewhat nebulous and subject to debate. However, in its broadest sense, it is the process of extracting useful information from a set of data. From this Interest Group's front page: "...Data mining, or KDD, is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. In other words, traditional data analysis is assumption driven as a hypothesis is formed and validated against the data. Data mining, in contrast, is discovery driven as the patterns are automatically extracted from data....” This is not meant to imply that data mining is incompatible with the traditional scientific method, i.e., form a hypothesis, then test that hypothesis by reference to experiment and observation, and modify it accordingly, but that there are differences compared to what has gone before. The approach is different from the more traditional interplay between theory and observation, because the amount of useful information present in already-available datasets potentially far exceeds that which has presently been extracted. It is thus a logical and justifiable extension of the traditional hypothesis-driven analysis of data to not only search the data for patterns with a hypothesis in mind, but to also look for unknown patterns as an intrinsically useful exercise in itself. These new patterns suggest new hypotheses, which can lead to new discoveries. Due to the size and complexity of modern data (sheer numbers of data points, intrinsic dimensionality, and so on), simply finding and describing or visualizing the patterns can be extremely non-trivial. Data mining encompasses the methods used to make the finding of useful information possible.Why is it Important for Astronomy?Data mining is important for astronomy because:
The Fourth ParadigmData mining has been described as the Fourth Paradigm of science [2]. The First Paradigm was empirical observation, via description or experiment. This was followed by theory, the formulation of models to explain the observations. The observations test the theories, which in turn are modified. The Third Paradigm, which has become prevalent in the last 60 years or so, is the use of computers to construct models (i.e., simulations) that are too complex to be calculated or described more simply. The Fourth Paradigm that is now emerging is somewhat analogous to the Third, except that, instead of complex models, the approach is in dealing with an exponentially increasing and complex set of available data. These data contain much useful information but are impossible to analyse in anything other than an automated way. The complexity arises not simply from the amount of data, but also from its often extremely high intrinsic dimensionality, which renders it impossible to easily describe or visualize in any conventional way. Astronomy is one of many fields that are subject to this exponential data increase.Knowledge Discovery in DatabasesGenerally, contemporary astronomical data are stored in the form of machine-readable files. Hence, the data mining process described in this guide will be the process of Knowledge Discovery in Databases (KDD), from which the IVOA interest group takes it name. In fact, most astronomical data, while online, are not in proper databases, but simply available in some form. Nevertheless, the obtained product, computerized data describing astronomical phenomena, is ultimately the same in principle, so in practice the term KDD is correct. One can think of the 'database' in KDD as being in its broadest sense.Computational-X and X-informaticsThe KDD approach has become prevalent in many scientific fields in the past decade or two, and astronomy is part of this broader context. Both the third and fourth paradigms are represented, in the form of computational-X and X-informatics, where X is a scientific discipline. While astronomy has traditionally been advanced in many technological areas, in database technology it lags well behind the commercial sector, and in informatics, bio- and geoinformatics are considerably more advanced. It is the aim of the KDD-IG and this guide to begin to redress this imbalance.AstroinformaticsFollowing the informatics idea, the term Astroinformatics was coined in 2009 to concisely describe the emerging use of informatics techniques within the subject. The idea and the word are thus quite new (for example, the word 'astroinformatics' does not appear in the recently completed Astro2010 decadal report from the US community).What Data Mining is NotFinally, a word on what data mining is not. This guide is not aiming to uncritically champion the approach, but to point out the advantages and the possibilities that will enable better science. We thus summarize some of the limitations of data mining:
References
<--
|