IVOA KDD-IG: A user guide for Data Mining in Astronomy

1: What is 'data mining' and why is it important in astronomy?

We begin by attempting to summarize what exactly is meant by 'data mining', and why it will become increasingly important to the significant fraction of the astronomical community who will be involved in data-intensive astronomy. To aid clarity, this section is deliberately kept quite concise.

A Description of the SKA Project

From the paper 'Data Challenges for Next-generation Radio Telescopes' [1]:

  • 10 terabytes s-1 data from the antenna
  • 40 gigabytes s-1 from correlator to processor
  • 750 teraflops - 1 petaflops computing power
  • 10 MW power
  • 70 petabyte yr-1 data products
  • Writing 12h of data to disk will take 8h at 10 gigabytes s-1

Currently, astronomers and their software have problems even handling gigabyte-sized files, and this project will produce gigabytes of data per second.

Unfortunately, the above is not describing the SKA project. It is describing the ASKAP pathfinder. The final project and datasets will be 100 times larger. In other words, 100 times larger than a dataset that astronomers already cannot handle. Clearly, the ability to transition to a regime in which we can handle this data, is desirable.

What is 'Data Mining'?

The exact meaning of 'data mining' is somewhat nebulous and subject to debate. However, in its broadest sense, it is the process of extracting useful information from a set of data. From the Interest Group's front page:

"...Data mining, or KDD, is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. In other words, traditional data analysis is assumption driven as a hypothesis is formed and validated against the data. Data mining, in contrast, is discovery driven as the patterns are automatically extracted from data....

This is not meant to imply that data mining is incompatible with the traditional scientific method, i.e., form a hypothesis, then test that hypothesis by reference to experiment and observation, and modify it accordingly, but that there are differences compared to what has gone before.

The approach is different from the more traditional interplay between theory and observation, because the amount of useful information present in already-available datasets potentially far exceeds that which has presently been extracted. It is thus a logical and justifiable extension of the traditional hypothesis-driven analysis of data to not only search the data for patterns with a hypothesis in mind, but to also look for unknown patterns as an intrinsically useful exercise in itself. These new patterns suggest new hypotheses, which can lead to new discoveries.

Due to the size and complexity of modern data (sheer numbers of data points, intrinsic dimensionality, and so on), simply finding and describing or visualizing the patterns can be extremely non-trivial. Data mining encompasses the methods used to make the finding of useful information possible.

Why is it Important for Astronomy?

Data mining is important for astronomy because:

  • It allows one to do better science with given data
  • Handling future large astronomical datasets will be intractable without it

In other words, the techniques of data mining will not merely be important, they will be unavoidable. They will also be cheap: mining a large database for new science is far less expensive than building the hardware to generate more data to populate a new one. In other words, while maintaining the databases is not free, the science per dollar/Euro, etc. will be significant.

The Fourth Paradigm

Data mining has been described as the Fourth Paradigm of science [2].

The First Paradigm was empirical observation, via description or experiment. This was followed by theory, the formulation of models to explain the observations. The observations test the theories, which in turn are modified. The Third Paradigm, which has become prevalent in the last 60 years or so, is the use of computers to construct models (i.e., simulations) that are too complex to be calculated or described more simply.

The Fourth Paradigm that is now emerging is somewhat analogous to the Third, except that, instead of complex models, the approach is in dealing with an exponentially increasing and complex set of available data. These data contain much useful information but are impossible to analyse in anything other than an automated way. The complexity arises not simply from the amount of data, but also from its often extremely high intrinsic dimensionality, which renders it impossible to easily describe or visualize in any conventional way.

Astronomy is one of many fields that are subject to this exponential data increase.

Knowledge Discovery in Databases

Generally, contemporary astronomical data are stored in the form of machine-readable files. Hence, the data mining process described in this guide will be the process of Knowledge Discovery in Databases (KDD), from which the IVOA interest group takes it name.

In fact, most astronomical data, while online, are not in proper databases, but simply available in some form. Nevertheless, the obtained product, computerized data describing astronomical phenomena, is ultimately the same in principle, so in practice the term KDD is correct. One can think of the 'database' in KDD as being in its broadest sense.

Computational-X and X-informatics

The KDD approach has become prevalent in many scientific fields in the past decade or two, and astronomy is part of this broader context. Both the third and fourth paradigms are represented, in the form of computational-X and X-informatics, where X is a scientific discipline.

While astronomy has traditionally been advanced in many technological areas, in database technology it lags well behind the commercial sector, and in informatics, bio- and geoinformatics (Earth observation) are considerably more advanced. It is the aim of the KDD-IG and this guide to move toward redressing this imbalance.

Astroinformatics

Following the informatics idea, the term Astroinformatics was coined in 2009 to concisely describe the emerging use of informatics techniques within the subject.

The idea and the word are thus quite new (for example, the word 'astroinformatics' does not appear in the recently completed Astro2010 decadal report from the US community).

What Data Mining is Not

We conclude this section with a few words on what data mining is not. The ultimate purpose of this guide is to provide enlightenment that enables better science to be done. Data mining is simply a tool to that end. Therefore, it is not the purpose of this guide to uncritically champion the approach, but to point out its advantages, disadvantages, and possibilities.

  • Data mining is not a substitute for thinking. The patterns searched for and found must still be meaningfully interpreted. There will be far more non-useful patterns than useful ones.

  • It is not a 'black box', incompatible with the construction of useful physical theory, nor is it unreproducible. Although some methods (e.g., standard artificial neural networks, K-nearest neighbour) do not produce a human-comprehensible description of the data, yet nevertheless enable predictions, others, such as decision trees or association rules, do produce such models. Explaining these models remains the job of the theorist.

  • It is not a substitute for bad data. Often, the range of wavebands available, the data quality, its correct preparation and accounting for systematic effects, will have a far greater effect on the outcome than the exact choice of data mining algorithm(s) employed.

  • It is not the work of a moment. Data mining is an entire subject in itself, and a large number of new techniques continue to appear. While there are a much smaller subset of well-known and useful techniques that can be readily applied to give useful science results, their use still entails some learning.

  • Nevertheless, it is not something to simply leave to a specialist with no knowledge of astronomy. A given data mining project must be science-driven, and only the astronomer can ask the best questions, and know the quirks of the data. The astronomer must therefore have sufficient expertise to be able to carry out the project, or at the very least communicate effectively with those engaged with the databases themselves.

References

  1. Norris, R.P., Accepted for Proceedings of 2010 Sixth IEEE International Conference on e-Science, arXiv/1101.1355
  2. T. Hey, S. Tansley and K. Talle (eds.), The Fourth Paradigm: Data-Intensive Scientific Discovery (Microsoft Research, Redmond, WA, 2009).


-- NickBall - 19 Mar 2011
-- NickBall - 21 Sep 2011


Topic revision: r7 - 2011-10-03 - NickBall
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback