TWiki> IVOA Web>IvoaKDD>IvoaKDDguide (revision 5)EditAttach

IVOA KDD-IG: A user guide for Data Mining in Astronomy




Who's interested?

NickBall
RaffaeleDAbrusco


Draft table of contents

Each heading 1, 2, ... to link to its own page. Items in square brackets examples of possible content to include.

  • 1: What is 'data mining' and why is it important in astronomy?

We begin by attempting to summarize what exactly is meant by 'data mining', and why it will be important to a significant fraction of the astronomical community.

The two most important points are:

    • It allows one to do better science with given data
    • Handling upcoming large astronomical datasets will be intractable without it

[Astroinformatics in context: The 'fourth paradigm', X-informatics and computational-X. Bio- geo- etc.]

  • 2: Examples of improved science enabled by data mining techniques

We describe some examples where data mining techniques allowed improved science results.

[Published science results best, indirectly also improved object detection, object classification, photometric redshifts. But these latter are not really convincing in their own right.]

  • 3: Overview of the data mining process

Here we elucidate common steps in the data mining process, from raw data to science result. While each particular case will be unique, and driven by the particular science question being addressed, many of the issues encountered are common to different pieces of work, and we describe those here.

    • Data collection
    • Data preprocessing
    • Attribute selection [Incl. dimension reduction]
    • Selection of algorithm
    • Improving results
    • Algorithm application and limitations

  • 4: The main data mining algorithms

One of the aims of the KDD-IG is to build up an inventory of data mining algorithms that are of use to astronomy. (The current list is here [link].) We don't attempt to duplicate that here, but instead provide descriptions of some of the most well-known data mining algorithms, many of which have been fairly extensively used in astronomy.

    • Artificial neural network
    • Decision tree
    • Genetic algorithms
    • k nearest neighbor
    • k-means clustering
    • Kernel density estimation
    • Kohonen self-organizing map
    • Independent component analysis
    • Mixture models and EM algorithm
    • Support vector machine
    • Bayesian Algorithms

  • 5: Which algorithm to use?

Unfortunately, there is no simple answer, because the differing characteristics of different algorithms render them more or less suitable for different problems. While within in the data mining community there is much literature, for example, comparing different algorithms on a specific dataset, or looking at their theoretical properties in idealized (read: unrealistic) situations, there is much less available to help make a practical choice with real data. We attempt to remedy that here by comparing and contrasting the characteristics of some commonly used algorithms.

[Comparison table: e.g., algorithm, advantages, disadvantages]

  • 6: Present and future directions

The combination of an abundance of available data mining algorithms, advancing technology, large amounts of new astronomical data continuously opening up new regions of parameter space, and the consequent large number of newly addressable science questions, means that several interesting new directions for data mining in astronomy are opening up in the near-term future.

    • The time domain [Markov models, etc.]
    • Graphical Processing Units [CUDA, code types amenable to speedup]
    • Parallel/distributed data mining [Clock speed -> more cores, code has to be rewritten]
    • Visualization [High dimensionality]
    • The VO [Standardized data access]
    • Semantics [e.g. MG's Semantics and Data Mining IVOA talk]
    • Clouds

  • 7: Algorithms and techniques astronomers could benefit from but don't use

There are a large number of algorithms and approaches that are well-known to computer scientists and statisticians, but are little- or un-used in astronomy. There is much potential for novelty in collaboration between these three subject areas.

  • 8: Links: websites, books

There are of course a huge number of websites and books about data mining. This section aims to point to some of those that are most useful for astronomy.

[Elements of Statistical Learning, mine and Kirk's DM reviews, 4th paradigm book, ...]

  • 9: Worked example

Illustrate raw-data-to-science from an existing paper.

-- NickBall - 04 Aug 2010


Edit | Attach | Watch | Print version | History: r13 | r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2010-08-05 - NinanSajeethPhilip
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback