IVOA KDD-IG: A user guide for Data Mining in Astronomy

6: Present and future directions

The combination of an abundance of available data mining algorithms, advancing technology, large amounts of new astronomical data continuously opening up new regions of parameter space, and the consequent large number of newly addressable science questions, means that several interesting new directions for data mining in astronomy are opening up in the near-term future.

Several reviews also discuss future directions, including the topics below in more detail, e.g., Ball & Brunner (2010), Borne (2009), or Pesenson et al. (2010), all linked in section 8 of the guide. This section aims to be fairly generic, not a detailed literature review, so long lists of links and references are avoided.

This section assumes some basic knowledge of both astronomy and data mining, so for astronomers, reading the earlier sections first is likely useful, if one does not have this knowledge. We also assume a basic knowledge of computing, e.g., what a CPU is. It is also inevitable that material in different sections will overlap, and to make each section self-contained would considerably lengthen them, so we do not attempt this.


Although the interaction of astronomy and statistics has a history going back centuries, because the large number of objects that are observable makes it an intrinsically statistical subject, the modern incarnation of astrostatistics is necessitated not only by the large number of objects, but in bringing the necessary sophistication to deal appropriately with data that are both large, high dimensional, and subject to numerous real-world issues.

There has also been a long running debate in statistics between the frequentist and Bayesian approaches. Until the advent of fast and cheap computing, the Bayesian approach has been somewhat limited by the computation time to evaluate the integrals. But now that constraint has gone away, and through packages such as in the widely used R language, Bayesian analyses are more straightforward to perform. The frequentist vs. Bayesian debate has now largely gone away, and the general consensus is that one should use whatever approach is more appropriate.

The Bayesian approach has many advantages, because one can rigorously incorporate prior knowledge. Indeed, some analyses are bringing astronomy results that were thought to be reasonably well established back into question. But there are also disadvantages, one obvious on being that the results are strongly dependent on the prior that is put in.

Some issues that astronomy data are subject to are:

  • Large, complex, increasingly high-dimensional, time domain
  • Missing data: non-observation or non-detection
  • Heteroscedastic, non-Gaussian, underestimated errors
  • Outliers, artifacts, false detections, systematic effects
  • Correlated inputs

It is beyond the scope of this guide to discuss astrostatistics in more detail than this. The author is not aware of an equivalent 'statistics in astronomy' guide to this one, but simple web searches, or the links in section 8, will reveal starting points. The Cosmology and Machine Learning Uninstitute is somewhat similar to this guide, and has many links to the statistical cosmology and other literature, which has a strong Bayesian emphasis (because we only observe one universe).

It is likely to continue to be the case that many of the truly novel projects done in astronomy (and in other sciences) will result from cross-disciplinary collaboration. Either with statisticians, or computer scientists (or others), the subjects of astrostatistics and the like will continue to thrive and grow. Of particular note is that statisticians like astronomy data: they are abundant, contain lots of interesting and complex information that challenges the boundaries of their subject, and are also worthless (i.e., they do not have large commercial value, and are generally freely accessible). But most good science will still need the guidance of astronomers to ask the right questions, and know when results make sense.

Cloud Computing

The idea of storage, processing, and distribution of data as an available resource akin to other utilities such as electricity or water is an attractive one for astronomers, and their institutions. If one can delegate the routine maintenance of hardware, backup of data, and so on, to a professional centre, then, money, physical space, and time are freed up for improved research. One could, in principle, work from anywhere that has a computer screen, and Internet access.

Currently, cloud computing in environments such as the Amazon EC2 cloud show promise, but remain fairly little used in astronomy. For large data, they are expensive. They are also not developed with scientists in mind as the primary users, subjecting them to similar issues of usability, long-term stability, etc. to other rapidly developing technologies such as GPUs (see below).

Issues include:

  • Expense, typically significantly more than running the hardware oneself, if that is feasible
  • Use is often made of virtual machines, which, while abstracting the hardware, may perform less well than a regular local machine
  • Transfer of data to and from the cloud
  • Software licenses may be one's own site only
  • Proprietary data is now offsite
  • Working on one cloud may make it difficult to transfer to another

More specialized resources are available. The CANFAR projects in Victoria, Canada, is designed to provide cloud computing for astronomers, combined with the job scheduling abilities of a supercomputing cluster. In CANFAR's setup, one runs one or more virtual machine (VMs), in the same way as managing a desktop, but has access via Condor to the batch processing power of several hundred (and growing) processor cores on computing nodes. The VMs run Linux, and one can install one's usual astronomy software on a VM, without alteration. The Condor batch script invokes the software using whatever command and arguments are used in its regular use. The main downside to CANFAR is it is not designed for jobs in which the operations at one node depend on those at another, such as would be the case with MPI on a distributed memory machine or OpenMP on a shared memory machine. Rather, it is designed for simple parallel processing.

Currently, it is likely that the cloud may or may not be suitable, depending on one's desired application.

The Curse of Dimensionality

A well-known problem in data mining is the 'curse of dimensionality'. Because modern astronomical detectors not only detect large numbers of objects, but also enable a lot of parameters (perhaps hundreds) to be measured for each one, many of which are of physical interest, astronomical data are increasingly subject to this problem. One can still make plots, but when there are N parameters of interest, (NC2) plots may be time consuming. One may reduce this multiplicity somewhat by the application of judicious physical thinking, but this is not necessarily the case.

Aspects of the curse of dimensionality are:

  • The more dimensions there are, the higher fraction of the dimension of the space is required to cover a given fraction of the volume - most of the space is near a 'corner'. In a ten dimensional, one has to span 80% of the cube's size to cover 10% of the volume. Distances, such as Euclidean, also become increasingly similar.
  • No matter how many data points one has observed, the data rapidly become sparse in high dimensions: the volume becomes exponentially large, and most of the space is empty.
  • The problem also occurs in Bayesian inference.
  • One can reduce the number of dimensions in many ways, such as principal component analysis, or its more sophisticated nonlinear variants, but what each component represents may immediately be unclear, physically, or in relation to a model or simulation.
  • The effects can also be mitigated by using sampling methods such as Markov Chain Monte Carlo.

What is not necessarily clear, is whether most astrophysical processes are intrinsically high dimensional, or whether we are just measuring a lot of parameters, that can still be explained by a combination of simpler processes. E.g., one could measure dozens of parameters for stars, but they might all be explained by the Hertzsprung-Russell diagram.

However, the use of, e.g., probability density functions to, for example, give significantly improved measures of distance via photometric redshifts than a simpler approach on the exact same data, suggests that, in general, better signals will be extracted from higher dimensional information. Also, while the underlying physical laws may indeed by simple, in general, complex emergent processes that are not yet understood, such as star or galaxy formation, will require comparison between high-dimensional observations, equally complex simulations, and comparison between the two.

Graphical Processing Units, and Other Novel Hardware

From circa 1965-2005, computer processor speeds doubled approximately every 18 months. Thus, computer codes simply rode this Moore's law speedup, and ran faster on every succeeding machine. From around 2005, however, heat dissipation rendered further clock speedups impractical, and processors have been increasing in speed by continuing to shrink parts, and containing more processor cores, executing in parallel. This means that to continue to speed up on these processors, codes must be parallelized. However, even this only goes so far, and other forms of hardware provide potentially even greater speedups. Given the exponential increase in data, every possible speedup that still gives the correct answer is desirable.

The most prominent alternative hardware is the graphical processing unit (GPU). Driven by the huge resources of the computer gaming industry, these chips were originally designed to render graphics at high speed, but there ability to process vector datatypes at a much higher speed than a regular CPU has rendered them useful for more general applications. Also more recently, they have been able to deal with floating point datatypes, first single precision, and now the double precision that makes them useful across the range of scientific applications. These chips are known as general purpose GPUs, or GPGPUs.

The downside of GPUs is that the code must be ported to run on them. This may be as simple as wrapping the appropriate single function, using an environment designed for GPU programming such as CUDA or OpenCL, but in other cases, the algorithm itself may need to be changed, which means rewriting the code. For codes that are part of a large pipeline, or have a large user base within the community, this means one loses the feedback, updates and support of those other users.

Several papers have appeared demonstrating speedups of astronomy code, from a few x, to over 100x. This means code that took a week now takes an hour.

Not every algorithm is suitable for GPU speedup. Suitable desired characteristics for an algorithm include:

  • Ability for massive parallelism: lots of subsections of the code are independent of each other
  • Locality of memory reference: threads close to each other in the code access similarly close together memory locations
  • Minimize threads conditionally executed by the portion of the code that is on the GPU
  • Maximize the number of mathematical operations per item of data, including by changing the algorithm
  • Minimize the data transfer to and from the GPU

A useful guide for astronomers considering speeding up their codes is Barsdell et al., MNRAS 408 1936 (2010). They indicate that one would expect algorithms such as N-body and semi-analytic simulations, source extraction, and machine learning, to be highly amenable to speedup, perhaps by a factor of O(100), and others, such as smoothed particle hydrodynamics, adaptive mesh, stacking, and selection by criteria, by a factor of O(10).

Other novel hardware besides GPUs includes the Field Programmable Gate Array, long used as specialized hardware (e.g., radio telescope correlators), but now able to be programmed more generally to instantiate in hardware a particular function, such as a trained neural network, and then be reprogrammed to a different function as needed. The programming can be done in variants of, e.g., C. There is also the IBM Cell processor, used in places including the petascale Roadrunner machine that has performed cosmological simulations. And next year, there will be the Intel MIC architecture, that provides speedup but a friendlier programming environment than GPU.

Parallel and Distributed Data Mining

As datasets become increasingly large, moving the data is becoming an increasing problem. Despite novel projects such as Globus Online, the fastest way to move a large dataset (e.g., 500T) across the country is to Fedex it.

This means that to apply data mining on a scale that utilizes this data, one must move the code to the data, rather than download the data. Although it is true that in many cases most of the science comes from a small fraction of the full data volume, e.g., a catalogue instead of an image, in general this is not true, and downloading may still not be the best approach.

Even when the data are in one place, however, applying a data mining algorithm to them may result in intractable computing times, because the algorithms often scale as N2 for N objects, or worse. This may be alleviated by parallelizing the tasks, or using faster versions of the algorithms that scale as NlogN or better, e.g., by employing kd-trees. Parallel and distributed data mining has been widely employed in the commercial sector, but so far has been little used in astronomy.

Parallel Programming

The convergence of technologies, from increasing processor cores on the one hand, to increasing generality of highly parallel chips such as GPUs (see above), means that, in general, future codes will be executed as several threads running in parallel. The speedup factor is limited by the portion of the code that is not parallel, so ideally this is minimized.

Parallelization causes numerous potential issues for astronomy codes:

  • Most existing codes are not designed to run in parallel
  • Astronomers are not trained in parallel programming
  • Parallelizing code may be extremely non-trivial, including altering algorithms, where even an arrangement that is not highly suboptimal is not obvious
  • Debugging parallel code can be difficult, because issues arise not encountered in sequential programming, such as interaction of processes, e.g., the output of one forms the input of another, or they operate on the same data
  • There are many different types of parallelism, e.g., task, data, coarse- and fine-grained

Programming languages such as Python support basic parallel programming, e.g., utilizing several of one's processor cores, but for larger scales there are many entire parallel programming languages.

However, a subclass of parallel tasks, those that are independent of each other ('embarrassingly parallel') often occur in astronomical data processing (e.g., classifying objects), and these are relatively straightforward to implement by simply calling the code multiple times.

Petascale and Exascale Computing

Most current astronomy is done using files of sizes measured in megabytes, or gigabytes. Many surveys, however, now have dataset sizes in the terascale (1012 bytes) regime, and some, e.g., Pan-STARRS, have entered the petascale (1015 bytes) regime. Similarly, most computers have performance measured in gigaflops, but modern supercomputers have performances in the teraflops range, and some have now exceeded 1 petaflops. The trend will continue, and exaflop machines (and exabytes of data) are anticipated by the end of the decade.

Astronomical projects already use petascale machines (e.g., cosmological simulations on Roadrunner at Los Alamos National Lab), several petascale surveys are already running, being developed, or are planned (LSST, etc.), and projects such as the SKA circa 2020 will enter well into the exascale regime.

Petascale and exascale machines have several issues, including:

  • Cost
  • Power consumption
  • Balancing processing with data input/output (I/O)
  • Failure of parts (e.g., among the thousands of disk drives)
  • Scaling software to use hundreds of thousands of processor cores
  • Decrease in memory per processor core

Balancing processing with I/O is important, because typical scientific applications require about 10,000 CPU cycles per byte of I/O, whereas, to avoid being delayed by waiting for the disk, one wants approximately 50,000 cycles. This rule of thumb is one of several useful approximations known as Amdahl's laws. Another states that the ratio of I/O per second to the number of instructions per second (the Amdahl number) should be approximately 1. Applications dominated by I/O thus have numbers far below this, e.g., 10-5, and this is the case with most supercomputing applications. Recent machines, e.g., GrayWulf at Johns Hopkins and the supernode architecture of the 'Gordon' machine at the San Diego Supercomputing Centre, have been developed that are more balanced (Amdahl number ~ 0.5-1), and this trend will continue.

Many of the petascale issues will continue to the exascale, and become even worse. Power consumption will rise to circa 100 MW, equivalent to a small power station, and most of it will be used for accessing memory and moving data, not processing. 100,000 processor cores will rise to over 100 million, and I/O will be even more dominant as a limiting factor. Hardware such as GPUs will not solve these more fundamental problems. Nevertheless, current projections show that the SKA project will require the world's fastest supercomputer circa 2020 to process its data, so astronomers have an interest in the outcome.

Real-time Processing and the Time Domain

Historical precedent has shown that whenever a new region of parameter space in astronomy is explored, unexpected new discoveries are made (e.g., pulsars). In the space of timescale of variability, versus brightness, particularly over large regions of the sky, a large parameter space remains unexplored, but in the next decade, large synoptic surveys such as LSST will explore this space. They will greatly increase the instances of known variable objects, discover more known rare objects, confirm objects predicted to exist but not yet observed, and likely find unexpected new classes of object.

While data mining techniques able to classify static data have been used for several years in astronomy (see section 2), several further issues are brought up by the time domain, including:

  • Multiple observations of objects that can vary in irregular and unpredictable ways, both intrinsic and due to the observational equipment
  • Objects in difference images, in which the static background is subtracted, leaving the variation, do not look like regular objects
  • The necessarily extremely rapid response to certain events such as gamma-ray bursts where physical information can be lost mere seconds after an event becomes detectable
  • Robust classification of large streams of data in real time
  • Lack of previous information on several phenomena, leaving a sparse or absent training set
  • The volume and storage of time domain information in databases

Other challenges are seen in static data, but will assume increased importance as real-time accuracy is needed.

  • Removal of artifacts that might otherwise be flagged as unusual objects and incur expensive follow-up telescope time
  • Variability will be both photometric, and astrometric, i.e., objects can vary in brightness, and/or move
  • Variability shows many different patterns, depending on the type of object, including regular, nonlinear, irregular, stochastic, chaotic, or heteroscedastic (the variability itself can change with time)

There has been much progress in this field in recent years, and machine learning processes that emphasize a training set when information is abundant, or prior information when it is not, have been utilized. Examples include Hidden Markov models, a type of Bayesian network which is a generalization of mixture models to the time domain, and other probabilistic and/or semi-supervised classifiers should prove useful. The time domain also allows object classification in new ways not possible in static data, e.g., using photometric variability to select quasar candidates, or astrometric variability to find asteroids.

Events in real time are recorded and passed to appropriate telescopes via the Virtual Observatory (see below) protocol VOEvent net, and followed up upon, either automatically, or by human observers. Time series analysis is a well developed area of statistics, and many further techniques will no doubt be useful. Several large survey projects, such as LSST and Gaia, have data mining working groups, for which the time domain is a significant component.


For astronomical data to be useful, they must be described by metadata: what telescope was used, when, what was the exposure time, which instrument, what are the units of measurement, and so on. This is crucial not only for the meaningfulness of their results, but their scientific credibility through being reproducible. The FITS file format is popular in part because it stores the associated metadata for observations.

It is also useful for the automated components of data analysis to understand this information. Consider searching Google image search for the Eagle Nebula, Messier 16. Typing 'm16' will give you the image, after several pages of pictures of the US military rifle. One can refine the search, but the fundamental limitation that only the string of characters, not what is actually wanted, is being matched.

This is an issue, because semantic information may be required for meaningful queries to be possible, e.g., one may want to search a compilation of catalogues in the Virtual Observatory for redshifts, which requires the system to know which catalogues contain redshifts. Semantics also allows the idea of annotations, in which knowledge concerning the data is added, e.g., by a user - this object classified as unknown by the algorithm is an artifact caused by the superposition of a star and part of a galaxy.

The main issue with semantic astronomy is its practical deployment such that it is useful for the community to allow new knowledge, i.e., science, to be discovered.

Unfortunately, semantics also suffers from a level of abstraction from the data. If the metadata are inaccurate, or missing (which is not unheard of), then that is propagated through the analysis. It is also not necessarily simple to define a fast-moving and constantly changing research area with the level of precision required for machines to use the data.

However, the potential payoff of directly increasing what we are searching for, i.e., data -> information -> knowledge -> wisdom, will almost certainly outweigh in its cumulative usefulness problems in the descriptions of particular datasets. The field is part of a broader effort in computer science to bring about the 'semantic web', and in related fields such as bioinformatics, and will benefit from progress in these areas.

The Virtual Observatory

Many novel discoveries in astrophysics, especially in the last half century, have been made by combining more than one dataset over different wavelengths. For example, quasars were discovered as radio sources that turned out to match starlike optical objects. The aim of the Virtual Observatory (VO) is to make the plethora of thousands of astronomy datasets interoperable, and provide corresponding analysis tools, so that the true potential of the data may be realized.

Internationally, each of several countries has its own VO, based at one or more data centres. These are funded by that country, and the VOs are federated into the International Virtual Observatory (IVOA). The IVOA has defined and developed a range of standards such that datasets can indeed be made interoperable. The aim is that regular users do not need to know about these standards, but utilize them transparently. In practice, a basic knowledge of what is going on is likely to be useful, e.g., your RA+decs from Topcat are being sent to DS9 not magically, but using SAMP, which is a general way to link together the output from one application to the input of another.

The VO has been in operation since around the year 2000. A huge amount of progress has been made, with many standards defined, and a lot of good, working software. Unfortunately, there is not yet a significant body of science that can be pointed to as being impossible without the VO (although see section 2 of this guide). This means that its reach within the wider astronomical community has been poor, and various misconceptions, such that it is simply a data repository (which we could have anyway), abound.

Another problem is in cross-matching data that is spread over different locations, if those data are large and hence difficult to move. What has been demonstrated, however, is the ability to cross-match large datasets in a given location, and VO tools to do this are becoming available.

There is much promise, however. Various VO schools and outreach programs have been run, and most users are very positive about their experience. This means that basic knowledge of the VO and its capabilities will become increasingly important.

Visualization of Large, Complex, and High-dimensional Data

Visualization comes in two forms: graphical, and scientific. While the former is concerned with presentation and impact, the latter aims to increase qualitative and quantitative understanding of the data, by revealing or clarifying patterns not otherwise evident.

An immediate difficulty with visualization is that the human brain is not designed to visualize more than three spatial dimensions. However, astronomy does not utilize even the well-studied research that exists even given this limitation, for example, methods for visualizing four variables on a 2D plot using different shaped glyphs overlaid on a colour map, and the most commonly used visualization programs are over two decades old. A classic book on the subject is Edward Tufte's 'The Visual Display of Quantitative Information'.

-- NickBall - 19 Mar 2011
-- NickBall - 02 Oct 2011

Edit | Attach | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2011-10-03 - NickBall
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback