IVOA KDD-IG: A user guide for Data Mining in Astronomy

6: Present and future directions

The combination of an abundance of available data mining algorithms, advancing technology, large amounts of new astronomical data continuously opening up new regions of parameter space, and the consequent large number of newly addressable science questions, means that several interesting new directions for data mining in astronomy are opening up in the near-term future.

Several reviews also discuss future directions, including the topics below in more detail, e.g., Ball & Brunner (2010), Borne (2009), or Pesenson et al. (2010), all linked in section 8 of the guide. The aim of this section is to provide a generic overview, not a detailed literature review, so long lists of links and references are avoided.

This section assumes some basic knowledge of both astronomy and data mining, so for astronomers, reading the earlier sections first is likely useful, if one does not have this knowledge. We also assume a basic knowledge of computing. Inevitably, material in different subsections will overlap, and to make each one self-contained would add considerable length, so we do not attempt this.


Although the interaction of astronomy and statistics has a history going back centuries, because the large number of objects that are observable makes it an intrinsically statistical subject, the modern incarnation of astrostatistics is necessitated not only by the large number of objects, but in bringing the necessary sophistication to deal appropriately with data that are both large, high dimensional, and subject to numerous real-world issues.

There has also been a long running debate in statistics between the frequentist and Bayesian approaches. Until the advent of fast and cheap computing, the Bayesian approach has been somewhat limited by the computation time to evaluate the integrals. But now that constraint has lessened, and through packages such as in the widely used R language, Bayesian analyses are more straightforward to perform. The frequentist vs. Bayesian debate has now largely evaporated, and the general consensus is that one should use whatever approach is more appropriate.

The Bayesian approach has many advantages, because one can rigorously incorporate prior knowledge. Indeed, some analyses are bringing astronomy results that were thought to be reasonably well established back into question. But there are also disadvantages, one of particular importance being that the results are strongly dependent on the prior that is put in.

Issues that astronomy data are commonly subject to include:

  • Large, complex, increasingly high-dimensional, time domain
  • Missing data: non-observation or non-detection
  • Heteroscedastic, non-Gaussian, underestimated errors
  • Outliers, artifacts, false detections
  • Systematic effects such as instrument calibration
  • Correlated data points

It is beyond the scope of this guide to discuss astrostatistics in more detail than this. The author is not aware of an equivalent 'statistics in astronomy' guide to this one, but simple web searches, or the links in section 8, will reveal starting points. The Cosmology and Machine Learning Uninstitute is somewhat similar to this guide, containing many links to the statistical cosmology literature, which has a strong Bayesian emphasis (because we only observe one universe).

It is likely to continue to be the case that many of the truly novel projects done in astronomy (and in other sciences) will result from cross-disciplinary collaboration. With the help of statisticians, computer scientists, or others, the subjects of astrostatistics and the like will continue to thrive and grow. Of particular note is that statisticians are often keen to work with astronomy data: they are abundant, contain lots of interesting and complex information that challenges the boundaries of their subject, and are also worthless (i.e., they do not have large commercial value, and are generally freely accessible). But most good science will still need the guidance of astronomers to ask the right questions, and give guidance on whether results make sense.

Cloud Computing

The idea of storage, processing, and distribution of data as an available resource akin to other utilities such as electricity or water is an attractive one for astronomers, and their institutions. If one can delegate the routine maintenance of hardware, backup of data, and so on, to a professional centre, then, money, physical space, and time are freed up for improved research. One could, in principle, work from anywhere that has a computer screen, and Internet access.

Currently, cloud computing in environments such as the Amazon EC2 cloud show promise, but remain fairly little used in astronomy. For large data, they are expensive. They are also not developed with scientists in mind as the primary users, subjecting them to similar issues of usability, long-term stability, etc. to other rapidly developing technologies such as GPUs (see below).

Issues include:

  • Expense, typically significantly more than running equivalent hardware oneself, if that is feasible
  • Use is often made of virtual machines, which, while abstracting the hardware, may perform less well than a regular local machine
  • Transfer of data to and from the cloud
  • Software licenses may be restricted to one's own site
  • Proprietary data is now offsite
  • Working on one cloud may make it difficult to transfer to another

More specialized resources are available. The CANFAR projects in Victoria, Canada, is designed to provide cloud computing for astronomers, combined with the job scheduling abilities of a supercomputing cluster. In the CANFAR setup, the user runs one or more virtual machine (VMs), in the same way as managing a desktop, but has access via Condor to the batch processing power of several hundred (and growing) processor cores on computing nodes. The VMs run Linux, and one can install one's usual astronomy software on a VM, without alteration. The Condor batch script invokes the software using the same command and arguments that are normally used. The main downside to CANFAR is it is not designed for jobs in which the operations at one node depend on those at another, such as would be the case with MPI on a distributed memory machine or OpenMP on a shared memory machine. Rather, it is designed for simple parallel processing.

Overall, it is likely that the cloud may or may not be suitable, depending on one's desired application.

The Curse of Dimensionality

A well-known problem in data mining is the 'curse of dimensionality'. Because modern astronomical detectors not only detect large numbers of objects, but also enable a lot of parameters (perhaps hundreds) to be measured for each one, many of which are of physical interest, astronomical data are increasingly subject to this problem. One can still make plots, but this may be time consuming. The multiplicity of plots may be reduced somewhat by the application of judicious physical thinking, but this is not necessarily the case.

Aspects of the curse of dimensionality are:

  • The more dimensions that are present, the higher fraction of the maximum dimension of the space is required to cover a given fraction of the volume - most of the space is near a 'corner'. In a ten dimensional cube, one has to span 80% of the size of the cube to cover 10% of the volume. Measurements of distance, such as Euclidean, also become increasingly similar.
  • No matter how many data points one has observed, the data rapidly become sparse in high dimensions, because the volume becomes exponentially large, and most of the space is therefore empty.
  • The time taken to run a data mining algorithm can scale exponentially with the number of dimensions.
  • One can reduce the number of dimensions in many ways, such as principal component analysis, or its more sophisticated nonlinear variants, but what each component represents may immediately become unclear, physically, or in relation to a model or simulation.
  • The effects can be mitigated by using sampling methods such as Markov Chain Monte Carlo.
  • The problems also occur in Bayesian inference.

What may not be clear is whether the astrophysical processes under study are intrinsically high dimensional, or whether the study has simply measured a lot of parameters. The parameter set may still be explained by a combination of simpler processes. For example, one could measure dozens of parameters for stars, but they might all be explained by the framework of the Hertzsprung-Russell diagram.

However, the use of, e.g., probability density functions to, for example, give significantly improved measures of distance via photometric redshifts than a simpler approach on the exact same data, suggests that, in general, better signals will be extracted from higher dimensional information. While the underlying physical laws may indeed by simple, in general, complex emergent processes that are not yet understood, such as star or galaxy formation, will require comparison between high-dimensional observations, equally complex simulations, and comparison between the two.

Graphical Processing Units, and Other Novel Hardware

From circa 1965-2005, computer processor speeds doubled approximately every 18 months. Thus, computer codes simply rode this Moore's law speedup, and ran faster on every succeeding machine. From around 2005, however, heat dissipation rendered further clock speedups impractical, and processors have been increasing in speed by continuing to shrink parts, and containing more processor cores, executing in parallel. This means that to continue their acceleration on these processors, codes must be parallelized. However, even this only goes so far, and other forms of hardware provide potentially even greater speedups. Given the exponential increase in data, every possible speedup that still gives the correct answer is desirable.

The most prominent alternative hardware is the graphical processing unit (GPU). Driven by the huge resources of the computer gaming industry, these chips were originally designed to render graphics at high speed, but their ability to process vector datatypes at a much higher speed than a regular CPU has rendered them useful for more general applications. Now able to deal with floating point datatypes, first single precision, and subsequently double precision, they have become useful across a range of scientific applications. These chips are known as general purpose GPUs, or GPGPUs.

The downside of GPUs is that code must be ported to run on them. This may be as simple as wrapping the appropriate single function, using an environment designed for GPU programming such as CUDA or OpenCL, but in other cases, the algorithm itself may need to be changed, which means rewriting the code. For codes that are part of a large pipeline, or have a large user base within the community, this means that one loses the feedback, updates and support of other users.

Several papers have appeared demonstrating speedups of astronomy code, from a factor of a few, to over 100x. Code sped up by the latter factor that previously took a week to run would instead take an hour.

Not every algorithm is suitable for GPU speedup. Suitable desired characteristics for an algorithm include:

  • Ability for massive parallelism: lots of subsections of the code are independent of each other
  • Locality of memory reference: threads close to each other in the code access similarly close together memory locations
  • Minimize threads conditionally executed by the portion of the code on the GPU
  • Maximize the number of mathematical operations per item of data, including by changing the algorithm
  • Minimize the data transfer to and from the GPU

A useful guide for astronomers considering speeding up their codes is Barsdell et al., MNRAS 408 1936 (2010). They indicate that one would expect algorithms such as N-body and semi-analytic simulations, source extraction, and machine learning, to be highly amenable to speedup, perhaps by a factor of O(100x), and others, such as smoothed particle hydrodynamics, adaptive mesh, stacking, and selection by criteria, by a factor of O(10x).

Other novel hardware besides GPUs includes the Field Programmable Gate Array, long used as specialized hardware (e.g., radio telescope correlators), but now able to be programmed more generally to instantiate in hardware a particular function, and then be reprogrammed to a different function as needed. Such a function could be, e.g., a trained artificial neural network. The programming can be done in variants of a regular language, such as C. There is also the IBM Cell processor, used in places including the petascale Roadrunner machine that has performed cosmological simulations. In 2012, the Intel MIC architecture will be available. This will provide speedup in a friendlier programming environment than GPU.

Parallel and Distributed Data Mining

As datasets become larger, moving the data is becoming an increasing problem. Despite novel projects such as Globus Online, the fastest way to move a large dataset (e.g., 500T) across the country is still to copy it to disk or tape, and physically transport it.

Hence, to apply data mining on a scale that utilizes these data, one must move the code to the data, rather than attempt to download dataset. Although it is true that in many cases most of the science comes from a small fraction of the full data volume, e.g., a catalogue instead of an image, in general this is not true, and even if it is, downloading may still not be the best approach.

Even when the data are in one place, however, applying a data mining algorithm to them may result in intractable computing times, because data mining algorithms often scale, for N objects, as N2, or worse. This may be alleviated by parallelizing the tasks, or using faster versions of the algorithms that scale as NlogN or better. Such algorithmic speedups are achieved in various ways, a common one being the use of tree structures, such as the kd-tree.

Parallel and distributed data mining has been widely employed in the commercial sector, but so far has been little used in astronomy, because it generally requires porting codes, and is affected by the type of parallelism, whether subsets of the data are independent of each other, and the architecture of a machine, e.g., distributed or shared memory. Incorporating data mining algorithms into parallel systems such as MPI or OpenMP is also not straightforward. Grid computing and crowdsourcing are also approaches likely to become more important in future.

Parallel Programming

The convergence of technologies, from increasing processor cores on the one hand, to increasing generality of highly parallel chips such as GPUs (see above), means that, in general, future codes will be executed as several threads running in parallel. The speedup factor is limited by the portion of the code that is not parallel, so ideally this is minimized.

Parallelization causes numerous potential issues for astronomy codes:

  • Most existing codes are not designed to run in parallel
  • Astronomers are not trained in parallel programming
  • Parallelizing code may be extremely non-trivial, including altering algorithms, where even an arrangement that is not highly suboptimal is not obvious
  • Debugging parallel code can be difficult, because issues arise not encountered in sequential programming, such as interaction of processes. The output of one process may form the input of another, or they may operate on the same data.
  • There are many different types of parallelism, e.g., task, data, coarse- and fine-grained

Programming languages such as Python support basic parallel programming, e.g., utilizing several of one's processor cores, but for larger scales there are many entire parallel programming languages.

However, a subclass of parallel tasks, those that are independent of each other ('embarrassingly parallel') often occur in astronomical data processing tasks such as classifying objects, and these are relatively straightforward to implement by simply calling the code multiple times.

Petascale and Exascale Computing

Most current astronomy is done using files of sizes measured in megabytes, or gigabytes. Many surveys, however, now possess dataset sizes in the terascale (1012 bytes) regime, and some, e.g., Pan-STARRS, have entered the petascale (1015 bytes) regime. Similarly, most computers achieve performance measured in gigaflops, but modern supercomputers have entered the teraflops range, and some have now exceeded 1 petaflops. The trend will continue, and exaflops (1018) machines, and exabytes of data, are anticipated by the end of the decade.

Astronomical projects already use petascale machines (e.g., cosmological simulations on Roadrunner at Los Alamos National Lab), several petascale surveys are already running, being developed, or are planned (LSST, etc.), and projects such as the SKA, circa 2020, will enter well into the exascale regime.

Petascale machines have several issues, including:

  • Cost
  • Power consumption
  • Balancing processing with data input/output (I/O)
  • Failure of parts (e.g., among the thousands of disk drives)
  • Scaling software to use hundreds of thousands of processor cores
  • Decrease in memory per processor core

Balancing processing with I/O is important, because typical scientific applications require about 10,000 CPU cycles per byte of I/O, whereas, to avoid being delayed by waiting for the disk, one wants approximately 50,000 cycles. This rule of thumb is one of several useful approximations known as Amdahl's laws. Another of the laws states that the ratio of I/O per second to the number of instructions per second (the Amdahl number) should be approximately 1. Applications dominated by I/O thus have numbers far below this, e.g., 10-5, and this is the case with most supercomputing applications. Recent machines such as GrayWulf at Johns Hopkins, and the supernode architecture of the 'Gordon' machine at the San Diego Supercomputing Centre, have been developed that are more balanced (Amdahl number ~ 0.5-1), and this trend will continue.

Many of the petascale issues will continue to the exascale, and become even more significant. Power consumption will rise to circa 100 MW, equivalent to a small power station, and most of it will be consumed by memory access memory and data transfer, rather than processing. Processor cores numbering in the hundreds of thousands will rise to over 100 million, and I/O will be even more dominant as a limiting factor. Alternative hardware such as GPUs will not solve these fundamental problems. Nevertheless, current projections show that the SKA project will require the world's fastest supercomputer circa 2020 to process its data, so astronomers have an interest in the outcome.

Real-time Processing and the Time Domain

Historical precedent has shown that whenever a new region of parameter space in astronomy is explored, unexpected new discoveries are made (e.g., pulsars). In the space of timescale of variability, versus brightness, particularly over large regions of the sky, a large parameter space remains unexplored. In the next decade, large synoptic surveys such as LSST will explore this space. They will greatly increase the instances of known variable objects, discover more known rare objects, confirm objects predicted to exist but not yet observed, and likely find unexpected new classes of object.

While data mining techniques able to classify static data have been used for several years in astronomy (see section 2), several new issues are encountered when one extends the analysis to time domain:

  • Multiple observations of objects that can vary in irregular and unpredictable ways, both intrinsic and due to the observational equipment
  • Objects in difference images, in which the static background is subtracted, leaving the variation, do not look like regular objects
  • The necessarily extremely rapid response to certain events such as gamma-ray bursts where physical information can be lost mere seconds after an event becomes detectable
  • Robust classification of large streams of data in real time
  • Lack of previous information on several phenomena, leaving a sparse or absent training set
  • The volume and storage of time domain information in databases
  • Available commercial databases may not contain a time domain datatype
  • Removal of artifacts that might otherwise be flagged as unusual objects, and incur expensive follow-up telescope time
  • Variability will be both photometric, and astrometric, i.e., objects can vary in brightness, and/or move
  • Variability shows many different patterns, depending on the type of object, including regular, nonlinear, irregular, stochastic, chaotic, or heteroscedastic (the variability itself can change with time)

There has been much progress in this field in recent years, and machine learning processes that emphasize a training set when information is abundant, or prior information when it is not, have been utilized. Examples include Hidden Markov models, a type of Bayesian network which is a generalization of mixture models to the time domain, and other probabilistic and/or semi-supervised classifiers. The time domain also allows object classification in new ways not possible in static data, e.g., using photometric variability to select quasar candidates, or astrometric variability to discover asteroids.

Events in real time are recorded and passed to appropriate telescopes via the Virtual Observatory (see below) protocol VOEvent net, and followed up upon, either automatically, or by human observers. Time series analysis is a well developed area of statistics, and many further techniques will no doubt be useful. Several large survey projects, such as LSST and Gaia, have data mining working groups, for which the time domain is a significant component.


For astronomical data to be useful, they must be described by metadata: what telescope was used, when, what was the exposure time, which instrument, what are the units of measurement, and so on. This is crucial not only for the meaningfulness of their results, but their scientific credibility through being reproducible. The FITS file format is popular in part because it stores associated metadata for observations.

It is also useful for the automated components of data analysis to understand this information. Consider searching Google image search for the Eagle Nebula, Messier 16. Typing 'm16' will give you the image, after several pages of pictures of the US military rifle. One can refine the search, but the fundamental limitation that only the string of characters, not what is actually wanted, is being matched.

This is an issue, because semantic information may be required for meaningful queries to be possible, e.g., one may want to search a compilation of catalogues in the Virtual Observatory for redshifts, which requires the system to know which catalogues, and which columns within them, contain redshifts. Semantics also allows the idea of annotations, in which knowledge concerning the data is added, e.g., a user could note that an object classified as unknown by the algorithm is an artifact caused by the superposition of a star and part of a galaxy.

The main issue with semantic astronomy is its practical deployment such that it is useful for the community to allow new knowledge, i.e., science, to be discovered.

Unfortunately, semantics also suffers from a level of abstraction from the data. If the metadata are inaccurate, or missing (which is not unheard of), then that is propagated through the analysis. It is also not necessarily simple to define a fast-moving and constantly changing research area with the level of ontological precision required for machines to use the data.

However, the potential payoff of directly increasing what we are searching for, i.e., data -> information -> knowledge -> wisdom, will almost certainly outweigh in its cumulative usefulness problems in the descriptions of particular datasets. The field is part of a broader effort in computer science to bring about the 'semantic web', and in related fields such as bioinformatics, and will benefit from progress in these areas.

The Virtual Observatory

Many novel discoveries in astrophysics, especially in the last half century, have been made by combining more than one dataset over different wavelengths. For example, quasars were discovered as radio sources that turned out to match starlike optical objects. The aim of the Virtual Observatory (VO) is to make the plethora of thousands of astronomy datasets interoperable, and provide corresponding analysis tools, so that the true potential of the data may be realized.

Internationally, each of several countries has its own VO, based at one or more data centres. These are funded by that country, and the VOs are federated into the International Virtual Observatory (IVOA). The IVOA has defined and developed a range of standards such that datasets can indeed be made interoperable. The aim is that regular users do not need to know about these standards, but utilize them transparently. In practice, a basic knowledge of what is going on is likely to be useful, e.g., your RA+decs from Topcat are being sent to DS9 not magically, but using SAMP, which is a general way to link together the output from one application to the input of another.

Analysis tools that have been developed include several visualization programs, the ability to collect data for a given position or object, to construct SEDs over multiple wavelengths, find, integrate and cross-match data from catalogues, time domain tools, e.g., light curves, semantic tools, and tools to run data mining algorithms, e.g., the DAME system.

The VO has been in operation since around the year 2000. A huge amount of progress has been made, with many standards defined, and a lot of good, working software. Unfortunately, there is not yet a significant body of science that can be pointed to as being impossible without the VO (although see section 2 of this guide). This means that its reach within the wider astronomical community has been poor, and various misconceptions, such that it is simply a data repository (which exist anyway), abound.

There is much promise, however. Various VO schools and outreach programs have been run, and most users are very positive about their experience. This means that basic knowledge of the VO and its capabilities will become increasingly important.

Visualization of Large, Complex, and High-dimensional Data

Visualization comes in two forms: graphical, and scientific. While the former is concerned with presentation and impact, the latter aims to increase qualitative and quantitative understanding of the data, by revealing or clarifying patterns not otherwise evident.

An immediate difficulty with visualization is that the human brain is not designed to visualize more than three spatial dimensions. However, astronomy does not utilize even the well-studied research that exists even given this limitation, for example, methods for visualizing four variables on a 2D plot using different shaped glyphs overlaid on a colour map. The most commonly used visualization programs are over two decades old and are not designed for datasets larger than machine memory.

Newer tools are becoming available, including ones compatible with the Virtual Observatory standards. Collaborations with other fields such as medical imaging have occurred, and large applications have been demonstrated. A system at Swinburne uses GPUs to visualize 1 terabyte ASKAP data cubes. As with many of the subjects discussed in this section, it is likely that continued collaboration with other fields will produce substantial progress.

-- NickBall - 19 Mar 2011
-- NickBall - 03 Oct 2011

Topic revision: r7 - 2011-10-03 - NickBall
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback