IVOA KDD-IG: A user guide for Data Mining in Astronomy6: Present and future directionsThe combination of an abundance of available data mining algorithms, advancing technology, large amounts of new astronomical data continuously opening up new regions of parameter space, and the consequent large number of newly addressable science questions, means that several interesting new directions for data mining in astronomy are opening up in the near-term future. Several reviews also discuss future directions, including the topics below in more detail, e.g., Ball & Brunner (2010), Borne (2009), or Pesenson et al. (2010), all linked in section 8 of the guide. This section aims to be fairly generic, not a detailed literature review, so long lists of links and references are avoided. This section assumes some basic knowledge of both astronomy and data mining, so for astronomers, reading the earlier sections first is likely useful, if one does not have this knowledge. We also assume a basic knowledge of computing, e.g., what a CPU is. It is also inevitable that material in different sections will overlap, and to make each section self-contained would considerably lengthen them, so we do not attempt this.AstrostatisticsAlthough the interaction of astronomy and statistics has a history going back centuries, because the large number of objects that are observable makes it an intrinsically statistical subject, the modern incarnation of astrostatistics is necessitated not only by the large number of objects, but in bringing the necessary sophistication to deal appropriately with data that are both large, high dimensional, and subject to numerous real-world issues. There has also been a long running debate in statistics between the frequentist and Bayesian approaches. Until the advent of fast and cheap computing, the Bayesian approach has been somewhat limited by the computation time to evaluate the integrals. But now that constraint has gone away, and through packages such as in the widely used R language, Bayesian analyses are more straightforward to perform. The frequentist vs. Bayesian debate has now largely gone away, and the general consensus is that one should use whatever approach is more appropriate. The Bayesian approach has many advantages, because one can rigorously incorporate prior knowledge. Indeed, some analyses are bringing astronomy results that were thought to be reasonably well established back into question. But there are also disadvantages, one obvious on being that the results are strongly dependent on the prior that is put in. Some issues that astronomy data are subject to are:
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
Added: | ||||||||
> > |
| |||||||
It is beyond the scope of this guide to discuss astrostatistics in more detail than this. The author is not aware of an equivalent 'statistics in astronomy' guide to this one, but simple web searches, or the links in section 8, will reveal starting points. The Cosmology and Machine Learning Uninstitute is somewhat similar to this guide, and has many links to the statistical cosmology and other literature, which has a strong Bayesian emphasis (because we only observe one universe).
It is likely to continue to be the case that many of the truly novel projects done in astronomy (and in other sciences) will result from cross-disciplinary collaboration. Either with statisticians, or computer scientists (or others), the subjects of astrostatistics and the like will continue to thrive and grow. Of particular note is that statisticians like astronomy data: they are abundant, contain lots of interesting and complex information that challenges the boundaries of their subject, and are also worthless (i.e., they do not have large commercial value, and are generally freely accessible). But most good science will still need the guidance of astronomers to ask the right questions, and know when results make sense.
Cloud ComputingThe idea of storage, processing, and distribution of data as an available resource akin to other utilities such as electricity or water is an attractive one for astronomers, and their institutions. If one can delegate the routine maintenance of hardware, backup of data, and so on, to a professional centre, then, money, physical space, and time are freed up for improved research. One could, in principle, work from anywhere that has a computer screen, and Internet access. Currently, cloud computing in environments such as the Amazon EC2 cloud show promise, but remain fairly little used in astronomy. For large data, they are expensive. They are also not developed with scientists in mind as the primary users, subjecting them to similar issues of usability, long-term stability, etc. to other rapidly developing technologies such as GPUs (see below). Issues include:
The Curse of DimensionalityA well-known problem in data mining is the 'curse of dimensionality'. Because modern astronomical detectors not only detect large numbers of objects, but also enable a lot of parameters (perhaps hundreds) to be measured for each one, many of which are of physical interest, astronomical data are increasingly subject to this problem. One can still make plots, but when there are N parameters of interest, (NC2) plots may be time consuming. One may reduce this multiplicity somewhat by the application of judicious physical thinking, but this is not necessarily the case. Aspects of the curse of dimensionality are:
| ||||||||
Added: | ||||||||
> > |
| |||||||
| ||||||||
Changed: | ||||||||
< < | What is not necessarily clear, is whether most astrophysical processes are intrinsically high dimensional, or whether we are just measuring a lot of parameters, that can still be explained by a combination of simpler processes. E.g., one could measure dozens of parameters for stars, but they might all be explained by the Hertzsprung-Russell diagram. | |||||||
> > | What is not necessarily clear, is whether most astrophysical processes are intrinsically high dimensional, or whether we are just measuring a lot of parameters, that can still be explained by a combination of simpler processes. E.g., one could measure dozens of parameters for stars, but they might all be explained by the Hertzsprung-Russell diagram. | |||||||
However, the use of, e.g., probability density functions to, for example, give significantly improved measures of distance via photometric redshifts than a simpler approach on the exact same data, suggests that, in general, better signals will be extracted from higher dimensional information. Also, while the underlying physical laws may indeed by simple, in general, complex emergent processes that are not yet understood, such as star or galaxy formation, will require comparison between high-dimensional observations, equally complex simulations, and comparison between the two.
Graphical Processing Units, and Other Novel HardwareFrom circa 1965-2005, computer processor speeds doubled approximately every 18 months. Thus, computer codes simply rode this Moore's law speedup, and ran faster on every succeeding machine. From around 2005, however, heat dissipation rendered further clock speedups impractical, and processors have been increasing in speed by continuing to shrink parts, and containing more processor cores, executing in parallel. This means that to continue to speed up on these processors, codes must be parallelized. However, even this only goes so far, and other forms of hardware provide potentially even greater speedups. Given the exponential increase in data, every possible speedup that still gives the correct answer is desirable. The most prominent alternative hardware is the graphical processing unit (GPU). Driven by the huge resources of the computer gaming industry, these chips were originally designed to render graphics at high speed, but there ability to process vector datatypes at a much higher speed than a regular CPU has rendered them useful for more general applications. Also more recently, they have been able to deal with floating point datatypes, first single precision, and now the double precision that makes them useful across the range of scientific applications. These chips are known as general purpose GPUs, or GPGPUs. The downside of GPUs is that the code must be ported to run on them. This may be as simple as wrapping the appropriate single function, using an environment designed for GPU programming such as CUDA or OpenCL, but in other cases, the algorithm itself may need to be changed, which means rewriting the code. For codes that are part of a large pipeline, or have a large user base within the community, this means one loses the feedback, updates and support of those other users. Several papers have appeared demonstrating speedups of astronomy code, from a few x, to over 100x. This means code that took a week now takes an hour. Not every algorithm is suitable for GPU speedup. Suitable desired characteristics for an algorithm include:
Parallel and Distributed Data MiningAs datasets become increasingly large, moving the data is becoming an increasing problem. Despite novel projects such as Globus Online, the fastest way to move a large dataset (e.g., 500T) across the country is to Fedex it. This means that to apply data mining on a scale that utilizes this data, one must move the code to the data, rather than download the data. Although it is true that in many cases most of the science comes from a small fraction of the full data volume, e.g., a catalogue instead of an image, in general this is not true, and downloading may still not be the best approach. | ||||||||
Changed: | ||||||||
< < | Even when the data are in one place, however, applying a data mining algorithm to them may result in intractable computing times, because the algorithms often scale as N2 for N objects, or worse. This may be alleviated by parallelizing the tasks, or using faster versions of the algorithms that scale as NlogN or better, e.g., by employing kd-trees. Parallel and distributed data mining has been widely employed in the commercial sector, but so far has been little used in astronomy. | |||||||
> > | Even when the data are in one place, however, applying a data mining algorithm to them may result in intractable computing times, because the algorithms often scale as N2 for N objects, or worse. This may be alleviated by parallelizing the tasks, or using faster versions of the algorithms that scale as NlogN or better, e.g., by employing kd-trees. | |||||||
Added: | ||||||||
> > | Parallel and distributed data mining has been widely employed in the commercial sector, but so far has been little used in astronomy, because it generally requires porting codes, and is affected by the type of parallelism, whether subsets of the data are independent of each other, and the architecture of a machine, e.g., distributed or shared memory. Incorporating data mining algorithms into parallel systems such as MPI or OpenMP is also not straightforward. Grid computing and crowdsourcing are also approaches likely to become more important in future. | |||||||
Added: | ||||||||
> > | ||||||||
Parallel ProgrammingThe convergence of technologies, from increasing processor cores on the one hand, to increasing generality of highly parallel chips such as GPUs (see above), means that, in general, future codes will be executed as several threads running in parallel. The speedup factor is limited by the portion of the code that is not parallel, so ideally this is minimized. Parallelization causes numerous potential issues for astronomy codes:
Petascale and Exascale ComputingMost current astronomy is done using files of sizes measured in megabytes, or gigabytes. Many surveys, however, now have dataset sizes in the terascale (1012 bytes) regime, and some, e.g., Pan-STARRS, have entered the petascale (1015 bytes) regime. Similarly, most computers have performance measured in gigaflops, but modern supercomputers have performances in the teraflops range, and some have now exceeded 1 petaflops. The trend will continue, and exaflop machines (and exabytes of data) are anticipated by the end of the decade. Astronomical projects already use petascale machines (e.g., cosmological simulations on Roadrunner at Los Alamos National Lab), several petascale surveys are already running, being developed, or are planned (LSST, etc.), and projects such as the SKA circa 2020 will enter well into the exascale regime. Petascale and exascale machines have several issues, including:
Real-time Processing and the Time DomainHistorical precedent has shown that whenever a new region of parameter space in astronomy is explored, unexpected new discoveries are made (e.g., pulsars). In the space of timescale of variability, versus brightness, particularly over large regions of the sky, a large parameter space remains unexplored, but in the next decade, large synoptic surveys such as LSST will explore this space. They will greatly increase the instances of known variable objects, discover more known rare objects, confirm objects predicted to exist but not yet observed, and likely find unexpected new classes of object. While data mining techniques able to classify static data have been used for several years in astronomy (see section 2), several further issues are brought up by the time domain, including:
SemanticsFor astronomical data to be useful, they must be described by metadata: what telescope was used, when, what was the exposure time, which instrument, what are the units of measurement, and so on. This is crucial not only for the meaningfulness of their results, but their scientific credibility through being reproducible. The FITS file format is popular in part because it stores the associated metadata for observations. It is also useful for the automated components of data analysis to understand this information. Consider searching Google image search for the Eagle Nebula, Messier 16. Typing 'm16' will give you the image, after several pages of pictures of the US military rifle. One can refine the search, but the fundamental limitation that only the string of characters, not what is actually wanted, is being matched. This is an issue, because semantic information may be required for meaningful queries to be possible, e.g., one may want to search a compilation of catalogues in the Virtual Observatory for redshifts, which requires the system to know which catalogues contain redshifts. Semantics also allows the idea of annotations, in which knowledge concerning the data is added, e.g., by a user - this object classified as unknown by the algorithm is an artifact caused by the superposition of a star and part of a galaxy. The main issue with semantic astronomy is its practical deployment such that it is useful for the community to allow new knowledge, i.e., science, to be discovered. Unfortunately, semantics also suffers from a level of abstraction from the data. If the metadata are inaccurate, or missing (which is not unheard of), then that is propagated through the analysis. It is also not necessarily simple to define a fast-moving and constantly changing research area with the level of precision required for machines to use the data. However, the potential payoff of directly increasing what we are searching for, i.e., data -> information -> knowledge -> wisdom, will almost certainly outweigh in its cumulative usefulness problems in the descriptions of particular datasets. The field is part of a broader effort in computer science to bring about the 'semantic web', and in related fields such as bioinformatics, and will benefit from progress in these areas.The Virtual ObservatoryMany novel discoveries in astrophysics, especially in the last half century, have been made by combining more than one dataset over different wavelengths. For example, quasars were discovered as radio sources that turned out to match starlike optical objects. The aim of the Virtual Observatory (VO) is to make the plethora of thousands of astronomy datasets interoperable, and provide corresponding analysis tools, so that the true potential of the data may be realized. Internationally, each of several countries has its own VO, based at one or more data centres. These are funded by that country, and the VOs are federated into the International Virtual Observatory (IVOA). The IVOA has defined and developed a range of standards such that datasets can indeed be made interoperable. The aim is that regular users do not need to know about these standards, but utilize them transparently. In practice, a basic knowledge of what is going on is likely to be useful, e.g., your RA+decs from Topcat are being sent to DS9 not magically, but using SAMP, which is a general way to link together the output from one application to the input of another. | ||||||||
Added: | ||||||||
> > | Analysis tools that have been developed include several visualization programs, the ability to collect data for a given position or object, to construct SEDs over multiple wavelengths, find, integrate and cross-match data from catalogues, time domain tools, e.g., light curves, semantic tools, and tools to run data mining algorithms, e.g., the DAME system. | |||||||
The VO has been in operation since around the year 2000. A huge amount of progress has been made, with many standards defined, and a lot of good, working software. Unfortunately, there is not yet a significant body of science that can be pointed to as being impossible without the VO (although see section 2 of this guide). This means that its reach within the wider astronomical community has been poor, and various misconceptions, such that it is simply a data repository (which we could have anyway), abound. | ||||||||
Deleted: | ||||||||
< < | Another problem is in cross-matching data that is spread over different locations, if those data are large and hence difficult to move. What has been demonstrated, however, is the ability to cross-match large datasets in a given location, and VO tools to do this are becoming available. | |||||||
There is much promise, however. Various VO schools and outreach programs have been run, and most users are very positive about their experience. This means that basic knowledge of the VO and its capabilities will become increasingly important.
Visualization of Large, Complex, and High-dimensional DataVisualization comes in two forms: graphical, and scientific. While the former is concerned with presentation and impact, the latter aims to increase qualitative and quantitative understanding of the data, by revealing or clarifying patterns not otherwise evident. | ||||||||
Changed: | ||||||||
< < | An immediate difficulty with visualization is that the human brain is not designed to visualize more than three spatial dimensions. However, astronomy does not utilize even the well-studied research that exists even given this limitation, for example, methods for visualizing four variables on a 2D plot using different shaped glyphs overlaid on a colour map, and the most commonly used visualization programs are over two decades old. A classic book on the subject is Edward Tufte's 'The Visual Display of Quantitative Information'. | |||||||
> > | An immediate difficulty with visualization is that the human brain is not designed to visualize more than three spatial dimensions. However, astronomy does not utilize even the well-studied research that exists even given this limitation, for example, methods for visualizing four variables on a 2D plot using different shaped glyphs overlaid on a colour map, and the most commonly used visualization programs are over two decades old and are not designed for datasets larger than machine memory. | |||||||
Added: | ||||||||
> > | Newer tools are becoming available, including ones compatible with the Virtual Observatory standards, and large applications have been demonstrated, e.g., a system at Swinburne uses GPUs to visualize 1 terabyte ASKAP data cubes. As with many of the subjects discussed in this section, it is likely that continued collaboration with other fields will produce substantial progress. | |||||||
-- NickBall - 19 Mar 2011 -- NickBall - 02 Oct 2011 <--
|