IVOA KDD-IG: A user guide for Data Mining in Astronomy

9: Worked example

We illustrate some of the concepts described in this guide by describing and annotating the raw-data-to-science process from an existing paper, "Robust Machine Learning Applied to Astronomical Datasets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey Using Decision Trees" (Ball et al. 2006).

We frame the example in terms of the headings in section 3 of the guide.

The abstract of the paper is:

"We provide classifications for all 143 million nonrepeat photometric objects in the Third Data Release of the SDSS using decision trees trained on 477,068 objects with SDSS spectroscopic data. We demonstrate that these star/galaxy classifications are expected to be reliable for approximately 22 million objects with r < ~20. The general machine learning environment Data-to-Knowledge and supercomputing resources enabled extensive investigation of the decision tree parameter space. This work presents the first public release of objects classified in this way for an entire SDSS data release. The objects are classified as either galaxy, star, or nsng (neither star nor galaxy), with an associated probability for each class. To demonstrate how to effectively make use of these classifications, we perform several important tests. First, we detail selection criteria within the probability space defined by the three classes to extract samples of stars and galaxies to a given completeness and efficiency. Second, we investigate the efficacy of the classifications and the effect of extrapolating from the spectroscopic regime by performing blind tests on objects in the SDSS, 2dFGRS, and 2QZ surveys. Given the photometric limits of our spectroscopic training data, we effectively begin to extrapolate past our star-galaxy training set at r ~ 18. By comparing the number counts of our training sample with the classified sources, however, we find that our efficiencies appear to remain robust to r ~ 20. As a result, we expect our classifications to be accurate for 900,000 galaxies and 6.7 million stars and remain robust via extrapolation for a total of 8.0 million galaxies and 13.9 million stars."

Data collection

As alluded to in section 3, this paper is quite typical of a data mining analysis, because the data were collected independently. Here, the data are the Third Data Release of the Sloan Digital Sky Survey.

Nevertheless, some additional work was involved in data collection, because, at the time, it was more practical for the Illinois Laboratory for Data Mining to copy, by special arrangement, and maintain locally, a copy of the SDSS database, than to query the public one for 143 million objects.

The work was also done in collaboration with machine learning specialists in the Automated Learning Group (ALG) at the National Center for Supercomputing Applications (NCSA), who wrote and maintained the 'Data to Knowledge' program used here. We used a version of the code with customized extensions written by an ALG member that allowed streaming of our large data.

The work also utilized the NCSA supercomputing facilities, to which we had access via time that was competitively allocated nationally via proposals.

Data pre-processing

Sloan catalogue data is supplied reduced, and calibrated from the images, but beyond that it remains in relatively raw form, thus sample cuts were made on the numerous processing flags. The purpose was to remove unphysical data from the inputs, but to leave a high completeness of real objects.

The form of the data used was ASCII, with one row per object, each column describing a given object feature, e.g., RA, dec, magnitude. ASCII (rather than, e.g., FITS) was used so that the data could be streamed, as it was much larger than machine memory, and was readable by the D2K program.

Because the number of possible combinations of cuts quickly multiplies to an exponentially large number, our astronomical domain knowledge was required to select what was reasonable, although, compared to many analyses, fewer cuts were applied here, especially to the testing set, because (a) each classification is independent of the others and (b) the spectroscopic classifications are generally (although not completely, it turns out) reliable.

The object features used were the PSF, fiber, Petrosian and model magnitudes in the 5 ugriz passbands, taking the differences (u-g, g-r, etc.) to form colours. Each contains different information (e.g., the SDSS basic star-galaxy separator is psfMag-cModelMag), but, as one would expect, the magnitudes are highly correlated.

Cuts included:

- Primary objects: photoPrimary SQL database view; exclude any repeat observations of an object.
- Apply correction for galactic extinction using the SDSS-supplied SFD '98 values.
- Remove non-physical values, e.g., -9999. In general, bad values in the data are given a number such as this, not, e.g., NaN. Thus, unremoved, they would propagate through the data mining process and distort the results. Also -40<colour<40.

resulting in 142,705,734 objects.

For the training set:

- specObj SQL database view
- We assume the spectroscopic classifications have no error

resulting in 477,068 objects. This is not perfect, but elucidating misclassifications in the spectra for the extremely small number of objects for which this may occur was beyond the scope of this paper, and decision trees are less affected by such errors than some methods.

Many of these cuts make the assumption that the data within the SDSS DR3 are correct. At some level this is unavoidable, e.g., it may be unreasonable to verify data beyond a certain point, and one also reasonably assumes that all levels of one's computer operating system and software are working correctly.

Model Building

Our philosophy was to provide probabilistic classifications for the broadest reasonable set of objects, allowing users of the classifications to impose further cuts after the fact, as dictated by science requirements. For example, calculation of a galaxy correlation function using the objects classified as galaxies would require a low contamination from stars, possibly at the expense of completeness, but searching for, e.g., candidate objects for gravitational lensing, using the nsng classification, would require a high completeness, while not being overly concerned with contamination. More generally, a user may simply want a maximal combination of completeness and efficiency. Users may also verify the classifications at a later date, and this was indeed done (see below).

We chose decision trees for the model, because of their general robustness to real-world data, e.g., unevenly sampled training data (which is the case here), outlying or irrelevant values, and correlated features such as magnitudes or colours. They were also readily available as part of the D2K software, and the ALG group at NCSA had expertise in them.

The D2K software and supercomputing resources allowed us to test several variants in decision tree construction, resulting in the construction of several thousand trees. Tree parameters included maximum tree depth, minimum decomposition population within a node, maximum number of nodes, minimum error reduction for node splitting, and so on. We also varied the ratio of the sizes of the training and testing sets, the fraction of objects used for bagging (see below), and the random seed for selecting subsamples.

All evidence, including visualizing the parameter space using the Partiview 3D software (also written at NCSA), suggested that a range of decision tree parameters gave a close approximation to the minimal error - a tree close to optimal for these data was quite easy to find, many runs giving a similar ~3% error. Also, the training sets were easily large enough for the task - subsets gave similar results. This may or may not be typical for astronomy, depending on the data and the application.

We chose not to prune the trees because our main concern was the prediction of new classes, not elucidating the particular set of rules by which they were arrived at.

We chose three classes, because any object that is not a star or galaxy at high probability, e.g., a quasar, is of potential astrophysical interest. This was different to the usual star-galaxy separation into 2 classes.

The tree provides classification probabilities by counting the fraction of each object type in an object's leaf node, and assigning that value as the probability.

Model Validation

We took 80% of our training data, and optimized the trees by minimizing the usual sum-of-squares error between the training set classifications and those assigned by the tree. 10-fold bagging was employed, as this improved the results.

We then quoted results for the remaining 20% using a the optimal tree. Because the training set was easily large enough, this pseudo-blind test was a useful check.

Our training set was limited to r ~ 18. Nevertheless, we wished to provide classifications for the whole survey, to r ~ 22, four magnitudes fainter. This sounds problematic at first glance, but this extrapolation in magnitude, astrophysically, does not necessarily imply a large extrapolation in colour, and it is colours that we were using.

Therefore, for objects fainter than the training set, we blind-tested the trained decision trees on two other, independent sky surveys, the 2dFGRS for galaxies, and the 2QZ for quasars. These, again, required sample cuts guided by astrophysical knowledge, e.g., classes 11 12 21 22 in 2QZ, i.e., best, or second best, classification and redshift. The surveys were cross-matched using a tolerance of 2 arcseconds. A basic cross-match sufficed because the object spacing on the sky is such that the number of objects that would be confused with others is negligible.

We detailed how to select a sample of given completeness and efficiency using the output probabilities, finding that the optimal completeness and efficiency for each type was obtained with a probability threshold close to 50%, as one might expect.

A more recent paper by Vasconcellos et al. (2011) showed that the classifications are indeed reliable, to r ~ 21.

Model Deployment

Finally, the model was deployed on the full SDSS DR3 dataset of 143 million objects.

To run the trees for a dataset of this size required construction of scripts to submit jobs on the supercomputing cluster to run subsets of the data in parallel. The bookkeeping for this was not unduly difficult, but it was not trivial.

The data were made available, although we did not have the resources locally to supply them as a database. (In fact, they remain available, at http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/community/ml_archive ).


In dealing with large data such as this, unexpected issues are encountered. Some examples not already described are:

Because the output probabilities are discretized, due to finite numbers of objects in the nodes, we used the Python 'decimal' data type to avoid incorrect bin counts due to finite floating point precision. Approximately 0.2% of objects had equal highest probability classifications. 0.2% is 285,839. (Python was only used on portions of the analysis where speed was not important, e.g., D2K is written in Java.)

Different versions of our copy of the SDSS database were not identical, which was traced to a single SDSS run. The author does not recall the cause of this, or whose fault, if any, it was.

The bagging was limited by the memory on the NCSA supercomputer (tungsten) nodes.

The D2K extensions were not documented as extensively as the main software, necessitating communication at various points.

Various systems, e.g., the supercomputer, mass storage system, etc., are generally reliable, but uptime is not 100%. A given supercomputer eventually disappears and is replaced by a new one, sometimes with not much notice, e.g., a month or two.

At the end of a long run, "no such file or directory" is not the correct result.


The example of this paper shows that the practical application of relatively well-known data mining algorithm to a large dataset can produce novel and scientifically interesting results. In this case, the results were aided by collaboration with both machine learning and supercomputing experts.


  1. Ball N.M. et al., "Robust Machine Learning Applied to Astronomical Datasets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey Using Decision Trees", ApJ 650 497 (2006)
  2. Vasconcellos E.C. et al., "Decision Tree Classifiers for Star/Galaxy Separation", AJ 141 189 (2011)

-- NickBall - 19 Mar 2011
-- NickBall - 23 Sep 2011

Topic revision: r5 - 2011-10-03 - NickBall
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback