IVOA KDD-IG: Template datasets for algorithm benchmarking




Who's interested?

RoyWilliams
CiroDonalek
RaffaeleDAbrusco

ftp://ftp.astro.caltech.edu/users/donalek/DM_templates/

reg_class: data from SDSS-DR7.

Dataset 1: regression problem.
# err_umg err_gmr err_rmi err_imz umg gmr rmi imz z
The first 8 columns contain the features (color errors and colors of galaxies), the last contains the target (spectroscopic redshifts). Whatever combination of the first 9 columns can be used. Training, evaluation (if needed) and test sets must be extracted from this file.

******************************

Dataset 2: classification problem.
# umg gmr rmi imz specClass
The first 4 columns contain the features (colors of stellar sources), the last column contains the target (spectroscopic classification). Training, evaluation (if needed) and test sets must be extracted from this file.
The target values are (0, 1, 3, 2, 4, 6).
While it would be preferable to obtain a classification in different classes corresponding to each distinct value of the target, a grosser classification in two classes (namely, target = (0,1,2,6) for stars and galaxies, and target = (3,4)) would be interesting and useful as well.
Classes:
0 -> unknown source
1 -> star
2 -> galaxy
3 -> quasars
4 -> high redshift quasars
5 -> artifact
6 -> late type stars

******************************

Dataset 3: regression problem.
# err_umg err_gmr err_rmi err_imz umg gmr rmi imz zspec
This dataset is similar to dataset 1, except for the fact that the sources are quasars, not galaxies. Same rules apply.


Notes

I am going to add more datasets... -- Ciro

It would seem to me that these points below are intimately related: (3) the templates and (5) working with IVOA, meaning choice of data models and formats. I would see the following as interesting data objects for the KDDIG:

  • Catalog of sources, each with ID, position, magnitudes, ... more
  • Light curve of source, with time, magnitude, upper-limit observations, ... more.
  • Image, with WCS, calibration, ...more.

In each case, there is an IVOA approach to these things through the spectrum data model (VOTable + Utypes). There are also other formats for these things that are not IVOA approved. And of course there is always a call for the stripped-down CSV (just gimme the data). The archive can have many format choices (so what are they?).

One could also say that the formatting is easy once the data is in the database, which avoids choice of a specific syntax. But the semantics of that database schema must enable the semantics of every proposed output format. -- Roy


Topic revision: r9 - 2011-05-04 - CiroDonalek
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback