reg_class: data from SDSS-DR7.
Dataset 1: regression problem.
# err_umg err_gmr err_rmi err_imz umg gmr rmi imz z
The first 8 columns contain the features (color errors and colors of galaxies), the last contains the target (spectroscopic redshifts).
Whatever combination of the first 9 columns can be used.
Training, evaluation (if needed) and test sets must be extracted from this file.
******************************
Dataset 2: classification problem.
# umg gmr rmi imz specClass
The first 4 columns contain the features (colors of stellar sources), the last column contains the target (spectroscopic classification).
Training, evaluation (if needed) and test sets must be extracted from this file.
The target values are (0, 1, 3, 2, 4, 6).
While it would be preferable to obtain a classification in different classes corresponding to each distinct value of the target,
a grosser classification in two classes (namely, target = (0,1,2,6) for stars and galaxies, and target = (3,4)) would be interesting and useful as well.
Classes:
0 -> unknown source
1 -> star
2 -> galaxy
3 -> quasars
4 -> high redshift quasars
5 -> artifact
6 -> late type stars
******************************
Dataset 3: regression problem.
# err_umg err_gmr err_rmi err_imz umg gmr rmi imz zspec
This dataset is similar to dataset 1, except for the fact that the sources are quasars, not galaxies. Same rules apply.
I am going to add more datasets... -- Ciro
It would seem to me that these points below are intimately related: (3) the templates and (5) working with IVOA, meaning choice of data models and formats. I would see the following as interesting data objects for the KDDIG:
In each case, there is an IVOA approach to these things through the spectrum data model (VOTable + Utypes). There are also other formats for these things that are not IVOA approved. And of course there is always a call for the stripped-down CSV (just gimme the data). The archive can have many format choices (so what are they?).
One could also say that the formatting is easy once the data is in the database, which avoids choice of a specific syntax. But the semantics of that database schema must enable the semantics of every proposed output format. -- Roy