Proposed Extensions to VOTable 1.0

Document 1x (27 Sep 2003)

Document repository: http://cdsweb.u-strasbg.fr/doc/VOTable/
Comments: votable(at)ivoa.net


Contents:

  1. Introduction
    1. Example
  2. Proposed Extensions
  3. Metadata improvements
    1. The utype attribute
    2. The <GROUP> proposal
    3. Arrays of variable-length strings
  4. Diversified data streaming
    1. The type="location" attribute
    2. The encoding attribute in <TD>


1  Introduction

The VOTable format is a proposed XML standard for representing tabular data in the context of the Virtual Observatory (VO); its version 1.0, available from http://www.ivoa.net/twiki/bin/view/IVOA/IvoaVOTable, defines the basic layout and the relations with the already existing data formats like FITS tables.

In our context, the specificity of a VOTable consists in the way the metadata (data describing the data) are organized, aiming at an automatized interpretation by the tools of the VO of the data coming from various horizons. The example below, a bit expanded from the example of the version 1.0, shows the metadata (located between the <TABLE> and <DATA> markers) essentially consisting in a set of <FIELD>s; the data are following, expressed in the example as an XML-formatted <TABLEDATA> structure. In terms of the semantic web community, the organization of a VOTable can be viewed as a definition of the properties of the tabular entity (the metadata), followed by the values of these properties in the many instances of the entity (the rows).

1.1  Example

This simple example of a VOTable document lists 3 galaxies with their velocity, distance, and literature references where velocity measurements have been published.
<?xml version="1.0"?>
<VOTABLE version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:noNamespaceSchemaLocation="http://vizier.u-strasbg.fr/xml/VOTable.xsd">
  <DEFINITIONS>
  <COOSYS ID="myJ2000" equinox="2000." epoch="2000." system="eq_FK5"/>
  </DEFINITIONS>
  <RESOURCE name="myFavouriteGalaxies">
    <TABLE name="results">
      <DESCRIPTION>Velocities and Distance estimations</DESCRIPTION>
      <FIELD name="RA" ucd="POS_EQ_RA_MAIN" ref="J2000" datatype="float" 
             width="6" precision="2" unit="deg"/>
      <FIELD name="Dec" "POS_EQ_DEC_MAIN" ref="J2000" datatype="float" 
             width="6" precision="2" unit="deg"/>
      <FIELD name="Name" ucd="ID_MAIN" datatype="char" arraysize="8*"/>
      <FIELD name="RVel" ucd="VELOC_HC" datatype="float" 
             width="5" unit="km/s"/>
      <FIELD name="R" ucd="PHYS_DISTANCE_TRUE" datatype="float" 
             width="4" precision="1" unit="Mpc">
        <DESCRIPTION>Distance of Galaxy, assuming H=75km/s/Mpc</DESCRIPTION>
      </FIELD>
      <FIELD name="references" ucd="REFER_BIBCODE" datatype="char" 
             arraysize="20x*"/>
      <DATA>
        <TABLEDATA>
        <TR><TD>010.68</TD><TD>+41.27</TD><TD>N  224</TD><TD>-192</TD>
	    <TD>0.7</TD><TD>1995ApJS...98..477H 1997ApJS..112..315H</TD>
	</TR>
	<TR><TD>287.43</TD><TD>-63.85</TD><TD>N 6744</TD><TD>842</TD>
	    <TD>10.4</TD><TD>1995ApJS...96..123T</TD>
        </TR>
	<TR><TD>023.48</TD><TD>+30.66</TD><TD>N  598</TD><TD>-182</TD>
	    <TD>0.7</TD><TD>1997ApJS..112..315H 1973UGC...C......0N</TD>
        </TR>
        </TABLEDATA>
      </DATA>
    </TABLE>
  </RESOURCE>
</VOTABLE>

2  Proposed Extensions

VOTable1.0 proved to be very useful, and the discussions concerning the limitations found in its usage lead to propose the following modifications or additions:

The first 3 items are actually additional features to better express the metadata part, while the last expresses a wish to use VOTable as in interface to data coded in various forms.

3  Metadata improvements

3.1   The utype attribute

In some contexts, it may be important that <FIELD>s are explicitely designed as being the parameter of a given data model. For instance, it might be important for an application to know that a given <FIELD> expresses the surface brightness processed by an explicit method. None of the existing name, ID or ucd attributes can fill this role, and we therefore propose the addition of a utype attribute. The respective roles of these attributes are:

It was effectively proposed during the discussions on UCDs that the ucd attribute could be replaced by a pointer to some data model in the future; in practice it seems rather impossible to the UCD to play simultaneously a role of global meaning enabling global interoperability and to play the role of defining precisely which parameter is represents in the context of a data model. The utype attribute is a simple solution to this dilemna.

<FIELD> and <PARAMETER> sharing the same set of attributes (with the exception of the value attribute), it is proposed that the <PARAMETER> entity can also exhibit a utype attribute.

3.2   The <GROUP> proposal

The <GROUP> tag is proposed to group together a set of <FIELDS> which are logically correlated, like a value and its error. The fields participating to a <GROUP> can be defined either physically (the field is defined within a group), or logically for fields just referenced in the group via the ref attribute (referencing the ID attribute of another field): a same physical field (i.e. a single column of the table) may therefore participate in several groups.

A straightforward example of a group is:

    <GROUP name="Velocity" ucd="VELOC_HC">
      <DESCRIPTION>Velocity and its error</DESCRIPTION>
      <FIELD name="RVel" ucd="VELOC_HC" datatype="float" 
             width="5" unit="km/s"/>
      <FIELD name="e_RVel" ucd="ERROR" datatype="float" 
             width="3" unit="km/s"/>
    </GROUP>

The <GROUP> entity can have the name, ID, ucd, utype and ref attributes. It can include a <DESCRIPTION>, <FIELD>s, <PARAMETER>s, and other <GROUP>s - this recursive grouping enabling a definition of arbitrary complex structures.

The possibility of adding <PARAMETER>s in groups introduces also a possibility of describing more accurately parameters, and is an alternative to the proposal of parametrized UCDs. For instance, it is possible to describe the actual frequency of a radio survey with:

    <GROUP name="Flux" ucd="VELOC_HC">
      <DESCRIPTION>Flux measured at </DESCRIPTION>
      <FIELD name="Flux" ucd="PHOT_FLUX_RADIO_400M" datatype="float" 
             width="6" precision="1" unit="mJy"/>
      <PARAMETER name="Freq" ucd="OBS_FREQUENCY" unit="MHz" datatype="float"
             value="352"/>
      <FIELD name="e_Flux" ucd="ERROR" datatype="float" width="4" 
             precision="1" unit="mJy"/>
    </GROUP>

Similarly, the <GROUP> can be used to associate several parameters to one or several <FIELD>s: a filter may for instance be characterized by the central wavelength and the FWHM of its transmission curve; or several parameters of an instrument setup may be detailed.

3.3  Arrays of variable-length strings

Following the FITS conventions, strings are defined as arrays of characters. This definition raises problems for the definition of arrays of strings, which have then to be defined as 2D-arrays of characters - but in this case only the slowest-varying dimension (i.e. the number of strings) can be variable. According to this limitation, the list of references given in the example above (<FIELDname=" references">) was assigned an arraysize of 20 to take into account the blank which separates two references made of 19 characters each.

FITS invented the Substring Array convention (defined in an appendix, i.e. not officially approved) which defines a separator character used to denote the end of a string and the beginning of the next one. In this convention (rA:SSTRw/ccc) the total size of the character array is specified by r, w defines the maximal length of one string, and ccc defines the separator character as its ascii equivalent value. The possible values for the separator includes the space and any printable character, but excludes the control characters.

Such arrays of variable-length strings being frequently used; a similar convention can be introduced in VOTable in the arraysize attribute, using the s followed by the separator character; an example can be   arraysize="100s,"   indicating a string made of up to 100 characters, where the comma is used to separate the elements of the array.

4  Diversified data streaming

Rather than requiring that all data described in the set of <FIELD>s are contained in a single stream which follows the metadata part, it is proposed to let the <FIELD> act as a pointer to the actual data, either in the form of a URI or of a reference to a component of a multipart document.

Each component of the data described by a <FIELD> may effectively have different requirements: while text data or small lists of numbers are quite efficiently represented in pure XML, long lists like spectra or images generate poor performances if these are converted to XML. The method proposed in VOTable1.0 to gain efficiency is to use a binary representation of the whole data stream by means of the <STREAM> element - at the price of delivering data totally non-human readable.

4.1  The type="location" attribute

In order to enable more flexibility in the way the various <FIELD>s can be accessed, it is proposed the following additions:

Note that the <LINK> is not required - a <FIELD> declared with type="location" and containing no <LINK> element is assumed to contain URIs.

An example of a table describing a set of spectra looks like the following:

<TABLE name="SpectroLog">
  <FIELD name="Target" ucd="ID_TARGET" datatype="char" arraysize="30*"/>
  <FIELD name="Instr" ucd="INST_SETUP" datatype="char" arraysize="5*"/>
  <FIELD name="Dur" ucd="TIME_EXPTIME" datatype="int" width="5" unit="s"/>
  <FIELD name="Spectrum" ucd="DATA_LINK" datatype="float" arraysize="*"
         unit="mW/m2/nm" type="location">
    <DESCRIPTION>Spectrum absolutely calibrated</DESCRIPTION>
    <LINK type="location" 
        href="http://ivoa.spectr/server?obsno="/>
  </FIELD>
  <DATA><TABLEDATA>
    <TR><TD>NGC6543</TD><TD>SWS06</TD><TD>2028</TD><TD>01301903</TD></TR>
    <TR><TD>NGC6543</TD><TD>SWS07</TD><TD>2544</TD><TD>01302004</TD></TR>
  </TABLEDATA></DATA>
</TABLE>
The reading program has therefore to retrieve the data by resolving the URI http://ivoa.spectr/server?obsno=01301903.

The same method could also be immediately applicable to Content-IDs which designate elements of a multipart message, using the protocol prefix cid: (RFC 2111)

4.2  The encoding attribute in <TD>

Accessing binary data improves quite significantly the efficiency both in storage and CPU usage, especially when one compares with the XML-encoded data stream. But binary data cannot be included in the same stream as the metadata description, unless a dedicated coding filter is applied which converts the binary data into an ascii representation. The base64 is the most used filter which does this conversion, where 3 bytes of data are coded as 4 ascii characters, which implies an overhead of 33% in storage, and some (small) computing time necessary for the reverse transformation.

In order to keep the full VOTable document in a unique stream, VOTable1.0 introduced the encoding attribute in the <STREAM> element, meaning that the data, stored as binary records, are converted into some ascii representation compatible with the XML definitions. One drawback of this method is that the entire data contents becomes non human-readable. The addition of the encoding attribute in the <TD> element allows the data server to decide, at the cell level, whether it is more efficient to distribute the data as binary-encoded or as fully edited values. The result may look like the following:

<TABLE name="SpectroLog">
  <FIELD name="Target" ucd="ID_TARGET" datatype="char" arraysize="30*"/>
  <FIELD name="Instr" ucd="INST_SETUP" datatype="char" arraysize="5*"/>
  <FIELD name="Dur" ucd="TIME_EXPTIME" datatype="int" width="5" unit="s"/>
  <FIELD name="Spectrum" ucd="SPECT_FLUX_VALUE" datatype="float" arraysize="*"
         unit="mW/m2/nm"/>
  <DATA><TABLEDATA>
    <TR><TD>NGC6543</TD><TD>SWS06</TD><TD>2028</TD><TD encoding="base64">
    QJKPXECHvndAgMScQHul40CSLQ5ArocrQLxiTkC3XClAq0OWQKQIMUCblYFAh753QGij10BT
    Em9ARKwIQExqf0BqbphAieuFQJS0OUCJWBBAhcrBQJMzM0CmRaJAuRaHQLWZmkCyhytAunbJ
    QLN87kC26XlA1KwIQOu+d0DsWh1A5an8QN0m6UDOVgRAxO2RQM9Lx0Din75A3o9cQMPfO0C/
    dLxAvUeuQKN87kCXQ5ZAjFodQH0vG0B/jVBAgaHLQI7Ag0CiyLRAqBBiQLaXjUDYcrBA8p++
    QPcKPUDg7ZFAwcKPQLafvkDDlYFA1T99QM2BBkCs3S9AjLxqQISDEkCO6XlAmlYEQKibpkC5
    wo9AvKPXQLGBBkCs9cNAuGp/QL0euEC4crBAuR64QL6PXEDOTdNA2987QN9T+EDoMSdA8mZm
    QOZumEDDZFpAmmZmQGlYEEBa4UhAivGqQLel40Dgan9A4WBCQLNcKUCIKPZAk1P4QNWRaEEP
    kWhBKaHLQTkOVkFEan9BUWBCQVyfvg==
    </TD></TR>
  </TABLEDATA></DATA>
</TABLE>


François Ochsenbein Observatoire Astronomique de Strasbourg, France