Note: this page contains discussion on VOTable 1.2, and the issues here were mostly considered and used as input to the VOTable 1.3 standard. For discussion of issues still current with respect to VOTable 1.3, see VOTableIssues13.

Nulls in VOTables

VOTables are the standard representation for data output from TAP and other DAL services and may be used for table input in TAP. VOTables are likely to be used in the exchange of information between databases for any implementation of distributed queries within the VO. This memo discusses issues that arise from the standard representation of nulls according to the VOTable standard. [The initial version of this document was written by Tom McGlynn, representing his understanding, to start discussion. Comments, corrections and suggestions by others are welcome.]

Floats and Doubles

For scalar float and double fields the VOTable specification (section 6) states The IEEE NaN pattern is used to represent null values. [italics in original] Relational databases clearly distinguish between floating NaN values and nulls. Whether this usage of the term null in the VOTable standard is intended to precisely match the database definition is not clear. If so, then the existence of NaNs in a database cannot be represented in a VOTable. No table that includes NaN values could be accurately transcribed into a VOTable.

If the intent was to suggest that NaNs are themselves rather like nulls and thus a user needing to represent a null should use a NaN, then the situation is rather worse. Someone reading either a null or NaN value cannot be sure of which is intended.

This can have significant consequences for many SQL queries. E.g., suppose we have a column, X, whose value in a TABLEDATA serialization has one row as <TD/> and a second which has <TD>NaN</TD>. Consider the following queries:


     Select count(*)  from table where X is null

     Select count(*) from table where X != 0

     Select count(*) from table where X != X
If we consistently treat all NaNs as nulls, then both rows are counted in the first select and ignored in the last two. If we treat both as NaNs then the last two expressions count the rows but not the first. The last select is a traditional idiom for finding NaNs (sometimes improperly optimized away).

The most ‘natural’ reading to the table might treat the first row as a null and the second as a NaN but this seems counter to the standard.

Strings and arrays

Strings in VOTables are not primitives but a special case of an array (of char or Unicode). Arrays of other types are permitted and frequently used. The creation of nulls of arrays is not addressed very clearly in the standard. Strings bring up yet another usage of the word NULL (capitalized in the text) the ASCII NULL (0x00). Currently the standard is silent on specificially how to represent a null string (or char). The ASCII NULL is used to terminate strings in binary representions, but this is not mentioned for TABLEDATA where the string comprises all characters between <TD> and </TD>.

Databases distinguish between null strings and strings of zero length. I.e.,

    insert into table (myString) values(null)
is entirely distinct from
   insert into table (myString) values(‘’)
where myString is a varchar field.

The simple query select length(myString) from table will return null for the first row and 0 for the second.

The interpretation of <TD/> and <TD></TD> is not explicitly defined for char* fields, but looking at how these would be represented in the binary representations (where we would use a length prefix of 0) suggests that these should be interpreted as a String of 0 length where feasible. For fixed length char fields it is unclear how these should be interpreted in the context of a presumed charN field in the database. An interpretation equating them to inserting ‘’ into the column seems most consistent, which would make them equivalent to N spaces.

Note that use of the null field in the VALUES subelement of the FIELD defining a character or string can be used to unambiguously define a null representation. For a single character field (or short strings) this can suffer from the same issues as we see for integer types as discussed below. However null strings cannot be robustly specified absent such a specification.

Defining nulls for other array types also requires care. The VOTable specification notes that the content of each TD element must be consistent with the defined field type. So if we have an array column, say the CDMatrix for an image returned in a SIA, then the null must be a valid 2x2 matrix. I.e., <TD/> or <TD></TD> are not acceptable only <TD>NaN NaN NaN NaN</TD> (after specifying <VALUES null=’NaN NaN NaN NaN’ />) Currently we don’t worry much about array values in our databases (though some support arrays).

Integers

Integer types can only specify nulls using the VALUES/null attribute. For short integer types it is very easy for this to mean that some information must be lost in the database. E.g., since there are only 256 distinct byte values a byte field in a table may easily include all possible values. Using a legal value to represent a null will inevitably lose information. The common standard seems to be to use the minimum value for the signed integer types. However this can be dangerous even for long values in the not uncommon scenario where an integer represents a set of flags. The minimum value is the first bit set and all others 0 quite possibly a common pattern. [The maximum is not better since it just flips the bits.] The requirement to use explicit null values for encoding null integer values substantially increases the complexity of reading and writing these datatypes.

Streaming

To most reliably handle string and integer nulls the results of the query must be examined to find an out-of-band value (if any) that can be specified as the null. Since the FIELD specification must be given at the beginning of the VOTable this effectively means that robust generation of VOTable results is incompatible with streaming results. In some cases, e.g., where the result fields are copied from fields in the table, potential nulls might be precomputed, but more generally queries may include result fields that are expressions of the inputs.

Possible Workarounds.

The VOTable standard suggests in 4.7 when discussing handling nulls "When this value is found in the corresponding data, it is assumed that no data exists for that table cell; the parser may also choose to use this when unparsable data is found, and the null value will be substituted instead."

Could this be recommended rather than optional practice? This allows for the easy specification of nulls for integer and non-character fixed-length array types at least in the TABLEDATA representation. For variable length arrays and strings a convention with <TD/> means null and <TD></TD> means a 0 length array would allow both of these to be represented, but this seems counter to the XML standard and doesn’t seem viable. String values probably require an explicit VALUES/null to be used for nulls with empty fields treated as 0-length stings.

The <TD> element could have a null attribute added. E.g., <TD null=’true’/> to represent a null for any type.

None of these help in the binary encodings.

No obvious simple recommendation seems to resolve the conflation of NaN and null values for floats.

Most integer types could be returned using a longer integer type with a value not possible for the original type used as null. A special xtype attribute for the field could indicate that this is a lengthened type, <FIELD datatype=”int” xtype=”ext:short” …><VALUES null=”-100000000”/></FIELD> to define a field that is a short type in the database but returned as an 4-byte integer in the VOTable. This cannot help for long values.


I agree with all the points on weaknesses in VOTable NULLs that Tom made, though I'm not 100% sure whether I worry too much about the representability of IEEE special values; there are people that say they wouldn't have been in IEEE either in a better world (in which FORTRAN has exceptions).

I also agree that we need fixes, and I'm quite sure the only clean solution would be explicit nulls for all defined "encodings" of VOTable data.

Compared to other options, the null attribute on TDs in TABLEDATA is clearly preferable (and would also resolve the issue with IEEE special values and NULL). It's simple and robust. The only thing it can't do is encode NULLs within arrays, but we could say (and I'd be all for it) that VOTable doesn't support this. A spec would probably need to clarify what should happen with 42 for a field with VALUES null='42'.

For BINARY serialization, I think what we want is record headers, with a bit field that says what columns are null. However, such a change would produce streams older software could not decode at all, so I think that's flat out. As regards FITS serialization, we're not at liberty to change anything anyway.

So, here's an alternative approach independent of serialization methods and allowing for old clients to fail more or less gracefully: Just add a column conventionally named, say, __NULLCOLS__ (case-sensitively, if I had to choose) of type bit*number_of_fields, which would have a true for each field that's null. Libraries that understand the convention and have internal NULL representations would not expose that column to users but instead insert the NULLs. With legacy clients users can emulate the correct behaviour manually.

This also works for streaming data. Again a clarification is required for sorting out contradictions between NULLCOLS and VALUES null attributes, and there's again no solution for NULLs in arrays, but otherwise I'm starting to like such an idea. Worse hacks have survived in FITS for ages...

-- MarkusDemleitner - 25 Jul 2011


Comments from MarkTaylor

I agree that some issues in VOTable relating to null values have not been particularly well thought out.

  • Floating point: Yes, the differences between nulls and NaNs in a database cannot be represented in a VOTable; VOTable has a different model for numeric data than RDBMS. The fact that databases use one model and that VOTables use a different one does not, to my mind, constitute an argument that the VOTable model is defective or needs to be changed. My feeling is that for most kinds of processing of tabular data which is done in astronomy, the VOTable model is adequate - either you have got a number you can use or you haven't.
  • Strings: In most cases in for astronomy data in practice I'm inclined to think similarly that it's not important to distinguish between and empty or zero-length string and a null, though if someone comes up with a counter-example I might change my mind.
  • Arrays (non-string): Yes, it's fiddly and tricky to represent null arrays, but it can be done (I've got a feeling that STIL doesn't do this correctly at present - my guess is it's not a common requirement in VOTables).
  • Integers and streaming: I agree this is a serious problem for software that outputs VOTables (though for VOTable parsing software it's easy enough to cope with). It may in practice mean that a service or application which would otherwise stream data is unable to do so, or that columns which would otherwise be output as one type have instead to get output as a different (wider) type.

If we feel that these things need to be addressed I think that the lowest-impact way of doing it would be to allow empty TD elements (<TD/> or equivalently <TD></TD>) to represent null values. In practice, it's not uncommon to find VOTables in the wild which do this for integer as well as non-integer data types rather than using the VALUES/null attribute. When STIL (hence TOPCAT etc) encounters a usage like this it logs a minor warning but treats the value as a null; I expect that other VOTable parsers do something similar. So in practice this illegal convention is already in use.

This would effectively mean that BINARY and FITS encoded VOTables are less capable than TABLEDATA ones. It would mean that you can't automatically/easily translate from TABLEDATA to BINARY/FITS encoded VOTables or to FITS tables. We could just say: if you want to stream VOTables in a way which you can't do for BINARY encoding, you'll have to use TABLEDATA instead.

(disclaimer: I'm not necessarily advocating this for now, just airing my thoughts).

-- MarkTaylor - 27 Jul 2011


Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r6 - 2014-06-01 - MarkTaylor
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback