Issues with VOTable version 1.3

This page contains notes and discussions of items in VOTable version 1.3 that may need attention or changes in some future version of the VOTable standard. It is a discussion forum, and inclusion here does not guarantee that any changes will be made. The existence of this page does not necessarily indicate that there will be a new revision of VOTable in the near or distant future, but it makes sense to gather issues here rather than to let them get lost in old mailing list threads.

See also VOTableIssues, which discusses items in VOTable version 1.2 that were addressed in version 1.3 (mostly to do with representing null values).

Unicode primitive character data representation

The discussion of the unicodeChar primitive data type in section 2.1 of the standard says:

VOTables support two kinds of characters: ASCII 1-byte characters and Unicode (UCS-2) 2-byte characters. Unicode is a way to represent characters that is an alternative to ASCII. It uses two bytes per character instead of one, it is strongly supported by XML tools, and it can handle a large variety of international alphabets. Therefore VOTable supports not only ASCII strings (datatype="char"), but also Unicode (datatype="unicodeChar").

This is a bit confused. Unicode itself does not specify how many bytes are used to encode each character. UCS-2 (a Unicode encoding scheme) does specify 2 bytes per character, but is very out of date. It resembles UTF-16, which uses two bytes for most characters, but more bytes for some exotic characters. Moreover the encoding used for Unicode character data inline (e.g. in TABLEDATA/TD elements) is determined by the encoding of the XML document hosting the VOTable, so it makes no sense to mandate an encoding (UCS-2 or anything else) for inline unicode character data. The encoding does however need to be determined somehow for BINARY or BINARY2-serialized data, which is not considered to form part of an XML stream. Another complication is that if UCS-2 really is intended (for BINARY) there should be some discussion of endianness. This topic was discussed in the "Unicode in VOTable" thread started by Walter Landry on the apps mailing list here.

What to do?

Proposals include, in ascending order of disruptiveness:

  1. Rewording the definition of the unicodeChar type to clarify that inline text uses the document's encoding, but retaining (the obsolete and deprecated) UCS-2 for binary serializations, and adding discussion of endianness. This just clarifies the existing text so it makes sense.
  2. Rewording and clarifying as above, but redefining the unicodeChar type to use UTF-16 rather than UCS-2. Given the characters likely to appear in VOTables, the encodings are likely to be identical in most or all cases.
  3. Changing char type to mean unicode character, with its encoding for BINARY serialization being UTF-8, while allowing inline char arrays to contain any Unicode character encoded as per the VOTable document's encoding. In practice it may be the case that some existing software in Unicode-friendly languages such as Java inadvertently treats char data as UTF-8 already. Another argument in favour is that strictly speaking ASCII characters are 7-bit and should have the same interpretation in UTF-8 as in ASCII, but it's probably the case that various extended ASCII character sets are in use in some VOTables which would cause trouble under such a redefinition.
  4. Rewording and clarifying char as above, but add a new type, which may or not go by the name utf8, with a BINARY serialization of UTF-8.
Note that in all except the first case, to make the BINARY encodings work and retain consistency between BINARY and TABLEDATA, we have a type in which the value of the arraysize attribute is no longer the number of characters in the string, but the number of bytes the UTF-8 serialization would take. For changing UCS-2 to UTF-16, it probably wants to be the number of 16-bit words the UTF-16 serialization would take. This would require some careful rewording in the standard.

My take is that, for maximum backward compatibility (including with non-Unicode-friendly formats and languages), char should continue to represent 1-byte characters (ASCII or extended ASCII), while people who want to write Unicode now should do it using the unicodeChar type. A new type utf8 could be introduced as in 4 for future use. However, repurposing char as in 3 seems to have more popular support, I think it's a reasonable way forward too.

-- MarkTaylor - 2014-03-31, updated 2014-08-18

Java uses UTF-16 internally. Converting to UTF-8 requires an explicit step. However, there are other systems which default to UTF-8, such as Linux file systems. So storing a list of files would be UTF-8 unless you do something special. Also, my take is that that extended ASCII (e.g. ISO 8859-1) is already illegal, so anyone using them is already in trouble. I have seen no evidence of any use of extended ASCII in a BINARY2 serialization in a VOTable in the wild. -- WalterLandry 2014-04-09

I'm pretty much for changing char to allow non-ASCII, with utf-8 encoding for non-TABLEDATA (TABLEDATA chars would of course keep having the encoding of the embedding XML). This resolves the trouble of what to do when char-valued columns contain non-ASCII in TABLEDATA, helps round-tripping those through non-TABLEDATA, probably won't break any existing software, and will in all likelihood work well all-around. At the same time, I'd deprecate unicodeChar, as it doesn't really serve any purpose any more after that change (except possibly saving re-coding in environments that use utf-16 internally -- but if that re-coding is a bottleneck for your application, I'd be really surprised). Oh, and non-ASCII in BINARY VOTables in the wild: At least with the relational registry, that's fairly normal -- try, e.g.,

 select * from rr.res_role where 1=ivo_hasword(role_name, 'Müller')

on a RegTAP endpoint like http://dc.g-vo.org/tap.

-- MarkusDemleitner - 2014-06-02

RESOURCE type attribute

The type attribute of the RESOURCE element is defined like this:

  <xs:attribute name="type" default="results">
    <xs:simpleType>
      <xs:restriction base="xs:NMTOKEN">
        <xs:enumeration value="results"/>
        <xs:enumeration value="meta"/>
      </xs:restriction>
    </xs:simpleType>
  </xs:attribute>

The two possible values seem a bit ad-hoc, and this restriction has been a cause of surprise (see the thread "Datalink feedback II: RESOURCE type" on the DAL list in March 2014 here). Does it make sense to change it? -- MarkTaylor - 2014-03-31

Schema tidy-up

The XSD schema for VOTable 1.3 contains a number of comments of only historical interest, which should be removed. Some of these are blocks of XSD from previous versions which have been commented out as XML comments and can easily be misread when eyeballing the text.

-- MarkTaylor - 2014-06-01

Review of "Possible VOTable Extensions" Appendix

Appendix A "Possible VOTable Extensions" contains a number of ideas of varying degrees of bakedness which were raised during the early history of VOTable (I believe there have been no changes to this part of the document since v1.1 in 2004). The value of including these unreviewed and unimplemented suggestions in the normative VOTable document itself is (to me) questionable. I suggest consideration of removing this appendix altogether, possibly moving the text out to a Note.

-- MarkTaylor - 2014-06-01

I'm all for that -- unless someone wanted to champion the inclusion of some of these into the main body of the standard.

-- MarkusDemleitner - 2014-06-02


Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2014-08-18 - MarkTaylor
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback