Issues with VOTable version 1.3This page contains notes and discussions of items in VOTable version 1.3 that may need attention or changes in some future version of the VOTable standard. It is a discussion forum, and inclusion here does not guarantee that any changes will be made. The existence of this page does not necessarily indicate that there will be a new revision of VOTable in the near or distant future, but it makes sense to gather issues here rather than to let them get lost in old mailing list threads. See also VOTableIssues, which discusses items in VOTable version 1.2 that were addressed in version 1.3 (mostly to do with representing null values).Unicode primitive character data representationThe discussion of theunicodeChar primitive data type in section 2.1 of the standard says:
VOTables support two kinds of characters: ASCII 1-byte characters and Unicode (UCS-2) 2-byte characters. Unicode is a way to represent characters that is an alternative to ASCII. It uses two bytes per character instead of one, it is strongly supported by XML tools, and it can handle a large variety of international alphabets. Therefore VOTable supports not only ASCII strings (datatype="char"), but also Unicode (datatype="unicodeChar"). | ||||||||
Changed: | ||||||||
< < | This is a bit confused. Unicode itself does not specify how many bytes | |||||||
> > | This is a bit confused. Unicode itself does not specify how many bytes are used to encode each character. UCS-2 (a Unicode encoding scheme) does specify 2 bytes per character, but is very out of date. It resembles UTF-16, which uses two bytes for most characters, but more bytes for some exotic characters. Moreover the encoding used for Unicode character data inline (e.g. in TABLEDATA/TD elements) is determined by the encoding of the XML document hosting the VOTable, so it makes no sense to mandate an encoding (UCS-2 or anything else) for inline unicode character data. The encoding does however need to be determined somehow for BINARY or BINARY2-serialized data, which is not considered to form part of an XML stream. Another complication is that if UCS-2 really is intended (for BINARY) there should be some discussion of endianness. This topic was discussed in the "Unicode in VOTable" thread started by Walter Landry on the apps mailing list here. | |||||||
Deleted: | ||||||||
< < | are used to encode each character. UCS-2 (a Unicode encoding scheme) does specify 2 bytes per character, but is very out of date. It resembles UTF-16, which uses two bytes for most characters, but more bytes for some exotic characters. Moreover the encoding used for Unicode character data inline (e.g. in TABLEDATA/TD elements) is determined by the encoding of the XML document hosting the VOTable, so it makes no sense to mandate an encoding (UCS-2 or anything else) for inline unicode character data. The encoding does however need to be determined somehow for BINARY or BINARY2-serialized data, which is not considered to form part of an XML stream. Another complication is that if UCS-2 really is intended (for BINARY) there should be some discussion of endianness. This topic was discussed in the "Unicode in VOTable" thread started by Walter Landry on the apps mailing list here. | |||||||
What to do?
Proposals include, in ascending order of disruptiveness:
char should continue to represent 1-byte characters (ASCII or extended ASCII), while unicodeChar should be redefined to use UTF-16 in binary contexts. People who want to write Unicode now should do it using the unicodeChar type, employing UTF-16 with a BOM in BINARY contexts. -- MarkTaylor - 2014-03-31 | ||||||||
Added: | ||||||||
> > | Java uses UTF-16 internally. Converting to UTF-8 requires an explicit step. However, there are other systems which default to UTF-8, such as Linux file systems. So storing a list of files would be UTF-8 unless you do something special. Also, my take is that that extended ASCII (e.g. ISO 8859-1) is already illegal, so anyone using them is already in trouble. I have seen no evidence of any use of extended ASCII in a BINARY2 serialization in a VOTable in the wild. -- WalterLandry 2014-04-09 | |||||||
RESOURCE type attributeThetype attribute of the RESOURCE element is defined like this:
<xs:attribute name="type" default="results"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="results"/> <xs:enumeration value="meta"/> </xs:restriction> </xs:simpleType> </xs:attribute>The two possible values seem a bit ad-hoc, and this restriction has been a cause of surprise (see the thread "Datalink feedback II: RESOURCE type" on the DAL list in March 2014 here). Does it make sense to change it? -- MarkTaylor - 2014-03-31 <--
|