TWiki
>
IVOA Web
>
IvoaApplications
>
VOTableIssues13
(revision 4) (raw view)
Edit
Attach
---+ Issues with VOTable version 1.3 This page contains notes and discussions of items in VOTable version 1.3 that may need attention or changes in some future version of the VOTable standard. It is a discussion forum, and inclusion here does not guarantee that any changes will be made. The existence of this page does not necessarily indicate that there will be a new revision of VOTable in the near or distant future, but it makes sense to gather issues here rather than to let them get lost in old mailing list threads. See also VOTableIssues, which discusses items in VOTable version 1.2 that were addressed in version 1.3 (mostly to do with representing null values). ---++ Unicode primitive character data representation The discussion of the =unicodeChar= primitive data type in section 2.1 of the standard says: <blockquote>VOTables support two kinds of characters: ASCII 1-byte characters and Unicode (UCS-2) 2-byte characters. Unicode is a way to represent characters that is an alternative to ASCII. It uses two bytes per character instead of one, it is strongly supported by XML tools, and it can handle a large variety of international alphabets. Therefore VOTable supports not only ASCII strings (datatype="char"), but also Unicode (datatype="unicodeChar").</blockquote> This is a bit confused. Unicode itself does not specify how many bytes are used to encode each character. UCS-2 (a Unicode encoding scheme) does specify 2 bytes per character, but is very out of date. It resembles UTF-16, which uses two bytes for most characters, but more bytes for some exotic characters. Moreover the encoding used for Unicode character data inline (e.g. in TABLEDATA/TD elements) is determined by the encoding of the XML document hosting the VOTable, so it makes no sense to mandate an encoding (UCS-2 or anything else) for inline unicode character data. The encoding does however need to be determined somehow for BINARY or BINARY2-serialized data, which is not considered to form part of an XML stream. Another complication is that if UCS-2 really is intended (for BINARY) there should be some discussion of endianness. This topic was discussed in the "Unicode in VOTable" thread started by Walter Landry on the apps mailing list [[http://www.ivoa.net/pipermail/apps/2014-March/000938.html][here]]. What to do? Proposals include, in ascending order of disruptiveness: 1 Rewording the definition of the =unicodeChar= type to clarify that inline text uses the document's encoding, but retaining (the obsolete and deprecated) UCS-2 for binary serializations, and adding discussion of endianness. This just clarifies the existing text so it makes sense. 1 Rewording and clarifying as above, but redefining the =unicodeChar= type to use UTF-16 rather than UCS-2. Given the characters likely to appear in VOTables, the encodings are likely to be identical in most or all cases. 1 Changing =char= type to mean unicode character, with its encoding for BINARY serialization being UTF-8, while allowing inline char arrays to contain any Unicode character encoded as per the VOTable document's encoding. In practice it may be the case that some existing software in Unicode-friendly languages such as Java inadvertently treats =char= data as UTF-8 already. Another argument in favour is that strictly speaking ASCII characters are 7-bit and should have the same interpretation in UTF-8 as in ASCII, but it's probably the case that various extended ASCII character sets are in use in some VOTables which would cause trouble under such a redefinition. My take is that, for maximum backward compatibility (including with non-Unicode-friendly formats and languages), =char= should continue to represent 1-byte characters (ASCII or extended ASCII), while =unicodeChar= should be redefined to use UTF-16 in binary contexts. People who want to write Unicode now should do it using the =unicodeChar= type, employing UTF-16 with a BOM in BINARY contexts. -- IVOA.MarkTaylor - 2014-03-31 Java uses UTF-16 internally. Converting to UTF-8 requires an explicit step. However, there are other systems which default to UTF-8, such as Linux file systems. So storing a list of files would be UTF-8 unless you do something special. Also, my take is that that extended ASCII (e.g. ISO 8859-1) is already illegal, so anyone using them is already in trouble. I have seen no evidence of any use of extended ASCII in a BINARY2 serialization in a VOTable in the wild. -- WalterLandry 2014-04-09 ---++ RESOURCE type attribute The =type= attribute of the =RESOURCE= element is defined like this: <verbatim> <xs:attribute name="type" default="results"> <xs:simpleType> <xs:restriction base="xs:NMTOKEN"> <xs:enumeration value="results"/> <xs:enumeration value="meta"/> </xs:restriction> </xs:simpleType> </xs:attribute> </verbatim> The two possible values seem a bit ad-hoc, and this restriction has been a cause of surprise (see the thread "Datalink feedback II: RESOURCE type" on the DAL list in March 2014 [[http://www.ivoa.net/pipermail/dal/2014-March/006743.html][here]]). Does it make sense to change it? -- IVOA.MarkTaylor - 2014-03-31 <br /> <!-- * Set ALLOWTOPICRENAME = IVOA.TWikiAdminGroup -->
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r7
<
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r4 - 2014-04-09
-
WalterLandry
IVOA
Log in
or
Register
IVOA.net
Wiki Home
WebChanges
WebTopicList
WebStatistics
Twiki Meta & Help
IVOA
Know
Main
Sandbox
TWiki
TWiki intro
TWiki tutorial
User registration
Notify me
Working Groups
Applications
Data Access Layer
Data Model
Distributed Services & Protocols
Registry
Semantics
Interest Groups
Data Curation
Education
Knowledge Discovery
High Energy
Operations
Radio Astronomy
Solar System
Time Domain
Committees
Stds&Procs
www.ivoa.net
Documents
Events
Members
XML Schema
Copyright © 2008-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback