VO data types
A review of the data types defined in the VO specifications.
Specifically looking at the relationships between types, attributes and columns with similar names in different standards and how they relae to each other.
VODataService
The
VODataService specification defines an XML schema for describing data collections and the services that access them.
This review refers to
version 1.1 (20101202) of the specification.
The data types defined in
VODataService are intended to be used to describe the data in VO data sets and the services and protocols used to access them.
DataType element
The
DataType XML element is defined in
section 3.5 (Data Parameters) of the
VODataService specification.
DataType defines the following attributes:
DataType =arraysize
The
DataType arraysize
attribute is defined in
section 3.5 (Data Parameters) of the
VODataService specification.
The specification text describes the
arraysize
attribute as follows:
- "The arraysize attribute indicates the parameter is an array of values of the named type."
- "Its value describes the shape of the array, and the delim attribute may be used to indicate the delimiter that should appear between elements of an array value."
- "The attribute's presence indicates that parameter holds an array values; the attribute's value indicates the length of the array along each dimension of the multi-dimensional array."
The text of the
VODataService specification describes the syntax for the
arraysize
attribute value as follows:
- "the VOTable arraysize format (vs:ArrayShape): LxMxN..., where each x-delimited positive integer is a length along a dimension of a multi-dimensional array. A single integer indicates a one dimensional array. Instead of an integer, the last length can be set to "*" which indicates a variable length."
Note - The reference to
"the VOTable arraysize format (vs:ArrayShape)" should probably be
"the vs:ArrayShape format ".
The text of the
VODataService specification does not describe the
ArrayShape string syntax.
The
VODataService XML schema defines the
ArrayShape string syntax as follows:
<!--
- this definition is taken from the VOTable arrayDEF type
-->
<xs:simpleType name="ArrayShape">
<xs:annotation>
<xs:documentation>
An expression of a the shape of a multi-dimensional array
of the form LxNxM... where each value between gives the
integer length of the array along a dimension. An
asterisk (*) as the last dimension of the shape indicates
that the length of the last axis is variable or
undetermined.
</xs:documentation>
</xs:annotation>
<xs:restriction base="xs:token">
<xs:pattern value="([0-9]+x)*[0-9]*[*]?"/>
</xs:restriction>
</xs:simpleType>
As the comment in the XML schema suggests, the
ArrayShape string syntax defined
in the
VODataService schema is similar to, but not explicitly linked to,
the
arrayDEF string format defined in the
VOTable specification.
The
ArrayShape string syntax is used in several places in the
VODataService XML schema to define the content of
arraysize
attributes on elements derived from
DataType, including
VOTableType and
TAPType.
The
ArrayShape string syntax is not used in any of the other VO specifications.
DataType =delim
The
DataType delim
attribute is defined in
section 3.5 (Data Parameters) of the
VODataService specification.
The specification text describes the
delim
attribute as follows:
- "the string that is used to delimit element of an array value when arraysize is not "1""
The specification text does not define a default value for the
delim
attribute.
The specification text encourages applications to allow optional spaces before and after the delimiter (e.g. "1, 5" when delim=",").
The XML schema defines a default value as a single white space " ".
<xs:attribute name="delim" type="xs:string" default=" ">
The comments in the XML schema specification encourages applications to allow optional spaces
before and after the delimiter (e.g. "1, 5" when delim=","),
but that is not encoded in the XML schema itself.
The
delim
attribute is not referred to by any of the other VO specifications.
So far, the examples we have found in the other VO specifications all use white space as the delimiter:
- The VOTable
TABLEDATA
serialization for arrays of numeric values explicity uses white space as the delimiter.
- The VOTable
TABLEDATA
serialization for floatComplex
and doubleComplex
explicity uses white space as the delimiter.
DataType =extendedType
The
DataType extendedType
attribute is defined in
section 3.5 (Data Parameters) of the
VODataService specification.
The specification text describes the
extendedType
attribute as follows:
- "The data value represented by this type can be interpreted as of a custom type identified by the value of this attribute. "
- "The name implies a particular expected format for the data value that can be parsed into a value in memory."
- " If an application does not recognize this extendedType, it should attempt to handle value assuming the type given by the element's value. "string" (or its equivalent) is a recommended default type."
- " This element may make use of the extendedSchema attribute and/or any arbitrary (qualified) attribute to refine the identification of the type. "
Looking at the body of standards as a whole, we assume that the
extendedType
attribute is functionally equivalent to the
xtype attribute defined in the
something specification.
However, as far as we can tell, this is not explicitly stated anywhere, and there in no mapping defined between the
(
extendedType
|
extendedSchema
) attribute pair defined in
VODataService
and the (
xtype with a prefix) attribute defined in the
something specification.
The
VODataService specification does not provide an example of how the
extendedType
attribute could be used.
The
extendedType
attribute is not referred to in any of the other VO specifications.
DataType =extendedSchema
The
DataType extendedType
attribute is defined in
section 3.5 (Data Parameters) of the
VODataService specification.
The specification text describes the
extendedType
attribute as follows:
- "An identifier for the schema that the value given by the extended attribute is drawn from."
The
VODataService specification does not provide an example of how the
extendedSchema
attribute could be used.
The
extendedSchema
attribute is not referred to in any of the other VO specifications.
TableDataType element
The
TableDataType XML element is defined in
section 3.5.3 (Table Column Data Types) of the
VODataService specification.
TableDataType extends
DataType.
The comment in the XML schema describe
TableDataType as:
- "an abstract parent for a class of data types that can be used to specify the data type of a table column."
VOTableType element
The
VOTableType XML element is defined in
section 3.5.3 (Table Column Data Types) of the
VODataService specification.
VOTableType inherits the following attributes from
DataType:
VOTableType defines the following set of allowed values:
-
boolean
-
bit
-
unsignedByte
-
short
-
int
-
long
-
char
-
unicodeChar
-
float
-
double
-
floatComplex
-
doubleComplex
The specification text describes
VOTableType as follows :
- "data types that correspond to the parameter and column types defined in the VOTable schema"
The XML schema comments describe
VOTableType as follows :
- "a data type supported explicitly by the VOTable format".
The definition of
VOTableType does not provide any further details about the sizes, ranges or content of the data types. It is left to the reader to refer to the
VOTable specification for details about the data types.
Note - the bibliography reference to the
VOTable specification explicitly refers to
version 1.2 (20091130) of the specification, this has since been superceded by
version 1.3 (20130920).
The definition of
VOTableType states that string values of arbitrary length are represented by a data type of
char
with
arraysize="*"
.
In order to support strings with unicode characters it may be clearer to explicitly state
ASCII strings should be represented by a data type of
char
with
arraysize="*"
and
Unicode strings should be represented by a data type of
unicodeChar
and
arraysize="*"
.
TAPDataType
The specification text does not describe the
TAPDataType element directly.
The XML schema comments describe
TAPDataType as follows:.
- "an abstract parent for the specific data types supported by the Table Access Protocol"
The
TAPDataType element defines the following attributes:
Note - the
TAPDataType element name reflects the historical situation where the data types were originally defined in the
TAP specification. The data type definitions have since been moved to the
ADQL specification, but for backward compatibility the XML element name has not been changed.
TAPType
The
TAPType XML element is defined in
section 3.5.3 (Table Column Data Types) of the
VODataService specification.
TAPType inherits the following attributes from
DataType:
TAPType inherits the following attributes from
TAPDataType:
TAPType defines the following set of allowed values:
-
BOOLEAN
-
SMALLINT
-
INTEGER
-
BIGINT
-
REAL
-
DOUBLE
-
TIMESTAMP
-
CHAR
-
VARCHAR
-
BINARY
-
VARBINARY
-
POINT
-
REGION
-
CLOB
-
BLOB
The specification text describes
TAPType as follows :
- "data types that correspond column types defined in the Table Access Protocol (v1.0) [TAP]"
The explicit reference to version 1.0 of the
TAP specification is no longer valid.
The
TAPType element name reflects the historical situation where the data types were originally defined in the
TAP specification. The data type definitions have since been moved to the
ADQL specification, but for compatibility reasons, the XML element name has not been changed.
The definition of
TAPType does not provide any further details about the sizes, ranges or content of the data types.
It is left to the reader to refer to the
TAP (now
ADQL) specification for details about the data types.
The text at the end the section refers to a mapping between
TAP_SCHEMA
types and [[#Votable][VOTable] types in the
TAP specification.
- "Note that the TAP standard [TAP] defines an explicit mapping between TAP_SCHEMA types and VOTable types."
This mapping is no longer part of the
TAP specification.
The definition of
TAPType states that string values should be represented by a data type of
VARCHAR
, the definition does not say whether this should be accompanied by a =
size or
arraysize
attribute.
Note - the
TAPType element name reflects the historical situation where the data types were originally defined in the
TAP specification. The data type definitions have since been moved to the
ADQL specification, but for backward compatibility the XML element name has not been changed.
TAPType =size
The
size
attribute is described as an attribute of the
TAPType element in
section 3.5.3 (Table Column Data Types)
of the
VODataService specification.
However, technically, in the XML schema
size
is an attribute of the abstract
TAPDataType parent element,
which is then inherited by
TAPType.
The
VODataService specification describes the
size
attribute as follows:
- "The length of the variable-length data type."
- "In the context of TAP, this attribute is only meaning when the data type is CHAR or BINARY; see discussion below."
This restriction seems to imply that
CHAR
and
BINARY
values have an inherent
'size' property, and are not treated as arrays of values, which have a different
'arraysize' property.
In the discussion that follows, the
VODataService specification gives two examples which are equivalent:
<dataType xsi:type="vs:VOTableType" arraysize="*"> char </dataType>
and
<dataType xsi:type="vs:TAPType"> VARCHAR </dataType>
A third example describes a fixed length string, using the
size
rather than the
arraysize
attribute
<dataType xsi:type="vs:TAPType" size="8" > CHAR </dataType>
However, the
VODataService specification does not explicitly explain the difference (if any) between
<dataType xsi:type="vs:TAPType" size="8" > CHAR </dataType>
and
<dataType xsi:type="vs:TAPType" arraysize="8" > CHAR </dataType>
This distinction between
CHAR
,
VARCHAR
and
BINARY
values with a
'size' property, and arrays of numeric values with an
'arraysize' property
are possibly left over from previous versions of the VO specifications.
The documentation element in the XML schema for
TAPDataType describes the
size
attribute as follows:
- "This corresponds to the size Column attribute in the TAP_SCHEMA and can be used with data types that are defined with a length (CHAR, BINARY)."
This establishes a reference link from
VODataService TAPDataType to
TAP_SCHEMA.columns in the
TAP specification.
In the
TAP specification the corresponding
size
column is described as :
- "retained for backwards compatibility to TAP-1.0"
The original text in version 1.0 of the
TAP specification describes the
size
column as follows :
- "The “size” gives the length of variable length datatypes, for example varchar(256);"
Neither version of the
TAP specification contain a reference from the
size
column back to
TAPDataType in the
VODataService specification.
The
size
attribute is not referred to by any of the other VO specifications.
VOTable
The
VOTable specification defines an XML based serialization format for exchanging tabular data within the VO.
VOTableTypes
Section 2.1 (Primitives) of the
VOTable specification defines the following data types
and their corresponding FITS data type and size in bytes:
datatype |
Meaning |
FITS |
Bytes |
boolean |
Logical |
L |
1 |
bit |
Bit |
X |
* |
unsignedByte |
Byte (0 to 255) |
B |
1 |
short |
Short Integer |
I |
2 |
int |
Integer |
J |
4 |
long |
Long integer |
K |
8 |
char |
ASCII Character |
A |
1 |
unicodeChar |
Unicode Character |
|
2 |
float |
Floating point |
E |
4 |
double |
Double |
D |
8 |
floatComplex |
Float Complex |
C |
8 |
doubleComplex |
Double Complex |
M |
16 |
Section 6 (Definitions of Primitive Datatypes)
of the
VOTable specification describes the representation of these primitives in the
BINARY
,
BINARY2
and
TABLEDATA
serializations.
VOTable =boolean
VOTable =bit
VOTable =unsignedByte
VOTable =short
VOTable =int
VOTable =long
VOTable =char
VOTable =float
VOTable =double
VOTable =unicodeChar
The description for the
BINARY
serialization of
unicodeChar
defines it as a
Unicode (UCS-2) fixed width 2-byte character.
- "Each Unicode character is represented in the BINARY/BINARY2 serialization by two bytes, using the big-endian UCS-2 encoding (ISO-10646-UCS-2)"
The UCS-2 character set includes all of the characters in the
Basic Multilingual Plane (BMP),
which contains characters for almost all modern languages.
The description for the
TABLEDATA
serialization includes an example showing how a
unicodeChar
that is outside the ASCII character
set can be represented in an XML document by using a
numeric character reference (NCR).
- "The representation of a Unicode character in the
TABLEDATA
serialization follows the XML specifications, and e.g. the Cyrillic uppercase ``Ya'' can be written Я in UTF-8."
The reference to
UTF-8 in the description of the
TABLEDATA
serialization may be misleading,
because a UTF-8 XML document can contain the multi-byte Cyrillic uppercase ``Ya'' character, Я, shown in the example as-is, without
needing to use a numeric character reference.
Declaring a UTF-8 encoding for a VOTable XML document containing
TABLEDATA
data may also be problematic,
<?xml version=“1.0” encoding=“utf-8”?>
as this would mean the XML document would be able to contain characters
that are beyond the range of the UCS-2 fixed-width character set.
Note; since 2005 it is no longer possible to encode all of the mandatory components defined in the
[[https://en.wikipedia.org/wiki/GB_18030#As_a_national_standard][official character set of the People's Republic of China, (GB 18030-2005)]
in a fixed width 2 byte character set. In addition, as of May 1, 2006, support for the GB 18030-2005 character set is officially
required for all software products sold in the PRC.
VOTable =floatComplex
The description for the
BINARY
serialization of
floatComplex
defines it as a pair of 32-bit, single precision, floating point numbers.
- "a sequence of pairs of 32-bit single precision floating point numbers in big-endian order"
The description for the
TABLEDATA
serialization of
floatComplex
defines it as a pair of floating point numbers separated by white space.
- "two representations of a Single Precision Floating Point numbers separated by whitespace, representing the real and imaginary part respectively"
Note that this effectively fixes the delimter for the
TABLEDATA
serialization to white space, regardless of the
delim
attribute
set by the
VODataService description of the source data table.
VOTable =doubleComplex
The description for the
BINARY
serialization of
doubleComplex
defines it as a pair of 64-bit, double precision, floating point numbers.
- "a sequence of pairs of 64-bit double precision floating point numbers in big-endian order"
The description for the
TABLEDATA
serialization of
floatComplex
defines it as a pair of floating point numbers separated by white space.
- "two representations of a Double Precision Floating Point numbers separated by whitespace, representing the real and imaginary part respectively"
Note that this effectively fixes the delimter for the
TABLEDATA
serialization to white space, regardless of the
delim
attribute
set by the
VODataService description of the source data table.
VOTableArrays
The
VOTable specification and schema include an
arraysize
attribute, but not a
delim
attribute.
Section 2.2 of the
VOTable specification uses a number of examples to show how a
combination of
datatype
and
arraysize
attributes can be used to describe
arrays of values in the metadata for a FIELD.
Section 5.1 of the
VOTable specification describes the
TABLEDATA
serialization of arrays as follows:
- "If a cell contains an array of numbers or a complex number, it should be encoded as multiple numbers separated by whitespace. However in the case of character and Unicode strings (declared in the corresponding FIELD as an array of char or unicodeChar datatype), no separator should exist."
It uses the following example to illustrate the difference between arrays of numbers and arrays of characters:
<TABLE>
<FIELD name="aString" datatype="char" arraysize="10"/>
<FIELD name="aShort" datatype="short"/>
<FIELD name="varInts" datatype="int" arraysize="*"/>
<FIELD name="Floats" datatype="float"arraysize="3"/>
<DATA><TABLEDATA>
<TR> <TD>Apple</TD> <TD/> <TD>1 2 4 8 16</TD> <TD>1.62 4.56 3.44</TD> </TR>
<TR> <TD>Orange</TD> <TD>15</TD> <TD>23 -11 9</TD> <TD>2.33 4.66 9.53</TD> </TR>
</TABLEDATA></DATA>
</TABLE>
VOTable =arraysize
The text of the
VOTable specification does not explicitly define the
arraysize
attribute.
The text of the
VOTable specification does not link the
VOTable arraysize
attribute with the
DataType =arraysize
attribute defined in the
VODataService specification.
VOTable =arrayDEF
The text of the
VOTable specification does not explicitly define the format of the
arraysize
attribute value.
The
VOTable XML schema defines the
arrayDEF string syntax as follows:
<xs:simpleType name="arrayDEF">
<xs:restriction base="xs:token">
<xs:pattern value="([0-9]+x)*[0-9]*[*]?(s\W)?"/>
</xs:restriction>
</xs:simpleType>
However, the
arrayDEF string syntax is not used in the definition of the
arraysize
attribute
<xs:complexType name="Field">
....
<xs:attribute name="arraysize" type="xs:string"/>
....
</xs:complexType>
The only reference to the
VOTable arrayDEF
string syntax in the other VO specifications is a comment in the definition of the
ArrayShape in the
VODataService schema.
The text of the
VOTable specification does not link the
VOTable
arrayDEF string syntax with the
ArrayShape string
syntax defined in the
VODataService schema.
The
arrayDEF string syntax is not used anywhere in
VOTable XML schema.
The
arrayDEF string syntax is not used in any of the other VO specifications.
DALI
#DALI
The
DALI specification defines ...
TAP
The
TAP specification defines ...
ADQL
#ADQL
The
ADQL specification defines ...
xtype
#xtype
The
xtype
attribute is defined in ...
The
xtype
attribute is referred to in ...
TAP_SCHEMA
The
TAP_SCHEMA
tables are defined in ...
The
TAP_SCHEMA
tables are referred in ...