UTypes and URIs
Draft version $Revision: 1.1 $

IVOA Draft Note, $Date: 2007/01/28 13:51:22 $

This version: XXX to appear
Latest version: XXX to appear
Author: Norman Gray, Euro-VOTech project and University of Leicester

Abstract

We describe a minor change to the interpretation of UType values in VOTables, which helps document UType meanings, and makes it easy to relate UTypes to each other, supporting interoperability while requiring minimal standardisation.

Status of this document

This is an IVOA Note.

This document is an IVOA Note expressing suggestions from and opinions of the authors. The first release of this document was YYYY Month DD.

It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.

A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgments

None, yet

Abstract
Status of this document
1–Introduction
- 1.1–Creating UTypes
2–UTypes as URIs
3–Documentation
4–Shared semantics
- 4.1–Describing subclass relationships
- 4.2–Reasoning about UTypes
A–UTypes and FITS
B–Apache recipes
C–Rationale
Bibliography

1–Introduction

UTypes are defined in section 4.5 of the VOTable standard [std:votable], with a definition which is sufficiently compact that we can reproduce it in full here.

In some contexts, it can be important that FIELDs or PARAMeters are explicitly designed as being the parameter performing some well-defined role in some external data model. For instance, it might be important for an application to know that a given FIELD expresses the surface brightness processed by an explicit method. None of the existing name, ID or ucd attributes can fill this role, and the utype (usage-specific or unique type) attribute has been added in VOTable 1.1 to fill this gap. By extension, most elements may refer to some external data model, and the utype attribute is legal also in RESOURCE, TABLE and GROUP elements.

In order to avoid name collisions, the data model identification should be introduced following the XML namespace conventions, as utype="datamodel_identifier:role_identifier". The mapping of datamodel_identifier to an xml-type attribute is recommended, but not required.

At the time, this was addressing an anticipated, but not yet actual, need, and so this terse definition sensibly neither greatly constrains UType syntax, nor defines any specific instances.

Our situation is now different. The SIA protocol [std:sia] has acquired a number of UTypes (informally introduced in a mail message from J C McDowell), and the on-going Dataset Characterisation effort [std:characterisation] includes a list of UTypes in at least one version of its draft note. None of these have yet been formally standardised, so that now, with examples in mind and standardisation in prospect, is a good moment to refine the UType definition.

We make three suggestions, which we can summarise as follows.

Regard the datamodel_identifier prefix above as an XML namespace, with the syntactic requirements that implies, and interpret the UType as a URI naming a concept.
Require that each UType URI be resolvable, on the web, to human-readable documentation for the concept thus named.
Require that each UType URI be resolvable, separately, to a formal (RDF) expression of its semantics, which would therefore be immediately retrievable, aggressively cacheable, and easily used by software to interpret data annotated with the UType.

The second and third suggestions build on the first, but are independent of each other.

@@TODO MBT strongly recommends that `require', above, be changed to `strongly recommend', on the practical grounds that that is how it would probably be used in fact. My own feeling is that blessing that degree of casualness in creating UTypes might be harmful to their usefulness, but I can appreciate the practical force of the argument, and can see the extra permissiveness as encouraging the uptake of UTypes.

Further discussion of each of these appears in the sections below, and a rationale for the overall approach appears in C–Rationale. Although simple uses of the reasoning framework described there would be immediately available, the more elaborate possibilities would require further work. We would like to stress, however, that this is not the only benefit of the UType refinement we are suggesting, and that the consistency and documentation benefits described here would follow even if the reasoning potential were never exploited.

The draft characterization document describes a possible mechanism for serialising a data source using a data model and UTypes. We presume the existence of such an agreed-upon mechanism in the discussion of data sharing below.

1.1–Creating UTypes

In this proposal, an organisation creating a UType must perform three steps, mirroring the steps described in section 1–Introduction.

Determine a namespace URI, creating a URI in a DNS domain the organisation controls; then identify individual UType names respecting the syntax described in section 2–UTypes as URIs.
Create documentation for the namespace, as described in section 3–Documentation.
Create a simple RDF document expressing how the new UTypes relate to other standardised or well-known UTypes, as described in section 4–Shared semantics.

2–UTypes as URIs

The UType definition quoted above (section 1–Introduction) includes a datamodel_identifier which syntactically resembles an XML namespace identifier without necessarily being one, and in particular without being necessarily associated with a URI which would give it uniqueness and a potential reference to documentation.

We suggest slightly expanding the UType definition by interpreting this datamodel_identifier prefix as precisely an XML namespace identifier (which must therefore be defined using an xmlns attribute if it is used), and identifying the UType as the string concatenation of the namespace name and the local name as given in the utype attribute, using the terminology of [std:xmlns]. There is precedent for this approach in the definition of `Compact URIs' (CURIE, see [birbeck05]), and it is a syntax used extensively and successfully in the RDF world.

In this interpretation, the following three fragments would represent identical UTypes and would be deemed to be equivalent.

xmlns:utns="http://www.ivoa.net/ut/#" utype="utns:axis"
xmlns="http://www.ivoa.net/ut/#" utype="axis"
utype="http://www.ivoa.net/ut/#axis"

The first is the usual XML namespace mechanism, and closely resembles the VOTable definition, the second uses the XML notion of the default namespace, and the third explicitly gives the URI which the other two resolve to. As with XML namespaces, the string used as the prefix -- utns in the example here -- is arbitrary, and it is only the post-concatenation URI that has any meaning attached to it.

This proposal requires no syntactic changes to the VOTable specification. It is purely a mild reinterpretation of the syntax already defined and used.

The UType string that results from this concatenation must be a valid URI. Since the namespace name is necessarily a URI, this constraint is satisfied if the local name matches a restricted form of the URI syntax of of RFC3986 (see [std:rfc3986]):

( path-absolute | path-rootless ) [ "?" query ] [ "#" fragment ]

In practice, we expect most UTypes' local name parts would match the fragment syntax, and more specifically that subset of it matching [0-9a-zA-Z_/.-]+.

@@TODO: what characters should be allowed in the local name? The above is a rather conservative set. XML allows the local name to be (Letter | "_") (NameChar-":")*, but NameChar includes large chunks of Unicode. This could be accomodated by requiring support for IRIs [std:rfc3987], but the XML namespace document includes only ambiguous support for that. Is the VO ready for kanji in its UTypes? Probably not.

Even without worrying about IRIs, we shouldn't rely on the fact that XML has Unicode sorted out. Other formats, and other software, will have to read UTypes, and so encoding issues rear their heads. In particular, we mustn't require any encoding which uses more than one byte per character, since that would generate various transcoding challenges, to put it mildly, when handling FITS files.

We could restrict ourselves to the characters of 7-bit ASCII, but it would probably be painless to use ISO-8859-1 in fact. The defined 0-127 characters in that set exactly match the printable 7-bit ASCII characters, and ISO-8859-1 as a whole matches Unicode code points 0-255. Thus, although this does not correspond to any Unicode encoding, there is a broad compatibility with Unicode in this case.

It would be wise to exclude '.' from the set of UType characters, as this character plays a syntactic role in Notation3, so that it would be mildly inconvenient to describe UTypes including a dot. Are there more similar restrictions?

In this example and below, we illustrate UTypes using the URI fragment identifier #: this is regarded as best practice in the RDF community and would generally be more convenient in the procedure we illustrate, but there is no technical reason why a set of distinct, fragmentless, URIs could not be used instead. One advantage of using the fragment identifier is that in this case it is natural to have the namespace URI refer to an overview document describing the namespace as a whole.

UTypes used in non-XML contexts -- such as FITS files -- would have to use either the third explicit mechanism or some separate namespacing mechanism, not specified here, though briefly discussed in appendix A–UTypes and FITS.

This mechanism makes it possible to mint URI UTypes through a wide variety of processes, from very formal and widely shared ones, managed by an elaborate standards process and probably in a www.ivoa.net namespace; through semi-formal ones specific to, and managed by, particular interest groups, perhaps on the way to full standardisation; to very precise ones, perhaps specific to a single instrument. Applications would choose which UTypes it was most useful to them to support: presumably most generic VO applications would support most www.ivoa.net UTypes, and X-ray applications, for example, might support many X-ray-specific UTypes. Perhaps a few applications will support instrument-specific UTypes directly -- perhaps because they fill a gap in a community-supported vocabulary -- but most such UTypes would likely be handled via the reasoning mechanisms described below.

3–Documentation

Once UTypes have been defined as URIs, then they immediately provide a source of documentation, if the namespace URI is made dereferenceable.

For example, to define a UType http://example.org/utypes/1.0#sharpBounds (presuming that we own the example.org domain), we would create a web page at the URL http://example.org/utypes/1.0 (see section B–Apache recipes for hints on making Apache return HTML for such URLs which don't end in .html), within which we have a link target with the same name, which leads to a human-readable description of the UType's semantics.

<h2><a name='sharpBounds'>Accurate bounds</a></h2>
<p>In our data, <code>#sharpBounds</code> are the
bounds on a bandpass where the transmission goes from 0% to 100%
within 10nm.  This is distinguished from
<code>#fuzzyBounds</code> data, where...

The description here can go into as much or as little detail as is appropriate for the formality and intricacy of the document. Thus the URI UType will, when entered into a browser, show the documentation for precisely that concept.

Obviously, any entity minting UTypes is making an institutional commitment to the long-term stability of the namespace URI. An entity unable or unwilling to make such a commitment should avoid creating externally visible UTypes.

4–Shared semantics

While the UType documentation described in section 3–Documentation is useful for humans, it is of course unintelligible to the applications that must interpret the data source annotated with the UType.

To continue our example, we might wish to share data using our new #sharpBounds concept. Doing so means that any application which is written to understand our more precise concept can make good use of the more precise meaning, but we want to make it possible for applications which do not know about this concept to make use of the data also.

4.1–Describing subclass relationships

We suggest a minimal profile of the W3C best-practice document [w3c:swbp] which describes how best to share standard RDF [std:rdf] and RDFS [std:rdfs] vocabularies.

We wish to assert that our new #sharpBounds UType is a more specific version of a concept #characterizationAxis-coverage-bounds, which we presume has already defined by the IVOA in the namespace http://www.ivoa.net/ut/characterization#, and which we can reasonably expect software to know about. We can do this using RDFS (here written in Notation3 syntax [std:n3]):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix myns: <http://example.org/utypes/1.0#>.
@prefix ivoa: <http://www.ivoa.net/ut/characterization#>.

myns:sharpBounds a rdfs:Class; 
    rdfs:subClassOf 
        ivoa:characterizationAxis-coverage-bounds .

This asserts that http://example.org/utypes/1.0#sharpBounds is a concept -- a Class in RDFS terms -- and that it is a more specific concept than the Characterisation model's bounds concept.

We propose that the file containing this machine-readable documentation of our UTypes be available at the namespace URI, and returned when the URI is dereferenced using an HTTP Accept header of text/rdf+n3. All non-trivial HTTP APIs have support for manipulating request headers in this fashion, and if all else fails, the command-line curl application can do the retrieval:

% curl --header accept:text/rdf+n3 http://example.org/utypes/1.0

Recipes for setting up a web server to support such content negotiation are in section B–Apache recipes.

4.2–Reasoning about UTypes

There are multiple systems (for example [app:jena] and [app:pellet]), in multiple languages, which can ingest such specifications and help an application make the necessary deduction. While an application could incorporate such functionality, it is straightforward to wrap such a reasoning system in a web-based service, and a system such as this has been prototyped.

Using such a resolver, an application which comes across the previously-unknown UType http://example.org/utypes/1.0#sharpBounds can resolve it in a single URL dereference (shown using curl here):

% curl http://localhost/resolver?q=http://example.org/utypes/1.0%23sharpBounds
http://www.w3.org/2000/01/rdf-schema#Class
http://www.ivoa.net/ut/#characterization.characterizationAxis.coverage.bounds
http://example.org/utypes/1.0#sharpBounds

This returns the list of superclasses of the #sharpBounds concept (which includes the #sharpBounds class itself, and the technical RDFS class), and so the application can simply work through this list until it finds a UType it recognises, and then proceed exactly as if that UType had been the one found in the input data stream, instead of the previously unknown #sharpBounds Utype. By making the subClassOf assertion above we have stated that this is a reasonable thing for an application to do.

The resolver does not need to be pre-loaded with a set of known UTypes. In fact, the reasoner can start off knowing about no UTypes at all, since when it is asked to resolve a hitherto unknown UType such as this one, it can simply dereference the URI as described in section 4.1–Describing subclass relationships, and add the retrieved relationships to its knowledgebase, ready to respond to this and any future queries. Since UType definitions will be stable, they can be aggressively cached (the assertions will be permanent in principle, but might include bugfixes and updates in practice). Thus this proposal requires no infrastructure beyond the dereferenceable URIs described above, and the commitment of the authors of those UTypes to maintain the URIs into the future.

Appendices

A–UTypes and FITS

The description above is expressed in terms of XML, through its reference to XML namespaces and its use of VOTable examples, but it is not specific to XML. To demonstrate this, and illustrate the potential use of these UTypes in other systems, we present here an example of how one might include UTypes in FITS files.

In a message to the IVOA data-modelling group, Jonathan McDowell proposed FITS keywords for UCDs and UTypes, namely TUCDnnnn and TUTYPnnn, each providing a UCD and UType for the data in the nnnth column.

This is already enough to reliably associate UTypes with columns, but it has the disadvantage that the UTypes in question would probably quickly run into the 72-character limit on FITS card values.

We could expand Jonathan's proposal by requiring the TUTYPnnn to include a namespace prefix, exactly as the utype VOTable attribute has, and adding a further header card to define the namespace prefix. This could be done with a header card TUTNSnnn, as follows:

TUTNS001=pfx:http://www.ivoa.net/ut/#
TUTYP010=pfx:axis

where the numbers nnn in TUTYPnnn refer to the annotated column, and nnn in TUTNSnnn distinguishes the namespace header cards from each other. Alternatively, namespaces could be defined with a card TUTNSaaa where the aaa letters define the necessarily short namespace prefix, as in

TUTNSpfx=http://www.ivoa.net/ut/#
TUTYP010=pfx:axis

This would have the side-effect of requiring that UTypes (or rather, the part of them following the namespace URI) have a maximum length of 68 characters (72 characters of a FITS card value, minus the three aaa characters and the colon). While this is unlikely to be a great imposition, it is worth noting that some of the proposed Characterisation UTypes [std:characterisation] are already tens of characters long.

@@TODO is there more to say, here?

B–Apache recipes

In sections 3–Documentation and 4–Shared semantics above we describe dereferencing a URL and retrieving either HTML or RDF depending on the content-negotiation phase of the HTTP transaction -- that is, depending on the content of the HTTP Accept header. In this appendix we describe a simple recipe for configuring Apache to support this; there will be similar configurations for other web servers. We describe only a single configuration here; fuller examples are available in the W3C best-practice document [w3c:swbp].

A namespace such as http://www.ivoa.net/ut/# would (typically) correspond to a directory .../ut on the web server. Let us suppose that we have, in this server directory, HTML documentation in a file namespace.html and RDF in the Notation3 syntax in a file namespace.n3. For completeness, we might as well have the same information in (the largely unreadable) RDF/XML [std:rdfxml] syntax as well, in a file namespace.rdf.

We presume that this configuration is being done in a per-directory .htaccess file, and that the server has been configured to allow this, by allowing the FileInfo Options overrides. The following .htaccess file will have the desired effect:

AddType application/rdf+xml .rdf
# The MIME type for .n3 should be text/rdf+n3, not application/n3:
# see MIME notes at http://www.w3.org/2000/10/swap/doc/changes.html
AddType text/rdf+n3 .n3
AddCharset UTF-8 .n3

RewriteEngine on
# RewriteBase is the path to the current directory
RewriteBase /ut

# Use response code 303, 'See Other'.
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteRule ^$ namespace.rdf [R=303]

RewriteCond %{HTTP_ACCEPT} text/rdf\+n3
RewriteRule ^$ namespace.n3 [R=303]

# Default -- typically text/html
RewriteRule ^$ namespace.html

With this configuration we can dereference the namespace URL in two different ways, to retrieve two different results:

% curl http://www.ivoa.net/ut/
<html>
<head>
[...]
% curl -i --header accept:text/rdf+n3 http://www.ivoa.net/ut/
HTTP/1.0 303 See Other
Date: Thu, 30 Nov 2006 16:19:51 GMT
Server: Apache/1.3.33
Location: http://www.ivoa.net/ut/namespace.n3
Content-Type: text/html; charset=iso-8859-1

[...]
% curl -L --header accept:text/rdf+n3 http://www.ivoa.net/ut/
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[...]
%

(the HTTP 303 `see also' response is the appropriate RFC2616 [std:rfc2616] response indicating that [t]he response to the request can be found under a different URI and SHOULD be retrieved using a GET method on that resource, and the -L option tells curl to follow any Location headers in the initial response).

C–Rationale

We include in this appendix a more discursive introduction to the problem this proposal is attempting to solve, and the larger social structure we expect to arise from it.

Standardisation is expensive, in both time and effort.

A standard must be as small as possible, so that it is more easily agreed on, and so that its documentation is not overwhelming; and it must at the same time be as large as possible, so that it covers enough of what its users want to exchange, to justify the effort of agreeing. The pressure for expanding the standard arises because, while standardisation is expensive, going beyond the standard incurs crippling costs as a result of the consequent loss of interoperability. Thus standardisation is not an end in itself, but merely a means to reach the real goal of universal interoperability.

The costs of standardisation arise because the participants in the standardisation process will have different designs in mind, and bring different implementations to the discussion. Sometimes these differences are merely accidents of history and taste, but sometimes they arise because the participants have different and incompatible requirements, so that the resulting standard ends up substantially more complicated than the designs that preceded it, still without completely satisfying anybody. Our particular concern here is the data models which structure shared data, which are variously designed for the convenience of the various data providers, but which a wide variety of data reduction applications nonetheless hope to read.

In this Note, we propose a structure which allows the different participants to retain their data models, and achieve interoperability, not by transforming their data into some never quite satisfactory consensus model, but by `explaining' their data model in terms applications can understand. Data providers can `explain' their model by analogy, saying that a concept in their data model is the same as, or a more specific variant of, a concept in another data model; if the latter concept is one which an application understands, then it knows how to handle the underlying data.

We would therefore expect to see a hierarchy of sets of UTypes.

At one extreme, it would be practical for data providers to provide extremely specific UTypes, to support those users who must deal with, for example, the nitty-gritty of a specific instrument. Such information would either be ignored by more generic data users, or be used by them as an instance of a more general, and more generally known, concept. A specialist user might care which particular filter was used in an observation, where a more general user would only need to know that it was J-band.
Users in a specific community, defined by wavelength (X-Ray or radio) or object (solar physicists), have sets of interoperable concepts which are important to them, but which would bloat an astronomy-wide standard. Radio astronomers use `janskys per beam' and `beam width in RA', in the face of general incomprehension; and though X-Ray astronomers are happy to talk about `barycentric coordinate time', (most) other astronomers are extremely happy not to.
At the top of this stack is the set of concepts which is used and understood by almost all of astronomy, and thus the set of UTypes used and understood by almost all astronomical applications. This is the level which would see the most careful standardisation of a relatively small set of UTypes.

We would therefore expect to see a large number of UTypes, which are of equal status in principle, but not in practice. It is in data providers' interests to make their data as widely intelligible as possible, by either using well-known UTypes or, where that is insufficiently precise, by `explaining' more specific ones in those terms. This creates an instability which produces a consensus on which UTypes are recognised as `well-known'. Of course, this process could be primed with an initial set of high level IVOA standard UTypes.

With this proposal, this last highest-level set of UTypes can be smaller than it might otherwise be, because it is no longer a costly disaster to omit things. If in retrospect it appears that a high-level standard omitted important concepts, then those can be developed in an agile fashion and stitched into the larger structure.

This agility emerges because this proposal facilitates not only different levels of specification, but also versioning and deprecation. The costs of versioning arise because it is expensive for applications to be reworked to use an updated version of a standard. If the new version's concepts are described in terms of the older version's, however, then it becomes reasonable for data providers to use the new improved version of a UType set, knowing that applications can deduce the relationship with the previous version they have coded-in knowledge of.

As well as versioning, reducing the community's reliance on a small set of gold-plated standards makes it possible for components of, or extensions to, standards to be designed, prototyped and maintained by specific interest groups, working independently.

Bibliography

[app:jena] Jena -- a semantic web framework for java.: [Online].
[app:pellet] Pellet: An owl dl reasoner.: [Online].
[birbeck05] Mark Birbeck.: CURIE syntax 1.0: A compact syntax for expression URIs. [Online].
[std:characterisation] IVOA Data Model Working Group.: Data model for astronomical dataset characterisation. IVOA Note, feb 2006.
[std:n3] Tim Berners-Lee.: Notation 3. Web page, mar 2006.
[std:rdf] World Wide Web Consortium.: Resource Description Framework. [Online, cited February 2005].
[std:rdfs] Dan Brickley and R V Guha.: RDF vocabulary description language 1.0: RDF Schema. W3C Recommendation, feb 2004.
[std:rdfxml] Dave Beckett.: RDF/XML syntax specification (revised). W3C Recommendation, feb 2004.
[std:rfc2616] R T Fielding, J Gettys, J Mogul, H Frystyk, L Masinter, P Leach, and T Berners-Lee.: Hypertext transfer protocol -- HTTP/1.1. RFC 2616, jun 1999.
[std:rfc3986] Tim Berners-Lee, Roy Thomas Fielding, and L Masinter.: Uniform resource identifier (URI): Generic syntax. RFC 3986, jan 2005.
[std:rfc3987] M Duerst and M Suignard.: Internationalized resource identifiers (IRIs). RFC 3987, jan 2005.
[std:sia] Doug Tody and Ray Plante.: Simple image access specification. IVOA Working Draft, may 2004.
[std:votable] François Ochsenbein, Roy Williams, Clive Davenhall, Daniel Durand, Pierre Fernique, David Giaretta, Robert Hanisch, Tom McGlynn, Alex Szalay, Mark B. Taylor, and Andreas Wicenec.: VOTable format defintion. [Online, cited July 2005].
[std:xmlns] Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin.: Namespaces in xml 1.0 (second edition). W3C Recommendation, aug 2006.
[w3c:swbp] Alistair Miles, Thomas Baker, and Ralph Swick.: Best practice recipes for publishing RDF vocabularies. W3C Working Draft, mar 2006.

UTypes and URIsDraft version $Revision: 1.1 $

IVOA Draft Note, $Date: 2007/01/28 13:51:22 $

Acknowledgments

Appendices

UTypes and URIs
Draft version $Revision: 1.1 $