Rationale and charter of the WG
Early in the work of the NVO and AVO/Interop teams a group of us
emphasized the need for standards describing the structure and semantics
of astronomical data to permit dataset interoperability. The VO will
enable astronomers, and writers of astronomical software, to more easily
locate public archival data, but that is only of limited use if all of
the datasets are in different, mutually incompatible, and poorly
described formats. The DM WG was established to attack this problem.
A data model, in the sense used by the WG, is an abstract description of
concepts and their interrelationships, used to fix both the names and
meanings of concepts in the VO context and also their internal structure
and cross-connections.
The WG was created after the AVO demo meeting at Jodrell Bank in Jan
2003, following collaboration between the NVO Data Model WBS in the USA
and interested parties in the AVO projects in the UK and France which
led to a Data Model Technical Meeting in Cambridge (USA) in Oct 2002.
Jonathan McDowell (SAO/Cambridge, USA) was appointed chair of the WG. At
the Cambridge (UK) Interoperability Meeting in April 2003, the WG discussed
issues of scope and process. It was decided that an IVOA data model
would consist of:
(1) A white paper discussing and naming the concepts involved and describing
the model and its intent in text form.
(2) A set of UML diagrams (including at least class diagrams) constituting
a formal definition of the model.
(3) An XML schema file providing a reference serialization of the model.
The schema does not define the model, but validates it and clarifies
the intent of the model. XML document instances should also be provided
as examples.
A combination document will then be circulated to the IVOA as an
IVOA Working Draft.
Although other technologies could have been chosen (OWL or other ontology
tools rather than UML, FITS serializations rather than XML), the
above choices reflect a reasonable combination of simplicity, familiarity
to most WG members, and consistency with the rest of VO work. Of course
the approach is subject to future review by the WG, but so far
the object-oriented rather than ontological methodology is what has
been agreed on.
Although almost all aspects of the VO could benefit from a standard
data model, we decided in Cambridge that the WG itself would focus
on data models for datasets themselves, to solve the problem of
dataset description and dataset interoperability. Other data models
may be developed by other WGs and submitted to the DM WG for verification.
For instance, the Registry WG is developing a data model for registries.
Interface of the WG with other parts of the IVOA
Data models will be used:
- by data providers to describe their data to the VO in a standard way
- by service and application developers to read those standard descriptions
and work in terms of their concepts
- by later versions of the query language to express complex queries
in ways that can be reliably matched against data.
This means that data models will be serialized as XML and FITS in VO-published
datasets, serialized as XML in the VO query language, and implemented
as software classes (bindings) in VO-aware applications. These serializations
will probably be specified by the DAL and VOQL working groups in collaboration
with the DM WG, but this is still to be worked out.
Clearly the strongest connections of the DM WG are with the DAL (Data Access Layer) WG
and the VO Query Language working groups, although all WGs will make use
of the concepts worked out in the DM.
Another strong interaction is with the UCD group. UCDs provide simple
descriptions of different physical concepts, without the precision
and structure of a data model. We anticipate that we will use the UCD
semantic vocabulary to tag data model objects with meanings.
Requirements from other WGs
Requirements from other WGs will be provided both informally
and in terms of use cases. Currently we have a requirement from the
DAL WG to provide a data model that will let data providers describe
1-dimensional spectra in sufficient detail to permit their selection
in a query and their interpretation at a simple level after retrieval.
Currently the process to provide requirements is by email to the IVOA
DM mailing list.
Data Models from other WGs
We recommend that data models created by other WGs should be submitted
first to the DM mailing list in the form of an IVOA Working Draft. The
DM WG will comment on the model's completeness and its consistency with
other models in the IVOA, but will in general refrain from
second-guessing the technical details specific to the other WG's area of
specialty. The model will then be circulated as an IVOA Working Draft
in the usual way.
Connecting Different Models
In principle one could generate a single massive data model for the
whole of the IVOA. However such a model would likely never converge.
Part of the WG charter is to ensure that the different data models for
different parts of the VO domain are as consistent with one another as
possible, and in particular that objects are reused where appropriate.
The WG should also ensure that models are as generic as possible - in
particular, that the generalization of a given model to all domains of
astronomy is discussed. This does not rule out the existence of models
which are specialized to particular subdisciplines, but such
specialization should be shown to be necessary (due to true uniqueness
or to considerations of efficiency or complexity).
Namespaces
At this stage we anticipate that different models will exist
for the same part of the problem domain. It has been suggested that
models used by particular projects within the VO be prefixed by
namespaces, e.g. CfA:Quantity, while models adopted as the preferred
description of a concept by the WG be identified by the IVOA namespace,
e.g. IVOA:Quantity. This proposal should be discussed further.
so want to define interfaces with individual non-VO-specific projects.
Specific data models tuned to a particular set of data are developed or
in progress by projects such as SDSS, ALMA, Chandra and Planck. Typically
these models will hide implicit assumptions specific to their domain,
and only make explicit those things which are not constant within their
problem. On the other hand, they may elaborate some areas of the problem
important to them in much greater detail than the IVOA generic models.
We should be able to provide partial mappings between these specific
models and our more general approach.
Serializations
The assumption in the IVOA is that the serialization language of
choice is XML. We still have to work out the details of
the interaction between XML-serialized data models and the VOTABLE
XML format.
The FITS format is likely to remain the underlying binary data format
for large datasets for the forseeable future. While the FITS format is
limited, in particular due to the 8-character keyword limit, it is
possible that it will sometimes be convenient to serialize data models
fully in FITS, to help analysis systems which do not conveniently
interact with XML parsers.
Bindings
The first priority in interoperability is defining the serialization
(byte representation as disk file or data stream) of the data model and
thus the interpretation of the serialized data. Without this not even a
human can interoperabily interpret data from different sources. Also
important is the definition of the binding of the data model class
methods, i.e. the subroutine interface to the data. These two
definitions interact, to the extent that use cases first imply software
methods, and thinking about what these methods need in order to work
tells us what attributes and metadata must be present in the data files.
However, it is not necessary to elaborate all possible operations one
might want to do with a data model to have a working implementation. We
should therefore err on the side of defining simple interfaces which
expose the data concepts as directly as possible while still hiding
abstractions which unify different flavors of those concepts.
Proposal by the DM WG chair: The DM WG should define bindings in
languages which are ANSI standard and implemented portably on most
platforms in common use in the astronomical community. In particular,
since user surveys show that Linux is becoming the most common operating
system used by astronomers, bindings should be implementable on some
flavor of Linux. This in no way prevents IVOA member institutions from
implementing non-IVOA-standardized bindings in other non-portable or
proprietary languages only available on other operating systems.
Informally, Java seems the most likely language for initial
implementations. However, since ANSI C and Fortran remain the most
common languages used by developers of astronomical analysis software
and by actual astronomers, we may wish to explore bindings in these
traditional but still widely used languages. Note that the DM work
straddles the interface between the VO web portals and services - firmly
in the computer science world where fashionable XML/Java/WSDL
technologies are appropriate - and the end-user astronomer analysing
data on their desktop, who we must be careful not to alienate by forcing
them to use technologies which are both unfamiliar and not as well
adapted to their problem.