IVOA Web>IvoaTheory>InterOpOct2008Theory>InterOpOct2008TheoryDiscussion (2008-10-31, HerveWozniak)

Discussion page of Theory Sessions (see also InterOpOct2008Theory)

This is a repository for the discussion of critical issues that appeared during the various TIG sessions.

(sroll down to see Session II and Session III discussions and opened issues)

Session I: SimDB

[Gerard Lemson: I was writing an email, will instead try to address some points here]

I am trying to comment on some of the comments made yesterday in the theory 1 session on SimDB. It was clearly very unfortunate I could not be there, as the skype interface made proper discussions almost impossible for me and the presentation was hard to give. Also I had hoped that the audience would be able to follow the more technical reasonings.

I do realise that we could have made our lives simpler by simply giving an XML schema and describing it, similar to the way others have done. I believe that this is not the best way of doing data modeling, and is also not sufficient for the type of service we wanted to define. Hence we concentrate on the UML and often leave out discussion of the physical products, the serialisations. And if we do it often gets tangled up with the discussion of the automation that we are using to derive these products, but which is in principle completely irrelevant for the SimDB spec. I am furthermore not very good at writing documents, and would very much appreciate help with that. Please give suggestions how the explanation should have gone according to you.

To start some other discussion I would like to react to some comments I (think I) heard.

First this. We have a reference implementation of a possible SimDb spec that prescribes the use of a particular XML schema for registering resources and a TAP compatible interface for querying it. The only thing we need to do there is finalise the data model (in UML), the rest will be automatically adjusted. And we need some text around this, for which there is work in progress.

(from Ray) "What is the connection of SimDB to the Registry?"

I think a SimDB is probably closer to a Registry than to a simple * access service's queryData part. SimDB is supposed to be describing the design of and interface to a database (of some sort), that allows people to register (using XML documents) their simulation and post-processing results (experiments), and their simulation and post-processing codes (protocols). And SimDB supports querying this database using ADQL. So there are clear correspondences to a Registry. I believe the main difference is that SimDB/Resource-s are in general more fine-grained than the resources that are registered in a Registry. The other difference is that we try to work out the content model in quite some detail. Whether this means SimDB is "merely" a Registry extension, I don't know. Let's discuss that. Yesterday, but already in Beijing we had a discussion with Ray about it being useful that some of the SimDB/Resources could also be registered directly in a Registry. We discussed there that scientific projects, often using a range of simulations and post-processing experiments, would be obvious targets for this. Hence we have explicitly added such a component to our model.

Precisely how one turns a SimDB/Resource into a Registry/Resource would have to be discussed. One reason why we want to have close cooperation from the Registry WG in this project.

(from Francoise) "The fact that the Registry model does not have a UML expression should not prevent us from using (some of) its concepts."

I am very well aware of this. I even said so explicitly in my presentation. We even have already been inspired by concepts form the resource model in SimDB, after all our base class is callled Resource, we have a class called Contact, we could easily change the design there to more closely follow the Registry design for Curation. I think those are really details in the end.

But the translation of these registry concepts into the SimDB context (and UML), as well as the discussion on which concepts could/should be added requires work. Work that I think should be done with assistance of the Registry WG. The current model is in some places on purpose somewhat vague, and the Curation part is there more as a place holder.

(from Ray) "IVO identifiers are supposed to be resolvable to a Registry Resource. So SimDB should use maybe only the positions after the # to identify elements."

This is the kind of feedback we have been asking for and that we should discuss more fully.

(from Rick) "We might use the Registry's Resource as base class in the SimDB data model, with our Experiment, Protocol etc classes inheriting from it."

This is the standard first guess on how to reuse other data models, it is also often wrong. Inheritance is a very tricky technique to use in data modelling, because it is very restrictive. For example saying that a SimDB/Simulation "is a" Registry/Resource would imply that I can send it to a Registry and it should accept it. This is LIKELY (may be discussed) not what is desired, any more than a Spectrum in the Spectrum data model "is a" resource, or an SIA model image is. Using inheritance, a SimDB/Simulation would furthermore inherit all features of a Registry/Resource, including constraints such as not-null. This would include features in the other model that are treated for specific reasons in more detail in SimDB. These other features were added into the Registry model in the form they are for the particular design purposes and requirements of a Registry. The requirements from Registry were different from those of SimDB, and consequently the resulting models are not going to be exactly the same.

However what may be possible is to identify elements in the SimDB model that can be translated into the (likely) courser elements in the Registry model. This translation can maybe even be formalised for example as XSLT transformations, transforming (maybe multiple!!) SimDB/Resource XML documents into a Registry/Resource one. If to do so requires adding metadata to the SimDB model, that should be done. We have identified possible missing concepts already in the Curation area. We may, when adding these features, try to follow as much as possible the structure of the other model. This could/should be discussed with the Registry WG. And it may not be necessary to do this particular exercise in a first version of SimDB.

(?) apparent changes to the model

I want people to be aware that the changes I discussed yesterday and which I also implemented in the actual UML model are NOT to be taken to be the final word. I had discussed with Herve that I was going to propose some changes and that I was going to implement them so that it would be easier to discuss them if other people could see them. The old version of the model is still available on Volute (SVN version 779).

(?) too technical

I would have loved to have a less technical presentation. The main technicalities should have been discussed before. The fact is that only a few people are actually actively studying the model, though everybody has been invited to do so and ALL our efforts are openly available on Volute, as was mentioned in Trieste explicitly with all links.

Already for > 1.5 year, since before the SNAP workshop in Graching, I have contacted exec, tcg, wgs for assistance. Requests that have almost continuously been met with almost complete silence I am afraid to say.

(?) "The model seems too complex".

I think this is partly due to people not being familiar with the type of simulations we are modelling here. They therefore likely do not see the intrinsic different nature of these resources from the more standard images, spectra and source catalogues.

But I agree there is some level of complexity. First, if we look for simple measures of complexity, say number of classes I count

43 classes (object types) in SimDB (version 832) having in total 64 attributes, 20 references and 27 colelctions.
49 complexType declarations in the generated schemas (6 of these correspond to value types or enumerations).
8 root elements in the SimDB_root.xsd schema. These correspond to the SimDB resources.

Compare this with the total number of elements in other schemas (hope I used the latest versions):

registry (possibly rather old): O(15) simpleType and O(20) complextype declarations
Characterisation has 33 complexType declarations
spectrum datamodel has 46 complexType declarations
STC: 114 complexType and 16 simpleType declarations and I think O(200) root elements.

In terms of these numbers we're not overly complex I'd say.

I do agree that normalisation is complicating support. But could we please have this discussion properly, not coming with blanket statements that "the model is too complex, go simplify it" without understanding the motivations we have had. We have been trying to look at this issue very hard, and I realise that not everybody is as familiar with data modelling, but the normalisation came from requests from those who actually tried mapping their simulations to the model and tried to implement prototypes. Sometimes the world is simply more complex than we might like!

ADDENDUM

I do think that it may be an option to consider whether the current SimDB model is too much an analysis, or domain model and should be denormalised to simplify usage. Note that in Victoria, where the decisions for our methodology were taken, we promised to deliver an anlysis model, from which to derive the logical model and the physical models (serialisations). The fact remains that when we had a simpler logical model, actual users/providers were the ones arguing that it did not serve their purposes.

on characterisation

We have had many discussions about this one box in our model. It is indeed not the same as any of the elements in the Characterisation DM. But that model was intended to characterise observations, not N-body simulations etc. Hence it contains many features that are irrelevant for SimDB, for example calibration information and which does not belong in this model. There are other reasons.

However I hope people will be able to appreciate that we do actually have attempted to follow the patterns from Characterisation DM of having more and more detailed descriptions of the actual data.

I have tried raising these issues in the DM WG as requirements from the theory IG and though the most recent discussion I had with Mireille and Francois seemed promising, we're clearly doomed to keep repeating the discussion.

Removed SubvolumeExtraction/-or and Visualisation/-or.

This was a proposal, implemented in the post-779 version of the data model on volute. Arguments against this were brought forth by Rick. In particular Subvolume extraction still produces Snapshots (i.e. representations of 3+1D space) so qualify as SimDB protocols/experiments.

The latter is not the case for Visualisations. rick argues (if I am not mistaken) that it would be useful, when describing a complete project, to add the possibility of visualisations. Problem is that than the Snapshot as a result is not correct, we need to open up this part of the model, introducing a general Result base class, with also for example Image as subclass. As we have discussed before, such a design will also be required if we want to fit micro-simulations for example into the SimDB model.

I think we should consider whether we want to add this complexity for the current version of the model.

Remove CompositeExperiment and –Protocol.

A CompositeExperiment was introduced by request (I believe) of Franck. It would be good if he gave his original motivation for it.

Note that it should not, for it need not, represent a pipeline withindividual experiments processing the result of earlier experiments, as in the example of the GalICS pipeline mentioned by Herve. Such pipelines can already be built explicitly in the model, and can be gathered into singleResources as Projects.

Session II: SimDAP

Service operations

GetAvailability - required
GetCapabilities - required
ListExperiments - required
ListSnapshots - required
- input parameters : experiment (required)
QueryData - required
- input parameters : experiment (required), snapshot (opt), properties (opt)
Cutout - optional
- input parameters : experiment (required), snapshot (opt), properties (opt), volume (optional)
Preview - optional
- input parameters : experiment (required), snapshot (opt), properties (opt)
Custom - optional
- input parameters : experiment (required), snapshot (opt), properties (opt)

FLP : how to get the list of properties RW: with ListExperimetns or ListSnapshots

PSKoda: very similar to 3D spectral cube in SSA apart from projection RW: projection could be a custom service

VOTable describing file properties

CRB: is the file format organization specific to this kind of simulations RW+CG: yes

Opened issues to be discussed on the theory@ivoa.net list

Query response:columns in the VOTable
Properties specification (e.g. density of the DM particles)
Data model extension to describe file content: separate data model or extension of SimDB?
optional and required service for SimDAP

exchanges expected on the mailing list. waiting for CG and RW proposal on the theory@ivoa.net.

After debate, seems that a separate data model allows more flexibility. Moreover, the description of the file content is useless for discovery.

Session III: MicroSimulations

Protocol issues:

How to describe in SimDB modular codes ? In particular when module are not part of the core code and developped by third party.
No vocabulary for all parameters yet
Description of physics covered by the code

[HW I wonder at what level we have to go down in the description of the physics. If we allow to much detail we increase the risk that no simulation at all is available]

Experiment issues:

Not obvious to choose between RepresentationObjects and Properties in some cases
Vocabulary again...

Various issues:

No way to describe stationnary simulations: SimDB assumes implicitly that simulations are time dependant
How to do the difference between an experiment with only 1 snapshot: 1/ corresponding to 1 time step for a time dependant code

and 2/ corresponding to a stationnary solution