(r9) SimDALRFC < IVOA

TWiki>

IVOA Web>TemplateRFC>SimDALRFC (revision 9)~~EditAttach~~

SimDAL 1.0 Proposed Recommendation: Request for Comments

Public discussion page for the IVOA SimDAL 1.0 Proposed Recommendation.

The latest version of the SimDAL Specification can be found at:

http://www.ivoa.net/documents/SimDAL/index.html

Reference Interoperable Implementations

SimDAL Reference Implementations

Comments from the IVOA Community and TCG members during RFC period: 2016-07-08 - 2016-08-22

Comments from Enrique Solano

"First SimDAL Repositories store codes and theoretical projects descriptions. They can be used by clients to discover theoretical services"

As it is written now, it seems to me that SimDAL Repositories are the only way to discover theoretical services. This is not true that, at least the "simdal search" services can also be found using the Registries. This should be clarified.

Anwser: This has been clarified. The text now mentions the components can be found in SimDAL repositories and registries

"Finally, SimDAL Data Access services are dedicated to retrieve raw data."

Only raw data? This is not true. I would remove "raw" from the sentence.

Anwser: Indeed. This has been corrected

The inclusion of an Appendix describing some implementations and showing how these services work in real life would be more than desirable. This was done with SSAP and it was very useful.

Anwser: This would be useful. An implementation note was written for the Simulation Data Model. It presents how to map the DM on different kind of simulations. We plan to do the same for SimDAL. Once the standard will be accepeted, it is planned to write an Implementation Note that will present how to use SimDAL to publish different categories of simulations / numerical models.

A typo: In the Introduction, the sentence "It is a fine grain registry for numerical codes and simulations in the Virtual Observatory" is repeated twice

Anwser: Thank you. This has been corrected

Comments from Mark Taylor

I don't have a strong interest in SimDAL, and I have not thoroughly reviewed this draft, but I read it and have some comments.

This document departs from usual VO procedures in various ways, apparently reinventing the capabilities of TAP and the Registry for its own purposes. There is a rationale provided in Appendix B for avoiding use of TAP, which I'm not sure I find convincing, but I haven't gone into the requirements of simulation data access carefully enough to want to comment further on that.

Anwser:

The notion of views may present similarities with TAP/TAP schemas. TAP has not been chosen as a solution because it does not fulfill the requirements for Theory. Theoretical services will publish very different kind of numerical models and simulations (N-body / SPH / MHD simulations, asterosismology models, radiative transfer codes, astrochemistry models, ...). Some of these theoretical results have a lot of properties characterizing simulated objects (> 100 000 in one the SimDAL implementation). These numbers are growing due to the progresses in numerical models.

We would need to have the properties as table columns in a table in a relational database, which is simply not possible for the majority of the rdbms currently in use (which we would have to use if we would like to use TAP, since TAP is strongly SQL, and so relational, coupled). Storing such data in TAP-way in RDBMS would require to have the properties as table columns in a table but it is not possible to manage high dimension data (i.e. table columns) for the majority of the RDMS currently in use (Postgress, MySQL). High dimension data and their use is much more properly served by other type of storage architectures. That publishers cannot (or would have great difficulty - i.e nonsense - to) use with TAP when they do not have SQL compatibility/adapter.

Note that if the definition of SimDAL has been so long, that is because many technological solutions have been tested (and implemented) before reaching the present proposition. Among them TAP has been tested on various data management systems / storage architecture. The conclusion of this implementation is that TAP is not an option. The views solution adopted in SimDAL has two benefits
1 - it decouples the standard VO interface of the technology to store the data (so a publisher can choose the technology he preferes depending on the particularities of his data)
2 - it is as similar as possible to TAP (virtual table + view schema) so that publishers already familiar with the VO should not be lost.

Concerning the SimDAL Repository part:
First, note that SimDAL components (and among them the SimDAL Repositories) are registered in the IVAO registries.
To the difference of the registries, SimDAL Repositories describe resources (protocols /codes, projects, etc.) with the semantics defined in the Simulation DataModel So it is only with SimDAL Repositories that a search for resources can be done using the SimDM semantics. Moreover, SimDAL Repositories are places where the SimDM XML serializations of projects and protocols (codes) are stored. These serializations are the descriptions
of theoretical projects and codes that are published in the VO. IVOA registries do not have functionalities to store and query such serializations whereas SimDAL Repositories do.
Discussions with Markus (for the Registry W.G.) showed that some parts of these serializations could be transformed and ingested in the IVOA registries. Nevertheless, this would be done loosing the relationships between SimDM classes, and so loosing the hierarchy of the model and a part of the SimDM semantics.
Presently, the SimDAL Repository search API does not allow to fully benefit of the SimDM XML serializations despite most scientific use cases would require fine grain search in these SimDM serializations to discover efficiently protocols and projects of interest. This has been a choice for the version 1.0 of SimDAL. Indeed, in the coming months / years we do not expect to have a lot of registered IVOA theory services and so, it should be easy for users to discover theoretical services with the SimDAL Repositories as presently defined. Nevertheless, when more and more theoretical services will be registered finer grain search will be necessary. SimDAL Repositories as defined in version 1.0, storing the full XML serializations of projects and protocols, contain all the informations and the standardized relationships between these informations to answer these use cases. It will then be time to extend the capabilities of its Search API.

Section 1.1: Only a specimen IVOA architecture diagram is included, a real one should be used. In view of the unusual content of this standard as I dicussed above, there should be some more detailed discussion here of which IVOA standards this document uses, which ones it avoids in favour of its own ways of doing similar things, and why.

Answer: Indeed. The diagram has been replaced. If a diagram with all the standards is required, it will be introduced in the corrected version of the document.

Section 3.2: The use of VOTable to encode errors here says it follows DALI, but in fact it looks different from the usual way that DALI-compliant services do it. The specification in this document encodes errors as a sequence of multiple (error_msg,error_code) pairs as rows within a TABLE, while DALI encodes an error as a single INFO element outside the TABLE element. I suspect this is a misunderstanding of DALI intention, but maybe it's deliberate because of the need to report sequences of errors rather than single ones. It should either be changed to match standard DALI practice, or if not it should be clear from the text that this is not DALI standard.

Answer: Thank you. That has been corrected.

Section 4.2: "The response schema of the results table is (FIELD IDs):" but the following table has FIELDs with name attributes as listed rather than ID attributes. Some of the VOTable samples use lower-case element names, which is not permitted in VOTable.

Answer: Thank you. Also corrected.

There are reference implemenations listed, which is good. However, I don't see any validators. I played around a bit with the implementations (not really understanding how to drive it properly); quite a few links in the obspm implementation lead to error pages. Validation tools should be provided by this stage of the review process, and ought to help in identifying missing/broken functionality like that I currently see in the obspm implementation.

Answer: At the InterOp of Sesto, in May 2015, when the procedure to finalize SimDAL has been launched, Severin Gaudet (as chair of the TCG) asked for a validator but said that a client compatible with the reference implementations is a validator. So, a client instead of a simple validator as been developed. It is compatible with the two reference implementations.

We tested the client (https://app.ism.obspm.fr/simdal-client/) and it seems to work properly.
A few comments on its use:
1 - To search for simulation, follow the order in the top menu: Search in the Repository, then do a SimDAL Search, and finally search in Access data. Each step provide the URIs for the next one.
2 - In the repository search, first select a SimDAL Repository before doing a {search} or ask for the list of {projects}.
3 - At each step, after a search, the system provides the URI of the services. These URIs have to be copy-paste in the next step.

-- MarkTaylor - 2016-07-14

Answers: --IVOA.FranckLePetit and DavidLanguignon - 2016-08-09

Comments from Markus Demleitner

Let me start with the very general remark that I believe this standard tries to do too much. I think it should be three different standards at least. When reading it, I kept having the creepy feeling that far too many details are left open, more or less by necessity because there's so much to specify. You're defining more than a dozen endpoints and quite a few VOTable hacks on 50 pages; perhaps tight integration, solid SimDM foundation and specialisation on particular use cases actually let you do that, but I'm concerned that all kinds of little issues will come up when different implementations try to interoperate. Is there a client that would exercise even half of the features described in the document? In your experimental implementations, was underspecification an issue?

In particular, I'm a bit concerned about the proliferation of end points. you're defining about about as many end point types as the entire rest of the VO combined. Perhaps that's ok, in particular because by and large your interfaces appear fairly "small" and tidy compared to some other things we've produced in the VO, but it's at least somewhat of a liability for writing validators, and I suspect for implementations, too. Since quite a few of the interfaces are essentially just searches in (perhaps virtual) XML documents: have you investigated whether you could reduce the number of interfaces required by re-using, say, xpath or xquery or whatever?

In short, I believe you should split up this document into three pieces, each of which would work out to be more handleable.

Individual issues:

I couldn't find the document source, so I couldn't fix a number of typos and editorial glitches (e.g., "SimDAL as then" -> "SimDAL and the", two instances of "It is a fine grain registry for numerical codes and simulations in the Virtual Observatory" in the Introduction, "fine grain" -> "fine-grained" throughout). If you tell me where the source is, I'd volunteer for another round of proofreading.

sect. 2.1 ends with a pointer to use cases in Appendix A. The text continues with "use cases" in 2.2. It would help the comprehension of the document if the reason for this distribution of use cases were made clear (actually, I think 2.2 could be re-formulated a bit so they actually become requirements rather than use cases).

I think the document would profit from a bit of de-duplication (e.g., the affirmation that "only a few data centers" would implement a repository is made at least twice; the different URI forms for-id/3 vs. views?id=3).

Starting p. 8, there are references to "UML classes". I don't think the "UML" should be there. Perhaps just "classes" is enough, or one needs a different terminology. That (initial) modelling has been done in UML is, I think, of no import for this specification, and indeed I would hope that future versions of SimDM will come in VO-DML.

On p. 8, there's "ex star, cloud, halo" -- I'd much rather see "e.g.," than "ex"; in general, I think it would be better if the term "object" could be avoided here (if it indeed refers to "astronomical object"). Does SimDM perhaps already offer precise terms for what's meant here?

On p. 9, the "pivot format" (incidentally, I'm not sure I understand why it is called "pivot" -- perhaps a brief explanation could help?) is defined as consisting of several files which are given as what looks like file names. It is not clear to me whether these file names are part of the standard, and if so, how multiple experiment files are to be stored under one name. If, as I suppose, these are generic identifiers for "sub-formats", I think you shouldn't use file name-like names for these but instead use more format-like names for them. But perhaps all that should rather be part of SimDM.

On p. 10, I think "with a list of couple (error messages, error code)" should be "containing rows consisting of a string-valued column error_msg and an integer-valued column error_code." or so. Also, DALI says. "The content of the INFO element conveying the status should be a message suitable for display to the user describing the status." Of course, this cannot convey multiple error messages, but for improved compatiblitiy with DALI I think you should keep the INFO text as something immediately displayable to a user. Also: Does the table in the error case have a name results as well or does it not? If it does, then perhaps the text currently in the paragraph "Result" on p. 11 should be an introductory paragraph to 3.2?

In 3.3, you define the "links" table; this is pretty much a stripped-down datalink table -- why do you not simply use datalink itself? It would have the advantage that client authors might already have code to parse and display datalink tables, and they'll curse you if they have to unnecessarily write some glue code just to shoehorn your table into their datalink data structures. Getting your SimDAL-specific terms into the Datalink vocabulary should not be a big deal.

I have a heartfelt dislike for your "foreign-key" GROUP in 3.3. My preference would be to just fix the column name(s) in results and links and be done with it. If, on the other hand, you want to establish a general mechanism for declaring foreign key relationships, don't do it here, do it in VOTable or in the VO-DML mapping document. We should do this properly; if every standard starts to ad-hoc this kind of annotation, VOTable will become an unimplementable, contradictory mess.

In general, after reading 3.3, I'd not be sure what I'm supposed to return in results. ident, yes. created? MUST? SHOULD? Just as an example? And as a client, what am I supposed to do with the results table? Just display it as an opaque table? You have a couple of words on "general" return fields quite a bit later; perhaps the document would profit if you pulled that part up a bit or at least referenced that from here.

In 3.4, I think you should give some explicit guidance as to what to say when a next_page/previous_page link has expired, be it because the query result was cached somewhere, be it because the underlying result has changed. In that same vein, you might consider recommending that services communicate an (estimated) validity span of the pagination link (see, e.g., OAI-PMH for how they did that).

Still in 3.4, I'd say there's not enough value in letting clients specify the page size to justify the complication in implementation. Let the service decide on the page size and trust that it's not so large as to overwhelm the client. Pagination is hard enough to get actually (!) right even without extra tricks.

In 3.7, you currently say "eventually followed by a decimal point and fractions of seconds"; I think you intend this to be "optionally followed", right? If not, I'd be severely concerned. In this context I think you should allow an optional "Z" at the end for compliance with other timestamp formats in the VO (ideally, just reference DALI 1.1 here).

In 4.1, you are claiming {search} were "search for concepts" -- as far as I can make out, this is just a full-text search. If so, I'd say just say so: "perform full-text searches".

In 4.1, I guess I'd rather start with "formal" definition of the query parameters and then go on with all the explanation. I was a bit confused about the talk about q and att. (And I'd then remove the "Note" about att, too, as it only repeats what's (now) later said under "Parameter".

In 4.1, you say that without a document schema "it is up to the client to understand what the attributes are and what they mean." I think that's misleading. The client simply has no way to figure out what attributes there are, no? Wouldn't "the metadata schema has to be communicated by non-standard means" or something like that be more appropriate?

In 4.1 and following, I think you should give the VOTable types you expect in the response sehemas ("text" is fine, since I don't think you should mandate arraysize="*" on char fields; however, if, e.g., created is a timestamp, I think you should mandate the corresponding xtype).

In 4.2, is the project parameter mandatory?

In 4.3, the query example uses "projects" as the parameter name, whereas the defined parameter name is "project".

Which brings me to a general point: I think SimDAL should say somewhere whether its parameters are supposed to repeatable (i.e.: can I pass multiple "project" parameters to, say, ?)

In 5, you say, at various places "These views can be seen as ASCII tab separated files.", "That is what would be done when performing a SQL query on a single flat table", "This server-side file is abstracted, in a VO context, as a VOTable.", "It aims at untying the standard and the implementation details." So -- I have to say I'm fairly confused. From what I can fathom from this, you're saying the underlying data structure is a relation, and a view is a projection of a subset of that relation? Whatever it is, I think you should define the basic data structure without reference to any specific serialisation, just in terms of the underlying mathematical model, exactly to untie model and implementation.

On p. 26, you give a VOTable group to declare foreign keys, which is fairly related to the foreign-key from 3.3, but has some additional PARAMs, but doesn't have a name and doesn't use ref. I appreciate that the use case is a bit different here, but couldn't there be one common mechanism for "foreign-key-like relationship between entities declared in some VOTable"? Sure, this might make the 3.3 GROUP a bit clumsier, and perhaps the typedness of the FIELDref is lost, but I'd consider this a small price to pay for at least internal consistency of SimDAL (of course, I'm still all for trying to do without some generic foreign-key mechanism defined in a DAL standard; let's have that in VOTable).

On p. 26, "Query Language", you should reference the concrete JSON standard people should implement against (or reference some Javascript specification and say which nonterminal your dictionaries should conform to. There's just too many flavours of JSON out there.

The {fields} end point apparently uses a REQUEST parameter and is polymorphic on it (REQUEST=search has a q parameter, REQUEST=schema has a field parameter). Isn't that a bit at odds with the rest of the design, where you have different endpoints for different functionalities? Why don't you split up these two functionalities into two endpoints (or, conversely, join a few other endpoints and use REQUEST to dispatch between different sub-functions; obviously, that's not my preference)?

In 6.3., a reader might suppose the first job creation request already returns the UWS document with the results element filled out. I'd suggest putting an "eventually" or something like this into "It returns a UWS resource".

p. 41f, "UWS extension" stipulates that SimDAL "differs" from UWS in two points. First, there's no joblist -- where does the SimDAL say that? If the sentence itself is the norm, this should be made much clearer. Also, I don't think there's a necessity to even outlaw it, since I'd expect most people would use off-the-shelf UWS components anyway that have their own ways of dealing with the "security" issues you cite. The second point, the use of JSON as a JCL, I cannot see as a difference from UWS, which does not specify the JCL in the first place. Which stipulation of UWS do you see violated there?

Finally, even in a technical text, the pervasive use of male-only forms reads a bit odd and cranky these days. Just use plural forms and don't worry about it ("...the final user can get a hint about if he is asking for too many..." -> "users can get a hint whether they are asking for too many...").

-- MarkusDemleitner - 2016-09-13

Comments from TCG members during the TCG Review Period: 2016-07-08 - 2016-08-22

WG chairs or vice chairs must read the Document, provide comments if any and formally indicate if they approve or do not approve of the Standard.

IG chairs or vice chairs are also encouraged to do the same, although their inputs are not compulsory.

TCG Chair & Vice Chair ( _Matthew Graham, Pat Dowler )

Applications Working Group ( _Pierre Fernique, Tom Donaldson )

Data Access Layer Working Group ( François Bonnarel, Marco Molinaro )

Data Model Working Group ( _Mark Cresitello-Dittmar, Laurent Michel )

Grid & Web Services Working Group ( Brian Major, Giuliano Taffoni )

Registry Working Group ( _Markus Demleitner, Theresa Dower )

We are somewhat concerned that there is a fairly large overlap between Registry and your repositories. It would seem that at least the plain {search} endpoint is largely covered by standard Registry infrastructure; for {projects} and {protocols} I'd say it's a matter of a Registry extension.

When you say "SimDAL services may be discovered through Registry queries", I think you should say "by looking for capabilities with the standard ids defined in sect. 3.6."

Beyond that, if you want to define a Registry extension (and I think you should), I think you should do so in the document. Splitting up the "DAL part" and the extension, as we've done with S*AP and TAP, has proven to be a severe maintenance liability. We are happy to assist you there, and as long as you have your metadata concepts worked out, this would be a quick process.

Talking about standardIds, in 3.6, it seems you are saying the curly braces should actually be part of the URI ("ivo://ivoa.net/std/SimDALSearch#{views}-1.0" in what you label "Example"). I doubt that is intended, but if it is, we would veto it; curly braces are not allowed in URIs.

In 4.1, you say that search "should implement the pagination API" -- so, how does a client find out whether it does? As long as there is a possibility that a given service doesn't support pagination, I'd suggest you should say in 3.4 how to discover pagination support once and for all. From a Registry perspective, I'd say this is a fairly natural item for a Registry extension's metadata model.

In 4.1, you define what boils down to a universal metadata model, including a means for schema discovery. We note for the record that from the Registry experience we are fairly uneasy about the usability of such an extremely generic thing; also, we've found many metadata items have a natural tree structure, which of course is not really representable in such a flat key-value structure.

In 4.1, you say "(in the sense of ivo:// id)" for the authority. We are not quite sure what you intend to do here, but we strongly suspect you do not want an authority here. A publisher typically is an Organization in VOResource, not an Authority, and an Authority can register multiple Organizations. We believe what you want to say here is: "The IVOA Identifier of the publisher of the project...". This would mean that, provided the publishers did their job right, that # is a globally unique identifier (albeit one for which only the publisher part properly resolves, but that's fine).

In the VO Registry, the real complex point is the proliferation of information from the publishers to the searchable registries. This problem must surely exist in the proposed SimDAL system, as the number of publishers is apparently expected to be much larger than the number of repositories. Some indication of how the initial metadata transfer and subsequent updates should be performed (file format, transfer modalities, signaling,...) would strengthen our confidence in this part of the standard.