Provenance Data Model RFC PageThe RFC Review Period has ended, after several meetings during the College Park Interop with the DM chairs, the document will now evolve to include the comments. -- MathieuServillat - 2018-11-20
We are pleased to announce the new Working Draft period for Provenance DM.
DocumentThe attached document presents the IVOA Provenance Data Model v1.0 which is proposed for review in this PR document. The model has been examined thoroughly with respect to the original use cases. Recent contacts at the Provenance Week has led to a number of discussions in order to stabilise the model and compare to the W3C data model and its customisations in various application domains (ProvONE, UniProv, RDA Prov patterns, etc). The provenance information is modeled with different focuses: the core model with the main classes bound to the tasks (activities) in which datasets (entities) are involved, and an extended model to attach information on the configuration of thoses tasks (input parameters, config file), their description, and the context (ambient conditions, execution environment, etc). Not all projects need the full detail of the model, but the core model will answer the general needs to trace progenitors of a dataset, for instance. Reference ImplementationRave provenance data database access (K. Riebe)
CTA implementation (M. Servillat)The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray instrument. Contrary to previous Cherenkov experiments, it will serve as an open observatory providing data to a wide astrophysics community, with the requirement to propose self-described data products to users that may be unaware of the Cherenkov astronomy specificities. Because of the complexity in the detection process and in the data processing chain, provenance information of data products are necessary to the user to perform a correct scientific analysis. Provenance concepts are relevant for different aspects of CTA:
Pollux Provenance: a simple access protocol to provenance of theoretical spectra (M. Sanguillon)POLLUX is a stellar spectra database proposing access to high resolution synthetic spectra computed using the best available models of atmosphere (CMFGEN, ATLAS and MARCS), performant spectral synthesis codes (CMF_FLUX, SYNSPEC and TURBOSPECTRUM) and atomic line lists from VALD database and specific molecular line lists for cool stars. Currently the provenance information is given to the astronomer in the header of the spectra files (depending on the format: FITS, ASCII, XML, VOTable, ...) but in a non-normalized description format. The implementation of the provenance concepts in a standardized format allows users on one hand to benefit from tools to create, visualize and transform to another format the description of the provenance of these spectra and on a second hand to select data depending on provenance criteria. In this context, the ProvSAP protocol has been implemented to retrieve provenance information in different formats of the serialized data: PROV-N, PROV-JSON, PROV-XML, VOTABLE and to build diagrams in the following graphic formats: PDF, PNG, SVG. These serializations and graphics are generated using the voprov python package derived from the prov Python library (MIT license) developed by Trung Dong Huynh (University of Southampton). SVOM Quick Analysis (L. Michel)The SVOM satellite is a Sino-French variable object monitor to be launched in 2021. When a transient object is detected, a fast alert is sent to the ground station through a worldwide VHF network. Burst advocates and instrument scientists are in charge of evaluating the scientific relevance of the event. To complete the assement, scientists have at their disposal high level data products such as light curves or spectra generated by an automatic data reduction pipeline. In some case, they might need to reprocess the raw data with refined parameters. To do so, scientific products (calib_leve >= 2) embedd their JSON provenance serialization in a specific extension. This provenannce instance can be extracted, updated and then uploaded to a dedicated pipeline to reprocess the photon list with different parameters. PROV CDS Data base : Implementation of a TAP service for Prov metadata bound to ObsTAP metadata. (F . Bonnarel, M. Louys, G. Mantelet) TBCProvTAP is a proposal for providing Provenance metadata via TAP services. The draft can be found here (https://wiki.ivoa.net/internal/IVOA/ObservationProvenanceDataModel/ProvTAP.pdf ). It is basically providing a TAP-schema mapping the provenance model onto the relational schema. The CDS ProvTAP service prototype is implementing this ProvTAP specification on top of a database providing metadata for HipS generation and also Schmidt plate digitization, cutouts opérations and RGB image construction. A full presentation of the prototype with a "slide demo" including a lot of ADQL queries can be found here : https://indico.obspm.fr/event/59/session/1/contribution/7/material/slides/ An oral presentation will be made at next ADASS A triple Store implementationfor an image database (M.Louys, F.X Pineau, L.Holzmann, F.Bonnarel).
We have implemented the IVOA provenance DM proposed here into a BlazeGraph triplestore. It handles images with a succession of Activities and derived entities. ActivityDescription is searchable as well as Parameter values. The full paper and the ADASS Poster corresponding to it are attached below. The list of [[https://wiki.ivoa.net/internal/IVOA/ProvenanceRFC/ ProvQuerytest-3store.pdf][queries]] supported shows how the dataprovider can take benefit of the encoded provenance to reorganise and optimise the data base. It shows the various queries an astronomer can play with in order to appreciate data quality from the recording of Activities , applied methods , parameter values , dataset dependency. MuseWise Provenance: Implementation of ProvSAP (O. Streicher) MUSE is an integral field spectrograph installed at the Very Large Telescope (VLT) of the European Southern Observatory (ESO). It consists of 24 spectrographs, providing a 1x1arcmin FOV (7.5" in Narrow Field Mode) with 300x300 pixel. For each pixel, a spectrum covering the range 465-930nm is provided. MuseWise is the data reduction framework that is used within the MUSE collaboration. It is built on the Astro-WISE system, which has been extended to support 3D spectroscopic data and integrates data reduction, provenance tracking, quality control and data analysis. The MUSE pipeline is very flexible and offers a variety of options and alternative data flows to fit to the different instrument modi and scientific requirements. The implementation provides the information collected by the system in using the ProvSAP protocol. The problems that arised during this implementation are discussed below. Implementations ValidatorsTBC RFC Review Period: 2018 October 22 - 2018 November 19Comments from WG membersComments by Ole StreicherThe PR was made before the working group converged on a common draft, leaving some discussions for the RFC period. So, the draft represents only part of the working group. The main concerns and alternative proposals are presented here. For convenience, we compiled the changes from the first three topics (Parameters, special Entities, W3C serialization) into a PDF covering sections 2.2 and 3.2 of the draft, including updated class diagrams. Reply from FrancoisBonnarel: The situation came from the fact that the concerns you expressed where not firmly supported by difficulties in tackling real use cases, despite your claims. See below why Parameters have been managed the way they are in the PR. I am pretty sure this can also work for your MUSE use-case, Ole -- FrancoisBonnarel - 2018-10-23
Special handling of Parameters
While in the model
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: The PR explains that hadConfiguration is to be seen as a simple association table to attach more attributes (parameters) to an activity. Indeed, the ParameterDescription transports the attributes that describe the parameter, i.e. its name, ucd, utype, or finer definition such as min, max, options... So the ParameterDescription already transports the role of the parameter (its name in fact, completed by the ucd and utype) and it would be redundant to repeat the usage in a UsedDescription. "Mangled" is when you separate this description into several classes, which are here unified for one purpose -- MathieuServillat - 2018-10-24
Specifically the last item limits the usability of
<--?xml version="1.0" encoding="UTF-8"?--> Reply from FrancoisBonnarel: This is intentional -- FrancoisBonnarel - 2018-10-23
The specific handling of A discussion within the working group pointed out three ways to handle configuration values:
Parameter complicates the voprov data model significantly, without gaining an appropriate advantage.
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: in the PR, the hadConfiguration relation is the only way to declare a configuration, usage is another relation. For flexibility, what is marked as a parameter can also be generated and used for different purposes than configuration.
Another sign of a bad model is that for UWS parameters it requires a hack of having
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: this is not about modeling, but about the fact that an input parameter to an activity can happen to be a reference.
Therefore, we propose to homogenize the handling of
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: The limit here is that we should not assume what a general entity will be (how can we?), so we do not describe the content/value of it, just the general category (specialized entity that are commonly manipulated in astronomy) or the type of container (in e.g. the EntityDescription with format, content_type... i.e. how to read it and not what we read). However, configuration information in the form of parameters as an input to an activity are relevant provenance information (it helps assess the reliability and quality of the activity, see section 2.2.6 of the PR), so for this input parameter, we describe and restrict the expected value of it, so that it becomes queryable in the standard. The content/value of a Data entity is not expected to be queryable using a provenance system, nor to provide relevant provenance information. -- MathieuServillat - 2018-10-24
<--?xml version="1.0" encoding="UTF-8"?-->Reply from FrancoisBonnarel: General statement to refuse this evolution: It is true that Parameter is managed in a special way in the current PR. It is intentional. Why? The Parameter class is there to tackle what astronomy application users generally call "parameters" of the application. Think to SExtractor or HipsGen. They have a couple of parameters such as "ANALYSIS _THRESH", "MAG_ZEROPOINT" (SExtractor), fov, skyval, border, publisher (HipsGen). They are definitly a different concept than DataEntity, which are the things we want to follow with our Provenance model, the things which are used or transformed by the activities. So parameters are so strongly bound to activities that they don't have existence outside their bounding to an activity. They have a value which is either a number or string, or a reference to external structures (files) or internal entities. In the latter case it is possible to use as a parameter value an id of something which has an history in the provenance. ActivityDescription and EntityDescription are there to gather all the properties common to activities sharing the same kind of processing and to Entities they use or generate.ParameterDescription gathers all the metadata common to Parameters of Activities sharing the same ActivityDescription. They have the same kind of binding to ActivityDescription than Parameters have to activities. The organisation you are proposing, Ole, simply miss the specificity of what is intended by the parameter concept. Parameter class has also a strong semantic value and so have ParameterDescription and hadConfiguration. Last point: Parameter is a derivation of Entity and ParameterDescription a derivation of EntityDescription. The benefit of this is that generic W3C-aware software can manage these classes. Parameter is an Entity with a special type= voprov:parameter. But this type changes all the "movie" (as we say in French) for behavior and interpretation. -- FrancoisBonnarel - 2018-10-23
Linking usage to =EntityDescription
The voprov model defines a number of specialization of
Entitys are linked to the Activity with different relations: used , hadConfiguration , and hadContext .Binding the usage to the Entity specialization is wrong, since the same Entity may be used differently by different Activities : f.e. what is a configuration for a data processing Activity may be input data for the generation of some visualization. It is also redundant, since the usage type is already specified by the usage relation.
The description of the
We therefore propose to remove the specialized
The specialized
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: the specialized entities where defined based on their existence as noticed in astronomy projects, not based on their usage. This gives latitude to users to tag their entities if they need to.
W3C serialization
The W3C serialization is integral part of the draft (Sec.3.2), but largely unspecified. It is f.e. undefined, how Since the model is already largely W3C consistent, we propose to include normative W3C representations for all attributes and classes directly in the model description. The draft normatively defines the prefix for the namespaces, but not their URI. This is incomplete and unnecessarily unflexible. Usually the namespace is defined by (only) the URI, which should be followed here as well.
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: The serializatio n of hadConfiguration, hadContext, and hadDescription relations is defined in the document. The namespace label should indeed not be enforced.
VO-DML compatible model
The VO-DML compatible data model (Fig. 6) does not correspond to the original voprov model (f.e. Fig. 5) and is significantly restricted: While in the original model Since the VO-DML compatible model is the base for other standards like provTAP, full correspondence to the original model is important here. So, the VO-DML compatible model must still be updated to reflect the recent changes. It is important to have serializations like provTAP fully compatible to the original model (and thus to W3C prov structure): for many dataprocessing frameworks that are used in astronomy (Kepler, Taverna etc.) there is ongoing work to produce W3C compatible provenance information, and it should be possible to use this to fill a database that can be f.e. queried with provTAP without major restructuration.
<--?xml version="1.0" encoding="UTF-8"?-->Reply from MathieuServillat: Because provenance relations have to be converted to be vo-dml compliant, the idea was to show the conversion of the UML to a VODML compliant model. In Modelio, all the relations are there, the missing arrow will be added. Workflow frameworks you mention did not change their internal data model, but simply convert their information to W3C compatible serialization.
Attributes designated to Workflow
While the draft explicitly excludes reproducibility in a workflow as a use case, there are a number of attributes that do not describe the actual provenance or how the
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: This is not about workflows, this is about having proper activity descriptions that helps the user understand what the activity does. It may help to build workflows, but managing workflows is about pluging an output to the input of another activity, and triggers other requirements. This is not in the scope of the model.
Several inconsistenciesThere is a large number of inconsistencies in the document:
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: the caption indicates that this diagram is completed by other figures, see fig 8, and this is explained in the text.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: see the diagram in Figure 8
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: UsedDescription is clearly to describe Used. Having a context is not part of an activity description. A configuration is always described by ParameterDescription.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: indeed, this should be m:n
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: ok, but start and stop time are the intrinsic properties of an activity, there should be some strong encouragement to fill this information that is always known somehow.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: I think you are mixing the concept of a configuration parameter, and its various representations in different context (VOTable, UWS...). Parameter is clearly defined in the document, that fixes its meaning.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: The idea was to consider any piece of provenance information optional, the more we know the better, without enforcing. But in the case of parameter, it would indeed be relevant to force the presence of ParameterDescription.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: There are no attributes... so they are not defined, yes.
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: optional attributes may not be implemented. Those in bold must not be null. The others are expected to exist in the implementations (but can be null).
<--?xml version="1.0" encoding="UTF-8"?--> Reply from MathieuServillat: no role, they point to specialized entities, as those specialized entities generally exist as such in projects. -- OleStreicher - 2018-10-22
<--?xml version="1.0" encoding="UTF-8"?--> General Reply: Because of the many and lengthy comments by Ole that did not come in time for the document but criticize its logic, the DM chairs consulted the group and fixed a new base for the model that should answer the various points of view expressed above.
Comments by Markus Demleitner(a) Can I ask you to remove "IVOA Data Model Working Group" from the list of authors? I don't think it helps anyone, but things like these are painful for computers trying to do something sensible with author lists and have stung me far too often.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. (b) Introduction: "In this document, we discuss a draft of an IVOA standard data model for describing...". This obviously shouldn't make it into a REC. I'd drop the sentence right now and start with: "According to \citet{std:W3CProvDM}, provenance is ... For this document, we adopt that definition.".
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. This kind of phrasing will be changed in the final document. (c) Minimum requirements: "We derived from our goals and use cases" doesn't seem to be quite true to me -- e.g., I don't see a use case for exchange of provenance information with non-IVOA software ("standard model") or even the links to other IVOA DMs. I don't dispute these are sane requirements, of course. Can't you just write: "We adopt the following requirements for the Prov DM"?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. (d) In the requirements, I'm not terribly happy about "if applicable" and friends. Can't you, for instance, say just which activities are exempt from having to have input entities? Sure, if that gets too verbose, it's counter-productive, but perhaps a few words can already go a long way towards making the requirements a bit more precise?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. this will be rephrased. (e) "Activities may point to output entities." -- why just "may"? What purpose could an output-less activity serve?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: This will be rediscussed. Most of the provenance goes backward in time, so the strict requirement is for entities to point to the activity that generated it, which is sometimes the missing information. As you say, activities generally already keep track of the generated entities so the requirement was seen as less strong. (f) "Entities, Activities and Agents [...] should have persistent identifiers." I wouldn't do this -- many entities are fairly ephemeral, and even recommending to obtain a DOI for, say, a flatfield is, I think, going much too far. Similarly, not everyone may want to have an ORCID or spread it in a provenance database (and I'm not getting started on the GDPR here). And no, "it's optional" doesn't invalidate that point: if it's a SHOULD a tool would still drop warnings if your flatfield doesn't have a DOI, and that can very well hide actual problem Can't we just strike any language on PIDs here?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. this topic should be avoided here, the requirement is to have a unique identifier. (g) Fig. 3, "main core classes". I'm still unconvinced the wasDerivedFrom and wasInformedBy relations are a good idea in our context. I realise they are shortcuts and thus might seem convenient for people generating provenance instances. However, many more people will consume them. To them, every feature you add is extra work, and they'd probably have to de-serialise your shortcuts into null activities or null entities. Which they won't appreciate. Also, since you could just as well generate these null entities or activities yourself (i.e., in your provenance instances), these two additional relationships introduce multiple ways to represent the same thing. That's always an indication for a feature that will lead to headache later. So, let me plead again: Are the shortcuts really so valuable to you that it's worth burdening our implementors with them? I also don't find too convincing the rationale for wasDerivedFrom on p. 14, "If there is more than one input and more than one output to an activity, it is not clear which entity was derived from which". If you have an activity with multiple inputs and outputs, it stands to reason that all inputs influence all outputs, so there's nothing for wasDerivedFrom to annotate. If there's distinct, unrelated groups of inputs and outputs then you really have two activities and you should describe them as such rather than hack around the deficient description. Similarly for the "deemed to be not important enough to be recorded in a pipeline" on wasInformedBy. The overhead of introducing an Entity is really not high (unless of course you require persistent identifiers for them...). And nothing is so insignificant that a few words of description couldn't come in handy when someone reads a provenance graph. And then "state that an activity communicates with another"... hm -- that's not provenance, that's activity description ("workflow"), no?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: those features are now optional, but it is important to normalize their meaning in the data model document. (i) Table 1, "attributes of the Entity class": From my Registry experience, "rights" as specified here has been profoundly useless (in 10 years of having it in the Registry nobody has used it as designed); in VOResource 1.1 we therefore moved to DataCite's model of copyright and licensing information, which I'd recommend here as well if I didn't recommend removing rights here in the first place. You see, I don't think this is provenance's turf -- it's not in W3C PROV either. What use case did you have in mind for that?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed, it will be removed to focus on provenance information. As provenance information is to be attached to entities that are distributed, there are some access rights or not. We did not plan to describe in details those rights. (j) Table 1, "attributes of the Entity class": if W3C PROV calls the description "description", and most everything else in the VO has "description", is there any deep reason you're using "annotation"? What would break if you used "description", too?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: in fact "description" is not a W3C term, this was a mistake, this attribute is new to the IVOA model. The idea is to have a place for comments on an entity. Description is not a comment, but a detailed explanation of what it is. We decided to use "description" for a free text attribute in the different description classes, and "comment" for a free text attribute of Activity, Entity and Agent to comment the creation of a new instance. Annotation is reserved to a system that could add comments on something after it is created (cannot be an attribute, but would be a class). (k) Table 1, "attributes of the Entity class": in the caption you offer "url" as a "project-specific attribute" -- how would that be different from the standard "location" attribute? What should a client do if there is both url and location?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. this was supposed to be removed from the table. (l) Sect. 2.1.2 cites the "Dataset Metdata Model" -- since DatasetDM has a large overlap with ProvDM, and DatasetDM hasn't seen activity since March 2016, I'd rather not reference it here (as it says in opening material of WDs: 'It is inappropriate to use IVOA Working Drafts as reference materials or to cite them as other than "work in progress".'). My hope is still that once ProvDM is there we can perhaps create a version of DatasetDM with a clear separation of concerns with ProvDM. If that happens, we'll be happy if we we've kept recursive dependencies at a minimum here. A similar argument applies to 2.1.6.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. there will be no reference to DatasetDM, just to ObsCore if relevant. (m) Sect 2.1.4 Activity -- what's the rationale for making startTime and endTime mandatory? Is there actually software that would become more complex if it couldn't rely on these? As an occasional user of provenance information, I have to say time was one of the processing attributes I've used less often (compared to, say a description or the parameters of the processing step).
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. it can be null. What is important is that start and stop time are the structural attributes of an activity (an activity is something that has a start and a stop time), so they always exist, even if they are not recorded. (n) Sect 2.1.4 Activity -- I'm very skeptical of the "status" attribute. Do you really want to record failed activities? If so, at least precisely define what you can have in status and define what it's for (a use case in section 1.1 would also be helpful). As a cautionary tale, the Registry lets people say that Resources can be active, inactive or deleted (in addition to the sensible deleted flag on the OAI-PMH level). Few VOResource features have wreaked more havoc, while really giving one nothing over what OAI-PMH already has. It's really much safer if you say "if it broke, don't advertise it".
<--?xml version="1.0" encoding="UTF-8"?--> Reply: agreed. This will be removed. Note that some activities may have a final status indicating an error and results generated before the error occured... the status is therefore some kind of quality control parameter. Anyway, this won't be handled as an attribute. (o) Sect. 2.1.5 Used/@time, WasGeneratedBy/@time -- are there really important use cases in which these couldn't be replaced by the activity's startTime and endTime (operationally, not concenptually)? Again, each extra feature puts a burden on the implementors, and I have a hard time imagining use cases in which this granularity would be necessary (if there are, you should really put them into Sect. 1.1).
<--?xml version="1.0" encoding="UTF-8"?--> Reply: the time of generation is now attached to the entity (commonly done now when a file is written for example). The Used.time has been seen as relevant in case of long time execution (things may change in the context of the activity), so the possibility to have a different usage time that start or stop time has been kept. (p) 2.1.6 Agent, WasAttributedTo/@role. Rather than provide an "e.g." table of terms in the document, why don't you create a vocabulary right away? There's nice tooling for this -- just ask me if interested. But I'll admit right away I'm not terribly happy with the list of terms as it stands now -- if you look at the DataCite metadata kernel (https://support.datacite.org/docs/schema-40), contributorType, there are many overlaps with your list -- can't you re-use/reference what DataCite has? You see, it would suck if I had to introduce some static mapping between my DataCite metadata and provenance metadata I may need to write somewhere.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: we now propose a free text parameter with a list of reserved words for this attribute (q) Extended Model. I admit to not having reviewed it. I'd strongly vote for having the core model put into REC frist and only then going for the extended model. My impression is it's difficult enough to get core right. As ProvDM-Core is taken up, we can figure out what else we need and what might already be covered sufficiently well by core. Which of your use cases would you have to drop until something like the extended metadata were standardised?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: The presentation may be misleading, as the core is in fact just the application of the W3C PROV concepts made VODML compliant. However, this is not covering all the goals of the model, hence the necessity of other classes in the model. The structure of the document has changed to map the features of the model to the exposed goals and use cases. (r) Serialisation, Introduction: 'For FITS files, a provenance extension called “PROVENANCE” could be added which contains provenance information of the activities that generated the FITS file.' Please let's avoid subjunctive language in specs -- it helps nobody ("should I implement this, now, or shouldn't I?"). Either say "To include provenance information into a FITS file, generate a PROV-N string and write it as the array of a PROVENANCE extension (BITPIX=8)" (or whatever) or don't say anything at all.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Serializations will be proposed in a separate document. (s) Sect 3.3, "VOTable Format" -- as I said in my last review, I don't see what purpose this VOTable serialisation serves. At least "emphasize the compatibility" is far to weak a reason for putting something into a standard.in my book Remember, people have to implement this stuff. I'm not arguing against a relational mapping of your model, but that needs to be defined much more carefully (presumably in ProvTAP, then). I strongly vote to remove the entire section 3.3; but if you don't remove it, it needs to explain much better what to do where (and why).
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Serializations will be proposed in a separate document. (t) Sect 3.4 "Description classes for web services" -- it's a cute idea, but it's so far from provenance that it really doesn't belong in this specification. If you think there's a use case for this, please transport this material into the DataLink specification (there's going to be an update for it anyway fairly soon). Nobody will look for material like this in the documentation for the provenance DM.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Serializations will be proposed in a separate document.However, the model has been thought with this compatibility in mind, hence there may be first a note, that may become a recommandation later for this feature. (u) Sect 4 "Acessing provenance information" -- my advice is to strike this section and integrate what little material there still is in it into the introduction.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: this section has been removed. (v) Appendix A "Examples" -- wouldn't it be enough to just show (perhaps an abbreviated rendition of) the PROV-N example and tell people how to use standard software to get to the PROV-JSON one? As to VOTable, see above. Note that ivoatex also has an auxiliaryurl macro that you can use to deliver example files without having to include them verbatim in the document (see ivoatexDoc). Apart from reducing the scaryness of the document (shaving off 10 pages downgrades it from OMG-60-pages! to Oh-dang-50-pages, which probably helps adoption a lot, shortening the examples section to what humans actually want or need to see probably saves a few trees, too -- IVOA documents are printed occasionally...
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Serializations will be proposed in a separate document. (w) Appendix B "Links to other DMs" -- oh my. "When delivering the data on request, the serialized versions can be adjusted to the corresponding no- tation." -- excuse me, but that won't work. If I get a request, how am I to know if I should include ProvDM or DatasetDM metadata? What technical reason should there be to distinguish between the two? No, I'm sorry, but we simply have to clean up our act. There needs to be at most one model per piece of reality. DatasetDM fortunately isn't REC yet -- it still can be re-written to use your classes where appropriate. With SimDM I'd not be worried too much -- people doing it probably won't bother with ProvDM or much else anyway. I suppose it'd be all right to just say in the introduction something like "For historical reasons, SimDM has its own rendition of provenance; we make no effort of reconciling the current and W3C's efforts with it". I'm also not convinced that appendix on UWS (B.3) couldn't be shortened into a paragraph in the introduction; that, of course, is even easier if we keep to the core model and thus won't have to explain about Parameter.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: those appendices have been removed from the data model document. They may appear in some form in an implementation note. B.4, finally, is too much of a "could be further developed" thing. I'm always advocating keeping promises for the future out of standard documents unless they actually help to keep implementors from making false assumptions. This doesn't seem to be the case here. Can we remove it?
<--?xml version="1.0" encoding="UTF-8"?--> Reply: appendix removed. -- MarkusDemleitner - 2018-10-29
Comments by Laurent MichelGeneral comments:
<--?xml version="1.0" encoding="UTF-8"?--> Reply: We now group all classes from W3C in the first section, use the exact W3C definition with a reference to the section of the W3C PROV-DM document.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: ok, the document now concentrates on defining the classes, give example, give constraints, but no more justifications.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: the reference to ProvONE has been removed.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: the diagrams and the VODML definition have been updated to indicate the cardinalities.
<--?xml version="1.0" encoding="UTF-8"?--> Reply: We now show only the VODML compatible diagram. Section 2.1:
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Those parts have been rephrased or updated. Section 2.2:
<--?xml version="1.0" encoding="UTF-8"?--> Reply: Those parts have been rephrased or updated.
TCG Review Period: TCG_start_date - TCG_end_dateTBC
|