Provenance Data Model RFC Page
Document
The attached document presents the IVOA Provenance Data Model v1.0 which is proposed for review in this PR document.
The model has been examined thoroughly with respect to the original use cases. Recent contacts at the Provenance Week has led to a number of discussions in order to stabilise the model and compare to the W3C data model and its customisations in various application domains (ProvONE, UniProv, RDA Prov patterns, etc).
The provenance information is modeled with different focuses: the core model with the main classes bound to the tasks (activities) in which datasets (entities) are involved, and an extended model to attach information on the configuration of thoses tasks (input parameters, config file), their description, and the context (ambient conditions, execution environment, etc).
Not all projects need the full detail of the model, but the core model will answer the general needs to trace progenitors of a dataset, for instance.
Reference Implementation
Rave provenance data database access (K. Riebe)
CTA implementation (M. Servillat)
The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray instrument. Contrary to previous Cherenkov experiments, it will serve as an open observatory providing data to a wide astrophysics community, with the requirement to propose self-described data products to users that may be unaware of the Cherenkov astronomy specificities. Because of the complexity in the detection process and in the data processing chain, provenance information of data products are necessary to the user to perform a correct scientific analysis.
Provenance concepts are relevant for different aspects of CTA:
- Pipeline: the CTA Observatory must ensure that data processing is traceable and reproducible, making the capture of provenance information necessary.
- Data diffusion: the diffused data products have to contain all the relevant context information as well as a description of the methods and algorithms used during the data processing.
- Instrument Configuration: the characteristics of the instrument at a given time have to be available and traceable (hardware changes, measurements of e.g. a reflectivity curve of a mirror, ...)
In the context of CTA, two main implementations were developed:
- a python Provenance class dedicated to the capture of provenance information of each CTA pipeline tool (including contextual information and connected to configuration information)
https://cta-observatory.github.io/ctapipe/api/ctapipe.core.Provenance.html
- OPUS: a job controller that can execute predefined jobs on a work cluster and expose the results. This job controller follows the IVOA UWS pattern and the definition of jobs proposed in the IVAO Provenance DM, so as to capture and expose the provenance information for each job and result via a ProvSAP interface.
https://uws-server.readthedocs.io
Pollux Provenance: a simple access protocol to provenance of theoretical spectra (M. Sanguillon)
POLLUX is a stellar spectra database proposing access to high resolution synthetic spectra computed using the best available models of atmosphere (CMFGEN, ATLAS and MARCS), performant spectral synthesis codes (CMF_FLUX, SYNSPEC and TURBOSPECTRUM) and atomic line lists from VALD database and specific molecular line lists for cool stars.
Currently the provenance information is given to the astronomer in the header of the spectra files (depending on the format: FITS, ASCII, XML, VOTable, ...) but in a non-normalized description format. The implementation of the provenance concepts in a standardized format allows users on one hand to benefit from tools to create, visualize and transform to another format the description of the provenance of these spectra and on a second hand to select data depending on provenance criteria.
In this context, the
ProvSAP protocol has been implemented to retrieve provenance information in different formats of the serialized data: PROV-N, PROV-JSON, PROV-XML, VOTABLE and to build diagrams in the following graphic formats: PDF, PNG, SVG. These serializations and graphics are generated using the voprov python package derived from the prov Python library (MIT license) developed by Trung Dong Huynh (University of Southampton).
SVOM Quick Analysis (L. Michel)
The
SVOM satellite is a Sino-French variable object monitor to be launched in 2021. When a transient object is detected, a fast alert is sent to the ground station through a worldwide VHF network. Burst advocates and instrument scientists are in charge of evaluating the scientific relevance of the event. To complete the assement, scientists have at their disposal high level data products such as light curves or spectra generated by an automatic data reduction pipeline. In some case, they might need to reprocess the raw data with refined parameters. To do so, scientific products (calib_leve >= 2) embedd their JSON provenance serialization in a specific extension. This provenannce instance can be extracted, updated and then uploaded to a dedicated pipeline to reprocess the photon list with different parameters.
PROV CDS Data base : Implementation of a TAP service for Prov metadata bound to ObsTAP metadata. (F . Bonnarel, M. Louys, G. Mantelet) TBC
ProvTAP is a proposal for providing Provenance metadata via TAP services. The draft can be found here (
http://volute.g-vo.org/svn/trunk/projects/dm/provenance/provtap/ProvTAP.pdf). It is basically providing a TAP-schema mapping the provenance model onto the relational schema.
The CDS
ProvTAP service prototype is implementing this
ProvTAP specification on top of a database providing metadata for
HipS generation and also Schmidt plate digitization, cutouts opérations and RGB image construction. A full presentation of the prototype with a "slide demo" including a lot of
ADQL queries can be found here :
https://indico.obspm.fr/event/59/session/1/contribution/7/material/slides/
An oral presentation will be made at next ADASS
MuseWise Provenance: Implementation of ProvSAP (O. Streicher)
MUSE is an
integral field spectrograph installed at the Very Large Telescope (VLT) of the European Southern Observatory (ESO). It consists of 24 spectrographs, providing a 1x1arcmin FOV (7.5" in Narrow Field Mode) with 300x300 pixel. For each pixel, a spectrum covering the range 465-930nm is provided.
MuseWise is the data reduction framework that is used within the MUSE collaboration. It is built on the Astro-WISE system, which has been extended to support 3D spectroscopic data and integrates data reduction, provenance tracking, quality control and data analysis. The MUSE pipeline is very flexible and offers a variety of options and alternative data flows to fit to the different instrument modi and scientific requirements.
The implementation provides the information collected by the system in using the
ProvSAP protocol. The problems that arised during this implementation are discussed below.
Implementations Validators
TBC
RFC Review Period: 2018 October 22 - 2018 November 19
Comments from WG members
The PR was made before the working group converged on a common draft, leaving some discussions for the RFC period. So, the draft represents only part of the working group.
The situation came from the fact that the concerns you expressed where not firmly supported by difficulties in tackling real use cases, despite your claims. See below why Parameters have been managed the way they are in the PR. I am pretty sure this can also work for your MUSE use-case, Ole --
FrancoisBonnarel - 2018-10-23
The main concerns and alternative proposals are presented here.
For convenience, we compiled the changes from the first three topics (Parameters, special Entities,
W3C serialization) into a
PDF covering sections 2.2 and 3.2 of the draft, including updated class diagrams.
Special handling of Parameters
While in the model
Parameter
is labelled as an
Entity
, they are used in a completely different way:
- The relation between the
Entity
and the EntityDescription
is a hasDescription
relation, while the Parameter
is linked to the ParameterDescription
by an attribute.
- The usage of an
Entity
by an Activity
is specified by the used
relation, while the usage of a Parameter
by an Activity
is specified by the hadConfiguration
relation.
- While the description of the usage for an
Entity
happens with a UsedDescription
, the description of the usage for a Parameter
is mangled into the ParameterDescription
.
This is not true. Used class contains a role attribute and HadConfiguration is a specialization of Used, which can give the role of the Parameter with respect to the Activity -- IVOA.FrancoisBonnarel - 2018-10-23
The class hierarchy of hadConfiguration
is not specified in your Proposal. And all properties of the Parameter
in your proposal are only with respect to a certain activity (this is how you actually define the Parameter
, see your comment below), so they are in the ParameterDescription
. For example the name
, min
, max
, default
, unit
. The role
attribute for hadConfiguration
is explicitly undefined, therefore hadConfiguration
has intentionally no corresponding description. -- OleStreicher - 2018-10-24
The PR explains that hadConfiguration is to be seen as a simple association table, no role there, a simplification of used, not a specialization. Indeed, the ParameterDescription transports the attributes that desccribe the parameter, i.e. its name, ucd, utype, or finer definition such as min, max, options... So the ParameterDescription already transports the role of the parameter (its name in fact, completed by the ucd and utype) and it would be redundant to repeat the usage in a UsedDescription. --
MathieuServillat - 2018-10-24
So you agree that Francois comment was wrong. Maybe we remove the comments related to this to keep the discussion clean? Francois?-- OleStreicher - 2018-10-24
Specifically the last item limits the usability of
Parameter
:
Parameter
is bound to a specific
ActivityDescription
(since the
Parameter
has a single link to the
ParameterDescription
, and this has a single link to the
ActivityDescription
):
- They cannot be created independently from their usage
- They cannot re-used by an
Activity
with a different ActivityDescription
(even not just another version!).
-
Activity
cannot accept Parameters
with different ParameterDescriptions
(f.e. different units) for the same place.
This is intentional --
FrancoisBonnarel - 2018-10-23
The specific handling of
Parameter
s is limited to values that are used in the configuration of an
Activity
. In contrast, values that carry non-configuration data are handled as a general
Entity
. This is also the case for values that are used for configuration, but don't fit into the listed restrictions.
A discussion within the working group pointed out three ways to handle configuration values:
- They may be handled as a normal
Parameter
,
- They may be defined as
Parameters
, but used with a general Entity
. In this case it is unclear which ParameterDescription
is relevant the one that is related to the Entity
by the hasDescription
relation or the one related to the Activity
by the hadConfiguration
relation.
- They may be handled like a general
Entity
. This contradicts however the class hierarchy, which only defines Parameter
, ConfigFile
and ObsConfig
as subclasses of Configuration
, but not Entity
that carry a value. So it is unclear whether f.e. they shall related to the Activity
via the used
or the hadConfiguration
relation.
The choice between these three ways may vary between different implementations, and even within one implementation. A client (or a query) therefore needs to check all possible ways to get a complete answer, and on the implementation side the selection of the correct way is not trivial either. So, the special handling of
Parameter
complicates the voprov data model significantly, without gaining an appropriate advantage.
Another sign of a bad model is that for UWS parameters it requires a hack of having
Parameter
that actually contain a reference instead of a value, which is explained in the Appendix B3 or the draft.
Therefore, we propose to homogenize the handling of
Parameter
with the handling of other
Entity
by re-using the
used
,
EntityDescription
, and
UsageDescription
classes also for
Parameter
. Then,
Parameter
is special only by the fact that they directly contain a value, while other
Entity
s would refer to their content by a link.
No (Data) Entities have values too. --
FrancoisBonnarel - 2018-10-23
You did misunderstand this sentence: Our alternative proposal is to make Parameter
be a special Entity
that carries a value, while other (Data) Entities
refer to the content via link.-- OleStreicher - 2018-10-24
The limit here is that we should not assume what a general entity will be (how can we?), so we do not describe the content/value of it, just the general category (specialized entity that are commonly manipulated in astronomy) or the type of container (in e.g. the EntityDescription with format, content_type... i.e. how to read it and not what we read). However, configuration information in the form of parameters as an input to an activity are relevevant provenance information (it helps assess the reliability and quality of the activity, see section 2.2.6 of the PR), so for this input parameter, we describe and restrict the expected value of it, so that it becomes queryable in the standard. The content/value of a Data entity is not expected to be queryable using a provenance system, nor to provide relevant provenance information. --
MathieuServillat - 2018-10-24
The value of a Parameter
is also stored in the simpler alternative model. Our model does not restrict the queryability of that value. Also your proposal does not specify that BTW. -- OleStreicher - 2018-10-24
This would also make the described hack unnecessary.
General statement to refuse this evolution: It is true that Parameter has managed in a special way in the current PR. It is intentional. Why ? The Parameter class is there to tackle what astronomy application users generally call "parameters" of the application. Think to SExtractor or HipsGen. They have a couple of parameters such as "ANALYSIS _THRESH", "MAG_ZEROPOINT" (SExtractor), fov, skyval, border, publisher (HipsGen). They are definitly a different concept than DataEntity, which are the things we want to follow with our Provenance model, the things which are used or transformed by the activities. So parameters are so strongly bound to activities that they don't have existence outside their bounding to an activity. They have a value which is either a number or string, or a reference to external structures (files) or internal entities. In the latter case it is possible to use as a parameter value an id of something which has an history in the provenance. ActivityDescription and EntityDescription are there to gather all the properties common to activities sharing the same kind of processing and to Entities they use or generate. ParameterDescription gathers all the metadata common to Parameters of Activities sharing the same ActivityDescription. They have the same kind of binding to ActivityDescription than Parameters have to activities. The organisation you are proposing, Ole, simply miss the specificity of what is intended by the parameter concept. Parameter class has also a strong semantic value and so have ParameterDescription and hadConfiguration. Last point: Parameter is a derivation of Entity and ParameterDescription a derivation of EntityDescription. The benefit of this is that generic W3C-aware software can manage these classes. Parameter is an Entity with a special type= voprov:parameter. But this type changes all the "movie" (as we say in French) for behavior and interpretation. --
FrancoisBonnarel - 2018-10-23
For the discussion of (IVOA) parameters see point 6 in section "Several inconsistencies" below.
Independent of this, you did not explain why specifics of Parameter
require a distinct structure in the model instead of completely integrating them into the normal Entity
- Activity
relations. All the requirements you specified for parameters are already fullfilled by the simpler alternative model. Getting specific configuration parameters can simply done by querying for the hadConfiguration
relation. And specifically for parameters like ANALYSIS_THRESH
, I don't see why you never want to give them a provenance and follow in our provenance model (Where did this value come from? How was it created? Who entered that value into the pipeline?). Or why you don't want to re-use the same Parameter
for an Activity
with a different ActivityDescription
(f.e. for a bugfixed "HipsGen/1.01" instead of "HipsGen/1.00").
So, parameters are (from the provenance point of view) not so different from other entities
:
- They may have provenance information attached, or come without provenance,
- They may be restricted to a specific
ActivityDescription
(which is one specific version of an application), or they may be applied to other versions, or even other applications
-- OleStreicher - 2018-10-26 (edited)
In addition to François' examples, I answer this above, and it is explained why the value of a parameter is relevant provenance information in the PR at the beginning of section 2.2.6. -- MathieuServillat - 2018-10-24
Linking usage to =EntityDescription
The voprov model defines a number of specialization of Entity
, based on their usage:
-
MainEntity
( Data
, Visualization
, Document
),
-
Configuration
( Parameter
, ConfigFile
, ObsConfig
),
-
Context
( AmbientConditions
, InstrumentalContext
, ExecutionEnvironment
).
These Entitys
are linked to the Activity
with different relations: used
, hadConfiguration
, and hadContext
.
Binding the usage to the Entity
specialization is wrong, since the same Entity
may be used differently by different Activities
: f.e. what is a configuration for a data processing Activity
may be input data for the generation of some visualization. It is also redundant, since the usage type is already specified by the usage relation.
The description of the Entity
itself (f.e. whether it is a "key=value" list in a given format) is already done in the EntityDescription
. This makes further specializations unnecessary.
We therefore propose to remove the specialized Entity
classes Configuration
and Context
and their subclasses from the standard and use the main class here, together with specialized usage relations. It should then also defined that the hadConfiguration
and hadContext
relations are specialized used
relations.
The specialized Entity
classes Visualization
, Document
, and Device
can also be replaced by putting this information into their EntityDescription
(this is what the EntityDescription
was made for). In ProvONE, this could not be done, since ProvONE does not have an EntityDescription
to describe the Entity
.
W3C serialization
The W3C serialization is integral part of the draft (Sec.3.2), but largely unspecified. It is f.e. undefined, how hadConfiguration
, hadContext
, and hadDescription
relations for Entity
are represented in the W3C serialization. For the representation of other attributes, only a suggestion is given. This makes it impossible to develop a client that consistently uses these attributes, which makes the format rather useless.
Since the model is already largely W3C consistent, we propose to include normative W3C representations for all attributes and classes directly in the model description.
The draft normatively defines the prefix for the namespaces, but not their URI. This is incomplete and unnecessarily unflexible. Usually the namespace is defined by (only) the URI, which should be followed here as well.
VO-DML compatible model
The VO-DML compatible data model (Fig. 6) does not correspond to the original voprov model (f.e. Fig. 5) and is significantly restricted: While in the original model Parameter
is an Entity
and may have provenance information attached (like how the Parameter
was created), in the VO-DML compatible model Parameter
is not derived from Entity
and therefore cannot have provenance information.
Since the VO-DML compatible model is the base for other standards like provTAP, full correspondence to the original model is important here. So, the VO-DML compatible model must still be updated to reflect the recent changes.
It is important to have serializations like provTAP fully compatible to the original model (and thus to W3C prov structure): for many dataprocessing frameworks that are used in astronomy (Kepler, Taverna etc.) there is ongoing work to produce W3C compatible provenance information, and it should be possible to use this to fill a database that can be f.e. queried with provTAP without major restructuration.
Attributes designated to Workflow
While the draft explicitly excludes reproducibility in a workflow as a use case, there are a number of attributes that do not describe the actual provenance or how the Activity
actually works, but document how an Activity
should be used. This includes:
-
min
, max
, options
, and (default) value
in ParameterDescription
-
multiplicity
, entityDescription
in UsedDescription
-
multiplicity
, entityDescription
in WasGeneratedByDescription
These attributes describe the requirements of the input resp. the intended output, which is part of the workflow instead of the provenance, and should therefore removed from the draft.
Several inconsistencies
There is a large number of inconsistencies in the document:
- The class diagram (Fig. 5) limits
hadConfiguration
to Parameter
, while the text implies that this relation should also used for ConfigFile
and ObsConfig
- The text allows a
Parameter
to be replaced by a general Entity
, but this is not shown in the class diagram
- It is unclear whether the usage of a
Context
or a (non- Parameter
) Configuration
( Configfile
, ObsConfig
), expressed by a hadContext
or a hadConfig
shall/may be accompanied with a UsedDescription
.
- The class diagram specifies a m:n relation between
UsedDescription
resp. WasGeneratedByDescription
and EntityDescription
, but the according tables (Table 14 and 15) specify only a single link (1:n relation).
- The
startTime
and endTime
attributes in Activity
(Table 3) are mandatory (cannot be empty). However, sometimes they may be unknown or not relevant. They should be optional (this does not affect the ability to sort or to query those attributes).
- As motivation for the special handling of
Parameter
, the text lists several uses of the word "param(eter)" in the VO, claiming that they belong to a single concept. This is however not true: in UWS a parameter ( uws:param
) corresponds to a generic Entity
and a used
relation in voprov, since it may contain either a value or a reference, and is unspecified whether it is data or configuration. A VOTABLE column with its FIELD
element is somehow modelled similar to a list of Parameter
s in the draft; however it is also not limited to configuration and may actually carry non-config data. It is also unrelated to a specific activity. So, this cannot serve as motivation for the definition of voprov:Parameter
in its current form (carry a value, limited to configuration, bound to a single ActivityDescription
).
- In the documentation of
Parameter
, value
is documented as "[...] type depends on ParameterDescription.datatype
and xtype
". However, the ParameterDescription
is optional. It is unclear how the type is specified when the ParameterDescription
is missing. Also is unclear whether the same is true for other Entity
s that carry a value
.
- The attributes of the specialized relations (
hasDescription
, hadDescription
, hadConfiguration
, hadContext
) are not defined.
- The tables often contain attributes that "must not be null" (bold), other attributes, and "Optional Attributes". It is unclear what the difference between other attributes and "Optional Attributes" is. Also, below the "Optional Attributes", links are listed (separated by a horizontal line). It is unclear whether they also belong to the Optional Attributes.
- The
hadContext
and hadConfiguration
relations have an empty role
attribute. This makes it impossible to relate them to a certain description. If an Activity
uses several Entities
(files) for configuration with the hadConfiguration
relation, there is no way to distinguish their role.
-- OleStreicher - 2018-10-22
Comments by Markus Demleitner
(a) Can I ask you to remove "IVOA Data Model Working Group" from the
list of authors? I don't think it helps anyone, but things like these
are painful for computers trying to do something sensible with author
lists and have stung me far too often.
(b) Introduction: "In this document, we discuss a draft of an IVOA
standard data model for describing...". This obviously shouldn't make
it into a REC. I'd drop the sentence right now and start with:
"According to \citet{std:W3CProvDM}, provenance is ... For this
document, we adopt that definition.".
(c) Minimum requirements: "We derived from our goals and use cases"
doesn't seem to be quite true to me -- e.g., I don't see a use case for
exchange of provenance information with non-IVOA software ("standard
model") or even the links to other IVOA DMs. I don't dispute these are
sane requirements, of course. Can't you just write: "We adopt the
following requirements for the Prov DM"?
(d) In the requirements, I'm not terribly happy about "if applicable"
and friends. Can't you, for instance, say just which activities are
exempt from having to have input entities? Sure, if that gets too
verbose, it's counter-productive, but perhaps a few words can already go
a long way towards making the requirements a bit more precise?
(e) "Activities may point to output entities." -- why just "may"? What
purpose could an output-less activity serve?
(f) "Entities, Activities and Agents [...] should have persistent
identifiers." I wouldn't do this -- many entities are fairly ephemeral,
and even recommending to obtain a DOI for, say, a flatfield is, I think,
going much too far. Similarly, not everyone may want to have an ORCID
or spread it in a provenance database (and I'm not getting started on
the GDPR here). And no, "it's optional" doesn't invalidate that
point: if it's a SHOULD a tool would still drop warnings if your
flatfield doesn't have a DOI, and that can very well hide actual
problem Can't we just strike any language on PIDs here?
(g) Fig. 3, "main core classes". I'm still unconvinced the
wasDerivedFrom and wasInformedBy relations are a good idea in our
context. I realise they are shortcuts and thus might seem convenient
for people generating provenance instances. However, many more people
will consume them. To them, every feature you add is extra work, and
they'd probably have to de-serialise your shortcuts into null activities
or null entities. Which they won't appreciate.
Also, since you could just as well generate these null entities or
activities yourself (i.e., in your provenance instances), these two
additional relationships introduce multiple ways to represent the
same thing. That's always an indication for a feature that will lead
to headache later.
So, let me plead again: Are the shortcuts really so valuable to you
that it's worth burdening our implementors with them?
I also don't find too convincing the rationale for wasDerivedFrom on
p. 14, "If there is more than one input and more than one output to
an activity, it is not clear which entity was derived from which".
If you have an activity with multiple inputs and outputs, it stands
to reason that all inputs influence all outputs, so there's nothing
for wasDerivedFrom to annotate. If there's distinct, unrelated
groups of inputs and outputs then you really have two activities and
you should describe them as such rather than hack around the
deficient description.
Similarly for the "deemed to be not important enough to be recorded in a
pipeline" on wasInformedBy. The overhead of introducing an Entity is
really not high (unless of course you require persistent identifiers for
them...). And nothing is so insignificant that a few words of
description couldn't come in handy when someone reads a provenance
graph.
And then "state that an activity communicates with another"... hm --
that's not provenance, that's activity description ("workflow"), no?
(i) Table 1, "attributes of the Entity class": From my Registry
experience, "rights" as specified here has been profoundly useless (in
10 years of having it in the Registry nobody has used it as designed);
in VOResource 1.1 we therefore moved to DataCite's model of copyright
and licensing information, which I'd recommend here as well if I
didn't recommend removing rights here in the first place. You see, I
don't think this is provenance's turf -- it's not in W3C PROV either.
What use case did you have in mind for that?
(j) Table 1, "attributes of the Entity class": if W3C PROV calls the
description "description", and most everything else in the VO has
"description", is there any deep reason you're using "annotation"?
What would break if you used "description", too?
(k) Table 1, "attributes of the Entity class": in the caption you offer
"url" as a "project-specific attribute" -- how would that be
different from the standard "location" attribute? What should a
client do if there is both url and location?
(l) Sect. 2.1.2 cites the "Dataset Metdata Model" -- since DatasetDM has
a large overlap with ProvDM, and DatasetDM hasn't seen activity since
March 2016, I'd rather not reference it here (as it says in opening
material of WDs: 'It is inappropriate to use IVOA Working Drafts as
reference materials or to cite them as other than "work in progress".').
My hope is still that once ProvDM is there we can perhaps create a
version of DatasetDM with a clear separation of concerns with ProvDM.
If that happens, we'll be happy if we we've kept recursive
dependencies at a minimum here. A similar argument applies to 2.1.6.
(m) Sect 2.1.4 Activity -- what's the rationale for making startTime and
endTime mandatory? Is there actually software that would become more
complex if it couldn't rely on these? As an occasional user of provenance
information, I have to say time was one of the processing attributes
I've used less often (compared to, say a description or the parameters
of the processing step).
(n) Sect 2.1.4 Activity -- I'm very skeptical of the "status" attribute.
Do you really want to record failed activities? If so, at least
precisely define what you can have in status and define what it's for
(a use case in section 1.1 would also be helpful). As a cautionary
tale, the Registry lets people say that Resources can be active,
inactive or deleted (in addition to the sensible deleted flag on the
OAI-PMH level). Few VOResource features have wreaked more havoc,
while really giving one nothing over what OAI-PMH already has. It's
really much safer if you say "if it broke, don't advertise it".
(o) Sect. 2.1.5 Used/@time, WasGeneratedBy/@time -- are there really
important use cases in which these couldn't be replaced by the
activity's startTime and endTime (operationally, not concenptually)?
Again, each extra feature puts a burden on the implementors, and I
have a hard time imagining use cases in which this granularity would
be necessary (if there are, you should really put them into Sect.
1.1).
(p) 2.1.6 Agent, WasAttributedTo/@role. Rather than provide an "e.g."
table of terms in the document, why don't you create a vocabulary
right away? There's nice tooling for this -- just ask me if
interested. But I'll admit right away I'm not terribly happy with
the list of terms as it stands now -- if you look at the DataCite
metadata kernel (https://support.datacite.org/docs/schema-40),
contributorType, there are many overlaps with your list -- can't you
re-use/reference what DataCite has? You see, it would suck if I had
to introduce some static mapping between my DataCite metadata and
provenance metadata I may need to write somewhere.
(q) Extended Model. I admit to not having reviewed it. I'd strongly
vote for having the core model put into REC frist and only then going
for the extended model.
My impression is it's difficult enough to get core right.
As ProvDM-Core is taken up, we can figure out what else we need and
what might already be covered sufficiently well by core. Which of
your use cases would you have to drop until something like the
extended metadata were standardised?
(r) Serialisation, Introduction: 'For FITS files, a provenance extension
called “PROVENANCE” could be added which contains provenance information
of the activities that generated the FITS file.' Please let's avoid
subjunctive language in specs -- it helps nobody ("should I implement
this, now, or shouldn't I?").
Either say "To include provenance information into a FITS file,
generate a PROV-N string and write it as the array of a PROVENANCE
extension (BITPIX=8)" (or whatever) or don't say anything at all.
(s) Sect 3.3, "VOTable Format" -- as I said in my last review, I don't
see what purpose this VOTable serialisation serves. At least "emphasize
the compatibility" is far to weak a reason for putting something into a
standard.in my book Remember, people have to implement this stuff.
I'm not arguing against a relational mapping of your model, but that
needs to be defined much more carefully (presumably in ProvTAP, then).
I strongly vote to remove the entire section 3.3; but if you don't
remove it, it needs to explain much better what to do where (and why).
(t) Sect 3.4 "Description classes for web services" -- it's a cute idea,
but it's so far from provenance that it really doesn't belong in this
specification. If you think there's a use case for this, please
transport this material into the DataLink specification (there's going
to be an update for it anyway fairly soon). Nobody will look for
material like this in the documentation for the provenance DM.
(u) Sect 4 "Acessing provenance information" -- my advice is to strike
this section and integrate what little material there still is in it into
the introduction.
(v) Appendix A "Examples" -- wouldn't it be enough to just show (perhaps
an abbreviated rendition of) the PROV-N example and tell people how to
use standard software to get to the PROV-JSON one? As to VOTable, see
above.
Note that ivoatex also has an auxiliaryurl macro that you can use to
deliver example files without having to include them verbatim in the
document (see ivoatexDoc).
Apart from reducing the scaryness of the document (shaving off 10
pages downgrades it from OMG-60-pages! to Oh-dang-50-pages, which
probably helps adoption a lot, shortening the examples section to
what humans actually want or need to see probably saves a few trees,
too -- IVOA documents are printed occasionally...
(w) Appendix B "Links to other DMs" -- oh my. "When delivering the data on
request, the serialized versions can be adjusted to the corresponding no-
tation." -- excuse me, but that won't work. If I get a request, how am
I to know if I should include ProvDM or DatasetDM metadata? What
technical reason should there be to distinguish between the two?
No, I'm sorry, but we simply have to clean up our act. There needs
to be at most one model per piece of reality. DatasetDM fortunately
isn't REC yet -- it still can be re-written to use your classes where
appropriate.
With SimDM I'd not be worried too much -- people doing it probably won't
bother with ProvDM or much else anyway. I suppose it'd be all right to
just say in the introduction something like "For historical reasons,
SimDM has its own rendition of provenance; we make no effort of
reconciling the current and W3C's efforts with it".
I'm also not convinced that appendix on UWS (B.3) couldn't be shortened
into a paragraph in the introduction; that, of course, is even easier if
we keep to the core model and thus won't have to explain about
Parameter.
B.4, finally, is too much of a "could be further developed" thing. I'm
always advocating keeping promises for the future out of standard
documents unless they actually help to keep implementors from making
false assumptions. This doesn't seem to be the case here. Can we
remove it?
-- MarkusDemleitner - 2018-10-29
TCG Review Period: TCG_start_date - TCG_end_date
TBC