Provenance Data Model RFC #2 PageHistoryAfter the first RFC period (November 2018) during which no consensus was reached, the Provenance model has been revamped.
DocumentThe attached document presents the IVOA Provenance Data Model v1.0 which is proposed for review is accessible in ivoadoc. This document describes how provenance information can be modeled, stored and exchanged within the astronomical community in a standardized way. We follow the definition of provenance as proposed by the W3C (https://www.w3.org/TR/prov-overview/), i.e. that "provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness."Such provenance information in astronomy is important to enable any scientist to trace back the origin of a dataset (e.g. an image, spectrum, catalog or single points in a spectral energy distribution diagram or a light curve), a document (e.g. an article, a technical note) or a device (e.g. a camera, a telescope), learn about the people and organizations involved in a project and assess the reliability, quality as well as the usefulness of the dataset, document or device for her own scientific work.
Reference Implementation
CTA implementations (M. Servillat)The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray instrument. Contrary to previous Cherenkov experiments, it will serve as an open observatory providing data to a wide astrophysics community, with the requirement to propose self-described data products to users that may be unaware of the Cherenkov astronomy specificities. Because of the complexity in the detection process and in the data processing chain, provenance information of data products are necessary to the user to perform a correct scientific analysis. Provenance concepts are relevant for different aspects of CTA:
Pollux Provenance: a simple access protocol to provenance of theoretical spectra (M. Sanguillon)POLLUX is a stellar spectra database proposing access to high resolution synthetic spectra computed using the best available models of atmosphere (CMFGEN, ATLAS and MARCS), performant spectral synthesis codes (CMF_FLUX, SYNSPEC and TURBOSPECTRUM) and atomic line lists from VALD database and specific molecular line lists for cool stars. Currently the provenance information is given to the astronomer in the header of the spectra files (depending on the format: FITS, ASCII, XML, VOTable, ...) but in a non-normalized description format. The implementation of the provenance concepts in a standardized format allows users on one hand to benefit from tools to create, visualize and transform to another format the description of the provenance of these spectra and on a second hand to select data depending on provenance criteria. In this context, the ProvSAP protocol has been implemented to retrieve provenance information in different formats of the serialized data: PROV-N, PROV-JSON, PROV-XML, VOTABLE and to build diagrams in the following graphic formats: PDF, PNG, SVG. These serializations and graphics are generated using the voprov python package derived from the prov Python library (MIT license) developed by Trung Dong Huynh (University of Southampton).SVOM Quick Analysis (L. Michel)The SVOM satellite is a Sino-French variable object monitor to be launched in 2021. When a transient object is detected, a fast alert is sent to the ground station through a worldwide VHF network. Burst advocates and instrument scientists are in charge of evaluating the scientific relevance of the event. To complete the assement, scientists have at their disposal high level data products such as light curves or spectra generated by an automatic data reduction pipeline. In some case, they might need to reprocess raw data with refined parameters. To do so, scientific products (calib_leve >= 2) embedd their JSON provenance serialization in a specific extension. This provenance instance can be extracted, updated and then uploaded to a dedicated pipeline to reprocess the photon list with different parameters.ProvHIPS CDS prototype service providing provenance metadata for HiPS datasets stored at CDS. ( F. Bonnarel, A. Egner)This prototype is both an implementation of Provenance Data Model and of the DAL ProvTAP protocol. ProvTAP is a proposal for providing Provenance metadata via TAP services. The current draft for this DAL protocol definition (TBC) can be found here). It is basically providing a TAP-schema mapping the IVOA Provenance model onto a relational schema.
A triple Store implementation for an image database (M.Louys, F.X Pineau, L.Holzmann, F.Bonnarel).We have implemented the IVOA provenance DM proposed using a BlazeGraph triplestore data base. It handles astronomical images as Entities with a succession of Activities that produce or consume them. Entities represent image files or photometric plates. Activities are described by an ActivityDescription instance which gives the template for execution of several different Activity instances based on the same template. / The Blazegraph prov-test prototype translates the classes via an Ontology of Object in OWL. The Prov_Owl diagram attached exposes the main objects of this ontology. This prototype implements:
MuseWise Provenance: Implementation of ProvTAP, ProvSAP, W3C, and visualisation (O. Streicher)MUSE is an integral field spectrograph installed at the Very Large Telescope (VLT) of the European Southern Observatory (ESO). It consists of 24 spectrographs, providing a 1x1arcmin FOV (7.5" in Narrow Field Mode) with 300x300 pixel. For each pixel, a spectrum covering the range 465-930nm is provided. MuseWise is the data reduction framework that is used within the MUSE collaboration. It is built on the Astro-WISE system, which has been extended to support 3D spectroscopic data and integrates data reduction, provenance tracking, quality control and data analysis. The MUSE pipeline is very flexible and offers a variety of options and alternative data flows to fit to the different instrument modi and scientific requirements. MuseWISE hast provenance "built-in", i.e. it stores all relevant provenance information during the execution of the pipeline. Our implementation presents this provenance information conform to the upcoming standard. Activity configuration is modelled as Entities, to enable accessing the provenance of the configuration, and for W3C compatibility. The provenance data are primarily presented as a relational database which is conform to the IVOA Provenance model and close to the upcoming ProvTAP standard. It can be queried via SQL and ADQL queries. The database covers the full processing chain from the exposure to the science-ready product, including all necessary entity, activity, usage, and generation descriptions. On top of the relational representation, we developed a few tools for alternative acess methods: 1. A complete ProvSAP sever that translates REST queries into relational queries for a provenance database, converts the result into the W3C Provenance model and presents it in W3C formats (XML, Prov-N, OWL2, JSON). The part of the IVOA model that is not W3C compatible (ActivityConfiguration package) is not queried. The result can be processed and stored by any W3C compatible tool and service. 2. A tabular vizualisation prototype of the provenance based on the W3C model (from the first tool), to present the data to the user. This page shows an example output for the complete chain of one data product. We also tested this prototype on other available data that we locally stored in databases with the same structure as our MuseWISE implementation: HIPS generation, CTA, and cube segmentation.APPLAUSE Provenance: publically available implementation via TAP (A. Galkin)German astronomical observatories own considerable collection of photographic plates. While these observations lead to significant discoveries in the past, they are also of interest for scientists today and in the future. In particular, for the study of long-term variability of many types of stars, these measurements are of immense scientific value. There are about 85000 plates in the archives of Hamburger Sternwarte, Dr. Karl Remeis-Sternwarte Bamberg, and Leibniz-Institut für Astrophysik Potsdam (AIP). The plates are digitized with high-resolution flatbed scanners. In addition, the corresponding plate envelopes and observation logbooks are digitized, and further metadata are entered into the database. The APPLAUSE implementaion of the provenance model is the only publically accessable imlementation and is accessable through the TAP protocol on https://www.plate-archive.org Currently, the provenance schema for the APPLAUSE Data Release 3 has 528082 entities, 425421 activities, 327 agents, 799097 used relations, 471888 wasGeneratedBy relations and 138552 wasAttributedTo relations.Implementations ValidatorsThere is no validator for the model as such. Validation tools should be applied to specific implementations of the model. | |||||||||||||||
Changed: | |||||||||||||||
< < | RFC Review Period: 2019 July 23 - 2019 September 03Comments from WG members | ||||||||||||||
> > | RFC Review Period: 2019 July 23 - 2019 September 03 | ||||||||||||||
Added: | |||||||||||||||
> > | Comments from WG members | ||||||||||||||
Comment from -- LaurentMichel - 2019-08-26
Comments from Markus DemleitnerRevisiting the two reviews I've written on this before, let me say that despite the points below, this version of the document is much improved. Thank you for that. Some concerns remain, however. In the list below, I've marked some I consider show stoppers with (!). Everything else is somehwere on the spectrum between woulnd't-it-be-nice and mark-my-words. In my review of a previous version 2017-09, I made the point that the document has no guidance for how to replace the plethora of FITS keywords with a with a ProvDM-compatible representation. I still think having a section in the introduction or an Appendix on this would make the document much easier to digest to the wider community. And if you adopt VOTable types (please! please!), you can scratch current Appendix C, so the document wouldn't even get longer. In my review of the previous PR 2018-10, I wrote: "I'd strongly vote for having the core model put into REC frist and only then going for the extended model" on grounds that it's difficult enough to get the core right. While I'd still vote for this if it were an option -- in particular because once we have interoperable clients that consume this kind of thing, we're far less likely to waste a lot of work on things clients won't or can't look at --, I'll grudgingly shut up on this and trust that people have made sure there's no way to have a VO-useful model without the extended part. (1) Personally, I'm not a big fan of the List of Figures and the List of Tables. If you're not terribly attached of them, consider dropping them. One less page, 2% less scary spec. (2) ivoatex uses non-French spacing by default, which means that after a full stop there is additional white space. The result is that if you have a period character that is not a sentence end (e.g., "i.e." or "e.g."), you should do something. The preferable way is to have a comma after abbreviations like "i.e.", which is what style guides still tell you to have. Where no comma is appropriate, use a tilda (as in "M.~Demleitner"). Also typographically, there are a few naked double quotes (") left in the text, and it would be nice if they could be replaced by typographical quotes (e.g., Use Case C: "well", "unbounded" in Table 14). There's also a typo in 2.5.3: "GenerationDescription classes[,] that are meant to store" -- the comma doesn't belong there. (3) In use case (B), there's "Find out who was on shift for data taking for a given dataset", which differs in style from the other bullet points, which are questions. Why not write "Who was on shift while the data was taken?"? (4) In use case (C) you write "Is the dataset produced or published by a person/organisation I can trust?" -- that's a bit marginal, and I'd prefer if it was withdrawn. If you take that use case seriously, you'll have to derive the requirement that the provenance itself becomes trustworthy, which then entails signatures and a public key infrastructure, which is something we certainly don't want to go into here. So, if you're not too attached to this particular thing, I'd prefer if it went away. (5) in use case (D), third bullet point, there's "coding language version", which I think at least lacks commas but would certainly profit from a tiny bit more verbosity. (6) in "General Remarks", you talk about it being "possible to enable the reproducibility", which opens a can of worms ("workflows") that I'd rather keep out of Provenance. Again, if you're not terribly attached to that passage, I'd rather see it removed. (7) I'm a bit confused by the distinction between "Model requirements" and "Best practices", in particular because quite a few of the "best practices" are actually required if you want to cover your use cases. Why did you create these two classes? And what's the relationship of the "best practices" to the modelling? It seems to me that something like "an entity should be linked to the activity that generated it" could simply be written as an actual requirement like "The model must link entities with the activities that generated it". That would, at least to me, indicate that a constraint on the model is being derived. (8) In sect. 1.4, the transition from "...was slowing down." to "The IVOA Data Model Working Group first gathered various..." is a bit sudden. Perhaps something like "To regain momentum" or so would improve the narrative here? (9) For the record, I still think the model's implementability would greatly profit from dropping the loop relations (wasDerivedFrom and wasInformedBy) -- they induce a combinatorial explosion of possible vertices to inspect when following a provenance chain (in an 11-entity provenance chain, you have to look at 21 vertices without them, and about 1045 ones with them). To me, that's a bad deal for saving on possibly empty Activities (or Entities). But well, I guess I can only hope I won't have to deal with this. Oh, and I mention in passing that when drop the loop relations, you also get to drop WasAttributedTo. How good can a deal get? (10) I haven't been able to work out what "optional" means in this specification, and I'd suggest explaining it somewhere in the introduction. Does it mean a client can ignore specifications in optional elements and still be compliant? If it means something else, what is it? And what about writers? Do writers have to write all "non-optional" elements? (11) In 2.5.1, you write "An activity is then a concrete case (instance) with a given start and stop time, and it refers to a description for further information." -- given that startTime, stopTime and description are optional as per Table 2, that is a misleading statement (and of course "description" is called "comment" in Activity). Can't you just write "ActivityDescription in this sense is a template defining the metadata of a set of slots filled using an AcitivityConfiguration to build a concrete Activity"? [disclaimer: I may not have quite gotten the plan straight on this first reading] (12) If there is some systematics behind the apparent inconsistency that human-readable annotation is in the "description" attribute in ActivityDescription but in the "comment" attribute in Activity? If so, it would help adopters if it were explained. (13) I can't say I like "doculink" too much. Is there a strong reason not to use the term "reference URL" that's well established for what I take this to be? (14) Table 13 and 14: There's multiplicity there, but it's not really clear what should be there if it were, say, "zero or one", or "exactly two", or really anything else than an unbounded number of entities. The text on p. 24 also suggests to write multiplicity=*, which isn't mentioned at all in the tables. (!) (15) On p. 24, there's the sentence "For example: when the input bias files..." that than peters off without saying what happens then. I can marginally see that with more effort a reader might be able to figure out the "then" part, but I guess it would be a nice service to say "the EntityDescription referencing the UsageDescription would declare..." (or whatever; I admit I've not tried very hard to figure it out myself). (16) In table 18, I'm deeply unhappy to see that you're requesting a VO-DML type to describe the data type. In both Registry and DAL (TAP), we've regretted trying shortcuts from VOTable types -- most of the values we're talking about are (at least optionally) going to end up in VOTables, so generating code in all likelihood already knows about VOTable types. Nobody will have thought about VO-DML types. Also, the VOTable is rich and implies serialisation -- you really don't want to go into all the intricacies of timestamps and polygons and all that again. So, please, please use datatype, arraysize and xtype as in VOTable here. I notice in A1 that the move to VO-DML types apparently was a change made towards this PR. Why was it taken? [I guess this is a good example why it pays to have public discussions on post-WD changes]. (17) In table 18, you say the possible values are "comma separated" -- but what happens if a value contains a comma? You might get by referencing the CSV standard that has escaping rules, but really: my advice is to just get rid of min, max, and options -- I don't see overwhelming use cases, and their semantics becomes really hard once you start digging (e.g., arrays: min, max of all components? by component? in some metric? cf. also the discussions on MAX interpretation in Geometries with SODA). (!) (18) I can marginally see why you might want a ConfigFile class, though I'd much prefer if a configuration file were simply modeled as another input (i.e., Entity); I don't see what would be lost, and I can easily imagine quite a few cases when the line between a configuration file and something you would probably call an Entity is really blurred (e.g., a catalogue of sources). What I really object to, tough, is ConfigFileDescription -- this is nowhere near specified well enough to enable any sort of interoperable use. Also, the statement that ConfigFiles contain a "list of (key,value) pairs" is either meaningless (when you let value be just about anything and allow for arbitrarily complex keys) or too restrictive (when key is unique and value atomic). There's no way to write a client that will reliably understand any of this. In consequence, if a value should be available to a provenance reader, it needs to go into the provenance literally anyway. So, you don't lose anything you can do right now if you drop ConfigFileDescription, but you save pages and implementor confusion ("What the heck should I do with an ConfigFile that is application/x-votable+xml?"). If you absolutely have to have ConfigFile, then at least say it's opaque (and essentially just a special kind of entity), and drop the language on keys and values, as well as ConfigFileDescription. Side benefit: the uglyness of requiring name in ConfigFileDescription to match name in ConfigFile (which I take exception with, too) can go. -- MarkusDemleitner - 2019-09-02 | |||||||||||||||
Changed: | |||||||||||||||
< < | TCG Review Period: TCG_start_date - TCG_end_date | ||||||||||||||
> > | Comments from TCG Members | ||||||||||||||
Early Comments from the Semantics WG(1) The Provenance DM currently comes with the vocabulary http://www.ivoa.net/rdf/agent-role (though that is not stated in the spec as it is). This is nice and certainly the way to go, but the vocabulary itself is severely lacking at this point. My main concern is that the various terms are not really well specified, and some of them quite obviously overlap (creator vs. author, provider vs. publisher vs. curator). Since the vocabulary would become normative (meaning: no terms can be removed) when ProvDM becomes REC, this is bad enough that Semantics can't approve this as is. Ideally, there should be a use case for each term. Which essentially means: Only include terms people have actually used in their existing descriptions, and then extend the vocabulary as actual provenance descriptions require (this is going to be a reasonably easy and fast process; see http://docs.g-vo.org/vocinvo2.pdf for the current draft). If that's impossible for lack of existing descriptions, my initial list would be to have the terms:
<--
|