TWiki> IVOA Web>IvoaDataModel>ProvenanceRFC2 (revision 12)EditAttach

Provenance Data Model RFC #2 Page

History

After the first RFC period (November 2018) during which no consensus was reached, the Provenance model has been revamped.

  • Based on a similar skeleton, this new version has significant differences with the former one.
  • The current document has been issued after some intensive sessions held in the Spring-19 meeting in Paris
  • After this, the TCG sugested to go in a new WG stage, shorten to 4 weeks, which actually ended on 2019/07/19.

Document

The attached document presents the IVOA Provenance Data Model v1.0 which is proposed for review is accessible in ivoadoc.

This document describes how provenance information can be modeled, stored and exchanged within the astronomical community in a standardized way. We follow the definition of provenance as proposed by the W3C (https://www.w3.org/TR/prov-overview/), i.e. that "provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness."

Such provenance information in astronomy is important to enable any scientist to trace back the origin of a dataset (e.g. an image, spectrum, catalog or single points in a spectral energy distribution diagram or a light curve), a document (e.g. an article, a technical note) or a device (e.g. a camera, a telescope), learn about the people and organizations involved in a project and assess the reliability, quality as well as the usefulness of the dataset, document or device for her own scientific work.

Reference Implementation


CTA implementations (M. Servillat)

The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray instrument. Contrary to previous Cherenkov experiments, it will serve as an open observatory providing data to a wide astrophysics community, with the requirement to propose self-described data products to users that may be unaware of the Cherenkov astronomy specificities. Because of the complexity in the detection process and in the data processing chain, provenance information of data products are necessary to the user to perform a correct scientific analysis.

Provenance concepts are relevant for different aspects of CTA:

  • Pipeline: the CTA Observatory must ensure that data processing is traceable and reproducible, making the capture of provenance information necessary.
  • Data diffusion: the diffused data products have to contain all the relevant context information as well as a description of the methods and algorithms used during the data processing.
  • Instrument Configuration: the characteristics of the instrument at a given time have to be available and traceable (hardware changes, measurements of e.g. a reflectivity curve of a mirror, ...)
In the context of CTA, two main implementations were developed:
  • a python Provenance class dedicated to the capture of provenance information of each CTA pipeline tool (including contextual information and connected to configuration information)
    https://cta-observatory.github.io/ctapipe/api/ctapipe.core.Provenance.html
  • OPUS: a job controller that can execute predefined jobs on a work cluster and expose the results. This job controller follows the IVOA UWS pattern and the definition of jobs proposed in the IVAO Provenance DM (Description classes), so as to capture and expose the provenance information for each job and result via a ProvSAP interface.
    https://opus-job-manager.readthedocs.io

Pollux Provenance: a simple access protocol to provenance of theoretical spectra (M. Sanguillon)

POLLUX is a stellar spectra database proposing access to high resolution synthetic spectra computed using the best available models of atmosphere (CMFGEN, ATLAS and MARCS), performant spectral synthesis codes (CMF_FLUX, SYNSPEC and TURBOSPECTRUM) and atomic line lists from VALD database and specific molecular line lists for cool stars.

Currently the provenance information is given to the astronomer in the header of the spectra files (depending on the format: FITS, ASCII, XML, VOTable, ...) but in a non-normalized description format. The implementation of the provenance concepts in a standardized format allows users on one hand to benefit from tools to create, visualize and transform to another format the description of the provenance of these spectra and on a second hand to select data depending on provenance criteria.

In this context, the ProvSAP protocol has been implemented to retrieve provenance information in different formats of the serialized data: PROV-N, PROV-JSON, PROV-XML, VOTABLE and to build diagrams in the following graphic formats: PDF, PNG, SVG. These serializations and graphics are generated using the voprov python package derived from the prov Python library (MIT license) developed by Trung Dong Huynh (University of Southampton).

SVOM Quick Analysis (L. Michel)

The SVOM satellite is a Sino-French variable object monitor to be launched in 2021. When a transient object is detected, a fast alert is sent to the ground station through a worldwide VHF network. Burst advocates and instrument scientists are in charge of evaluating the scientific relevance of the event. To complete the assement, scientists have at their disposal high level data products such as light curves or spectra generated by an automatic data reduction pipeline. In some case, they might need to reprocess raw data with refined parameters. To do so, scientific products (calib_leve >= 2) embedd their JSON provenance serialization in a specific extension. This provenance instance can be extracted, updated and then uploaded to a dedicated pipeline to reprocess the photon list with different parameters.

ProvHIPS CDS prototype service providing provenance metadata for HiPS datasets stored at CDS. ( F. Bonnarel, A. Egner)

This prototype is both an implementation of Provenance Data Model and of the DAL ProvTAP protocol.

ProvTAP is a proposal for providing Provenance metadata via TAP services. The current draft for this DAL protocol definition (TBC) can be found here). It is basically providing a TAP-schema mapping the IVOA Provenance model onto a relational schema.

  • A first implementation of a TAP service for Provenance metadata bound to ObsTAP metadata was done by F . Bonnarel, M. Louys and G. Mantelet in August 2018.
This first prototype of a ProvHiPS service was implementing this ProvTAP specification on top of a PostGres relational database with metadata for a few HiPS datasets generation as well as Schmidt plate digitization, image cutouts operations and RGB image construction. A full presentation of the prototype with a "slide demo" including a lot of ADQL queries can be found here.

HiPS is an all sky organisation of astronomical data (images, catalogs) based on Healpix tesselation of the sky, with iso-area, iso-latitude property of the tiles building up the mesh structure. (see HiPS description) Each HiPS data collection is obtained by the reprojection on the celestial sphere of original FITS images belonging to the same image collection. The hipsgen code fis used at CDS to generate the HiPS format of data stored at CDS image data base. These image collections are themselves produced in various ways and stages , from raw data, mosaiced, calibrated, normalized formats before they can be ingested into the HiPS processing activity.

For example in the case of HST images it is possible to trace back from HST HiPS datasets to stacked images collection, single calibrated exposure collections or raw exposure image collections In the case of HiPS obtained from Schmidt plate datasets, it is possible to go back to the plates through the digitized images. As an HiPS metadata tree also stores the identifiers of the progenitor images inside each tile it is also possible to trace the history of a tile. The whole HiPS dataset can appear both as an entity and as a collection of tiles.

We are currently creating a new version of our ProvHiPS database containing the history of the HiPS tiles for all the HST HIPS subsets (see here, here and there ) as well as the HiPS version of the DSS2 Schmidt plates survey. This is also globally illustrated by this viewgraph and the result of this work will be presented at next interop meeting in Groningen.

A triple Store implementation for an image database (M.Louys, F.X Pineau, L.Holzmann, F.Bonnarel).

We have implemented the IVOA provenance DM proposed using a BlazeGraph triplestore data base. It handles astronomical images as Entities with a succession of Activities that produce or consume them. Entities represent image files or photometric plates. Activities are described by an ActivityDescription instance which gives the template for execution of several different Activity instances based on the same template. / The Blazegraph prov-test prototype translates the classes via an Ontology of Object in OWL. The Prov_Owl diagram attached exposes the main objects of this ontology. This prototype implements:

  • Agent, Activity, Entity ( mainly DatasetEntity)
  • Used, wasGeneratedBy
  • wasAssociatedTo, wasAttributedTo
  • wasDerivedFrom
  • hasParameter
  • ActivityDescription
  • describedBy
  • describes
  • UsageDescription
  • GenerationDescription
  • ParameterDescription
  • Parameter
ActivityConfiguration in our case implements only the Parameter Class and no ConfigFile instances. Therefore the ActivityConfiguration is implemented using a direct relation called hasParameter attached to the Activity instance. This hasParameter predicate is an implementation of wasConfiguredBy/artefactType=Parameter in our implementation. It supports the same cardinality for the Activity <--> parameter relation as shown in the Proposed REC UML diagram.

The translation of Association classes like wasAttributedTo or wasAssociatedTo in the ontology requires a kind of predicate extension. These two association classes contain a role attribute , which relates to an agent instance when it is in relation to an Activity (associatedTo) or in relation with an Entity ( wasAttributedTo) . this role value depends only on the relation , and not on the two nodes objects. Therefore the ontology realizes this by adding a new predicate : holdsRoleinTime on the agent instance when it is active in one of these two relations. Description links are represented by the describedBy predicate and represents the generic link from left classes in the UML diagram to their description companion class in the right part.

The Blazegraph prov-test database contains all the triples extracted from the initial PostGres DB. It answers queries formed in SPARQL, compliant with the defined ontology. Examples of queries available with this prototype implementation are given in the attached document ProvQuerytest-3store.pdf

The list shows how the user can query for provenance metadata . Some queries focus on the management aspect of these metadata, as seen from the data provider's point of view. These can help to re-organise and optimise the image data base and their provenance metadata. It shows the various queries an astronomer can build in order to appreciate data quality from the recording of Activities, applied methods, parameter values, dataset dependency, typically the link to a progenitor Entity implemented using the wasDerivedFrom relation.

MuseWise Provenance: Implementation of ProvTAP, ProvSAP, W3C, and visualisation (O. Streicher)

MUSE is an integral field spectrograph installed at the Very Large Telescope (VLT) of the European Southern Observatory (ESO). It consists of 24 spectrographs, providing a 1x1arcmin FOV (7.5" in Narrow Field Mode) with 300x300 pixel. For each pixel, a spectrum covering the range 465-930nm is provided. MuseWise is the data reduction framework that is used within the MUSE collaboration. It is built on the Astro-WISE system, which has been extended to support 3D spectroscopic data and integrates data reduction, provenance tracking, quality control and data analysis. The MUSE pipeline is very flexible and offers a variety of options and alternative data flows to fit to the different instrument modi and scientific requirements.

MuseWISE hast provenance "built-in", i.e. it stores all relevant provenance information during the execution of the pipeline. Our implementation presents this provenance information conform to the upcoming standard. Activity configuration is modelled as Entities, to enable accessing the provenance of the configuration, and for W3C compatibility. The provenance data are primarily presented as a relational database which is conform to the IVOA Provenance model and close to the upcoming ProvTAP standard. It can be queried via SQL and ADQL queries. The database covers the full processing chain from the exposure to the science-ready product, including all necessary entity, activity, usage, and generation descriptions.

On top of the relational representation, we developed a few tools for alternative acess methods:

1. A complete ProvSAP sever that translates REST queries into relational queries for a provenance database, converts the result into the W3C Provenance model and presents it in W3C formats (XML, Prov-N, OWL2, JSON). The part of the IVOA model that is not W3C compatible (ActivityConfiguration package) is not queried. The result can be processed and stored by any W3C compatible tool and service.

2. A tabular vizualisation prototype of the provenance based on the W3C model (from the first tool), to present the data to the user. This page shows an example output for the complete chain of one data product. We also tested this prototype on other available data that we locally stored in databases with the same structure as our MuseWISE implementation: HIPS generation, CTA, and cube segmentation.

APPLAUSE Provenance: publically available implementation via TAP (A. Galkin)

German astronomical observatories own considerable collection of photographic plates. While these observations lead to significant discoveries in the past, they are also of interest for scientists today and in the future. In particular, for the study of long-term variability of many types of stars, these measurements are of immense scientific value. There are about 85000 plates in the archives of Hamburger Sternwarte, Dr. Karl Remeis-Sternwarte Bamberg, and Leibniz-Institut für Astrophysik Potsdam (AIP). The plates are digitized with high-resolution flatbed scanners. In addition, the corresponding plate envelopes and observation logbooks are digitized, and further metadata are entered into the database.

The APPLAUSE implementaion of the provenance model is the only publically accessable imlementation and is accessable through the TAP protocol on https://www.plate-archive.org

Currently, the provenance schema for the APPLAUSE Data Release 3 has 528082 entities, 425421 activities, 327 agents, 799097 used relations, 471888 wasGeneratedBy relations and 138552 wasAttributedTo relations.

Implementations Validators

There is no validator for the model as such. Validation tools should be applied to specific implementations of the model.

RFC Review Period: 2019 July 23 - 2019 September 03

Comments from WG members

TCG Review Period: TCG_start_date - TCG_end_date

TBC

Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg ImageProvHiPS.jpg r2 r1 manage 92.2 K 2019-08-20 - 14:19 FrancoisBonnarel HST HiPS and tiles provenance metadata
PNGpng OntologyProvenance2019-07-22.png r1 manage 124.6 K 2019-07-22 - 20:07 MireilleLouys list of owl objects in the Prov-test ontology M.Louys
PNGpng OwlOntologyPanel1.png r1 manage 297.9 K 2019-07-24 - 16:22 MireilleLouys list of owl objects, properties and predicates in the Prov-test ontology M.Louys
PDFpdf ProvQuerytest-3store.pdf r1 manage 45.1 K 2019-07-22 - 19:57 MireilleLouys List of example queries for CDS Triplestore implementation. M.Louys
PNGpng VisuProvenance.png r1 manage 443.8 K 2019-08-20 - 14:41 FrancoisBonnarel full provenance chain HiPS HST 1
PNGpng VisuProvenance1.png r1 manage 622.9 K 2019-08-20 - 14:41 FrancoisBonnarel full provenance chain HiPS HST 2
PNGpng VisuProvenance2.png r1 manage 560.9 K 2019-08-20 - 14:42 FrancoisBonnarel full provenance chain HiPS HST 3
Edit | Attach | Watch | Print version | History: r43 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2019-08-20 - FrancoisBonnarel
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback