Provenance Data Model

A workpackage inside the DM WG back to DM main page


ProvDayMarch2017

Events

Meetings

(supports from ASTERICS/DADI WP, CTA project, GAVO Project, ASOV OV-France, Paris Data Center )

Goal

This workpackage is focused on the description on the data processing applied to provide astronomical data. We identified two different use-cases :

  • data processing and recombination, as in the pipeline of a typical astronomical survey.
  • acquisition configuration
We stick to the point of view of a user who wants to select appropriate data sets for his science study. In most cases the user will start with science ready data products but will have to discover progenitor data and possibly reprocess the original data.

The purpose is to outline the main blocks of information that can summarise how a dataset was produced using some specific facility, in particular observation conditions. This will be articulated with the Dataset Metadata DM, a Data Model that federates the developments of existing models: Observation Core Components DM, Spectral DM, Characterisation DM, Photometry DM.

Currently the development focus is on observational data. Simulated data have been modeled with the IVOA Simulation DM ( Simulation Data Model ) specification which offers a different granularity for the descriptions of parameters and data.

Next steps

Documents

Use-cases

  • CTA project https://portal.cta-observatory.org/Pages/Home.aspx

    The Cherenkov Telescope Array (CTA) is the next generation ground-based very high energy gamma-ray instrument. It will provide a deep insight into the non-thermal high-energy universe. Contrary to previous Cherenkov experiments, it will serve as an open observatory providing data to a wide astrophysics community, with the requirement to propose self-described data products to users that may be unaware of the Cherenkov astronomy specificities.

    Cherenkov telescopes indirectly detect gamma-rays by observing the flashes of Cherenkov light emitted by particle cascades initiated when the gamma-rays interact with nuclei in the atmosphere. The main difficulty is that charged cosmic rays also produce such cascades in the atmosphere, which represent an enormous background compared to genuine gamma-ray-induced cascades. Monte Carlo simulations of the shower development and Cherenkov light emission and detection, corresponding to many different observing conditions, are used to model the response of the detectors. With an array of such detectors the shower is observed from several points and, working backwards, one can figure out where the origin, energy and time of the incident particle. Extensive simulations are needed in order to perform this reconstruction.

    Because of this complexity in the detection process, Provenance information of data products are necessary to the user to perform a correct scientific analysis.

    Provenance concepts are relevant for different aspects of CTA :

    Data diffusion: the diffused data products have to contain all the relevant context information with the assumptions made as well as a description of the methods and algorithms used during the data processing.

    Pipeline : the CTA Observatory must ensure that data processing is traceable and reproducible.

    Instrument Configuration : the characteristics of the instrument at a given time have to be available and traceable (hardware changes, measurements of e.g. a reflectivity curve of a mirror, ...)

    USE CASE 1 :
    Reprocess a data product: The different processing steps and relevant parameters used in the original analysis are required, as well as the progenitor.

  • Pollux and Polarbase databases : http://pollux.graal.univ-montp2.fr and http://polarbase.irap.omp.eu

    Pollux is a synthetic spectra database while Polarbase is an observed spectra one. Both databases are accessible by their web interface or via the OV protocol : SSA.

    Concerning the synthetic spectra, the provenance tracing is important to know on the one hand the workflow which has generated each synthetic spectrum and on the second hand some important input parameters which characterize the result. The SSA protocol with the FORMAT=METADATA query allows users to query the database with different parameters defined by the provider (including provenance entities or activities). Currently users can query Pollux on the atmosphere model (which corresponds to the first code of the workflow), the effective temperature, the gravity, the mass and the microturbulence which are input parameters of this code (all the selected parameters avalaible on the web site are not implemented in the SSA protocol). All the provenance data are stored in the header of the Pollux spectrum in a home readable format.

    Concerning the observed spectra, the provenance of them is important and the provenance characteritics are mostly described by the ObsConfig part of the spectral datal model. But all observed spectra offered to users are not raw data. They have often been transformed by programs (calibration, ...). No provenance information is given about those programs.

    The provenance data model which can be included in a lot of OV data models such as the spectral data model allows providers and users to use the same format of description of the data provenance (PROV-N for example) and to translate this description in other formats (JSON, SVG, ...) via existing tools.


    USE CASE 1 :
    Show me a list of synthetic spectra satisfying :
    - domain of wavelength = visible
    - domain of effective temperature = [4000, 5000]

    USE CASE 2 :
    Show me a list of synthetic spectra satisfying :
    - code for model atmosphere = MARCS
    - type of model atmosphere = spherical

    USE CASE 3 :
    Show me a list of synthetic spectra satisfying :
    - code for spectral synthesis = turbospectrm
    - version of this code = 2008.1

    USE CASE 4 :
    For a given star identified by POS and SIZE, show me a list of spectra satisfying :
    - Stokes parameter = Q
    - Result of the LSD (code 1) treatment = definite

  • Linking Lightcurve Points and Source Images

    In a plot of a lightcurve, people should be able to view the image the flux was extracted from by clicking on a photometric point.

    Looking at what I think would be under the hood of such a thing, I think there's at least two levels of refinement we could aim for here.

    Level 1: simply say "this point was derived from this image". A group at the Czech Academy of Sciences already does something like that, and it's fairly harmless: Just add a column with the URL of the image in question. The role the provenance DM has to play there: add some annotation to the field so clients can figure out that a URL that's in there points to the image the photometric point is derived from (more complex provenance scenarios are conceivable).

    In Level 2, the table would not contain a ready-made URL, but rather some sort of data id and a global reference for a datalink-type service descriptor; the advantage would be that the client can choose how big the cutout retrieved would be. The Provenance DM would in this case have to have some model of accessing artifacts from the provenance chain through datalink/SODA services (which might be something useful beyond just this use case).

  • HIPS creation

    HIPS image should contain information about its progenitors (original fits files)

  • Other Projects

    From the pipeline description of various projects (RAVE, XMM,...) we check how to apply the W3C Provenance Model the main classes

  • Off-line Processing

    An observer want to process again an observation event list to produce level 3 products fitting its science requirements better than those delivered by the routine pipeline. The Provenance model can be use to annotate the regular datasets with the parameters of all tasks operated by the pipeline. This would facilitate the set-up of an off-line processing by refining some of these initial parameter values.

Data Model concepts


The W3C provenance DM ( http://www.w3.org/TR/prov-overview/ ) offers a pattern Activity/Entity/Agent that seems attractive for current use-cases . We are currently prototyping some use-cases following these ideas.

Provenance DM Vocabulary

Terms describing the provenance relationships from one document or data set already exist in various Data publication projects. ( Datacite, provdm, etc..)

We are currently examining the vocabulary needs in the scope of this IVOA Provenance DM.

The W3C pattern has 5 main relation ships with qualified roles between the 3 parts Entity/Activity/Agent.

  • was attributed to Entity --> Agent: may act as contributor, author/creator, publisher etc.
  • was derived from Entity --> Entity: points at 'progenitors' data sets role = {isDerivedFrom, IsSourceOf} w.r.t Datacite terms
  • was generated by Entity--> Activity: points from an entity to the action, operation the result of which this Entity is.
  • used Activity --> Entity
  • was associated with Activity --> Agent

Prototyping along the W3C Provenance pattern

Interactions with other efforts

Requirements from other IVOA WG groups

Previous IVOA efforts (before 2013)

* By following the Legacy link you will have a compilation of previous efforts on Provenance made during the stone and copper ages of the IVOA.

Topic attachments
I Attachment Action Size Date Who Comment
PDFpdf Example4.pdf manage 32.8 K 2015-09-30 - 16:32 MicheleSanguillon  
PNGpng Example4.png manage 934.7 K 2015-09-30 - 16:32 MicheleSanguillon  
Unknown file formatprovn Example4.provn manage 15.8 K 2015-09-30 - 16:32 MicheleSanguillon  
SVG (Scalable Vector Graphics)svg Example4.svg manage 190.8 K 2015-09-30 - 16:33 MicheleSanguillon  
Texttxt PROV-N-Example3.txt manage 16.2 K 2015-09-23 - 13:43 MicheleSanguillon  
Texttxt Pollux-PROV-N-Example1.txt manage 19.9 K 2015-09-22 - 13:57 MicheleSanguillon  
PNGpng Provenance_150930.png manage 60.9 K 2015-09-30 - 16:34 MicheleSanguillon  
Topic revision: r40 - 2017-04-11 - MicheleSanguillon
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2017 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback