Provenance Workshop in Montpellier, May 3rd - May 4th 2017
==========================================================

Participants: 
Francois Bonnarel, Mireille Louys, Markus Nullmeier, Kristin Riebe, Michele Sanguillon, Mathieu Servillat

Session 1: Implementation feedback
----------------------------------
3rd May, 11:00 - 19:00 +4th May, 9:00-10:00
Chair: Mireille Louys

* Discussed the overview diagram (in the draft)
    - **TODO Mireille:** create a new version of the uml model in Modelio, but without the intermediate relation classes (replace by association classes)
* Michele presented her questions from implementing voprov Python library:
	- which namespace to be used for id, label, description etc.? prov:id or voprov:id?
	- **Decision:** use voprov everywhere, because:
		- there is no uml (or valid vo-dml) for W3C
		- the voprov:Entity and its attribute are in principle the same as W3C, 
			but are semantically different, e.g. string-type is from IVOA, not general UML string, there may be other subtle differences, so prov:Entity is not exactly the same as vorpov:Entity, though they can be translated into each other
		- it's easier to make a consistent model, if it does not depend to much on external models -- if included W3C classes directly, and the W3C model is changing, then our model may break
* W3C uses prov:label for "human-readbale" labels/names; but SimDM uses "name" instead and uses "label" for SKOS-labels, decided to stay closer to SimDM, i.e.:
    - **Decision:** rename 'label' to 'name' everywhere in the model
* For implementations, need attributes for all the classes, including 'used' etc.
    - put a table for each class, even those relation classes, in appendix for the working draft?
    - should also mark the foreign keys in the tables
    - but: will have an automated html-documentation including all this once the vodml-workflow works with the xmi exported from Modelio
    - [Update from 05.05.2017: we have an uptodate version of that vodml-generated html-documentation now at 
        http://volute.g-vo.org/svn/trunk/projects/dm/vo-dml/models/provenancedm/ProvenanceDM.html
      ]
    - xml-namespaces are usually used for elements of xml-documents, should not be used for values; but here we want to use namespaces for the values. Can we do that?
    - add ucd to each attribute in the model? (and thus also to each table in the draft?)
    - **TODO Kristin**: update the working draft (voprov-namespace, label->name renaming); mark foreign keys in attribute tables
* There should be a new paragraph on the implementation of voprov in the implementation section.
* Restructure section 4) Accessing provenance information: 
    - have a really small example for provenance information, in different formats: votable, json, prov-n
    - paragraph on description serialisation, slightly more extended example for this
    - maybe add new paragraph about graphical representation of prov
* A TAP service needs to return exactly one VOTABLE -> how to represent the VOPROV result then, if usually we have multiple tables with relations between them?
    - TAP service returns the result of a query, attributes of entity, agent etc, may be mixed (e.g. id, name exist for each of them). But then it's the users responsibility to use proper column aliasing.
    - => No need to use unique names for the attributes (e.g. like `ent_id`, `ent_name`, `ag_id`, etc.)

* Provenance service has to provide a way for the user to discover, if it can also directly talk to an obscore table, which additional attributes it supports etc.

* Entities currently contain "dataproductType", "dataProductSubType" and "level" as obscore-specific attributes; but this only fits to observation datasets; they have limited usefulness for simulated data and make no sense for entities containing system configuration information, log files and similar.
* => Discussion: use "desctype" for this? or "category"?
  => **Descision:** Remove obscore specific attributes from entity

* Discussion on how to link Entity, -Description and Dataset
    - if we were just using W3C, then Dataset belongs to Entity; as it is now, we have split up entity and entitydescription (for normalisation); and one can do the same with the dataset-attributes, but not class-wise, obly on a per-attribtue basis
    - thus:
    - **Decision**:  do not link classes from other data models directly with Provenance classes, rather provide a mapping on a per-attribtue basis (as already done in the mapping tables in the draft). Just explain the pattern. It's not as easy as taking the prov-classes and adding them to Dataset-model or vice versa, because they follow different design goals.
* Kristin showed example web application for mapping simulation metadata to SimDM and ProvenanceDM classes 
    => its the implementer's decision if one wants to have a SimDM service with provenance on top (derived from SimDM classes on the fly) or a Provenance service enriched with SimDM attributes. Both works fine (especially since SimDM includes the relations needed for provenance tracking)

* Do we need a mapping of entity.id to some external ids of the same entity?
    - i.e. a mapping to obscore_id, simdm_id, spectra_id (there can be more than 1)
=> No decision reached yet.

* **TODO**: (someone) Make an example of implementing obscore alongside provenance?

* Mireille and Francois reported about a student who does an internship at CDS on implementing a PROV-TAP service; 
* Mireille made a table of possible attributes with ucds and utypes, see wikipage attachment; could be put as an example in appendix of working draft

* **TODO**: put into the draft a section on how to build a provenance service (simple recipe for others to follow, identify your entities, your activities, ...)

* Mathieu: Implementations for CTA - UWS service
- UWS service uses a VOTABLE template as ActivityDescription, when a job is created, the input and output as well as the parameters of the activity are known to the UWS service, so it can automatically create provenance.
- The VOTABLE uses PARAM-groups to group attributes  that belong to 'used' and 'wasGeneratedBy'
- this is used for constructing the web form, but also for provenance (~ prototype), i.e.: if job is finished, generate the provenance from this on the fly

- Problem:
    + if user created a job today, took the result and several days later the results are uploaded as input for a new job to the same uws-service. 
    + How can the service know, that the input file was the result from a previous job?
    + The uws service currently does NOT keep the provenance information in a database, thus service does not know the jobid.
- possible solution: keep the ids of created fields in a database, maybe use a SHA-algorithm to create a unique hash for each file; store the hash with the file, so that the UWS service can recognize it.

* Mathieu: Implementation in CTA pipe
- Provenance class, use Provenance.startActivity() and Provenance.stopActivity,
- add entities using Provennace.addInput and ...addOutput
- may also start activities within other activities (subactivities)
- shall record provenance automatically


Session 2: Working draft
------------------------
4th May, 10:00 - 13:00
Chair: Kristin Riebe

* **TODO:** rename 'access' to 'rights' in entity attributes, to be more consistent with other VO models

* EntityDescription and link to other data models:
    - we realized that merging with other (IVOA) data models is not working; it's not straightforward to connect classes from different data models
    - thus: rather focus on provenance here, use external ids for entities to connect them (e.g. use obscore's PublisherDID as entityId or provide a map for them)

* **TODO:** remove Dataset Box from diagram in Figure 5; move figure as detailed collection figure to collection-section

* What is needed to make a service a provenance service?
    - given an id of an entity or activity, return the backwards history in one of the (vo)prov serialisation formats

* Discussion about entity types: prov:type for entity is 'entity' or 'collection'; but need something to distinguish between logging, system, provenance, calibration, simulation, observation, configuration, misc
=> call this "descriptionType"? or "category"?

* **TODO Michele, Francois, Kristin:** Explain in introduction to section 4 that a serialised provenance description can be included in a fits-file etc, as Laurent wants to use it (Kristin makes first draft for this)

* Discussion on ActivityFlow (section 5.3 in draft) and multiplicity of wasGeneratedBy (not unique anymore):
    - could enforce uniqueness by enforcing that finest level always has to exist; activityflow is only a "view" generated by the client to allow users to choose different levels of detail for viewing/exporting provenance
    - Michele sees no problem in keeping multiple wasGeneratedBy relations per entity in implementations, thus allow this

* Discussion: add attribute "viewLevel" for ActivityFlow?
    - problem: which range? Is 0 finest or coarsest version?
    - after defining an activityFlow, there can always be finer and coarser levels added, so where to set the 0 point? use pos. + negative integers?
    - What if there are activityFlows chained together that have different viewLevels?

* section 5.4: vodml: put link to vodml-stuff here

* **TODO**: rewrite sections 5.3 and 5.4 accordingly, i.e. put to the right places in the main part of the draft

* **TODO**: add address and phone number to agents (similar to DatasetDM)

* **TODO**: add new table 5 with examples for entity roles in used/wasGeneratedBy.


Additional discussion on lightcurves use case
---------------------------------------------
(after lunch on 4th May, just Michele, Kristin and Markus)

* Idea: user extracts a lightcurve from a database (see e.g. first lightcurves in GAVO data center, from Kepler, using SPLAT for publication), the data points can be viewed in SPLAT or Topcat or so, magnitudes vs. time.

* image labels exist in lightcurve table for each data point
* lightcurve table is served as VOTable, should be enriched with provenance metadata, so that one can track for each datapoint where it is coming from
* each row in the lightcurve table is an entity with a provenance record
* use the id, a link, ... to retrieve the accompanying provenance record from a web service
* or have the provenance information stored as a blob in the light curves table
* could consider the light curves table as a collection of entities, creationActivity is the activity of putting the data together.
* Idea: be able to click on a point of the lightcurve to retrieve and view the progenitor image
 => provenance information needs to contain enough information so that one can retrieve the image from another service.