Provenance Workshop in Montpellier, May 3rd - May 4th 2017 ========================================================== Participants: Francois Bonnarel, Mireille Louys, Markus Nullmeier, Kristin Riebe, Michele Sanguillon, Mathieu Servillat Session 1: Implementation feedback ---------------------------------- 3rd May, 11:00 - 19:00 +4th May, 9:00-10:00 Chair: Mireille Louys * Discussed the overview diagram (in the draft) - **TODO Mireille:** create a new version of the uml model in Modelio, but without the intermediate relation classes (replace by association classes) * Michele presented her questions from implementing voprov Python library: - which namespace to be used for id, label, description etc.? prov:id or voprov:id? - **Decision:** use voprov everywhere, because: - there is no uml (or valid vo-dml) for W3C - the voprov:Entity and its attribute are in principle the same as W3C, but are semantically different, e.g. string-type is from IVOA, not general UML string, there may be other subtle differences, so prov:Entity is not exactly the same as vorpov:Entity, though they can be translated into each other - it's easier to make a consistent model, if it does not depend to much on external models -- if included W3C classes directly, and the W3C model is changing, then our model may break * W3C uses prov:label for "human-readbale" labels/names; but SimDM uses "name" instead and uses "label" for SKOS-labels, decided to stay closer to SimDM, i.e.: - **Decision:** rename 'label' to 'name' everywhere in the model * For implementations, need attributes for all the classes, including 'used' etc. - put a table for each class, even those relation classes, in appendix for the working draft? - should also mark the foreign keys in the tables - but: will have an automated html-documentation including all this once the vodml-workflow works with the xmi exported from Modelio - [Update from 05.05.2017: we have an uptodate version of that vodml-generated html-documentation now at http://volute.g-vo.org/svn/trunk/projects/dm/vo-dml/models/provenancedm/ProvenanceDM.html ] - xml-namespaces are usually used for elements of xml-documents, should not be used for values; but here we want to use namespaces for the values. Can we do that? - add ucd to each attribute in the model? (and thus also to each table in the draft?) - **TODO Kristin**: update the working draft (voprov-namespace, label->name renaming); mark foreign keys in attribute tables * There should be a new paragraph on the implementation of voprov in the implementation section. * Restructure section 4) Accessing provenance information: - have a really small example for provenance information, in different formats: votable, json, prov-n - paragraph on description serialisation, slightly more extended example for this - maybe add new paragraph about graphical representation of prov * A TAP service needs to return exactly one VOTABLE -> how to represent the VOPROV result then, if usually we have multiple tables with relations between them? - TAP service returns the result of a query, attributes of entity, agent etc, may be mixed (e.g. id, name exist for each of them). But then it's the users responsibility to use proper column aliasing. - => No need to use unique names for the attributes (e.g. like `ent_id`, `ent_name`, `ag_id`, etc.) * Provenance service has to provide a way for the user to discover, if it can also directly talk to an obscore table, which additional attributes it supports etc. * Entities currently contain "dataproductType", "dataProductSubType" and "level" as obscore-specific attributes; but this only fits to observation datasets; they have limited usefulness for simulated data and make no sense for entities containing system configuration information, log files and similar. * => Discussion: use "desctype" for this? or "category"? => **Descision:** Remove obscore specific attributes from entity * Discussion on how to link Entity, -Description and Dataset - if we were just using W3C, then Dataset belongs to Entity; as it is now, we have split up entity and entitydescription (for normalisation); and one can do the same with the dataset-attributes, but not class-wise, obly on a per-attribtue basis - thus: - **Decision**: do not link classes from other data models directly with Provenance classes, rather provide a mapping on a per-attribtue basis (as already done in the mapping tables in the draft). Just explain the pattern. It's not as easy as taking the prov-classes and adding them to Dataset-model or vice versa, because they follow different design goals. * Kristin showed example web application for mapping simulation metadata to SimDM and ProvenanceDM classes => its the implementer's decision if one wants to have a SimDM service with provenance on top (derived from SimDM classes on the fly) or a Provenance service enriched with SimDM attributes. Both works fine (especially since SimDM includes the relations needed for provenance tracking) * Do we need a mapping of entity.id to some external ids of the same entity? - i.e. a mapping to obscore_id, simdm_id, spectra_id (there can be more than 1) => No decision reached yet. * **TODO**: (someone) Make an example of implementing obscore alongside provenance? * Mireille and Francois reported about a student who does an internship at CDS on implementing a PROV-TAP service; * Mireille made a table of possible attributes with ucds and utypes, see wikipage attachment; could be put as an example in appendix of working draft * **TODO**: put into the draft a section on how to build a provenance service (simple recipe for others to follow, identify your entities, your activities, ...) * Mathieu: Implementations for CTA - UWS service - UWS service uses a VOTABLE template as ActivityDescription, when a job is created, the input and output as well as the parameters of the activity are known to the UWS service, so it can automatically create provenance. - The VOTABLE uses PARAM-groups to group attributes that belong to 'used' and 'wasGeneratedBy' - this is used for constructing the web form, but also for provenance (~ prototype), i.e.: if job is finished, generate the provenance from this on the fly - Problem: + if user created a job today, took the result and several days later the results are uploaded as input for a new job to the same uws-service. + How can the service know, that the input file was the result from a previous job? + The uws service currently does NOT keep the provenance information in a database, thus service does not know the jobid. - possible solution: keep the ids of created fields in a database, maybe use a SHA-algorithm to create a unique hash for each file; store the hash with the file, so that the UWS service can recognize it. * Mathieu: Implementation in CTA pipe - Provenance class, use Provenance.startActivity() and Provenance.stopActivity, - add entities using Provennace.addInput and ...addOutput - may also start activities within other activities (subactivities) - shall record provenance automatically Session 2: Working draft ------------------------ 4th May, 10:00 - 13:00 Chair: Kristin Riebe * **TODO:** rename 'access' to 'rights' in entity attributes, to be more consistent with other VO models * EntityDescription and link to other data models: - we realized that merging with other (IVOA) data models is not working; it's not straightforward to connect classes from different data models - thus: rather focus on provenance here, use external ids for entities to connect them (e.g. use obscore's PublisherDID as entityId or provide a map for them) * **TODO:** remove Dataset Box from diagram in Figure 5; move figure as detailed collection figure to collection-section * What is needed to make a service a provenance service? - given an id of an entity or activity, return the backwards history in one of the (vo)prov serialisation formats * Discussion about entity types: prov:type for entity is 'entity' or 'collection'; but need something to distinguish between logging, system, provenance, calibration, simulation, observation, configuration, misc => call this "descriptionType"? or "category"? * **TODO Michele, Francois, Kristin:** Explain in introduction to section 4 that a serialised provenance description can be included in a fits-file etc, as Laurent wants to use it (Kristin makes first draft for this) * Discussion on ActivityFlow (section 5.3 in draft) and multiplicity of wasGeneratedBy (not unique anymore): - could enforce uniqueness by enforcing that finest level always has to exist; activityflow is only a "view" generated by the client to allow users to choose different levels of detail for viewing/exporting provenance - Michele sees no problem in keeping multiple wasGeneratedBy relations per entity in implementations, thus allow this * Discussion: add attribute "viewLevel" for ActivityFlow? - problem: which range? Is 0 finest or coarsest version? - after defining an activityFlow, there can always be finer and coarser levels added, so where to set the 0 point? use pos. + negative integers? - What if there are activityFlows chained together that have different viewLevels? * section 5.4: vodml: put link to vodml-stuff here * **TODO**: rewrite sections 5.3 and 5.4 accordingly, i.e. put to the right places in the main part of the draft * **TODO**: add address and phone number to agents (similar to DatasetDM) * **TODO**: add new table 5 with examples for entity roles in used/wasGeneratedBy. Additional discussion on lightcurves use case --------------------------------------------- (after lunch on 4th May, just Michele, Kristin and Markus) * Idea: user extracts a lightcurve from a database (see e.g. first lightcurves in GAVO data center, from Kepler, using SPLAT for publication), the data points can be viewed in SPLAT or Topcat or so, magnitudes vs. time. * image labels exist in lightcurve table for each data point * lightcurve table is served as VOTable, should be enriched with provenance metadata, so that one can track for each datapoint where it is coming from * each row in the lightcurve table is an entity with a provenance record * use the id, a link, ... to retrieve the accompanying provenance record from a web service * or have the provenance information stored as a blob in the light curves table * could consider the light curves table as a collection of entities, creationActivity is the activity of putting the data together. * Idea: be able to click on a point of the lightcurve to retrieve and view the progenitor image => provenance information needs to contain enough information so that one can retrieve the image from another service.