Provenance management

"Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness."

For a scientific workflow system, provenance can have several aspects:

  1. Provenance of the workflow definition
  2. Provenance of a workflow run
  3. Provenance of data

Provenance of workflow definitions

Taverna does not capture provenance of editing a workflow definition, but assumes the scientist manages the evolution of workflow definitions through existing means for versioning files, such as filenames and folders, version control systems like Git, or workflow sharing websites like myExperiment.

Within Taverna, a workflow can be annotated to give attribution to the Authors of a workflow (or nested workflow). We recommend using comma(s) or linefeed(s) to separate multiple authors.

Taverna's workflow file format has an internal workflow identifier (UUID) which is updated for every workflow change. A log of previous workflow identifiers is included within the workflow definition formats t2flow and Taverna 3 workflow bundle, allowing detection of workflows with common ancestry.

Provenance of workflow runs

Taverna can capture provenance of workflow runs, including individual processor iterations and their inputs and outputs. This provenance is kept in an internal database, which is used to populate Previous runs and Intermediate results in the Taverna Workbench Results perspective.

The provenance trace can be used by the Taverna-PROV plugin to export (1) the workflow run, including the output and intermediate values, and (2) the provenance trace as a PROV-O RDF graph which can be queried using SPARQL and processed with other PROV tools, such as the PROV Toolbox.

We are planning to extend myExperiment to handle uploading of such provenance traces, which would give a mechanism to present and browse values and details of a workflow run within the browser.

This presentation about Taverna's provenance support gives an overview of the model and software architecture.

Provenance of data

Scientists using Taverna to perform analyses are often more interested in derivation and attribution of workflow data and less concerned about the detailed workflow run provenance. For example, a workflow may perform text-mining on a biomedical article to extract gene names, and then retrieve the genome sequences for those genes using a database lookup. The workflow in effect derives the sequences from the database. Consequently, the sequences should (according to the license of the web service) be attributed to its maintainers. Similarly, the sequence list is derived from the biomedical article and also requires attribution.

Taverna workflows typically use local tools to combine web services found “in the wild” (e.g., BioCatalogue). This approach will not usually provide “science-level provenance.” myGrid is planning a capability for such data provenance in different ways:

  1. Merging and propagation of PROV-AQ-provided provenance traces for REST services (including matching data identity) -- “white-box service”

  2. A provenance “backchannel” for Components, which can be populated either by the underlying service directly or by shims within the component. This allows higher-level provenance that is meaningful for a set of components instead of service-specific execution details.

  3. Annotation of workflow fragments by common motifs, which can provide higher-level provenance for data generated by the workflow

The paper Enhancing and Abstracting Scientific Workflow Provenance for Data Publishing (doi 10.1145/2457317.2457370) details these approaches.


myGrid actively participated in the W3C Provenance Working Group which developed the PROV family of standards. The Taverna-PROV plugin has been developed for Taverna and allows the export of workflow run provenance as PROV-O RDF.

The wf4ever project is investigating the sharing of workflows and workflow runs as research objects. Of particular importance for Taverna is the development of the Research Object Bundle, which will form a single archive of a workflow run, including run provenance, inputs, outputs, intermediate values, workflow definition and (for Taverna 3) information about the run environment.

Past collaborations

Since early 2010, we have been invited partners of the NSF DataONE project, dedicated to large-scale preservation of scientific data, and founding members of the Worklow and Provenance Working Group promoted by the project, along with Prof. Ludaescher at UC Davis, USA and Juliana Freire at University of Utah, USA.

Historically, work on provenance within the myGrid consortium and Taverna team has been focusing on multiple aspects, beginning with the design and implementation of Janus, a data model and software component for provenance capture and analysis for Taverna. Our research in this area is often pursued in collaboration with external partners:

Other past collaborations on the topic of provenance include: