Provenance and Linked Open Data

Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.

First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.

Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:

  • Linked data/Semantic web
  • Open government (data.gov, data.gov.uk)
  • HCLS (?)

Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.

The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:

  • Common (working) definition of provenance

“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”

Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)

Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security

Provenance assertions can have their own provenance!

Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.

  • Developed set of key dimensions for provenance

3 top level dimensions:

Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query

Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?

Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…

Each of these dimensions broken down into further sub-categories.

  • Collected use cases

Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).

  • Designed 3 flagship scenarios from the use cases

The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:

  1. News aggregator
  2. Disease Outbreak
  3. Business Contracts
  • Created mappings for existing vocabularies for provenance
  • … more

Group came up with recommendations:

  • Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web

Timeline:

W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).

Open Questions:

  • How to deal with Complex Objects
    • dealing with multiple levels of granularity
    • how provenance interacts with Named Graphs
    • Unification of database provenance and process ‘style’ provenance
    • objects, their versions and their provenance
    • visualisation and summarization
  • Imperfections
    • What is adequate provenance for proof/quality?
    • How do we deal with gaps in provenance?
    • Repeatability vs. reproduction and how much provenance is enough?
    • Can provenance help us get around the problem of reasoning over integrated data?
    • Using provenance as a platform for trust, does it work?
  • Distribution
    • How do we encourage provenance capture?
    • Multiple disagreeing claims about the origins data – which one is right?
    • SameAs detection through provenance
    • Distribution often gives us privacy – once we integrate how do we preserve privacy
    • Scale (way more provenance than data! Has to scale – to very large)
    • Hypothesis: distribution is a fundamental property of provenance

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.