Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.
First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.
Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:
- Linked data/Semantic web
- Open government (data.gov, data.gov.uk)
- HCLS (?)
Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.
The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:
- Common (working) definition of provenance
“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”
Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)
Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security
Provenance assertions can have their own provenance!
Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.
- Developed set of key dimensions for provenance
3 top level dimensions:
Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query
Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?
Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…
Each of these dimensions broken down into further sub-categories.
Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).
- Designed 3 flagship scenarios from the use cases
The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:
- News aggregator
- Disease Outbreak
- Business Contracts
- Created mappings for existing vocabularies for provenance
Group came up with recommendations:
- Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web
Timeline:
W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).
Open Questions:
- How to deal with Complex Objects
- dealing with multiple levels of granularity
- how provenance interacts with Named Graphs
- Unification of database provenance and process ‘style’ provenance
- objects, their versions and their provenance
- visualisation and summarization
- Imperfections
- What is adequate provenance for proof/quality?
- How do we deal with gaps in provenance?
- Repeatability vs. reproduction and how much provenance is enough?
- Can provenance help us get around the problem of reasoning over integrated data?
- Using provenance as a platform for trust, does it work?
- Distribution
- How do we encourage provenance capture?
- Multiple disagreeing claims about the origins data – which one is right?
- SameAs detection through provenance
- Distribution often gives us privacy – once we integrate how do we preserve privacy
- Scale (way more provenance than data! Has to scale – to very large)
- Hypothesis: distribution is a fundamental property of provenance