Opening Data – Opening Doors: Technical Standards

Some slightly sketchy notes of Paul Walk’s talk

Paul says: the real challenges are around:
Business case
IPR
etc.

Technical issues not trivial, but insignificant compared to other challenges

We aren’t building a system here – but thinking about an environment … although probably will need to build systems on top of this at some point

‘The purple triangle of interoperability’!:

  • Shared Principles
  • Technical Standards
  • Community/Domain Conventions

Standards are not the whole story

  • Use (open) technical standards
  • Require standards only where necessary
  • Avoid pushing standards to create adoption
  • Establish and understand high-level principles and ‘explain the working out’ – support deeper understanding

Paul suggests some ‘safe bets’ in terms of approaches/principles:

  • Use Resource Oriented Architecture
  • identify persistently – global and public identities to your high-order entities (e.g. metadata records, actual resources)
    • URLs (or http URIs) is a sensible default for us (although not the only game in town)
  • use HTTP and REST

Aggregation is a corner-stone of RDTF vision – so make your resources a target for aggregation:

  • use persistent identifiers for everything
  • adopt appropriate licensing
  • ‘Share alike’ maybe easier than ‘attribution’

Paul still a little sceptical of ‘Linked Data’ – it’s been the future for a long time. Tools for Linked Data still not good enough – can be real challenge for developers. However, we should be a
Quote Tom Coates: “Build for normal users, developers and machines” – and if possible, build the same interface for all three [hint, a simple dump of RDF data isn’t going to achieve this!]

Expect and enable users to filter – give them ‘feeds’ (e.g. RSS/Atom) – concentrate on making your resources available

Paul sees slight risk we embrace ‘open at the expense of ‘usability’ – being open is an important first step – but need to invest in making things useful and usable

Developer friendly formats:

  • XML has a lot going for it, but also quite a few issues
    • well understood
    • lots of tools available
    • validation is a pain
    • very verbose
    • not everything is a tree
  • JSON has gained rapid adoption
    • less verbose – simple
    • enables client side manipulation

Character encodings – huge number of XML records from UK IRs are invalid do to character encoding issues

Technical Foundations:

  • Work going on now – Will be a website ETA June 2011
  • JISC Observatory will gather evidence of ‘good use’ of technical standards etc
  • Need to understand federated aggregation better

Questions for data providers:Do you want to provide a data service, or just provide data?

Opening Data – Opening Doors: Cambridge University Library

Finally in this set of three ‘perspectives’ session, Ed Chamberlain from Cambridge University Library.

Why expose bibliographic data?

  • Natural follow on from philosophy of ‘meeting reader in their (online) place’
  • Already exposing data to others (OCLC, COPAC, SUNCAT etc.) – lots of work to setup each agreement and export – Open data approach might give easier way of approaching this
  • Offer value for money (for taxpayer)
  • Internal academic pressure – ‘we are being asked for data’

e.g. use Rufus Pollock – wanted to do ‘analysis of size and growth of the public domain using CUL bibliographic data (http://rufuspollock.org/tags/eupd)

The COMET (Cambridge Open METadata) project will be releasing large amounts of bibliographic data under an Open Data Commons License. Formats will include MARC21 and RDF – partnering with OCLC so linking into related services such as FAST and VIAF.

Ed thinks the library sector should have following ambitions around resource discovery:

  • Hope to see ‘long tail’ effect – exposing data to large audience
  • ‘Out of domain’ discovery
  • Multiple points of discovery at multiple levels for multiple audiences
  • Services for Undergraduates, for academics AND for developers

Practicalities/Challenges:

  • Licensing
    • While individual records may not be protected by copyright, collections of records may be – and often obtained by library from shared catalogue resources/commercial suppliers under contract
    • Ideal is full unrestricted access
    • Better to publish data (as much as you can, even if necessary to have more restrictive licensing attached)
  • RDF vocab and mappings – no standard
  • Triplestores – for managing RDF – but new technology, seems complex

Opportunities:

  • Strong platform for future development
  • Linked formats and open licenses are virtuous pairing
  • Huge scope for back office benefits

Need to also think beyond bibliography – what about holdings? libraries (physical locations)? librarians as linked data (!) (finding people with specialisms etc.)

Opening Data – Opening Doors: The National Archives

Next up, Nick Kingsley from the National Archives.

For ‘non-archivist’ a whistlestop tour:

  • Archival holdings consist of collections (or ‘fonds’) representing any number of archival objects – these are primary units of management
  • Collections often have ‘natural’ or imposed structure
  • Ideally catalogue are linked to authority records for names and places + taxonomies for subjects
  • Archive users typically use both search and browse to aid resource discovery
  • Archive catalogues compile over long periods (a century or more in some cases) – so inconsistencies/changes in language etc.

The ‘National Register of Archives’ – start of ‘aggregation’ for archives – computerised and then made available online throughout 80s and 90s

Funding silos meant outcomes of ‘Archives Online’ report published by Nationa Council on Archives (1998) were taken forward through a series of different projects – but all with committment to interoperability to allow for integration or cross-searching. Projects include:

About 5 years ago, a view started to emerge that future not about aggregations ‘doing it for’ archives, but individual archives publishing their own catalogues online – but “usually proved disappointing” (personal view of Nick) – because:

  • contrained by lack of technical support
  • 2 widely adopted commercial platforms – developments limited to those supported by ‘majority’ cusomters
  • Rarely offer robust search/browse
  • ….

National Archives committed to supporting and promoting open data. Also has been pioneer in exploiting the potential of Linked Data – through http://legistlation.gov.uk and also looking at Linked Data version of PRONOM (impartial, definitive information about file formats, software products and other technical components) – see the National Archive ‘labs’ page http://labs.nationalarchives.gov.uk/wordpress/index.php/2011/01/linked-data-and-pronom

Lots of work going on at the National Archives – see http://labs.nationalarchives.gov.uk/wordpress/index.php. Also looking at a review of the National Register of Archives and considering a linked data approach.

Opening Data – Opening Doors: National Maritime Museum

Today I’m at the ‘Opening Data – Opening Doors’ event, which is both the first public event around the ‘Resource Discovery Task Force’ work, and also an opportunity to launch the JISC ‘Guide to Open Bibliographic Data’. The event is being live streamed at http://www.livestream.com/johnpopham

Following introduction from David Baker on the background of the RDTF and the ‘vision’ that came out of that work, now Laurence Chiles from the National Maritime Museum is talking about how they’ve approached publishing their collections and data on the web. The results are going live in the next few weeks.

Amongst a wide range of aims, they wanted to:

  • Connect objects & records across varied collection – use linked data to enable connectivity between objects; help develop the story and relationships across the collections
  • Give objects a growing online identity – permanent/stable home based on Object IDs
  • To be conversational – let people use the data but then start/react to the conversation – if no one knows it’s there …

Actions they took:

Changed the criteria:
From ‘web ready’ to ‘not for web use’ – i.e. moved from a ‘not on web’ assumption, to a ‘on web unless specific decision not to’ assumption
Decided publishing data without images was OK
4 basic mandatory fields – (Title, ID, ?, ?)

Offer new ways to the data:

  • OAI-PMH (for aggregation into Culture Grid and onwards to Europeana)
  • OpenSearch
  • XML

Used principles of linked data to link out of collections online:

  • AEON (Archival retrieval service through their Archive catalogue)
  • Cabinet (for print ordering service)
  • WorldCat (links to publications)
  • Plans to work with Wikimedia commons to enhance authoristy records
  • Exposed bothe the SOLR API for ‘traditional’ search and the SPARQL end-point for lined data
    • Promoted at Culture and History Hack days

Going forward want to:

  • Promote a collaborative, conversational approch – e.g. ‘Help the NMM’ feature on all records
  • Improve ‘on-gallery’ experiences
  • Contiue to release more data and monitor – .e.g 1915 Crew lists

Provenance use cases for legislation.gov.uk

Stephen Cresswell from The Stationery Office outlining:
Background
  • legislation.gov.uk – 60k pieces of legislation for UK managed by the National Archives
  • Publication in various formats – paper docs, pdf, xml, xhtml+RDFa, RDF
  • TSO (The Stationery Office) currently redesigning workflows for legislation

Use cases/Requirements:

  • Drafters of legislation (e.g. government departments) – “how is my job progressing though the publishing workflow?”
  • Management information (aggregated) – “where are the bottlenecks in the publishing workflow”
  • Maintainers may want to trace problems – “Which documents were derived from this XSLT?”
  • Anyone might ask “Where did this document come from?”
  • Acceptance test – re-run workflow from provenance graph (to prove that the provenance recorded is the true provenance – that re-running the same workflow results in the same outcome)

 

Provenance and Linked Data Issues

This list formed from discussions and voting yesterday (and will be posted to the wiki at some point)

Top Issues:

  • Things that change (if the thing that a URI points at change – e.g. boundaries of a geographical area) then what happens to the existing provenance statements
  • Provenance of links between data ‘islands’
  • Summarisation of provenance to aid usability and scalability
  • Reasoning about provenance: Cross-referencing filling the gap in the provenance

Further issues

  • 80/20 principle for provenance
  • What general provenance model to use to enable interoperability
  • Provenance for validation of facts
  • Is reasoning even possible over provenance?
  • Interaction of triple stores and data integration/transformation
  • Semantics vs data capture (does the rich semantic nature of Linked Data offset the need to capture provenance data ‘at source’)
  • Access level of provenance (public vs private provenance statements)

The Open Provenance Model

Introduction to the Open Provenance Model (OPM) from Luc Moreau, University of Southampton.

Came out of a series ‘provenance challenges’ – driven by a desire to understand provenance systems. First 3 provenance challenges (around interoperability of provenance information from different systems) led to the OPM, which became the basis of a further challenge event.

OPM is ‘annotated causality/dependency graph – directed, acyclic’ – it is an abstract modle with serialization formats to XML and RDF. It encourages specialisation for specific domains through profiles.

A simple use case inspired by Jeni Tennison blogpost http://www.jenitennison.com/blog/node/133 [worth reading to look at practical application of provenance in RDF]

Lots of detail here that I didn’t get – need to go and read http://openprovenance.org/

OPM itself doesn’t say anything about aggregations – but using OPM it is possible to build an ontology that expresses this – there is a draft proposed collection profile – but needs more work.

Some work around being able to add a digital signature to provenance information – to validate it as from a specific source/person.

… at this point I’m afraid the 5am start got to me and I didn’t manage to capture the rest of this presentation 🙁

Some pointers

Becoming Dataware

This presentation from James Goulding …

Lots of ‘personal’ data being collected (e.g. by Tesco) – but not open. Who owns it? Very unclear – no precedents. Perhaps parallel with photography – if someone takes a photo of you, it is data about you that you don’t own.

Do ‘business’ own it? Lots of data in their data silos (Facebook, Tesco etc.)

Do we own the data? What if ‘I’ (as an individual) want ‘my’ data to be open

Data ‘tug-of-war’ – data can be duplicated, instantly transferrable

Marx: split between ‘those who own the means of production and those who work on them’ – but in data creation of this type, we’re not the worker? We’re not the customer?

We are the product!

Policy slow to catchup with practice.

Big data generates new data…

So question is not owns an individuals data, but: Who controls the means of analysis?

Vision of a ‘personal datasphere’:

  • My (SPARQL?) endpoint – under my control
  • Logically a single entity – a catalo on hosts under my control or on my cloud – maintains privacy
  • User controlled – may decide to expose to trusted partners or for a price… (right to access data may not be right to process it!)

Dataware concept – it updates the ‘catalogue’ as other data sources (e.g. energy consumption data, social media etc.) update. Third party applications then could request permission to access data from the Dataware catalogue, the catalogue issues a token, which then allows the 3rd party app to access the data source (Facebook, Bank data, etc. etc.)

Illustrated here:

(click for larger picture)

James interested in how provenance might apply to the Dataware catalogue.

Provenance in the Dynamic, Collaborative New Science

This presentation by Jun Zhao, University of Oxford – but missed the start as I presented just now, and getting myself sorted…

Want to use some of the systems/expertise of libraries (esp. digital preservation) to preserve workflows – to make experiments repeatable at any time in the future. Project is ‘Workflow4Ever‘ – http://www.wf4ever-project.org/

Need to be able re-run experiments over data sets as data sets grow – does finding remain true as data grows.

Biology Use case – ‘reuse’:

  • Search for existing experiments from myExperiment (http://myexperiment.org)
    • Challenge – understand the workflow
    • Perform test runs with test data and his/her own data
    • Read others’ logs
    • Read annotations to workflows
  • Reuse scripts from colleagues and perform test that his/her colleagues are familiar with

Provenance Challenges:

  • Identity
  • Context
  • Storage
  • Retrieval

Provenance and Linked Open Data

Today and tomorrow I’m at a workshop on Provenance and Linked Open Data in Edinburgh. The workshop is linked to the work of the W3C Provenance Incubator Group.

First up Paul Groth (@pgroth) from VU University Amsterdam is going to summarise the work of the incubator group and outline remaining open questions.

Paul says this audience (at this workshop) takes as a given the need for provenance. Provenance is fundamental to the web – it is a pressing issue in many areas for W3C:

  • Linked data/Semantic web
  • Open government (data.gov, data.gov.uk)
  • HCLS (?)

Most people do not know how to approach provenance – people looking for standard and methodology that they can use immediately. Existing research/work on provenance scattered across computer and library science research – hard to get overview. Also within enterprise/business systems often a concept of provenance, but without using the same terminology.

The provenance group was tasked to ‘provide state-of-the-art understanding and develop and roadmap’. About 20 active members, worked over about a year and came to:

  • Common (working) definition of provenance

“Provenance of a resource is a record that described entities and processes involved in producing and delivering or otherwise influencing that resource”

Provenance is metadata (but not all metadata is provenance). Provenance provides a ‘substrate for deriving different trust metrics’ (but it isn’t trust)

Provenance records can be used to verify and authenticate among other uses – but you can have provenance without cryptography/security

Provenance assertions can have their own provenance!

Inference is useful if provenance records are incomplete. There may be different accounts of provenance for the same data.

  • Developed set of key dimensions for provenance

3 top level dimensions:

Content – ability to identify things; describe processes; describe who made a statement; to know how a database solve a specific query

Management – How should provenance be ‘exposed’; How do we deal with the scale of provenance; How do we deal with scale?

Use – How do we go about using provenance information – showing trust, ownership, uncovering errors, interoperability…

Each of these dimensions broken down into further sub-categories.

  • Collected use cases

Over 30 use cases – from many domains (at least two from Cultural Heritage ‘Collection vs Objects in Cultural Heritage‘, ‘Different_Levels_Cultural_Heritage‘).

  • Designed 3 flagship scenarios from the use cases

The 30+ use cases were boiled down into three ‘super use-cases’ – trying to cover everything:

  1. News aggregator
  2. Disease Outbreak
  3. Business Contracts
  • Created mappings for existing vocabularies for provenance
  • … more

Group came up with recommendations:

  • Proposed a Provenance Interchange Working Group – to define a provenance exchange language – to enable systems to exchange provenance information, and to make it possible to publish this on the web

Timeline:

W3C in the process of deciding whether the Provenance Interchange Working Group should be approved. If this goes ahead will start soon. Two year working group – aggressive deliverable target. “Standards work is hard” says Paul. Will rely on next version of RDF (not time to cover this now).

Open Questions:

  • How to deal with Complex Objects
    • dealing with multiple levels of granularity
    • how provenance interacts with Named Graphs
    • Unification of database provenance and process ‘style’ provenance
    • objects, their versions and their provenance
    • visualisation and summarization
  • Imperfections
    • What is adequate provenance for proof/quality?
    • How do we deal with gaps in provenance?
    • Repeatability vs. reproduction and how much provenance is enough?
    • Can provenance help us get around the problem of reasoning over integrated data?
    • Using provenance as a platform for trust, does it work?
  • Distribution
    • How do we encourage provenance capture?
    • Multiple disagreeing claims about the origins data – which one is right?
    • SameAs detection through provenance
    • Distribution often gives us privacy – once we integrate how do we preserve privacy
    • Scale (way more provenance than data! Has to scale – to very large)
    • Hypothesis: distribution is a fundamental property of provenance