Linked Data and Libraries: LODUM

LODUM is Linked Open Data and University of Munster – presented by Carsten Kessler
Started LODUM – about providing scientific and educational data as Linked Open Data
Have started linking Library and CRIS… next want to start linking Courses, and then Buildings (and Bus Stops…)

Brasil – from research project
Maps – starting to annotate maps and have descriptions as LD
Bio-topes data – species etc. from the local region
Interviews – want to annotate recordings and link to transcripts

Library has central role – hub that provides the publication and want to link all the datasets to the relevant publications in the library. Hope in the future will have pointers from publications (within text) to the data.

Concrete use case:
Have institute of Planetology. Have information about the Moon. To save money they look for areas of the earth with similar characteristics as ‘reference data’ – hope that this will be something they can provide.

Development work is advancing with 4 student assistants. Can see some data at http://data.uni-muenster.de; Establishing contacts with other universities via http://linkeduniversities.org
Need more funding – only have startup funding at the moment

Linked Data and Libraries: Linked Data OPAC

This session by Phil John – Technical Lead for Prism (was Talis, now Capita). Prism is a ‘next generation’ discovery interface – but built on Linked Data through and through.

Slides available from http://www.slideshare.net/philjohn/linked-library-data-in-the-wild-8593328

Now moving to next phase of development – not going to be just about library catalogue data – but also journal metadata; archives/records (e.g. from the CALM archive system); thesis repositories; rare items and special collections (often not done well in traditional OPACs) … and more – e.g. community information systems.

When populating Prism from MARC21 – do initial ‘bulk’ conversion, then periodic ‘delat’ files – to keep in sync with LMS. Borrower and availability data is pulled from LMS “live” – via a suite of RESTful web services.

Prism is also a Linked Data API… just add .rss to collection of .rdf/.nt/.ttl/.json to items. This means simple to publish RSS feeds of preconfigured searches – e.g. new stock, or new stock in specific subjects etc.

Every HTML page in Prism has data behind it you can get as RDF.

One of the biggest challenges – Extracting data from MARC21 – MARC very rich, but not very linked… Phil fills the screen with #marcmustdie tweets 🙂

But have to be realistic – 10s of millions of MARC21 records exist – so need to be able to deal with this.
Decided to tackle problem in small chunks. Created a solution that allows you to build a model interatively. Also compartmentalises code for different sections – these can communicate but work separately and can be developed separately. Makes it easy to tweak parts of the model easily.

Feel they have a robust solution that performs well – even if it only takes 10 seconds to convert a MARC record – then when you use several million records it takes months.

No matter what MARC21 and AACR2 says – you will see variations in real date.

Have a conversion pipeline:
Parser – reads in MARC21- fires events as it encounters different parts of the record – it’s very strict with Syntax – so insists on valid MARC21
Observer – listens for MARC21 data structures and hands control over to …
Handler – knows how to convert MARC21 structures and fields into Linked data

First area they tackled was Format (and duration) – good starting point as it allows you to reason more fully about the record – once you know Format you know what kind of data to expect.

In theory should be quite easy – MARC21 has lots of structured info about format – but in practice there are lots of issues:

  • no code for CD (it’s a 12 cm sound disk that travels at 1.4m/s!)
  • DVD and LaserDisc shared a code for a while
  • Libraries slow to support new formats
  • limited use of 007 in the real world

E.g. places to look for format information:
007
245$$h
300$$a (mixed in with other info)
538$$a

Decided to do the duration at the same time:
306$$a
300$$a (but lots of variation in this field)

Now Phil talking about ‘Title’ – v important, but of course quite tricky…
245 field in MARC may duplicate information from elsewhere
Got lots of help from http://journal.code4lib.org/articles/3832 (with additional work and modification)

Retained a ‘statement of responsibility’ – but mostly for search and display…

Identifiers…
Lots of non identifier information mixed in with other stuff – e.g. ISBN followed by ‘pbk.’
Many variations in abbrevations used – have to parse all this stuff, then validate the identifier
Once you have an identifier, you start being able to link to other stuff – which is great.

Author – Pseudonyms, variations in names, generally no ‘relator terms’ in 100/700 $$e or $$4 – which would show the nature of the relationship between the person and the work (e.g. ‘author’ ‘illustrator’) – because these are missing have to parse information out of the 245$$c

… and not just dealing with English records – especially in academic libraries.

Have licensed Library of Congress authority files – which helps… – authority matching requirements were:
Has to be fast – able to parse 2M records in hours not days/months
Has to be accurate

So – store Authorities as RDF but index in SOLR – gives speed and for bulk conversions don’t get http overhead…

Language/Alternate representation – this is a nice ‘high impact’ feature – allows switching between representations – both forms can be searched for – use RDF content language feature – so useful for people using machine readable RDF

Using and Linking to external data sets…
part of the reason for using linked data – but some challenges….

  • what if datasource suffers downtime
  • worse – what if datasource removed permanently?
  • trust
  • can we display it? is it susceptible to vandalism?

Potential solutions (not there yet):

  • Harvest datasets and keep them close to the app
  • if that’s not practical proxy requests using caching proxy – e.g. Squid
  • if using wikipedia and worried about vandalism – put in checks for likely vandalism activity – e.g. many multiple edits in short time

Want to see
More library data as LOD – especially on the peripheries – authority data, author information, etc.
LMS vendors adopting LOD
LOD replacing MARC21 as standard representation of bibliographic records!

Questions?
Is process (MARC->RDF) documented?
A: Would like to open source at least some of it… but discussions to have internally in Capita – so something to keep and eye on…

Is there a running instance of Prism to play with:
A: Yes – e.g. http://prism.talis.com/bradford/

[UPDATE: See in comments Phil suggests http://catalogue.library.manchester.ac.uk/ as one that has used a more up to date version of the transform

Linked Data and Libraries: Report on the LOD LAM Summit

This report from Adrian Stevenson from UKOLN. Summit held 2-3 June 2011. Brought together 100 people from around the world, with generous funding from the Internet Archive; National Endowment from the Humanities; Alfred P Sloan foundation.

Adrian’s slides at http://www.slideshare.net/adrianstevenson/report-on-the-international-linked-open-data-for-libraries-archives-and-museums-summit

Find out more at http://lod-lam.net/summit

85 organisations represented from across libraries, archives, museums.

Summit aimed to be practical and about actionable approaches to publishing Linked Open Data. Looking at Tools, licensing policy/precedents, definitions and use cases.

Meeting ran on a ‘Open Space Technology’ format – some way between a formal conference and an unconference – agenda created via audience proposals – but there were huge numbers of sessions being proposed/run.

First day was more discursive, second day about action.

Some sessions that ran:
Business Case for LOD; Provenance/Scalability; Crowdsourcing LOD; Preservation of RDF/Vocabulary Maintenance

What next?
Connections made; acitivities kick-started
Many more LOD LAM events planned – follow #lodlam on Twitter
#lodlam #london had meeting yesterday; but bigger event planned for November – see http://bit.ly/oJ6qsl for more details as they become available

Linked Data and Libraries: Richard Wallis

A year on from the first Talis Linked Data and Libraries meeting – a lot has happened. The W3C has a group on ‘linked data and libraries’; the BL has released a load of records as RDF/XML – brave decision; Richard went to meeting where in Denmark there was discussion about trying to persuade publishers to release metadata; Europeana Linked Data now available; parts of French National Catalogue etc. etc.

While it may feel that progress is slow – people are getting out there and ‘just doing it’ as Lynne Brindley just suggested.

Talis – started pioneering in Semantic Web/Linked Data areas. Recently the Library Systems Division has been sold to Capita – allow it to focus on libraries, while ‘Talis’ is going forward with focus on linked data/semantic web.

Now Talis is made up of:

Talis Group – core technologies/strategy – run the Talis Platform
Talis Education – applications in academia – offer the Talis Aspire product (for ‘reading lists’ in Universities)
Talis Systems Ltd – Consulting/Training/Evangelism around Linked Data etc.
Kasabi.com – [Linked] Data Marketplace – free hosting at the moment – Community – APIs. Evetually looking to monetise

Enough about Talis …

UK Government still pressing forward with open linked data
BBC have done more with Linked Data – e.g. World Cup 2010 site was based on Linked Data – delivered more pages and more traffic with less staff. BBC already working with same technology to look at Olympics 2012 site…

Richard now mentioning Good Relations ontology – adoption by very large commercial organisations.

Linked Data ‘cloud’ has got larger – more links – but what are these links for?
Links (i.e. http URIs) identify things – and so you can deliver information about things you link to… Richard says lots of the ‘talk’ is about things like SPARQL endpoints etc. But should be about identifying things and delivering information about them.

Richard breaking down Linked Data – how RDF describes stuff etc. Allows you to find relationships between things – that machines can parse… [Richard actually said ‘understand’ but don’t think he is necessarily talking AI here]

Richard stressing that Linked Data is about technology and Open Data about licensing – separate issues which talking about ‘Linked Open Data’ conflates – quotes Paul Walk on this from http://blog.paulwalk.net/2009/11/11/linked-open-semantic/ – but says he (Richard) would talk about the Linked Data web not the Semantic Web (Paul uses latter term)

Richard thinks that Libraries have an advantage in entering the Linked Data world – lots of experience, lots of standards, lots of records. We have described things, whereas generally people just have the things they need to describe.

Already have lots of lists of things – subject headings (lcsh), classifications (dewey), people (authority files)

Are libraries good at describing things… or just books?

Are Libraries applicable for Linked Data? How are we doing? Richard gives a school report – “Could do better”; “Showing promise”

When we look at expressing stuff as linked data we need to ask “Why and Who For!”

Linked Data and Libraries: Keynote

Today at the second Linked Data and Libraries meeting organised by Talis.

Lynne Brindley is kicking off the day…
Noting that the broad agenda is about getting access to information, linking across domains etc. See potential of Linked Data approach to increasing the use of their catalogue & so collections. Bringing better discovery and therefor utility to researchers – and to exploit the legacy of data that has been created over long periods of time.

The British library is ‘up for it’ – but need to look at the costs, benefits and may need to convince sceptics. But BL has history of taking innovative steps – introduced public online information retrieval systems to the UK around 40 years ago (MEDLARS in the 1970s). 10 years later UK was one of the first countries to publish National Bibliography on CD-ROM (now ‘museum pieces’! says Lynne).

And now exposing national bibliography as linked open data… – some history:

BL involved in UK PubMed Central (UKPMC) – repository of papers, patents, reports etc. etc. Contains many data types from many organisations. Provides better access to hard to find reports, theses etc. For Lynne this is also about ‘linking’ even if not built on “Linked Data” technology stack. – sees it as part and parcel of same thing and movement in a direction of linking materials/collections.

Also ‘sounds’ – UK Sound Map http://sounds.bl.uk/uksoundmap/index.aspx – linked across domains and also involved public in capturing ‘sounds of Britain’ – via AudioBoo – added metadata and mashed up with Google Maps…

‘Evolving English’ exhibition – had a ‘map your voice’ element – many people recorded the same piece of material – which has been incorporated into a research database of linguistic recordings – global collaboration and massive participation.

Lynne says – it is pretty difficult to do multi-national, multi-organisational stuff – and should learn from these examples.

The BL Catalogue is primary tool to access, order and manage the BL collections. Long operated a priced service where the catalogue records are sold to others – in various forms. Despite pressure to earn money from Government, BL decided to take step of offering BNB records as RDF/XML under a CC0. Today will be announcing a linked data version of BNB – more later today from Neil Wilson.

Hope that the data will get used in a wide variety of ways. Key lesson for BL says Lynne – is ‘relinquish control, let go’ – however you think people are going to use what you put out there, they will use it in a different way.

Promise of linked data offers many benefits across sectors, across ‘memory institutions’. But the institutions involved will need to face cultural change to achieve this. ‘Curators’ in any context (libraries, archives, museums) are used to their ‘vested authority’ – and we need to both recognise this at the same time as ‘letting go’ – from the library point of view no-one can afford to stand on the sidelines – we need to get in there and experiment.

Need to get out of our institutional and metadata silos – and take a journey to the ‘mainstream future’. Partnerships are important – and everyone wants to ‘partner’ with the British Library – but often proposed partnerships are one sided – we need to look for win-win partnerships with institutions like the BL.

Final message – we are good at talking – but we need to ‘just do it’ – do it and show benefits and convince people.

JISCExpo: Past, Present and Future of Linked data in Higher Education

Opening the day is Paul Miller who authored the Linked Data Horizon Scan.

The horizon scan was written in Q3/4 2009 to see what JISC needed to focus on in terms on Linked Data. It included 9 recommendations – 3 on web identifiers; 4 on data publishing; 2 support measures

Paul going to revisit these recommendations this morning…

“Learn from Cabinet Office Guidance on the creation of URIs”
Paul thinks this ‘is done’ – “almost unnecessary to say in 2011”
[I’m not as optimistic as Paul on this one – still feel huge battle to fight in terms of convincing people that URIs are identifiers not just web addresses]

“Identify a core set of widely used identifiers (JACS, instution codes, etc.) and assign HTTP URIs”
Paul says some progress I suppose – really should have been more though – this is the step that will make all of this stuff work across data sets”

Identify the ways that researchers identify themselves and link to instutional, professional, socail identities as appropriate”
Paul says – on the whol remains ad hoc, with individuals using self defined URLs (whether self-owned or twitter/linkedin/blog url) as surrogates. Paul asks – is this OK?

David Shotton (Oxford) mentions ORCID http://www.orcid.org/ – feeling from the room that it shows promise for solving the problem of identifying people in authoring contexts – but there is a larger problem, and the question of how multiple identifiers for people are matched up is also going to be problematic

“Look at OPSI Unlocking Service (http://www.opsi.gov.uk/unlocking-service/OPSIpage.aspx?page=UnlockIndex) and consider whether a similar approach might be used in helping the community identify data sets to prioritise”
Paul says – not really been looked at – do we still need to?

“Evaluate the effectiveness of Data Incubator etc as a way of marrying data to developers”
Paul says – some DevCSI activity on this but nothing systematic?
Also http://getthedata.org gets a mention, and ‘data without borders’ initiative http://jakeporway.com/2011/06/data-without-borders-an-exciting-first-day/

“Validate existing data licenses, and engage with government”
Not perfect by pretty good across Strategic Content Alliance (SCA http://sca.jiscinvolve.org/wp/), Discovery initiative (http://discovery.ac.uk), etc.

Paul feels we are moving to point where question is ‘why can this not be open’ rather than ‘why should this be open’

“Demonstrate the utility of embedding RDFa on institutional web pages”
Paul really suprised by the apparent lack of any serious progress in this area.. – debate from the floor – some question value of RDFa – why do it. If CMS doesn’t support then difficult to achieve. I raise issue we saw in Lucero of how Google et al actually present or use data published in RDFa

“Identify ways in which the community can consume and contribute to existing data services.”
? Missed Paul’s summary…

“Identify a focus for Linked Data activities”
This programme – #jiscexpo – so challenge now is how to share and get issues out.

Paul thinks on balance – good progress on 4, failed on 5

Paul says – we need to focus less on raw numbers more on real utility – and more links between resources – how do we achieve this? Very little interlinking going on – except small number of key resources such as Dbpedia. Need real linking beyond a single data set, beyond a single institution…

Debate from the floor – is ‘link more’ really relevant? Paul agrees – again just ‘lots of links’ is not the point – about valuable linking.

I make point about re-use of URIs – more difficult than coining URIs!

Paul says we need to share wisdom around – Where does Linked Data add real value – where is it merely possible and where is it ‘really stupid’…

JISCExpo: Notes from the Linked data Workshop at Stanford

This session being presented by Jerry Persons.

This workshop spent a week looking at Linked Data – ‘be part of the web, not just on it’. Workshop sponsored by a variety of research and national libraries, research groups, companies, etc.

The workshop focused on “crafting fund-able plans for creating tools, processes and vehicles to expedite a disruptive paradigm shift in the workflows, data stores and interfaces used for managing, discovery and navigating…”

The workshop was deliberately ‘library focussed’ but recognise much wider issues – especially synergy for GLAM (galleries, libraries, archives, museums)

“I’ve liked to characterize the current moment as a circle of libraries, museums, archives, universities, journalists, publishers, broadcasters and a number of others in the culture industries standing around, eyeing one another and at the space between them while wondering how they need to reconfigure for a world of digitally networked knowledge” – Josh Greenberg

“The biggest problem we face right now is a way to ‘link’ information that comes from different sources that can scale to hundereds of millions of statements” – Stefano Mazzocchi

22 issues were identified by the mid-point of the workshop – just a few here:
co-referencings, reconciliation
use of extant metadata
killer apps
user seduction and training
workflow
scalability
licensing

Jerry says … “The elevator pitch for linked data does not yet exist”

Thinking about ‘novice’ (apprentice), ‘journeyman’, ‘master’ stages of engaging with Linked Data:

  • Value statement use cases
  • Publishing data
  • etc.

At each stage we should be looking model implementations that people can look at/follow

Elephants in the room:
URIs not strings – don’t underestimate the amount of effort required to transform large subsets of GLAM metadata into linked data with URIs as identifiers

Caveats…
Management of co-refereces needs to be a bottom up process
Build systems that accept the way the world is, not what you would like to be
Focus on changing current practices (in the long run) not only on reconciling data (in the short run) – preventing problems better than solving them!

Some docs will be coming out from the workshop very soon as well as proposals for work – over next few months

JISCExpo: Community and Linked Data

I’m at the #jiscexpo programme meeting today and tomorrow…

Ben O’Steen is the first formal talk of the day … talking about ‘community’…

Ben notes that SPARQL has a very bad reputation – people don’t like it and don’t want to use it. Taking a step back – SQL is standard way of interacting with databases, but in general you don’t write SQL queries against someone else’s database – and v unusual to do this without permission and documentation etc. (I guess unless you are really hacking into it!)

In general SQL databases are hidden from ‘remote’ users via APIs or other interfaces which present you with views on the data, not the raw data structure.

So what does this tell us about what we need to do with Linked Data?

Interaction Feedback Loop – fundamental – if you can get this you get engagement. Example ‘mouse presses button, mouse gets cheese’ – this encourages a behaviour in the mouse. Ben uses World of WarCraft as example of interaction feedback loop that works incredibly well – people write their own programmes and interfaces for WoW.

Ben notes this is not about Gamification… this is about getting pay-off for interaction.

Ben sets some homework – go read http://jboriss.wordpress.com/2011/07/06/user-testing-in-the-wild-joes-first-computer-encounter/ – blog post about user testing on web browsers and the experience of ‘Joe’ a 60 year-old who has never used a computer before – and what happened when he tried to find a local restaurant to eat in via three major web browsers “There is little modern applications do to guide people who have never used a computer”.

Sliding scale of interaction

  • googling and finding a website;
  • hunting and clicking round the website for information;
  • using a well-documented or cookie-cutter API (such as an Atom feed or a search interface);
  • Using boolean searching or other simple API ‘tricks’ –
    • WITHOUT requiring understanding of the true data model

Ben now going back to SPARQL – it is common when interacting with an unknown SPARQL endpoint to become frustrated….

What do you need to understand to craft successful SPARQL?
Understand

  • RDF and triple/quad model
  • RDF types and namespaces
  • structures in an endpoint
  • SPARQL syntaxes
  • SPARQL return formats
  • libraries for RDF responses
  • libraries for XML responses
  • … and more

Developers are clamouring for APIs

  • Every new social/web service is seen to be lacking if it is missing an API due to desire to build mobile applications
  • Whilst SPARQL can be seen as the ultimate API, then the ultimate Twitter API would be access using its Scala/Java libraries
  • Many need to see the benefits of something simple in order to hook them into learning something more complex

Taking an ‘opinionated view’ on information helps adopters – offering a constrained view of the model. Could offer csv/json/html views on the data behind a SPARQL endpoint. Ben notes ‘access to the full model is a wonderful thing’ – but don’t forget (paraphrase) ‘most average developers want constrained view’
Ben now talking about schema.org – new intiative from Google, Bing and Yahoo! Ben notes – schema.org delivers ‘cheese’ immediately – clear that the reason you want to do this is to improve search engine results.

Ben notes – schema.org contains very ‘opinionated’ views of the things it can describe – but this gives simplicity and lowers barriers to adoption.

Schema.org going to increase the amount of structured data on the web –

In summary:

  • Be empathetic to those who don’t understand what you are doing
  • Need to provide gamut of views on your data
  • You don’t have to use a triplestore to use RDF
  • Raw dumps of data are often far better than dumps of structured data such as RDF if that structure is not documented
  • “Semantic Web” has garnered such a bad PR that ‘we’ (?) are on the back foot – things and attitudes need to change or it will be forgotten in favour of schema.org

An internal monologue on Metadata Licensing

I’ve been involved in discussions around the licensing of library/museum/archive metadata over the last couple of years, specifically through my work on UK Discovery (http://discovery.ac.uk) – an initiative to enable resource discovery through the publication and aggregation of metadata according to simple, open, principles.

In the course of this work I’ve co-authored the JISC Guide to Open Bibliographic Data and become a signatory of the Discovery Open Metadata Principles. Recently I’ve been discussing some of the issues around licensing with Ed Chamberlain and others (see Ed’s thoughts on licensing on the CUL-COMET blog), and over coffee this morning I was trying to untangle  the issues and reasons for specific approaches to licensing – for some reason they formed in my head as a set of Q&A so I’ve jotted them down in this form… at the moment this is really to help me with my thinking but I thought I’d share in case.

N.B. These are just some thoughts – not comprehensive and not official views from the Discovery intiative

Q1: Why apply an explicit license to your metadata?
A1.1: To enable appropriate re-use

Q2: What prevents appropriate re-use
A2.1: Uncertainty/lack of clarity about what can be done with the data
A2.2: Any barriers that add an overhead – could be technical or legal

Q3: What sort of barriers add overhead?
A3.1: Attribution licensing – where data from many sources are being mixed together this overhead can be considerable.
A3.2: Machine readable licensing data to be provided with data – adds complexity to data processing, potentially increases network traffic and slows applications
A3.3: Licensing requiring human intervention to determine rights for reuse at a data level – this type of activity effectively stops applications being built on the data as it isn’t possible for software to decide if a piece of data can be included or not (NB human intervention for whole datasets is less of an issue – building an app on a dataset where all data is covered by the same license which has been interpreted by a human in advance of writing software is not an issue)
A3.4: Licensing which is not clear about what type of reuse is allowed. The NC (Non-commercial) licenses exemplify this as the definition of what amounts to ‘commerical use’ is often unclear.
A3.5: Licensing not generally familiar to the potential consumers of the data (for re-use purposes) – e.g. writing a new license specific to your requirements rather than adopting a Creative Commons or other more widely used licence.

Q4: What does this suggest in terms of data licensing decisions?
A4.1: Putting data in public domain removes all doubt – it can be reused freely – a consumer doesn’t have to check anything etc.
A4.2: Putting data in public domain removes the overhead of attribution – where data from many sources are being mixed together this overhead can be considerable
A4.3: Where there is licensing beyond public domain, reuse will be encouraged if it is easy to establish (preferably in an automated way) what licensing is associated with any particular data
A4.4: Where data within a single set is available under different licensing, reuse will be encouraged by making it easy to address only data with a specified license attached. E.g. ‘only give me data that is CC0 or ODC-PDDL’

Comments/questions welcome…

Open Culture 2011 – Google Art Project

Laura Scott – Head of External Relations from Google UK

Art Project (http://www.googleartproject.com/) – v ambitious for Google – making art and information relating to art more accessible to people everywhere in the world. Google new to ‘arts’ – but have committment.

Google 20% time – if you have a great an idea, you can spend up to 20% time on that – e.g. Google Mail big success – but failure also embraced. Failure does cost – Google may well be able to take financial risk, but when things fail media will pick up on this…

Encourage experimentation – small things as many previous speakers have mentioned.

Partnerships are crucial for Google – Art project had partnerships with 9(?) galleries across the world.

Google not a curator – doesn’t want to be – wanted to make art immediately accessible – possible to ‘jump right in’ – but no sense in which mean to replace physical visit – and evidence so far this is not the case at all. Each museum chose one image to do in v v high resolution – can zoom in to very high levels – example of ‘The Harvesters’ from the Metropolitan Museum of Art.

Example of ‘No Woman, No Cry’ from Tate Britain – again v high levels of zoom – but also ‘view in darkness’ option – to reveal message about Stephen Lawrence – a key part of artwork.

Also ability to integrate other media – e.g. films of people commenting on works

This is the first version – Google very aware not perfect, and things to improve. Want to increase geographical spread of museums included. Have had over 10million visitors and 90k people create ‘my collections’ – importance of making things social and being able to share – and realised need to develop that feature further.

Art Project took 18 months to get up and running – felt like a long time – but takes time, and long term project for Google – next phase over next couple of years.