This session by Phil John – Technical Lead for Prism (was Talis, now Capita). Prism is a ‘next generation’ discovery interface – but built on Linked Data through and through.
Slides available from http://www.slideshare.net/philjohn/linked-library-data-in-the-wild-8593328
Now moving to next phase of development – not going to be just about library catalogue data – but also journal metadata; archives/records (e.g. from the CALM archive system); thesis repositories; rare items and special collections (often not done well in traditional OPACs) … and more – e.g. community information systems.
When populating Prism from MARC21 – do initial ‘bulk’ conversion, then periodic ‘delat’ files – to keep in sync with LMS. Borrower and availability data is pulled from LMS “live” – via a suite of RESTful web services.
Prism is also a Linked Data API… just add .rss to collection of .rdf/.nt/.ttl/.json to items. This means simple to publish RSS feeds of preconfigured searches – e.g. new stock, or new stock in specific subjects etc.
Every HTML page in Prism has data behind it you can get as RDF.
One of the biggest challenges – Extracting data from MARC21 – MARC very rich, but not very linked… Phil fills the screen with #marcmustdie tweets 🙂
But have to be realistic – 10s of millions of MARC21 records exist – so need to be able to deal with this.
Decided to tackle problem in small chunks. Created a solution that allows you to build a model interatively. Also compartmentalises code for different sections – these can communicate but work separately and can be developed separately. Makes it easy to tweak parts of the model easily.
Feel they have a robust solution that performs well – even if it only takes 10 seconds to convert a MARC record – then when you use several million records it takes months.
No matter what MARC21 and AACR2 says – you will see variations in real date.
Have a conversion pipeline:
Parser – reads in MARC21- fires events as it encounters different parts of the record – it’s very strict with Syntax – so insists on valid MARC21
Observer – listens for MARC21 data structures and hands control over to …
Handler – knows how to convert MARC21 structures and fields into Linked data
First area they tackled was Format (and duration) – good starting point as it allows you to reason more fully about the record – once you know Format you know what kind of data to expect.
In theory should be quite easy – MARC21 has lots of structured info about format – but in practice there are lots of issues:
- no code for CD (it’s a 12 cm sound disk that travels at 1.4m/s!)
- DVD and LaserDisc shared a code for a while
- Libraries slow to support new formats
- limited use of 007 in the real world
E.g. places to look for format information:
007
245$$h
300$$a (mixed in with other info)
538$$a
Decided to do the duration at the same time:
306$$a
300$$a (but lots of variation in this field)
Now Phil talking about ‘Title’ – v important, but of course quite tricky…
245 field in MARC may duplicate information from elsewhere
Got lots of help from http://journal.code4lib.org/articles/3832 (with additional work and modification)
Retained a ‘statement of responsibility’ – but mostly for search and display…
Identifiers…
Lots of non identifier information mixed in with other stuff – e.g. ISBN followed by ‘pbk.’
Many variations in abbrevations used – have to parse all this stuff, then validate the identifier
Once you have an identifier, you start being able to link to other stuff – which is great.
Author – Pseudonyms, variations in names, generally no ‘relator terms’ in 100/700 $$e or $$4 – which would show the nature of the relationship between the person and the work (e.g. ‘author’ ‘illustrator’) – because these are missing have to parse information out of the 245$$c
… and not just dealing with English records – especially in academic libraries.
Have licensed Library of Congress authority files – which helps… – authority matching requirements were:
Has to be fast – able to parse 2M records in hours not days/months
Has to be accurate
So – store Authorities as RDF but index in SOLR – gives speed and for bulk conversions don’t get http overhead…
Language/Alternate representation – this is a nice ‘high impact’ feature – allows switching between representations – both forms can be searched for – use RDF content language feature – so useful for people using machine readable RDF
Using and Linking to external data sets…
part of the reason for using linked data – but some challenges….
- what if datasource suffers downtime
- worse – what if datasource removed permanently?
- trust
- can we display it? is it susceptible to vandalism?
Potential solutions (not there yet):
- Harvest datasets and keep them close to the app
- if that’s not practical proxy requests using caching proxy – e.g. Squid
- if using wikipedia and worried about vandalism – put in checks for likely vandalism activity – e.g. many multiple edits in short time
Want to see
More library data as LOD – especially on the peripheries – authority data, author information, etc.
LMS vendors adopting LOD
LOD replacing MARC21 as standard representation of bibliographic records!
Questions?
Is process (MARC->RDF) documented?
A: Would like to open source at least some of it… but discussions to have internally in Capita – so something to keep and eye on…
Is there a running instance of Prism to play with:
A: Yes – e.g. http://prism.talis.com/bradford/
[UPDATE: See in comments Phil suggests http://catalogue.library.manchester.ac.uk/ as one that has used a more up to date version of the transform