ResourceSync: Web-based Resource Synchronization

Final paper in the ‘Repository Services’ session at OR2012 is presented by Simeon Warner. This is the paper I really wanted to see this morning as I’ve seen various snippets on twitter about it (via @azaroth42 and @hvdsomp). Simeon says so far it’s been lots of talking rather than doing 🙂

A lot of the stuff in this post is also available on the ResourceSync specification page http://resync.github.com/spec/

Synchronize what?

  • Web resources – things with a URI that can be dereferences and are cache-able – not dependent on underlying OS or tech
  • Small websites to large repositories – needs to work at all scales
  • Need to deal with things that change slowly (weeks/months) or quickly (seconds) and where latency needs may vary
  • Focus on needs of research communicatoin and cultural heritage orgs – but aim for generality

Why?

Because lots of projects and services are doing synchronization but have to roll their own on a case by case basis. Lots of examples where local copies of objects are needed to carry out work (I think CORE gives an example of this kind of application).

OAI-PMH is over 10 years old as protocol, and was designed to do XML metadata – not files/objects. (exactly the issue we’ve seen in CORE)

Rob Sanderson done work on a range of use cases  – including things like aggregation (multiple sources to one centre). Also ruled out some use cases – e.g. not going to deal with bi-directional synchronization at the moment. Some real life use cases they’ve looked at in detail:

DBpedia live duplication – 20million entries updated @ 1 per second -= though sporadic. Low latency needed. This suggests has to be ‘push’ mechanism – can’t have lots of services polling every second for updates

arXin mirrory – 1million article versions – about 800 per day created. Need metadata and full-text for each article. Accuracy important. Want low barrier for others to use.

Some terminology they have determined:

  • Resource – an object ot be syncrhonized – a web resource
  • Source – system with the original or master resource
  • Destination – system to which resource from the source will be copied
  • Pull – process to get information from source to destination, initiatnd by destination
  • Push – process to get information from source to destination. Initiated by source
  • Metadata – information about Resources such as URI, modification time, checksum, etc. (Not to be confused with Resources that may themselves be metadata about another resource, e.g. a DC record)

Three basic needs:

  • Baseline synchronization – perform initial load or catchup between source and destination
  • Incremental synchronization – deal with updated/creates/deletes
  • Audit – is my current copy in sync with source

Need to use ‘inventory’ approach to know what is there and what needs updating. So Audit uses inventory to check what has changed between source and destination, then do incremental  synchronization. Don’t necessarily need to use full inventory – could use inventory change set to know what has changed.

Once you’ve got agreement on the change set, need to get source and destination back in sync – whether by exchanging objects, or doing diffs and updates etc.

Decided that for simplicity needed to focus on ‘pull’ but with some ‘push’ mechanism available so sources can push changes when necessary.

What they’ve come up with is a framework based on Sitemaps – that Google uses to know what to crawl on a website. It’s a modular framework to allow selective deployment of different parts. For example – basic baseline sync looks like:

Level zero -> Publish a Sitemap

Periodic publication of a sitemap is basic implementation. Sitemap lists at least a list of URLs – one for each resource. But you could add in more information – e.g. like a hashsum for the resource – which would enable better comparison.

Another usecase – incremental sync. In this case use sitemap format but include information only for change events. One <url> element per change event.

What about ‘push’ notification. They believe XMPP best bet – this is used for messaging services (like Google/Facebook chat systems). This allows rapid notification of change events. XMPP a bit ‘heavy weight’ – but lots of libraries already available for this, so not going to have to implement from scratch.

LANL Research Library ran a significant scale experiment in synchronization of the LiveDBpedia  database to two remote sites using XMPP to push changes. A couple of issues, but overall very successful.

Sitemaps have some limitation on size (I think Simeon said 2.5billion URLs?) – but not hard to see how it could be extended beyond this if required.

Dumps: a dump format is necessary – to avoid repeated HTTP GET requests for multiple resource. Use for baseline and changeset. Options are:

  • Zip + sitemap – Zip very common, but would require custom mechanism to link
  • WARC – designed for this purpose but not widely implemented

Simeon guesses they will end up with hooks for both of these.

Expecting a draft spec very very soon (July 2012). Code and other stuff already on GitHub https://github.com/resync/

Repository Services

I’m briefly at the Open Repositories 2012 conference in Edinburgh, and this morning in a session about ‘repository services’ – which sounds like a nice easy session to ease into the morning, but is actually diving into some pretty hard technical detail pretty quickly!

There are three papers in this session.

Built to scale?

Edwin Shin is describing using Hydra (a repository stack built on Fedora, Solr, Blacklight). I missed the start, but the presentation is about dealing with very large numbers of digital objects – from millions to hundreds of millions. It’s a pretty technical talk – optimisation of Solr through sharding, taking a ‘sharded’ approach to Fedora (in the ActiveFedora layer).

Perhaps high level lessons to pull out are that you ought to look at how people use a system when planning quite technical aspects of the repository. For example – they reworked their disaster recovery strategy based on knowledge that vast number of requests were for current year – since the full system recovery takes days (or weeks?) they now deposit objects from current year so they can be restored first and quickly.

Similarly with Solr optimisation – having done a lot of generic optimisation they were still finding performance (query response times) far too slow on very larges sets of documents. By analysing how the system was used they were able to perform some very specific optimisations (I think this was around increasing the filtercache settings) to achieve a significant reduction in query response times.

Inter-repository Linking of Research Objects with Webtracks

This session being presented by Shirley Ying Crompton. Shirley describing how the research process leads to research data and outputs being stored in different places with no links between them. So decided to use RDF/linked data to added structured citation links between research objects (and people – e.g. creators).

However, different objects created in different systems – so how to make sure objects are linked as they are created? Looked at existing protocols for enabling links to be created:

  • Trackbacks – use for blogs/comments
  • Semantic pingback – an RPC protocol to form semantic links between objects
  • Salmons – RSS protocol

Decided to take ‘webtracks’ approch – this is an inter-repository communication protocol. The Webtracks InteRCom protocol – allows formation of links between objects in two different repositories. InteRCom is two stage protocol – first stage is ‘harvest’ to get links, then second stage ‘request’ a link between two objects.

InteRCom implementation has been done in Java, available as open source – available for download from http://sourceforge.net/projects/webtracks/.

Shirley says: Webtracks facilitates propagation of citation links to provide a linked web of data – uses emerging linked data environment and support linking between diverse types o digital research objects. There are no constraints on link semantics or metadata. Importantly (for the project) is that it does not rely on centralised service – it is peer-to-peer.

Webtracks has been funded by JISC and is a collaboration between the University of Southampton an the STFC – more information at http://www.jisc.ac.uk/whatwedo/programmes/mrd/clip/webtracks.aspx

ResourceSync: Web-based Resource Synchronization

This session is of particular interest to me, and I took more extensive notes – so I’ve put these into a separate post http://www.meanboyfriend.com/overdue_ideas/2012/07/resourcesync-web-based-resource-synchronization/

Boutique Catalogues

In my previous post on MaRC and SolrMaRC I described how SolrMaRC could be used, as part of Blacklight, VuFind or other discovery layers, to create indexes from various parts of the MaRC record. My question to those at Mashcat was “what would you do, or like to do, with these capabilities” – I was especially interested in how you might be able to offer better search experiences for specific types of materials – what we talked about in the pub afterwards as ’boutique catalogues’.

During the Mashcat session I asked the audience what types of materials/collections it might be interesting to look at with this in mind. In the session itself we came up with three suggestions – so I thought I’d capture these here, and maybe try to start working out what the SolrMARC index configuration and (if necessary) related scripts might look like. I’d hoped to take advantage of having a lot of cataloguing knowledge in the room to drill into some of these ideas in detail, but in the end – entirely down to me – this didn’t happen. Looking back on it I should have suggested breaking into groups to do this so that people could have discussed the detail and brought it back together – next time …

Please add suggestions through the comments of other ’boutique catalogue’ ideas, additional ideas on what indexes might be useful and what the search/display experience might be like:

(Born) Digital formats

Main suggestion – provide a ‘file format’ facet
If 007/00 = c, then it’s an electronic resource and there’s information to dig out of 007/01-13
Might be worth looking at 347$$a$$b (as well as other subfields)

Would it be possible to look up file formats on Pronom/Droid and add extra information either when indexing or in the display?

Rare books

Main suggestion – use ‘relators’ from the added entry fields – specifically 700$$e. For rare books these can often contain information about the previous owners of the item, which can be of real interest to researchers.

I’d asked a question about indexing rare books previously on LibCatCode – see http://www.libcatcode.org/questions/42/indexing-rare-book-collections. I suspect it might be worth re-asking on the new Stack Exchange site for Libraries and Information Science

Music

Main suggestions – created faceted indexes on the following:

  • Key
  • Beats per minute
  • Time signature

I’m keen on the music idea – MARC isn’t great for cataloguing music in the first place, and much useful information isn’t exposed in generic library catalogue interfaces, so I had a quick look at where ‘key’ might be stored – it turns out in quite a few places – I started putting together an expression of this that could be dropped into a SolrMaRC index.properties file:
musicalkey_facet = custom, getLinkedFieldCombined(600r:240r:700r:710r:800r:810r:830r)
I’m not sure if you actually need to use the ‘getLinkedFieldCombined’ (probably not). I’m also not sure I’ve got all the places that the key can be recorded explicitly – $$r for musical key appears in lots of places.

What I definitely do know is although 600$$r etc can be used to record the musical key, it might appear as a text string in the 245, or possibly a notes (500, 505, etc.) field. Whether it is put explicitly in the 600$$r (or other $$r) will depend on local cataloguing practice. I’m guessing that it might be worth writing a custom indexing script that uses regular expressions to search for expression of musical key in 245 and 5XX fields – although it would need to cover multiple languages (I’d say English, German and French at least).

I haven’t looked at Beats per minute or Time signature and where you might get that from. It seems obvious that also getting information on what instruments are involved etc. would be of interest.

In fact there has already been some work on a customised Blacklight interface for a Music library – mentioned at https://www.library.ns.ca/node/2851 – although I can’t find any further details right now (and I don’t have access to the Library Hi-Tech journal). If the details of this are published online anywhere I’d be very interested. Also the example of building an index of instruments is one of the examples in the SolrMaRC wiki page on the index.properties file.

Perhaps a final word of caution on all this – you can only build indexes on this rich data if it exists in the MaRC record to start with. The MaRC record can hold much more information than is typically entered – and some of the fields I mention in the examples above may not be commonly used – and either the information isn’t recorded at all, or you would have to write scripts to extract it from notes fields etc. The latter, though painful, might be possible; but in the former case, there is nothing you can do…

MaRC and SolrMaRC

At the recent Mashcat event I volunteered to do a session called ‘making the most of MARC’. What I wanted to do was demonstrate how some of the current ‘resource discovery’ software are based on technology that can really extract value from bibliographic data held in MARC format, and how this creates opportunities for in both creating tools for users, and also library staff.

One of the triggers for the session was seeing, over a period of time, a number of complaints about the limitations of ‘resource discovery’ solutions – I wanted to show that many of the perceived limitations were not about the software, but about the implementation. I also wanted to show that while some technical knowledge is needed, some of these solutions can be run on standard PCs and this puts the tools, and the ability to experiment and play with MARC records, in the grasp of any tech-savvy librarian or user.

Many of the current ‘resource discovery’ solutions available are based on a search technology called Solr – part of a project at the Apache software foundation. Solr provides a powerful set of indexing and search facilities, but what makes it especially interesting for libraries is that there has been some significant work already carried out to use Solr to index MARC data – by the SolrMARC project. SolrMARC delivers a set of pre-configured indexes, and the ability to extract data from MARC records (gracefully handling ‘bad’ MARC data – such as badly encoded characters etc. – as well). While Solr is powerful, it is SolrMARC that makes it easy to implement and exploit in a library context.

SolrMARC is used by two open source resource discovery products – VuFind and Blacklight. Although VuFind and Blacklight have differences, and are written in different languages (VuFind is PHP while Blacklight is Ruby), since they both use Solr and specifically SolrMARC to index MARC records the indexing and search capabilities underneath are essentially the same. What makes the difference between implementations is not the underlying technology but the configuration. The configuration allows you to define what data, from which part of the MARC records, goes into which index in Solr.

The key SolrMARC configuration file is index.properties. Simple configuration can be carried out in one line for example (and see the SolrMARC wiki page on index.properties for more examples and details):

title_t = 245a

This creates searchable ‘title’ index from the contents of the 245 $$a field. If you wanted to draw information in from multiple parts of the MARC record, this can be done easily – for example:

title_t = 245ab:246a

Similarly you can extract characters from the MARC ‘fixed fields’:

language_facet = 008[35-37]:041a:041d

This creates a ‘faceted’ index (for browsing and filtering) for the language of the material based on the contents of 008 chars 35-37, as well as the 041 $$a and $$d.

As well as making it easy to take data from specific parts of the MARC record, SolrMARC also comes pre-packaged with some common tasks you might want to carry out on a field before adding to the index. The three most common are:

Removing trailing punctuation – e.g.
publisher_facet = custom, removeTrailingPunct(260b)

This does exactly what it says – removes any punctuation at the end of the field before adding to the index

Use data from ‘linked’ fields – e.g.
title_t = custom, getLinkedFieldCombined(245a)

This takes advantage of the ability in MARC to link MARC fields to alternative representations of the same text – e.g. for the same text in a different language.

Map codes/abbreviations to proper language – e.g.
format = 000[6-7], (map.format)

Because the ‘format’ in the MARC leader (represented here by ‘000’) is represented as a code when creating a search index it makes sense to translate this into more meaningful terms. The actual mapping of terms can either be done in the index.properties file, or in separate mapping files. The mapping for the example above looks like:

map.format.aa = Book
map.format.ab = Serial
map.format.am = Book
map.format.as = Serial
map.format.ta = Book
map.format.tm = Book
map.format = Unknown

These (and a few other) built in functions make it easy to index the MARC record, but you may still find that they don’t cover exactly what you want to achieve. For example, they don’t allow for ‘conditional’ indexing (such as ‘only index the text in XXX field when the record is for a Serial), or if you want to extract only specific text from a MARC subfield.

Happily, you can extend the indexing by writing your own scripts which add new functions. There are a couple of ways of doing this, but the easiest is to write ‘bean shell’ scripts (basically Java) which you can then call from the index.properties file. Obviously we are going beyond simple configuration and into programming at this point, but with a little knowledge you can start to work the data from the MARC record even harder.

Once you’ve written a script, you can use it from index.properties as follows:

format = script(format-full.bsh), getFormat

This uses the getFormat function from the format-full.bsh script. In this case I was experimenting with extracting not just basic ‘format’ information, but also more granular information on the type of content as described in the 008 field – but the meaning of the 008 field varies based on type of material being catalogue so you get code like:

f_000 = f_000.toUpperCase();
if (f_000.startsWith("C"))
{
result.add("MusicalScore");
String formatCode = indexer.getFirstFieldVal(record, null, "008[18-19]").toUpperCase();
if (formatCode.equals("BT")) result.add("Ballet");
if (formatCode.equals("CC")) result.add("ChristianChants");
if (formatCode.equals("CN")) result.add("CanonsOrRounds");
if (formatCode.equals("DF")) result.add("Dances");
if (formatCode.equals("FM")) result.add("FolkMusic");
if (formatCode.equals("HY")) result.add("Hymns");
if (formatCode.equals("MD")) result.add("Madrigals");
if (formatCode.equals("MO")) result.add("Motets");
if (formatCode.equals("MS")) result.add("Masses");
if (formatCode.equals("OP")) result.add("Opera");
if (formatCode.equals("PT")) result.add("PartSongs");
if (formatCode.equals("SG")) result.add("Songs");
if (formatCode.equals("SN")) result.add("Sonatas");
if (formatCode.equals("ST")) result.add("StudiesAndExercises");
if (formatCode.equals("SU")) result.add("Suites");
}
else if (f_000.startsWith("D"))

(I’ve done an example file for parsing out detailed format/genre details which you can get from https://github.com/ostephens/solrmarc-indexproperties/blob/master/index_scripts/format-full.bsh – but although more granular it still doesn’t exploit all possible granularity from the MARC fixed fields)

Once you’ve configured the indexing, you run this over a file of MARC records. The screenshot here shows a Blacklight with a faceted ‘format’ which I created using a custom indexing script

 

These tools excite me for a couple of reasons:

  1. A shared platform for MARC indexing, with a standard way of programming extensions gives the opportunty to share techniques and scripts across platforms – if I write a clever set of bean shell scripts to calculate page counts from the 300 field (along the lines demonstrated by Tom Meehan in another Mashcat session), you can use the same scripts with no effort in your SolrMARC installation
  2. The ability to run powerful, but easy to configure, search tools on standard computers. I can get Blacklight or VuFind running on a laptop (Windows, Mac or Linux) with very little effort, and I can have a few hundred thousand MARC records indexed using my own custom routines and searchable via an interface I have complete control over

While the second of these points may seem like it’s a pretty niche market – and of course it is – we are seeing increasingly librarians and scholars making use of this kind of solution, especially in the digital humanities space. These solutions are relatively cheap and easy to run. Indexing a few hundred thousand MARC records takes a little time, but we are talking tens of minutes, not hours – you can try stuff, delete the index and try something else. You can focus on drawing out very specific values from the MARC record and even design specialist indexes and interfaces for specific kinds of material – this is not just within the grasp of library services, but the individual researcher.

In the pub after the main mashcat event had finished, we were chatting about the possibilities offered by Blacklight/VuFind and SolrMARC. I used a phrase I know I borrowed from someone else, but I don’t know who – ’boutique search’ – highly customised search interfaces that server a specific audience or collection.

A final note – we have the software, what we need is data – unless more libraries followed the lead of Harvard, Cambridge and others and make MARC records available to use, any software which produces consumes MARC records is of limited use …

Mendeley and APIs

Now Ian Mulvany talking about Mendeley and how they use APIs – both publishing and consuming.

Try to expose all the metadata being added by users via an API – a “social catalogue”. This enables ‘discovery’, but not ‘delivery’ – this is where Mendeley can make use of external APIs – such as the WorldCat API.

Mendeley invest in APIs because

  • It helps them extend their product, but integrating data/functionality from other places
  • It enables others to extend their product – they don’t have time to build everything that users are asking for. E.g. Android client built by users, as company didn’t have the resource

Mendeley uses WorldCat registry to find/suggest appropriate OpenURL resolver depending on users location – as most users won’t know what an OpenURL resolver is, or what the detail are.

Mendeley uses OAuth – which means they can integrate with institutional repositories and a users own publications in Mendeley – going to be live soon (working with JISC, Symplectic and University of Cambridge on this – http://jisc-dura.blogspot.com/). Learnt a lot about consuming their own APIs in this project – and uncovered bugs…

“We should have built the API first, and the product second” – the fact they didn’t now creating work. Now they are creating a new application for libraries, and building API first. Ian firmly believes this is a better approach.

Ian’s top 10 tips for API provision:

  • first API, then app
  • use your own APIs (and he believes Mendeley should do this more)
  • make an (API) interface you would use yourself
  • provide lots of example docs – coders like to do stuff quickly – if they can get something working from an example quickly, they’ll then invest
  • version your API – backwards compatible
  • put rate limits in place
  • work with a 3rd party to provide keys
  • have clear licensing
  • engage with your community
  • promote, promote, promote, promote, promote

In terms of consuming:

  • know what you want to do
  • define the value – this may be service delivery, or could be development of skills for developers etc.
  • measure the value – otherwise difficult to prioritise future developments
  • understand the SLA
  • if it’s important – have a backup plan – dependence on 3rd party is a risk which you should manage
  • don’t wait on API for page loads – found that Mendeley homepage was waiting for a response from an API was down, and so the page didn’t load…
  • get on the mailing list/dev group
  • look for good example code
  • don’t be afraid to pay – if it’s important, it’s worth paying for
  • use Mendeley’s APIs 😉

 

 

Citavi and APIs

I’m at the OCLC EMEARC meeting today, talking and hearing about APIs. Having done my bit at the start, now trying to relax into the other presentations before questions and general discussion at the end.

Now Antonio Tejada and Hans-Siem Schweiger are talking about Citavi which combines Reference Management and Knowledge Organisation. Citavi designed to help with searching, retrieving results, acquiring materials – all of these require interaction with library sources. Citavis supports adding data manually, from file upload, browser extensions and via APIs.

Manual entry is error prone and time consuming

File upload – uses standard formats (RIS/BibTeX); supported by a wide range of catalogues and databases; but still time consuming

Browser Extension – e.g. looks for embedded metadata in the page (e.g. COinS) or find standard identifiers in the page (e.g. ISBN) and import data

APIs – eliminates the browser – you don’t need to go to lots of different sources on the web. Fastest mechanism. Direct. Integrates in the workflow much better. However cost of implementation can vary quite a bit – it all depends on the API – some very fast (e.g. z39.50 can do in minutes now, but custom APIs can be more difficult)

Citavi Features which use APIs:

  • Online search – integrated into the Citavi application
  • Retrieve by identifier (e.g. DOI, ISBN, PubMed ID)
  • Import formatted bibliography – can take a bibliography from a word file and Citavi will run a search for each item in the bibliography
  • Find Library Locations
  • Find Full Text
  • Check availability with OpenURL (seems like this actually just pushes user to their local resolver?)
Citavi supports a proxy service for some resources, when needed. E.g. for WorldCat API where an API key is required.

Universities can get site license for Citavi – allows library to create a special settings file with authentication details for databases (that are not IP authenticated)

Challenges for Citavis using APIs:

  • Administrative challenges
    • Some libraries don’t want to be accessible (at least via a desktop application)
    • Catalogues that charge by the record for metadata
    • Inconsistent communication – e.g. change of settings on library system, don’t inform Citavi
  • Technical challenges
    • Custom catalog software – missing or inconsistent standards support and inconsistent field mapping
    • Legacy data – not as well-structured; inconsistent data entry

Going forward:

  • Geographic search (WorldCat Search API)
  • Enhanced availability search (WorldCat Registry and OpenURL Gateway)
  • Acquisitions management (WMS Acquisitions) – Citavi didn’t anticipate this, but some libraries using Citavi to manage acquisitions processes
  • Metadata – looking at authority control (WorldCat Identities); Alternate editions (xISBN); ISSN lookup (xISSN)

CETIS Conference: Learning Registry show and tell

First up Scott Wilson (@scottbw) describing potential use of learning registry to bring together ‘paradata’ (activity/usage data) for ‘widgets’ (or apps) across different widget (app) stores – the idea that you could have the same app in different stores, and want to aggregate the reviews or ratings from each store. Put in bid to JISC under rapid innovation call for a project ‘SPAWS’…

Terry McAndrew (from JISC TechDis) – want to network experience with resources – identify accessible practice/purposes. Terry says most ‘OER Problems’ are social not technical. Asks – can we find learning registry output via Google?

Walt Grata showing tools that he has built on top of Learning Registry … (on github):

  • ‘Landing pages’ for content – that can be indexed via Google (think this is new, and not up at github yet)
  • Harvesting tool – to grab stuff from a node and put into another storage mechanism – e.g. couchdb,  postgresql, etc etc

Pat Lockley – slides

  • Chrome plugin, code on github
    • No-one will search outside Google – so take learning registry to Google. Chrome plugin finds all links on the page, and checks each one on the learning registry – and looks for some common attributes – like ‘title’ or ‘description’ etc. – and can then manipulate browser display to make use of this data.
  • WordPress Widget  – code on github
    • plugin for WordPress to display content from a learning registry (node) in a wordpress blog

Steven Cook – used Cake (PHP framework) to extract and ‘slice’ data from learning registry node. Also pulling data from other sources – like Topsy. Code on github. Talking about how can’t expect Learning Registry to do the hard work here – have to expect to pull out data, cache it, etc. Notes learning registry API isn’t completely RESTful (? not sure what the issues are) .

CETIS Conference: Capturing Conversations About Learning Resources

This session is really why I’ve come to the CETIS Conference (apart from the general opportunity to meet and chat to people which is also great) – it’s about “The Learning Registry” (@learningreg and http://learningregistry.org). The Learning Registry is not a destination – it’s about building infrastructure – and in some ways has both parallels and relevance to the work the Discovery programme is undertaking (which I’m involved in).

A simple use case for the learning registry is:

  • Nasa publishes a physics video
    • PBS posts a link to the video
    • NSDL posts a link to the video
    • A school uses the video in a course in their Moodle VLE

Each place/portals where the link is used or published only knows about use of their link or copy of the resource. So Learning Registry aims to support way of sharing this type of activity ‘in the open’ – so that this can be captured and reflected – the ‘social metadata timeline’ – Learning Registry is to provide infrastructure to support this. Learning Registry describes this type of activity/usage data associated with a resource ‘paradata’ – although learning registry learning registry doesn’t care what type of data it stores (as long as it can be expressed as JSON)

The learning registry is “an idea, a research project, an open source community project, a public social metadata distribution network’…

The guiding principles: be enabling, capability not solutions, no barrier entry, no single point of failure – everything distributed…

Not going to try to blog the technical architecture of.. but summary of APIs

  • Distribute API – uses http POST. About copying data from one node to another – i.e. achieving the distributed part of the architecture
  • Publish API – how you get stuff into a Learning Registry node (that is, you, the producer of information, publish it *to* the learning registry node) – uses http POST. Learning reg also supports SWORD for publishing data into a node
  • Obtain API – getting data out of a learning registry node – uses http GET
  • Harvest and OAI-PMH APIs – another way of getting stuff out of the node. Harvest returns JSON but supports OAI-PMH type actions. OAI-PMH also supported.

It is stressed that this is really a project at the start of its work – the way to engage and to find out how to do this stuff is to join the community – join the developer list etc. and raise issues, ask questions – this is part of the experiment and will inform the development.

JLern

JLern is the project to setup an experimental node in the UK – being run by Mimas.

2 kinds of nodes in the Learning Registry:

  • Common node
  • Gateway node

JLern have setup a ‘common node’ – this can support:

  • Publish services
  • Access services
  • Distribution services (JLern now have a 2nd common node up and running to try these)
  • ….

Common nodes can be part of ‘networks’. Networks can (only) be connected via ‘gateway nodes’

When networks are connected, this is called a ‘community’. A ‘network community’ is a collection of interconnected resource distribution networks. A resource network can only be a member of one community.

Now have published the JORUM metadata (via OAI-PMH) – so about 15k resources. Open University now looking at similar activity. Now Jorum exploring framework for capturing paradata about resources.

Gathering ideas and use cases now – e.g. see JLern challenge from dev8D http://dev8d.org/challenges/

The JLern ‘Alpha’ node is at alpha.mimas.ac.uk – you can authenticate using details given in this blog post http://jlernexperiment.wordpress.com/2012/02/02/alpha-node/.

As already mentioned they’ve harvested JORUM OAI-PMH data and published on JLern alpha node.

They now have ‘Beta’ node (this doesn’t represent a level of development – just naming convention I think) – this is running on Windows (Alpha is on Linux). Also planning a ‘Gamma’ node running on Amazon EC2.

JLern hackday held in January – write up at http://jlernexperiment.wordpress.com/2012/02/21/the-hackday-report-and-reflections/, and also a Java Library for interacting with Learning Registry nodes at https://github.com/navnorth/LRJavaLib .

 

CETIS Conference: Bring on the metaverse

Ian Hughes (@epredator) from Feeding Edge ltd. Ian describes himself as a ‘metaverse evangelist’. Ended up presenting a section on ‘Cool Stuff’ on CITV – and finding he was talking about exactly the same stuff to talk to children as he’d been talking to the commercial sector/large corps about. Tries to include mentions of open source, and show that children can get involve and affect stuff … in the way he did when he was young.

A lot of the stuff Ian talks about comes back to games – perhaps because about playing and about building – he went into computing because he wanted to build games. Not just about writing games – but using toolkits to mod characters, game play etc.

Also interest in animation (involved in a BCS group about animation) – and this is about art and technical skill – you need to bring together people with these different skill sets, and each needs to understand what the other has to offer. Things like http://unity3d.com and http://opensimulator.org allow you to write stuff yourself . Tools like http://evolver.com give you easy ways into building characters etc. Also platforms are available http://smartfoxserver.com – used to delivered Club Penguin – and if you have <100 people connecting it is free to run.

Forza – racing game/driving sim – that allows you to mod the car – including paintwork etc. – and then when you race against someone they see your car. Can include things like logos etc…

Ian now demonstrating how he can run OpenSim on his laptop – his customised avatar wears digital version of the leather jacket he is currently wearing – identity and links between virtual and real. Can create virtual objects immediately – and all viewers of shared space see it immediately – you’ve distributed just be creating it. Ian talking about how he finds virtual objects as useful cues for talking – using ‘3 dimensional’ cues for what he is going to say (reminds me of ‘palace of memories’ type stuff).

Ian says, a shared ‘space’ when presenting gives different effects and works well for some people – you can share presentations in the space, and also place discussion in the space.

Now moving onto Minecraft – much more game based, but lots of similarity with virtual worlds. You can run a Minecraft server yourself, or on the web. Starting to see some use of this in schools – mentions “Minecraft Teacher” http://minecraftteacher.net/. Ian describing how his children used Minecraft together for the first time – collaboration, exploration, building etc. Minecraft also allows building of mechanical devices – using things like trip switches, trains, etc.

Ian mentions Arduino and 3D printing as things he’s got onto the Cool Stuff program. Ian is especially enthusiastic about 3d printing – highlighting possibilities of moving between environments like Skylander – you could print out your own figures, with RFID chips in them….

Finally Ian closes by talking about how engaging children is about presenting this stuff in fun/interesting ways – but perhaps also about trusting children  will be interested and will learn, if you make it interesting.

 

Experimenting with British Museum data

[UPDATE 2014-11-20: The British Museum data model and has changed quite a bit since I wrote this. While there is still useful stuff in this post, the detail of any particular query, or comment may well now be outdated. I’ve added some updates in square brackets in some cases]

In September 2011 the British Museum started publishing descriptions of items in its collections as RDF (the data structure that underlies Linked Data). The data is available from http://collection.britishmuseum.org/ where the Museum have made a ‘SPARQL Endpoint’ available. SPARQL is a query language for extracting data from RDF stores – it can be seen as a parallel to SQL, which is a query language for extract data from traditional relational databases.

Although I knew what SPARQL was, and what it looked like, I really hadn’t got to grips with it, and since I’d just recently purchased “Learning SPARQL” it seemed like a good opportunity to get familiar with the British Museum data and SPARQL syntax. So I had a play (more below). Skip forward a few months, and I noticed some tweets from a JISC meeting about the Pelagios project (which is interested in the creation of linked (geo)data to describe ‘ancient places’), and in particular from Mia Ridge and Alex Dutton which indicated they were experiementing with the British Museum data. My previous experience seemed to gel with the experience they were having, and prompted me to finally get on with a blog post documenting my experience so hopefully others can benefit.

Perhaps one reason I’ve been a bit reluctant to blog this is that I struggled with the data, and I don’t want this post to come across as overly critical of the British Museum. The fact they have their data out there at all is amazing – and I hope other museums (and archives and libraries) follow the lead of the British Museum in releasing data onto the web. So I hope that all comments/criticisms below come across as offering suggestions for improving the Museum data on the web (and offering pointers to others doing similar projects), and of course the opportunity for some dialogue about the issues. There is also no doubt that some of the issues I encountered were down to my own ignorance/stupidity – so feel free to point out obvious errors.

When you arrive at the British Museum SPARQL endpoint the nice thing is there is a pre-populated query that you can run immediately. It just retrieves 10 results, of any type, from the data – but it means you aren’t staring at a blank form, and those ten results give a starting point for exploring the data set. Most URIs in the resulting data are clickable, and give you a nice way of finding what data is in the store, and to start to get a feel for how it is structured.

For example, running the default search now brings back the triple:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Object type :: marriage equipment ::

 

Which is intriguing enough to make you want to know more (I am married, and have to admit I don’t remember any special equipment). Clicking on the URI http://collection.britishmuseum.org/id/object/EAF119772 in a browser takes you to an HTML representation of the resource – a list of all the triples that make statements about the item in the British Museum identified by that URI.

While I think it would be an exaggeration to say this is ‘easily readable’, sometimes, as with the triple above, there is enough information to guess the basics of what is being said – for example:

Subject http://collection.britishmuseum.org/id/object/EAF119772
Predicate http://collection.britishmuseum.org/id/crm/P3F.has_note
Object Acquisition date :: 1994 ::

 

From this it is perhaps easy enough to see that there is some item (identified by the URI http://collection.britishmuseum.org/id/object/EAF119772) which has a note related to it stating that it was acquired (presumably by the museum) in 1994.

So far, so good. I’d got an idea of the kind of information that might be in the database. So the next question I had was “what kind of queries could I throw at the data that might produce some interesting/useful results?” Since I’d recently been playing around with data about composers I thought it might be interesting to see if the British Museum had any objects that were related to a well-known composer – say Mozart.

This is where I started to hit problems…. In my initial explorations, while some information was obvious, I’d also realised that the data was modelled using something called CIDOC CRM, which is intended to model ‘cultural heritage’ data. With some help from Twitter (including staff at the British Museum) I started to read up on CIDOC CRM – and struggled! Even now I’m not sure I’d say I feel completely on top of it, but I now have a bit of a better understanding. Much of the CIDOC model is based around ‘events’ – things that happened at a certain time/in a certain place. This means that often what might seem like a simple piece of information – such as where a item in the museum originates from – become complex.

To give a simple example, the ‘discovery’ of an item is a kind of event. So to find all the items in the British Museum ‘discovered’ in Greenwich you have to first find all the ‘discovery’ events that ‘took place at’ Greenwich, then link these discovery events back to the items they are a related to:

An item -> was discovered by a discovery event -> which took place at Greenwich

This adds extra complexity to what might seem initially (naively?) a simple query. This example was inspired by discussion at the Pelagios event mentioned earlier – the full query is:

SELECT ?greenwichitem WHERE
{
	?s <http://collection.britishmuseum.org/id/crm/P7F.took_place_at> <http://collection.britishmuseum.org/id/thesauri/x34215> .
	?subitem <http://collection.britishmuseum.org/id/crm/bm-extensions/PX.was_discovered_by> ?s .
	?greenwichitem <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?subitem
}

and the results can be seen at http://bit.ly/vojTWq.

[UPDATE 2014-11-20: This query no longer works. The query is now simpler:

PREFIX ecrm: <http://erlangen-crm.org/current/>
SELECT ?greenwichitem WHERE 
{ 
 ?find ecrm:P7_took_place_at <http://collection.britishmuseum.org/id/place/x34215> .
 ?greenwichitem ecrm:P12i_was_present_at ?find
}

END UPDATE]

To make things even more complex the British Museum data seems to describe all items actually as made up of (what I’m calling) ‘sub-items’. In some cases this makes some sense. If a single item is actually made up of several pieces, all with their own properties and provenance, it clearly makes sense to describe each part separately. Each part of the object will have it’s own properties and provenance, and it makes sense to describe these separately.

However, the British Museum data describes even single items as made up of ‘pieces’ – just that the single item consists of a single piece – and it is then that piece that has many of the properties of the item associated with it. To illustrate. A multi-piece item is like:

Which makes sense to me. But a single piece item is like:

 

Which I found (and continue to find) this confusing. This isn’t helped in my view by the fact that some properties are attached the the ‘parent’ object, and some to the ‘child’ object, and I can’t really work out the logic associated with this. For example it is the ‘parent’ object that belongs to a department in the British Museum, while it is the ‘child’ object that is made of a specific material. Both the parent and child in this situation are classified as physical objects, and this feels wrong to me.

Thankfully a link from the Pelagios meeting alerted me to some more detailed documentation around the British Museum data (http://www.researchspace.org/Stage-2-Outputs), and this suggests that the British Museum are going to move away from this model:

Firstly, after much debate we have concluded that preserving the existing modelling relationship as described earlier whereby each object always consists of at least one part is largely nonsense and should not be preserved.

While arguments were put forward earlier for retaining this minimum one part per object scheme, it has now been decided that only objects which are genuinely composed of multiple parts will be shown as having parts.

The same document notes that the current modelling “may be slightly counter-intuitive” – I can back up this view!

So – back to finding stuff related to Mozart… apart from struggling with the data model, the other issue I encountered was that it was difficult to approach the dataset through anything except a URI for a entity. That is to say, if you knew the URI for ‘Wolfgang Amadeus Mozart’ in the museum data set, the query would be easy, but if you only know a name, then it is much more difficult. How could I find the URI for Mozart, to then find all related objects?

Just using SPARQL, there are two approaches that might work. If you know the exact (and I mean exact) form of the name in the data, you can query for a ‘literal’ – i.e. do a SPARQL query for a textual string such as “Mozart, Wolfgang Amadeus”. If this is the exact for used in the data, the query will be successful, but if you get this slightly wrong then you’ll fail to get any result. A working example for the British Museum data is:

SELECT * WHERE 
{ 
	?s ?p "Mozart, Wolfgang Amadeus"
}

The second approach you can use is to do a more general query and ‘filter’ the result using a regular expression. Regular expressions are ways of looking for patterns in text strings, and are incredibly powerful (supporting things like wildcards, ignoring case etc. etc.). So you can be a lot less precise than searching for an exact string, and for example, you might try to retrieve all the statements about ‘people’ and filter for those containing the (case insensitive) word ‘mozart’. While this would get you Leopold Mozart as well as Wolfgang Amadeus if both are present in the data, there are probably a small enough number of mozarts that you would be able to pick out WA Mozart by eye, and get the relevant URI which identifies him.

A possible query of this type is:

SELECT * WHERE 
{ 
	?s <http://xmlns.com/foaf/0.1/Name> ?o 
	FILTER regex(?o, "mozart", "i") 
}

Unfortunately these latter type of ‘filter’ queries are pretty inefficient, and the British Museum SPARQL endpoint has some restrictions which mean that if you try to retrieve more than a relatively small amount of data at one time you just get an error. Since this is essentially how ‘filter’ queries work (retrieve a largish amount of data first, then filter out the stuff you don’t want), I couldn’t get this working. The issue of only being able to retrieve small sets of data was a bit of a frustration overall with the SPARQL endpoint, not helped by the fact that it seemed to be relatively arbitrary in terms of what ‘size’ of result set caused an error – I assume it is something about the overall amount of data retrieved, as it seemed unrelated to the actual number of results retrieved – for example using:

SELECT * WHERE
{
	?s ?p ?o
}

You can retrieve only 123 results before you get an error, while using

SELECT ?s WHERE
{
	?s ?p ?o
}

You can retrieve over 300 results without getting an error.

This limitation is an issue in itself (and the British Museum are by no means alone in having performance issues with an RDF triple store), but it is doubly frustrating that the limit is unclear.

The difficulty of exploring the British Museum data from a simple textual string became a real frustration as I explored the data – it made me realise that while the Linked Data/RDF concept of using URIs and not literals is something I understand and agree with, as people all we know is textual strings that describe things, so to make the data more immediately usable, supporting textual searches (e.g. via a solr index over the literals in the data) might be a good idea.

I got so frustrated that I went looking for ways of compensating. The British Museum data makes extensive use of ‘thesauri’ – lists of terms for describing people, places, times, object types, etc. In theory these thesauri would give the text string entry points into the data, and I found that one of the relevant thesauri (object types) was available on the Collections Link website (http://www.collectionslink.org.uk/assets/thesaurus/Objintro.htm). Each term in this data corresponds to a URI in the British Museum data, and so I wrote a ScraperWiki script which would search for each term in the British Museum data and identify the relevant URI and record both the term and the URI. At the same time a conversation with @portableant on twitter alerted me to the fact that the ‘Portable Antiquities‘ site uses a (possibly modified) version of the same thesaurus for classifying objects, so I added in a lookup of the term on this site to start to form connections between the Portable Antiquities data and the British Museum data. This script is available at https://scraperwiki.com/scrapers/british_museum_object_thesaurus/, but comes with some caveats about the question of how up to date the thesaurus on the Collections Link website is, and the possible imperfections of the matching between the thesaurus and the British Museum data.

Unfortunately it seems that this ‘object type’ thesaurus is the only one made publicly available (or at least the only one I could find), while clearly the people and place thesauri would be really interesting, and provide valuable access points into the data. But really ideally these would be built from the British Museum data directly, rather than being separate lists.

So, finally back to Mozart. I discovered another way into the data – which was via the really excellent British Museum website, which offers the ability to search the collections via a nice web interface. This is a good search interface, and gives access to the collections – to be honest already solving problems such as the one I set myself here (of finding all objects related to Mozart) – but nevermind that now!  If you search this interface and find an object, when the you view the record for the object, you’ll probably be at a URL something like:

http://www.britishmuseum.org/research/search_the_collection_database/search_object_details.aspx?objectid=3378094&partid=1&searchText=mozart&numpages=10&orig=%2fresearch%2fsearch_the_collection_database.aspx&currentPage=1

If you extract the “objectid” (in this case ‘3378094’) from this, you can use this to look up the RDF representation of the same object using a query like:

SELECT * WHERE
{
	?s <http://www.w3.org/2002/07/owl#sameAs> <http://collection.britishmuseum.org/id/codex/3378094>
}

This gives you the URI for the object, which you can then use to find other relevant URIs. So in this case I was able to extract the URI for Wolfgang Amadeus Mozart (http://collection.britishmuseum.org/id/person-institution/39629) and so create a query like:

SELECT ?item WHERE
{
	?s ?p <http://collection.britishmuseum.org/id/person-institution/39629> .
	?item <http://collection.britishmuseum.org/id/crm/P46F.is_composed_of> ?s
}

To find the 9 (as of today) items that are in someway related to Mozart (mostly pictures/engravings of Mozart).

The discussion at the Pelgios meeting identified several ‘anti-patterns’ related to the usability of Linked Data – and some of these jumped out at me as being issues when using the British Museum data:

Anti-patterns

  • homepages that don’t say where data can be found
  • not providing info on licences
  • not providing info on RDF syntaxes
  • not providing egs of query construction
  • not providing easy way to get at term lists
  • no html browsing
  • complex data models
The Pelagios wiki has some more information on ‘stumbling blocks’ at http://pelagios.pbworks.com/w/page/48544935/Stumbling%20Blocks, and also the group exploring (amongst other things) the British Museum data made notes at http://pelagios.pbworks.com/w/page/48535503/UK%20Cultural%20Heritage. Also I know that Dominic Oldman from the British Museum was at the meeting, and was keen to get feedback on how they could improve the data or the way it is made available.
One thing I felt strongly when I was looking at the British Museum data is that it would have been great to be able to ‘go’ somewhere that others looking at/using the data would also be to discuss the issues. The British Museum provide an email to send feedback (which I’ve used), but what I wanted to do was say things like “am I being stupid?” and “anyone else find this?” etc. As a result of discussion at the Pelagios meeting, and on twitter, Mia Ridge has setup a wiki page for just such a discussion.
A final thought. The potential of ‘linked data’ is to bring together data from multiple sources, and combine to give something that is more than the sum of it’s parts. At the moment the British Museum data sits in isolation. How amazing would it be to join up the British Museum ‘people’ records such as http://collection.britishmuseum.org/id/person-institution/39629 with the VIAF (http://viaf.org/viaf/32197206/) or Library of Congress (http://id.loc.gov/authorities/names/n80022788) identifier for the same person, and start to produce searches and results that build on the best of all this data?