Counting the cost

The context of this post and related work is that our local public library service (Warwickshire Library and Information Service) is facing a cut of approximately 27% of its budget over the next 3 years (£2million from an overall budget of £7.4million). For more information see Facing the Challenge on libraries on the Warkwickshire County Council website.

One canard that seems to come up repeatedly in discussions around public libraries and their role is that ‘books are cheap’ and that this means there isn’t the need for public libraries that there once was. I thought an interesting experiment might be to work out how much our family’s use of the library service might equate to based on the cost of buying the books that we have borrowed from the library. So I set about writing a script to achieve this.

The first stage was to retrieve details of a loan history – luckily the system used by Warwickshire libraries (Vubis) do have a loan history which seems to go back 18 months (although with an absolute minimum of information – more on this later). Unluckily there is no way of querying this via an API, only via a web browser. This is not so unusual, and there are good ways of scraping data out of html which I’ve used before. However, in this case the page I needed to get the data from was protected by a login – as it was a loan history linked to an individual library account, and required a barcode and PIN to be input into a form on a webpage before accessing the loan history page.

To help with this I used a Ruby library called ‘Mechanize‘ (inspired by the Perl Mechanize library which does the same sort of thing). This essentially allows you to simulate web browser behaviour within a script – it can fill in forms for you and deal with things like cookies which the web page might try to write to the browser. (incidentally, if you are interested in doing stuff like automating checks for authenticated library resources online Mechanize is worth a look)

Once Mechanize has navigated through the login form and then to the loan history page the content of the page can be parsed as usual – luckily Mechanize supports parsing page content with another Ruby library called Nokogiri – which is my preferred HTML/XML parser in Ruby anyway. So grabbing the details from loan history page was fine once Mechanize had retrieved it.

Unfortunately the amount of information available in the loan history is minimal – just a title and a date. Since at minimum I need to have a price, I need more information. My first thought was I’d get prices from Amazon, but with only the title available I quickly realised I was going to get a lot of duplicate items – especially since my son (3 and a half) borrows books with titles like ‘Goldilocks and the three bears’ and ‘Shapes’ – not very unique.

So I needed to get more details – preferably an ISBN – before I start looking up prices. Any method of looking up books given only the title is going to suffer from the same problems of many duplicate titles. The way to minimise the number of duplicates is to search in the most limited pool of items for the titles from the loan history – which in this case is to search the Warwickshire library catalogue and limit the results to books in our local branch (Leamington Spa) – as this is the only branch we use.

The software that Warwickshire use to provide their library catalogue interface on the web is called ‘Aquabrowser‘ (as opposed to the one that stores administrative information like the loan history – which is called Vubis). Happily unlike Vubis, Aquabrowser does have some APIs that can be used to get information out of the system in a machine readable format. I have to admit that this functionality isn’t well advertised (neither by Warwickshire Library service nor by the vendor of the software), but luckily Chris Keene and Ed Chamberlain have both done some work in documenting the Aquabrowser API (here and here) which is in use at their place of work (University of Sussex and University of Cambridge respectively).

Aquabrowser supports what is meant to be a sector wide standard interface called SRU. The location of the SRU interface for each Aquabrowser interface seems to be in a standard place – http://aquabrowser-url/aquabrowser-path/sru.ashx? (in this case librarycatalogue.warwickshire.gov.uk/abwarwick/sru.ashx?) – then followed by the relevant SRU parameters (e.g. operation=explain&version=1.1&query=dogs&maximumRecords=100&recordSchema=dc – see http://www.loc.gov/standards/sru/ for full documentation on SRU). However, there are several issues with the Aquabrowser implementation of SRU. I’m not clear which issues are Aquabrowser issues, and which are down to local configuration decisions, but the result is a rather limited interface. In this case the main problems for me was that it didn’t allow me to limit a search by title, and it didn’t include the ISBN in the results.

As well as the ‘standard’ SRU interface, Aquabrowser also supports a proprietary interface which will provide search results as XML. This is pretty much undocumented as far as I can tell, but it seems you can take some of the URLs from the HTML interfaces and add the parameter ‘output=xml’ to the URL (I’m not 100% sure, but my guess is that the HTML interface actually is built over the XML API). A basic request looks like http://aquabrowser-url/aquabrowser-path/result.ashx?q=emily%20gravett&output=xml (in this case http://librarycatalogue.warwickshire.gov.uk/abwarwick/result.ashx?q=emily%20gravett&output=xml).

The default HTML interface uses frames which obscures some of what is happening – however, if you use the ‘Accessible’ version, the syntax in the URLs can be seen easily. Using this method I was able to copy the syntax for limiting the results to stock available in a specific branch – in this case the parameter required is ‘branch=LEA’.

The results returned by the XML interface in the Warwickshire instance include information lacking from the SRU interface – and ISBN is one of the elements included. However searching by title is still not possible as far as I can tell – although I came up with a work around for this.

So, once I have the list of titles from the loan history, I use each title to run a search against the Aquabrowser XML interface, limiting the results to stock in the Leamington Spa branch. Since I can’t limit to just title search instead I check the contents of the <title> tag in each record contained in the XML response (note in the XML response Aquabrowser wraps matches to your search terms <exact> tags so you have to ignore these when comparing search phrases). This technique also means I can do more exact matching (e.g. including case sensitivity) than a search would let me.

I decided that if I found duplicates having done this filtering I’d take each duplicate record and include them in my results for later manual tweaking.

So now I have a set of bibliographic records with generally a title, an author (or author statement) and an ISBN. An added bonus is a URL for a cover image for the book is in the response, so I grab that as well (it’s an Amazon cover image). What I don’t have is the price (price may well be in the library record somewhere, but it isn’t included in the XML response). So I then use the ISBN to look up the price on Amazon – I tried a couple of Ruby libraries designed to work with the Amazon ‘Product Advertising’ API (which is the one you need for stuff like prices), but the only one I really liked was ‘Sucker‘ – it is very lightweight, doesn’t hide the Amazon API that much, and again has Nokogiri somewhere underneath (and some other stuff) – which means dealing with the results is relatively straightforward if you’ve used Nokogiri.

So, from the Amazon response I grab the price (I check it is in GBP first) and also a URL for the Amazon product page (so I can provide a link back to Amazon). (This isn’t perfect – despite the library catalogue searching above, I don’t get ISBNs for every item and so can’t search them all on Amazon).

Now I have all the information I need but I don’t have a way to present it (and I still have some unwanted duplicate entries in the books list). For this I went back to a tactic I used previously for the (currently defunct) ‘What to Watch‘ application – importing the results of a script into WordPress as a series of blog posts. In this case created a csv file which allowed me to easily add in a few tags – such as names to indicate which account (mine, Damyanti’s or our son’s) the item had been borrowed on. The ‘text’ of the blog post was some simple html to display the cover image and a few links – including an option to add the item to a LibraryThing account in a single click. I also included some WordPress ‘custom fields’ to record the title, ISBN and price – to enable to additional functions in the blog.

Using the ‘csv importer’ WordPress plugin (a great plugin) to import the csv file the content was now in the blog. This is the point at which we removed those duplicate book entries (this did require us to remember what we’d borrowed!). The final step was to actually do the calculation I originally wanted to do – the cost of the books we’d borrowed. This final step was achieved by writing a WordPress plugin to support the display of a Widget which uses SQL to get values from the ‘custom field’ storing price information. Since I haven’t got prices for all books, I use those I have got, work out an average per book, and multiply this up by the total number of loans.

The final result is the ‘Overdue Books‘ blog showing every book we have borrowed as a family since September 2009 at http://www.meanboyfriend.com/overdue_books/. The headline figure is that if we’d bought those items we’ve borrowed in this time the cost would have been approximately £3,428.26 – or around £190 per month  – or £44 pounds a week. I think it is fair to say even a relatively well off family would find this steep!

Finally the code is available at https://github.com/ostephens/Savelibs – there are all kinds of things I’m unhappy about and want to improve – including modifying it so it can be used to regularly update the Overdue Books blog without any manual intervention from me – and if anyone wants to take it and make something that any library user could use to do the same type of calculation feel free (I’d like to do this, but not sure where I’m going to get the time).

Addendum 08/03/2011

In an oversight last night I forgot to add my apologies and thanks to the IT people at Warwickshire council and library – firstly they chose a system for their library catalogue with an API (despite some shortcomings, this is a big step!) and secondly they didn’t shout at me when I repeatedly ran hundreds of searches against their catalogue in quick succession, but rather got in touch politely so we could work together to ensure the scripts weren’t going to cause a problem on the system – if you are thinking of hacking stuff around like this, probably as well to get in touch with system owners first!

What I do hope is that there might be some opportunity for developers like me to work with the library system and data – it would be really good if we could see applications built on top of the library systems APIs (such as this one which won the first ‘Hack Warwickshire’ competition) – the attitude Warwickshire have to open data and wanting to get developers involved is really refreshing 🙂

CORE

CORE (COnnecting REpositories) (Presented by Petr Knoth from Open University)

Working with content and metadata from Open Access Institutional Repositories – approx 167 repositories in the UK. Mainly interested in Full-text items (approx 10 percent of metadata records in repositories have full-text items attached).

Will use OAI-PMH to harvest metadata, and then use to grab the pdf (or other full-text) representations of resource. Will then analyse content, and find ‘similarities’ between items – and then express as RDF. Will then make available via triple store.

Have started working with the Open University repository (ORO) – finding about 30% have full-text. Will focus on extracting relationships – specifically ‘semantic similarity’ based on content… (rather than on metadata)

Use cases – demonstrator client that can be integrated into any repository – which will provide links to papers in other repositories based on similarity relationships – will be open to any institution to use.

COMET

(Ed Chamberlain presenting) – COMET (Cambridge Open Metadata) – follow on from the Open Bibliography project– collaborating with CARET at Cambridge, with support from OCLC.

Want to publish lots of Open Bibliographic data – engaging with RDF, enriching records with FAST and VIAF linking – and will document experiences

Be taking MARC21 encoded records from library catalogue – data taken from main University Library catalogue – historical mix of quality and origin built overtime – some mixture of standards as cataloguing practice has changed over time

Data will be made available as MARC21 (bulk download) and also RDF/XML (bulk download and triple-store with SPARQL endpoint etc.) Will publish where possible under PDDL, but will need discussions with OCLC where it includes data from OCLC derived records…

Contextual Wrappers

(Presented by David Scruton) Contextual Wrappers is project from FitzWilliam Museum at the University of Cambridge – working with Knowledge Integration Ltd and Culture Grid.

Looking at how ‘Collection Level Descriptions’ interact/add value to ‘item level descriptions’

Will be providing metadata for 160k items and 50 collection level descriptions (think I got that right)

Will be producing draft update for collection level descriptions (and related APIs) in the Culture Grid

Discovering Babel

(Martin Wynne presenting) Discovering Babel based on Oxford Text archive – looking at c.1400 metadata records and c.1400 electronic literary and linguistic datasets – electronic texts, text corpora, lexicons, audio data, etc.

Also including British National Corpus and an archive of central and East European language resources (TRACTOR)

Metadata will be made available via TEI (Text Encoding Initiative) XML headers; Dublin Core – but will use DC extensions from the Open Language Archives Community (OLAC); CLARIN Metadata Initiative (CMDI); RDF linked data (‘may as well’!)

Will provide an OAI-PMH target, and will be harvested by OLAC, CLARIN, etc.

Example use cases … want to provide data/metadata and allow others to build services on top of this…
Aim to make it easier for end-users to find and access resources; will also produce a ‘How to make your language resources discoverable’ manual.

Key technical challenges – establishing sensible and standards conformant architecture for resource file locations (persistent URLs) …

OpenART

(presented by Julie Allinson) OpenArt is based on the University of York, working in partnership with the Tate and Acuity Unlimited.

Taking ‘The London Art World 1660-1735’ – an AHRC funded project which has currently produced data in a series of inter-linked spreadsheet – about people, places and sales (or art works) – and also about the art works (where they are currently – expect mainly in Tate Britain) and the source of the information of the sales etc. (bibliographic information)

Can see rich set of inter-relationships – e.g. even between people can have relations like ‘was patron of’ – see all artists with same patronage etc.

Data will be made available via Fedora – will be normalising formats and vocabs. Probably be released in some flavour of RDF. OpenArt will focus on data, but already funding in place to produce ‘end user’ focussed interfaces – which will follow on from OpenArt project.

Example use cases:
Answering research questions lie ‘how many paintings were imported annually into England during this period’
Looking at where art works are now

For institution hope to gain a mechanism for releasing open data

Blog at http://yorkdl.wordpress.com/

Open Metadata Pathfinder

Open Metadata Pathfinder: optimising and enhancing access to AIM25‘ – project based on AIM25 (descriptions of Archives in the M25 area) – 16000 Collection Level descriptions – 2 million hits per month – enquiries from around the world.

AIM25 covers a very wide range of material, of interest to a very wide audience.

Use UKAT (UK Archival Thesaurus) for searching – many users come in via Thesaurus terms rather than keyword search.

Going to take a subset of data – from five institutions that are new members of AIM25 – so starting from clean slate – will be using Linked Data approach – and can therefore test against the existing approach – and compare the approach they have taken to date with Linked Data approach.

Jerome

I’m at the ‘startup’ meeting for JISC ‘Resource Discovery Infrastructure’ projects funded under the JISC 15/10 call

Quick descriptions of project follow, starting with …

Project at the University of Lincoln funded by JISC (Paul Stainthorp presenting)

Jerome is a project to ‘liberate data’ – in essence to build a ‘quick and dirty’ index of data that can be made available openly from the University of Lincoln library. Feel confident can release catalogue records as overall volumes small, and have traditionally invested in local cataloguing.

Where full record can’t be released, can still release fact that something belongs to collection, and can then enhance with data from elsewhere that is open.

Then including repository data and archives data, and where possible e-journal holdings from KnowledgeBase (again, use ISSN to draw in descriptive data from open sources due to issues around releasing descriptive metadata from KnowledgeBase).

Technical architecture is a ‘NoSQL’ index – very very fast. First and foremost point of access for the data is an API. Any services created by the University of Lincoln will use this API (‘eat our own dog food’)

Aiming for ‘radical personalisation’ – what do we know about users – their past activity, their location, the weather(!) etc. – use that to deliver relevant services.

Going for ‘incredibly fast’ – “if we can measure it, it is too slow” – literally looking for response times that they can’t measure

Example service – allow users to produce their own library discovery tools.

Presentation available at http://paulstainthorp.com/2011/03/01/jisc-rdtf-meeting-birmingham-jerome/

Dev8D Lightning Talk: Hacking with MS Kinnect

Ben O’Steen
Doing work today on using Kinnect as controller for an AR Drone
Kinnect not just camera, but also uses IR Projection – this allows it to calculate things like distance – darker stuff is further away

Open Sound Control library (OSC) – allows you to stream bytes from one place, and pick them up at another
Ben using OSC to broadcast tracking info from the Kinnect – just the data points – example tracking hands and shoulders

Has used this to make flapping motions map to up and down motion of AR Drone 🙂

But idea is that based on the byte stream from OSC – means you can easily consume data without having to have Kinnect etc. – so provides lightweight way of getting data out and using. E.g. someone today has built a spotify control based on byte stream

Ben demoing ‘raw capture’ from Kinnect – pointing out issues with very thin items (e.g. chair legs) and reflective items

Ben says – if you want to hack Kinnect, don’t buy one that comes bundled with XBox because doesn’t have separate power supply if you buy bundled version

OpenNI is the place to start with Kinnect hacking – but still not straightforward at the moment – be prepared to spend quite a bit of time on this!