I’ve been promising a blog post of my entry into the JISC MOSAIC competition for a while now, so here goes.
The JISC MOSAIC competition was basically about demonstrating different ways in which library usage data could be exploited. The data made available for the competition is from the University of Huddersfield, where Dave Pattern has led the way in putting this type of data to work. I was also keen to dust off my rather rusty coding skills. I have to admit that when I first saw the large XML files that the project was offering, I was slightly worried – doing any kind of analysis on the files looked like it was going to be a bit of work. Luckily very soon after the competition was announced, Dave offered a simple API to the data which definitely looked more my kind of thing – a relatively simple XML format, with nice summary information available.
I had originally though that working on the competition might give me the push I needed to learn a new programming language – trying to get up to speed with Python or Ruby has been on my todo list for a while. However I ended up falling back on the language I’ve used most in the past – Perl. Several years ago I wrote some Perl scripts to parse various XML files so I was confident I could pick this up again. I was also slightly suprised that Perl still seemed to have some of the most extensive XML parsing options (although this may be simply due to my pre-existing knowledge – I’d be interested to hear what other languages I should be looking at?)
I wanted to come at the data from a slightly different angle. I had two ideas:
- Generate purchase recommendations for libraries by finding the items they already owned in the usage data, and finding those linked items (in the usage data) that are not already owned
- Get people to upload lists of books they owned/liked, find which courses they were linked to by the usage data, and suggest courses the person
I’d have liked to do both (and at one point thought I might pull this off with some help), but in the end I went with the second of these.
The idea was that if we know what books students on a specific course uses, if someone really likes those books then they may well find the course interesting. I’m still unsure of whether this assumption would be borne out in practice, and I’d be interested in comments on this. My program basically needed to:
- Allow you to upload a list of books (I went for a list of ISBNs for simplicity)
- Check which course codes those books were related to
- Find where courses matching those course codes were available
- Display this information back to you
The first thing I realised was how much Perl I’d forgotten – it took me quite a while to get back into it, and even now looking at the script I can see things that I would do quite differently if I were to start over.
I was able to pinch quite a few bits from existing tutorials and examples on the web (this is one of the great things about using Perl – lots of existing code to use). Things like uploading a file of ISBNs were relatively trivial. I’m not going to run through the whole thing here, but the bits I want to highlight are:
Dealing with UCAS
UCAS really don’t make it easy to get information out of their website on a machine-to-machine basis. I’ve done an entire post on scraping information from UCAS, which I’m not going to rehash here, but honestly if we are going to see people developing applications which help individuals build personalised learning pathways through Higher Education courses this has got to improve.
How much overlap is significant?
The first set of test data I used was the ISBNs from my own LibraryThing account. This is a free account, so limited to 200 items – so approximately this was 200 ISBNs. I realise that most people are not going to have a list of 200 ISBNs to hand (a major issue with what I’m proposing here), but it seemed like a good place to start. However, I found that only 2 of these 200 items matched items in the usage data from Huddersfield. Initially these two items resulted in several course recommendations – because I’d assumed that any overlap was a ‘recommendation’. However it was immediately apparent that the fact I owned ‘The Amber Spyglass’ by Philip Pullman didn’t really imply I’d be interested in studying History with English Language Teaching, or that owning Jane Eyre meant I’d be interested in Community Development and Social Work – these were just single data points, and amounted to ‘coincidence’.
Given this, I introduced the idea of ‘close matches’ which meant that you owned/read at least 1% of all the items associated with a course code. However, this led to my own data generating zero matches – not a good start. For the purposes of demonstration I basically faked some sets of ISBNs which would give results. I have no idea whether 1% is a realistic level to set for ‘close matches’ – it could well be this is too low, but it seemed like a good place to start, and it can easily be adjusted within the script.
I think it is really important to stress that the only usage data the competition worked against was that from the University of Huddersfield. This was bound to give limited results – any single institutions data would suffer from the same problem. However, if we were to see usage data brought together from Universities from across the UK I still think there are some possibilities here (and who knows what might turn up if you added public library information into the mix somehow?).
So – the result is at ReadToLearn and you are welcome to give it a go – and I’m very interested in comment and feedback. I’m hoping to at least partially rewrite the application to use the UCAS screenscraping utility I’ve since developed. Although I’m rather embarrassed by the code as it definitely leaves alot to be desired, if you want to you can download the ReadtoLearn code here.