Do you Read to Learn?

I’ve been promising a blog post of my entry into the JISC MOSAIC competition for a while now, so here goes.

The JISC MOSAIC competition was basically about demonstrating different ways in which library usage data could be exploited. The data made available for the competition is from the University of Huddersfield, where Dave Pattern has led the way in putting this type of data to work. I was also keen to dust off my rather rusty coding skills. I have to admit that when I first saw the large XML files that the project was offering, I was slightly worried – doing any kind of analysis on the files looked like it was going to be a bit of work. Luckily very soon after the competition was announced, Dave offered a simple API to the data which definitely looked more my kind of thing – a relatively simple XML format, with nice summary information available.

I had originally though that working on the competition might give me the push I needed to learn a new programming language – trying to get up to speed with Python or Ruby has been on my todo list for a while. However I ended up falling back on the language I’ve used most in the past – Perl. Several years ago I wrote some Perl scripts to parse various XML files so I was confident I could pick this up again. I was also slightly suprised that Perl still seemed to have some of the most extensive XML parsing options (although this may be simply due to my pre-existing knowledge – I’d be interested to hear what other languages I should be looking at?)

I wanted to come at the data from a slightly different angle. I had two ideas:

  • Generate purchase recommendations for libraries by finding the items they already owned in the usage data, and finding those linked items (in the usage data) that are not already owned
  • Get people to upload lists of books they owned/liked, find which courses they were linked to by the usage data, and suggest courses the person

I’d have liked to do both (and at one point thought I might pull this off with some help), but in the end I went with the second of these.

The idea was that if we know what books students on a specific course uses, if someone really likes those books then they may well find the course interesting. I’m still unsure of whether this assumption would be borne out in practice, and I’d be interested in comments on this. My program basically needed to:

  • Allow you to upload a list of books (I went for a list of ISBNs for simplicity)
  • Check which course codes those books were related to
  • Find where courses matching those course codes were available
  • Display this information back to you

The first thing I realised was how much Perl I’d forgotten – it took me quite a while to get back into it, and even now looking at the script I can see things that I would do quite differently if I were to start over.

I was able to pinch quite a few bits from existing tutorials and examples on the web (this is one of the great things about using Perl – lots of existing code to use). Things like uploading a file of ISBNs were relatively trivial. I’m not going to run through the whole thing here, but the bits I want to highlight are:

Dealing with UCAS
UCAS really don’t make it easy to get information out of their website on a machine-to-machine basis. I’ve done an entire post on scraping information from UCAS, which I’m not going to rehash here, but honestly if we are going to see people developing applications which help individuals build personalised learning pathways through Higher Education courses this has got to improve.

How much overlap is significant?
The first set of test data I used was the ISBNs from my own LibraryThing account. This is a free account, so limited to 200 items – so approximately this was 200 ISBNs. I realise that most people are not going to have a list of 200 ISBNs to hand (a major issue with what I’m proposing here), but it seemed like a good place to start. However, I found that only 2 of these 200 items matched items in the usage data from Huddersfield. Initially these two items resulted in several course recommendations – because I’d assumed that any overlap was a ‘recommendation’. However it was immediately apparent that the fact I owned ‘The Amber Spyglass’ by Philip Pullman didn’t really imply I’d be interested in studying History with English Language Teaching, or that owning Jane Eyre meant I’d be interested in Community Development and Social Work – these were just single data points, and amounted to ‘coincidence’.

Given this, I introduced the idea of ‘close matches’ which meant that you owned/read at least 1% of all the items associated with a course code. However, this led to my own data generating zero matches – not a good start. For the purposes of demonstration I basically faked some sets of ISBNs which would give results. I have no idea whether 1% is a realistic level to set for ‘close matches’ – it could well be this is too low, but it seemed like a good place to start, and it can easily be adjusted within the script.

I think it is really important to stress that the only usage data the competition worked against was that from the University of Huddersfield. This was bound to give limited results – any single institutions data would suffer from the same problem. However, if we were to see usage data brought together from Universities from across the UK I still think there are some possibilities here (and who knows what might turn up if you added public library information into the mix somehow?).

So – the result is at ReadToLearn and you are welcome to give it a go – and I’m very interested in comment and feedback. I’m hoping to at least partially rewrite the application to use the UCAS screenscraping utility I’ve since developed. Although I’m rather embarrassed by the code as it definitely leaves alot to be desired, if you want to you can download the ReadtoLearn code here.

Accessing Sconul Access

This is a very quick lunchtime post to document a script I’ve been working on over the last week or so. SCONUL Access is a scheme that offers reciprocal access to various university libraries across the UK.

The SCONUL Access website allows you to enter details of a UK university affiliation, and then will list details of those libraries which you can use via the reciprocal agreement scheme (you have to apply for a SCONUL access card at your ‘home’ institution before you can use the other libraries).

I’ve occasionally thought it would be nice to do something like map the results of a SCONUL access enquiry on a Google map, or integrate the question of ‘which libraries can I use’ with ‘where can I get a book’ – so that users could potentially do a search of all the libraries they can access (perhaps limited by a geographical radius?). Aside from these ideas, the SCONUL Access directory actually contains quite a bit of useful information on each library it lists – including the insitution website, the library website and the library catalogue URL.

Further, I was recently inspired by Philip Adams from Leicester (@Fulup) on Twitter who pointed me at http://www.library.dmu.ac.uk/Resources/OPAC/index.php?page=366 which combines information from SCONUL access with the Talis Silkworm directory to show SCONUL Access libraries (relevant to those at the University of Leicester I guess) on Google Maps.

Unfortunately the SCONUL Access website doesn’t provide an API to query the data it has on the libraries, so I thought I’d start writing something. I haven’t (yet anyway) tried to replicate the function that SCONUL access provide of taking user details, and giving a list of available libraries – to get this function you still have to go to SCOUNL Access website and fill in their forms. What my script does is simply provide SCONUL Access member library details in an XML format. The script lives at:

http://www.meanboyfriend.com/sconulaccess

It supports three modes of use:

1. Summary of all SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess
Function: returns a summary of all institutions participating in SCONUL Access from their A-Z Listing. This XML (see below for format) only includes the SCONUL Access (internal) code for the library, the name of the institution and the URL for the full SCONUL Access record

2. Full records for specified SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess/? e.g. http://www.meanboyfriend.com/sconulaccess/?institution=2,3,4
Function: returns full records for each institution specified by its SCONUL Access ID in the URL (see full XML structure below)

3. Full records for all SCONUL Access libraries
URL: http://www.meanboyfriend.com/sconulaccess/?institution=all
Function: similar to 2 but returns full records for all institutions that are obtained via 1. This takes some time to return results as it retrieves over 180 records from the SCONUL Access website – so it isn’t recommended for general use.

XML Structure

<sconul_access_results>
 <institution code=”4″ name=”Aston University”>
  <inst_sconul_url>
    http://www.access.sconul.ac.uk/members/institution_html?ins_id=4
  </inst_sconul_url>
  <website>http://www.aston.ac.uk/</website>
  <library_website>http://www1.aston.ac.uk/lis/</library_website>
  <library_catalogue>http://library.aston.ac.uk/</library_catalogue>
  <contact_name>Anne Perkins</contact_name>
  <contact_title>Public Services Coordinator</contact_title>
  <contact_email>a.v.perkins@aston.ac.uk</contact_email>
  <contact_telephone>01212044492</contact_telephone>
  <contact_postcode>B4 7ET</contact_postcode>
 </institution>
 <source>
  <source_url>http://www.access.sconul.ac.uk/</source_url>
  <rights>Copyright SCONUL. SCONUL, 102 Euston Street, London, NW1 2HS. </rights>
 </source>
</sconul_access_results>

The <institution> element is repeatable.
For (1) above the only elements returned are:
<institution>
</inst_sconul_url>
<source> (and subelements)

Anyway, I’d be interested in comments, and would be happy to look at alternative functions and formats – let me know if there is anything you’d like to see.

UCAS Course code lookup

While I was writing my entry for the JISC MOSAIC competition (which I will write up more thoroughly in a later post I promise – honest), one of the problems I encountered was retrieving details of courses and institutions from the UCAS website. Unfortunately UCAS don’t seem to provide a nice API to their catalogue of course/institution data. To extract the data I was going to have to scrape it out of their HTML pages. Even more unfortunately they require a session ID before you can successfully get back search results – this means you essentially have to start a session on the website and retrieve the session ID before you can start to do a search.

I hacked together something to do enable me to get what I needed to do for the MOSAIC competition. However, I wasn’t the only person who had this problem – in a blog entry on his MOSAIC entry Tony Hirst notes the same problem. At the time Tony asked if I would be making what I’d done available, and I was very happy to – unfortunately the way I’d done it I couldn’t expose just the UCAS course code search. I started to re-write the code but writing something that I could share with other people, with appropriate error checking and feedback proved more challenging than my original dirty hack.

I’ve finally got round to it – it works as follows:

The service is at http://www.meanboyfriend.com/readtolearn/ucas_code_search?
The service currently accepts two parameters:

  • course_code
  • catalogue_year

The course_code parameter simply accepts a UCAS course code. I haven’t been able to find out what the course code format is restricted to – but it looks like it is a maximum of 4 alphanumeric characters, so this is what the script accepts. Assuming the code meets this criteria, the script passes this directly to the UCAS catalogue search. The UCAS catalogue doesn’t seem to care whether alpha characters are upper or lower case and treats them as equivalent. For some examples of UCAS codes, you can see this list provided by Dave Pattern. (see Addendum 2 for more information on UCAS course codes and JACS)

The catalogue_year parameter takes the year in the format yyyy. If no value is given then the UCAS catalogue seems to default to the current year (2010 at the moment). If an invalid year is given the UCAS catalogue also seems to default to the current year. It seems that at most only two years are valid at a single time. However the script doesn’t check any of this – as long as it gets a valid four digit year, it passes it on to the UCAS catalogue search.

An example is http://www.meanboyfriend.com/readtolearn/ucas_code_search/?course_code=R901&catalogue_year=2010

The script’s output is xml of the form:

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<course_name>xxxx</course_name> (repeatable)
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<xml>
<ucas_course_results course_code=”” catalogue_year=”” ucas_stateid=””>
<institution code=”” name=””>
<inst_ucas_url>[URL for Institution record on UCAS website]</inst_ucas_url>
<course ucas_catalogue_id=””> (repeatable) (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>[URL for course record on UCAS website]</course_ucas_url>
<name>xxxx</name>
</course>
</institution>
</ucas_course_results>

For example:

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<course_name>Combined Modern Languages</course_name>
</institution>
</ucas_course_results>

(I’ve made a slight change to the output structure since the original publication of this post)
(Finally I’ve added a couple of extra elements inst_ucas_url and course_ucas_url which provide links to the institution and course records on the UCAS website respectively)

<ucas_course_results course_code=”R901″ catalogue_year=”2010″ ucas_stateid=”DtDdAozqXysV4GeQbRbhP3DxTGR2m-3eyl”>
<institution code=”P80″ name=”University of Portsmouth”>
<inst_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsInstDetails.run?i=P80
>/inst_ucas_url>
<course ucas_catalogue_id=””> (the ucas_catalogue_id is not currently populated – see Addendum 1)
<course_ucas_url>
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/DtGJmwzptIwV4rADbR8xUfafCk6nG-Ur61/HAHTpage/search.HsDetails.run?n=989628
</course_ucas_url>
<name>Combined Modern Languages</name>
</course>
</institution>
</ucas_course_results>

The values fed to the script and the StateID for the UCAS website is fed back in the response.

If there is an error at some point in the process and error message will be included in the response in an <error> tag.

Addendum 1
The script relies on the HTML returned by UCAS remaining consistent. If this changes, my script will probably break.

Having done the hard work I’d be happy to offer alternative formats for the data returned by the script – just let me know in the comments. I’d also be happy to look at different XML structures for the data so again just leave a comment.

Something I should have mentioned in the original post. Given the data returned by the script you should be able to form a URL which links to an institution on the UCAS website using a URL of the form:
http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/<insert state ID from xml here>/HAHTpage/search.HsInstDetails.run?i=<insert institution code here>

Since finishing this work last night I’ve realised that I’ve left out one important piece of data which is an identifier that would let you form a link to a specific course from a specific institution. I have slightly restructured the XML to leave a space for the ucas_catalogue_id in the XML. I’ll add this in as soon as I can.
This has now been added.

Addendum 2
I’ve just found quite a bit more detail on the format and structure of the UCAS ‘course codes’. UCAS now uses JACS (Joint Academic Coding System) for course codes (see JACS documentation from HESA). JACS codes consist of 4 characters, the first being an uppercase letter and the remaining three characters being digits. JACS codes are essentially hierarchical with the first character representing a general subject area and the digits representing subdivisions (in with increasing granularity). The codes in the UCAS catalogue are a mixture of JACS 1.7 and JACS 2.0 codes. A full listing of JACS v2.0 codes is available from HESA, and a listing of JACS v1.7 codes is available from UCAS as a pdf.

UCAS have an explanation of why and where they use both JACS v2.0 and JACS v1.7.

However because UCAS need to code courses which cover more than one subject area, they have rules for representing these courses while sticking to codes with a total length of 4 characters. These rules are summarised on the UCAS website, but a fuller description is available in pdf format. This last document is most interesting because it indicates how you might create the UCAS code from a HESA Student Record which could be of interest for future mashups.

The implications of all this for my script are relatively small as I currently assume that there is a 4 character alpha-numeric code. On the basis of this documentation I could refine this to check for 3 alpha-numeric characters followed by a single digit I guess – perhaps I will at some point.

Finally it looks like UCAS and HESA are currently looking at JACS v3.0 which could introduce further changes I guess, although it looks unlikely that this will affect the code format, but rather the possible values, and maybe the meaning of some values. While this isn’t a problem for my script, it would mean that historical course codes from datasets such as MOSAIC could not be assumed to represent the same subject areas in the current UCAS course catalogue as they did when the data was recorded – which is, to say the least, a pain.

Addendum 3
A final set of changes (I hope):

  • The ucas_catalogue_id is now populated
  • Added inst_ucas_url element which contains the URL linking to the Institution record in the UCAS catalogue
  • Added course_ucas_url element which contains the URL linking to the Course record in the UCAS catalogue

Everyone’s a winner?

The results of the JISC MOSAIC competition were announced this week. The winning entries were great, and I think their prizes were well deserved. The only downside in this was that my entry didn’t make the cut. I will admit to having a moment of disappointment over this, but this passed in about 5 seconds – after all, I’d really enjoyed the challenge of writing my entry and was relatively pleased with the result.

Later in the week I fell into conversation with a couple of people on Twitter about how there hadn’t been much collaboration in the competition. With one notable exception none of the contestants had published early thoughts online, and all the entries had been from individuals rather than teams.

During the course of this conversation I managed to both insult and upset someone I greatly like, admire and respect. For this I am truly sorry. This post is is in the way of an apology as well as an attempt to express my own thoughts around the nature of ‘developer competitions’ such as JISC MOSAIC.

The idea of a developer competition is that you set a challenge, aimed at computer programmers and interested others, and offer prizes to the best entries – the criteria can vary wildly. Perhaps the biggest prize of this type we’ve seen is the $1million NetFlix prize, but in the UK HE community where I work there have been a few smaller prizes on offer, and more widely in the UK community there have been prizes for ideas about using government data, and we are about to see one launched on the use of Museum data. The JISC MOSAIC competition offered a 1st prize of £1000 for work on library usage data.

One of the amazing things about the web, and perhaps particularly about the communities I’m engaged in, is the incredible personal commitment made in terms of time and resource by individuals to what many would regard as ‘work’. Both of the people I was talking to put in a great deal of effort into contributing to and developing ideas that many might think of as ‘the day job’ – and they do so with no thought of reward.

So – given this tendency to be self-motivated to solve problems, contribute, take part etc. Why do we need developer competitions?

My starting point is to look at my own motivation for entering the JISC MOSAIC competition. Would I have done this work without the competition? Trying to be completely honest here – probably not. However, I would almost certainly done other things instead – perhaps blogged more, perhaps done some other development (like this). So the competition focussed my energy on a particular area of work. Was I motivated by the cash prize? I’m not sure – at the end of the day it isn’t that relevant to me (although no doubt I could have found something to treat myself to). I think it was just the idea of the ‘competition’ that gave me the focus. I’m the kind of person who works relatively well with clear deadlines – so having a date by which a set of work was to be done definitely gave me something to aim at.

So – the competition was one element. However, I was also looking for ways to dust off my scripting skills. I used to script in Perl as part of my job, but I haven’t done this for several years – I had been looking for ways of picking this up again as it was something I always enjoyed doing. I am also extremely interested in the ideas behind the competition – I believe libraries should be exploiting their usage data more, and I was keen to show the community how valuable that usage data might be.

I don’t assume that others are motivated in the same way as me. When the usage data that was part of the JISC MOSAIC competition was first put online somebody immediately took it and transformed it into RDF – they weren’t motivated by a competition, they just did it.

My conclusion is that such competitions harness existing energy in the community and focus it on a particular problem for a particular time period. It won’t generally work where people aren’t inclined to do the work anyway. You need an interesting problem or proposition to engage people.

So far, so good? I’m not sure. The problem with a competition is that it is, well, competitive. Again trying to be honest about my own situation (and I’m not particularly proud of this, so don’t take it as an endorsement of my own approach) is that I immediately became more protective of my ideas. The competition had put a ‘value’ on them that they hadn’t previously had. I should say I actually started work on two entries to the competition – one was in collaboration with someone else, which unfortunately we weren’t able to pull together in time – so it wasn’t all about ‘me’. However, I didn’t announce my own entry until I was ready to submit. This isn’t how I usually work – I’m usually happy to share half baked ideas (as readers of this blog will know only too well!).

Again I think the factors around this are complex. It wasn’t just that I didn’t want to give away my idea. The truth is that I’m not a very good programmer. I wanted to take this chance to develop my programming skills (or at least get myself back to my previous level of incompetence). I am under no illusions – any developer worth their salt could take my idea and do a better job with it. In general this would be great – if my idea is good enough to inspire other people to do it much much better than I can I’d be very happy. But for the period of the competition this suddenly seemed like a bad idea.

Reflecting on this now, this shows a pretty rubbish (on my part) attitude to others – the ‘fear’ that my idea would be ‘stolen’ (and of course the egoism that says my idea was worth stealing). I’m pretty confident in retrospect that the only possible outcome of publishing early would have been a better entry (possibly in collaboration with others). However, I would say that my guess is it would have resulted in me not doing the coding – which I would have been sorry about.

I am going to blog my entry in detail, and release all the work I’ve done – which others are more than welcome to use and abuse.

So although I think developer competitions work in terms of focussing people on a problem, I think there are some possible downsides, perhaps chief of which is that competitions may discourage collaboration. I don’t think this is a given though, and so in closing here are some thoughts that future developer competitions might want to consider:

  • Is there an element in your competition that encourages team entries above or aswell as individual entries?
  • Can you reward collaboration either within or outside the competition structure?
  • How are you going to ensure that the whole community can share and benefit from the competition outcomes? Plan this from day 1!

Perhaps consider splitting the prizes in different ways to acheive this – not one ‘big winner’, but rather judging and rewarding contributions as you go along. Perhaps consider having a ‘collaboration’ environment where ideas can be submitted (and judged separately) and where teams can form and work together.

A final thought – I really enjoyed entering the JISC MOSAIC competition – it stretched my skills and scratched an itch for me. I am in no way disappointed I didn’t win – the winning entries were very deserving. I fully intend to do more scripting/programming going forward. And sharing.

Would you recommend your recommender?

We are starting to see software and projects emerging that utilise library usage data to make recommendations to library users about things they might find useful. Perhaps the most famous example of this type of service is the Amazon ‘people who bought this also bought’ recommendations.

In libraries we have just had the results of the JISC MOSAIC project announced, which challenged developers to show what they could do with library usage data. This used usage data from Huddersfield, where Dave Pattern has led the way both in exploiting the usage data within the Huddersfield OPAC, and also in making the data available to the wider community.

On the commercial side we now have the bX software from Ex Libris, which takes usage data from SFX installations across the world (SFX is an OpenURL resolver which essentially helps makes links between descriptions of bibliographic items and the full text of the items online). By tracking what fulltext resources a user accesses in a session and looking at behaviour over millions of transactions, this can start to make associations between different fulltext resources (usually journal articles).

I was involved in trialling bX, and I talked to some of the subject librarians about the service and the first question they wanted to know the answer to was “how does it come up with the recommendations”. There is a paper on some of the work that led to the bX product, although a cursory reading doesn’t tell me exactly how the recommendations are made. Honestly I actually hope that there is some reasonably clever mathematical/statistical analysis going on behind the recommendation service that I’m not going to understand. For me the question shouldn’t be “how does it work?” but “does it work?” – that is are the recommendations any good?

So we have a new problem – how do we measure the quality of the recommendations we get from these services?

Perhaps the most obvious approach is to get users to assess the quality of the recommendations. This is the approach that perhaps most libraries would take if assessing a new resource. It’s also an approach that Google take. However, when looking at a recommender service that goes across all subject areas, getting a representative sample of people from across an institution to test the service thoroughly might be difficult.

Another approach is to use a recommendation service and then do a longitudinal study of user behaviour and try to draw conclusions about the success of the service. This is how I’d see Dave Pattern’s work at Huddersfield, which he recently presented on at ILI09. Dave’s analysis is extremely interesting and shows some correlations between the introduction of the recommender service and user behaviour. However, it may not be economic to do this where there is a cost to the recommender service.

The final approach, and one that appeals to me, is that taken by the NetFlix Prize competition. The NetFlix Prize was an attempt by the DVD/Movie lending company NetFlix to improve their recommendation algorithm. They offered a prize of $1million to anyone who could improve on their existing algorithms by a factor of 10% or more. The NetFlix prize actually looked at how people rated (1-5) movies they had watched – based on previous ratings the goal was to predict how individuals might rate other movies. The way the competition was structured was that a data set with ratings was given to contestants, along with a set of ratings where the actual values of the ratings had been removed. The challenge was to find an algorithm that would fill in these missing ratings accurately (or more accurately than the existing algorithm). This is a typical approach when looking at machine based predictions – you have a ‘training set’ of data – which you feed into the algorithms, and the ‘testing set’ which is the real life data against which you compare the machine ‘predictions’.

The datasets are available at the UCI Machine Learning Repository. The Netflix prize was finally won in September 2009 after almost 3 years.

What I find interesting about this approach is that it tests the recommendation algorithm against real data. Perhaps this is an approach we could look at with recommendation services for libraries – to feed in a partial set of data from our own systems and see whether the recommendations we get back match the rest of our data. As we start to see competition in this marketplace, we are going to want to know which services best suit our institutions.

Middlemash, Middlemarch, Middlemap

The next Mashed Library event was announced a few months ago, but now more details are available. Middlemash is happening at Birmingham City University on 30th November 2009. I hope to see you there.

In discussion with Damyanti Patel, who is organising Middlemash, we thought it would be nice to do a little project in advance of Middlemash. When we brainstormed what we could do I originally suggested that maybe someone had drawn a map of the fictional geography of Middlemarch, and if we could find one, we could make it interactive in some way. Unfortunately a quick search turned up no such map. However, what it did turn up was something equally interesting – this map of relationships between characters in Middlemarch on LibraryThing.

This inspired a new idea – whether this could be represented in RDF somehow. My first thought was FOAF, but initially this seemed limited as it doesn’t allow for the expression of different types of relationship. However, I then came across this post from Ian Davis (this is the first in a series of 3), which used the Relationship vocabulary in addition to FOAF to express more the kind of thing I was looking for.

The resulting RDF is at http://www.meanboyfriend.com/overdue_ideas/middlemash.rdf. However, if you want to explore this is a more user-friendly manner, you probably want to use an RDF viewer. Although there are several you could use, the one I found easiest as a starting point was the Zitgist dataviewer. You should be able to browse the file directly with Zitgist via this link. There are however a couple of issues:

  • Zitgist doesn’t seem to display the whole file, although if you browse through relationships you can view all records evenutally
  • At time of posting I’m having some problems with Zitgist response times, but hopefully these are temporary

This is the first time I’d written any RDF, and I did it by hand, and I was learning as I went along. So I’d be very glad to know what I’ve done wrong, and how to improve it – leave comments on this post please.

I did find some problems with the Relationship vocabulary. It still only expresses a specific range of relationships. It also seems to rely on inferred relationships in some cases. The relationships uncle/aunt/nephew/niece aren’t expressed directly in the relationship vocabulary – presumably on the basis that they could be inferred through other relationships of ‘parentOf’, ‘childOf’ and ‘siblingOf’ (i.e. your uncle is your father’s brother etc.). However, in Middlemarch there are a few characters who are described as related in this manner, but to my knowledge no mention of the intermediary relationships are made. So we know that Edward Causubon has an Aunt Julia, but it is not stated whether she is his father’s or mother’s sister, and further his parents are not mentioned (this is as far as I know, I haven’t read Middlemarch for many years, and I went from SparkNotes and the relationship map on LibraryThing).

Something that seemed odd is that the Relationship vocabulary does allow you explicitly to relate grandparents to grandchildren without relying on the inferrence from two parentOf relathionships.

Another problem, which is one that Ian Davis explores at length in his posts on representing Einsteins biography in RDF is the time element. The relationships I express here aren’t linked to time – so where someone has remarried it is impossible to say from the work I have done here whether they are polygamous or not! I suspect that at least some of this could have been dealt with by adding details like dates of marriages via the Bio vocabulary Ian uses, but I think this would be a problem in terms of the details available from Middlemarch itself (I’m not confident that dates would necessarily be given). It also looked like hard work 🙂

So – there you have it, my first foray into RDF – a nice experiment, and potentially an interesting way of developing representations of literary works in the future?

Preserving bits

Just as I posted that last post, including some stuff on preservation of the digital, this piece from Robert Scoble dropped into my Twitter stream. I thought a quick sharing of my approach to digital preservation (such as it is) might be interesting:

Photos
When we copy these from our digital camera, they go straight onto our NAS (networked attached storage), in date labelled folders (named as YYYYMMDD) – one for each day we do a download. I then copy them into iPhoto on our MacBook Pro – which is our primary tool for organising the photos – we might delete some of the pictures we import, but I don’t go back and remove these from the NAS. In iPhoto I take advantage of the various organisation tools to split the photos into ‘events’, and have recently started adding ‘place’ and ‘face’ information (where it was taken, and who is in the photo) using the built in tools.

We then may select some of these to be published on our website. We used to do this in custom software built into our then blogging platform, but now we use Flickr.

The photos on the NAS are backed up to online storage (using JungleDisk, which layers over Amazon S3) on a weekly basis. So that is essentially two local copies, and one remote.

Pictures are taken as JPEGs and stored in that format. I haven’t got a plan for what happens when the standard image format moves away from JPEG – I guess we’ll have to wait and see what happens.

Music
Also on our NAS, and backed up online once a week. Organised by iTunes, but this time on our Mac Mini rather than the MacBook Pro. Files are a mix of AAC and MP3.

Video
Also on the NAS and backed up online once a week. Organised by iMovie on the MacBook Pro again. I think this is an area I’m going to have to revisit, as neither the MBP or the Mac Mini really have enough disk space to comfortably store large amounts of video.

Sometimes I get round to producing some actual films from the video footage, and these are either published to our website (just as video files). I think I’ve only put one on YouTube. I have to admit I’m a bit fuzzy about the format – the camera records MPEG2, but I’m not sure what iMovie does to this on import. I tend to export any finished films as MPEG4.

Documents
Simply stored on the NAS with weekly online backups. Stuff obviously gets put on the MacBook Pro at various times, but I’m pretty good at making sure anything key goes back on the NAS.

I guess that this blog is the other key ‘document’ store – and at the moment I only have a very vague backup policy for this – I do have a snapshot from a couple of months ago stored as web pages on our NAS (and therefore backed up online).

Conclusions

In some ways the video and photos are our biggest problem. However the fact we are already doing some selection should actually make preservation easier I think. We are already ‘curating’ our collections when we decide what goes online, or in a film. It would make sense to focus preservation activities on these parts of the collection – and much cheaper to do as well.

Probably the least ‘curated’ part of our collection is Documents – this contains just about everything I’ve done over the last 10 years – including huge email archives, and documents on every project I’ve been involved in since about 1998. I haven’t deleted much, and everytime I think about pruning it, I realise I don’t know where to start, and besides, compared to the video it hardly takes up any space….

The areas I feel I need to look at are:

  • File formats – are we using the most sensible file formats, check what we use for video
  • Migration strategies – how would I move file formats in the future
  • Curation strategies – should we focus only on the parts of the collection we really care about?
  • What to do about blogs?

What I really don’t believe to be the answer is (as Robert Scoble suggests, and as came up in Giles Turnbull’s Bathcamp talk) ‘print it out’.

Bathcamp

Last weekend, I went to Bathcamp, a barcamp style event, but slightly unusual as it actually included camping. Although I don’t live particularly close to Bath, I knew several of the people involved – mainly via Twitter (at least initially).

After I booked, I suddenly had the idea that rather than drive down to Bath, I could instead do a combination of cycling and taking the train. I had one days holiday to take before the end of September, so I decided to set off on Friday morning aiming to get to the campsite in time to get my tent up before the sun went down.

I set off slightly late after a last minute search for the keys to my bike lock, and headed from Leamington Spa down to Moreton-in-Marsh. I was aiming to get to Moreton in time to get the 10:48 train – I had just under 3 hours to do about 25 miles. As I went along I tweeted – starting with this tweet. What I wasn’t aware of was a whole other twitter conversation going on around me.

Unfortunately I made it to Moreton-in-Marsh just in time to see the train I wanted pulling out. So, I stopped for an early lunch (BLT and Chips) in the local pub, and got the 12:48 train to Bath. I’d originally intended to go to Chippenham by train and cycle from there, but I decided I might not make it to the campsite before sundown, and that going to Bath was a safer bet. I tweeted that I was going that way, and got an offer of some company for a bit of the way from Andy Powell – which was extremely welcome as he was able to show me a canal-side route that avoided the huge hill outside Bath.

The weekend included a huge variety of talks – from an introduction to jQuery to Libraries (me), from HTML Email to making music with Ableton Live, as well as films, live music, barbecued dinner and breakfast and the odd sip of cider.

A couple of the talks I managed to make some reasonable notes about – and it surprised me they were both very relevant to my work. The first one was by Giles Turnbull, and was about the use of URL shorteners – Giles said that he was responsible for the original idea which, with some help from other people, became makeashorterlink.com. Giles described how they really didn’t anticipate the level of abuse that the service would get from spammers. However, despite this they kept it going for a couple of years. Then for various reasons – changes in lives and locations – they decided they could no longer maintain the service – they asked if anyone wanted to take over the service, and the fledgling service TinyURL took it over.

The issue that Giles wanted to highlight was that really the service relied on the enthusiasm of a few individuals – and he felt that this was essentially true of all online services. This, combined with the experience of finding old papers belonging to his step father (I think), made him realise how emphemeral what he put online was compared to paper. He said he was excited by the idea of Newspaperclub which is a service (currently in Alpha) to create a printed ‘newspaper’ from your online content – something you can keep, or give as a gift.

I’m not convinced by this – the solution to digital preservation can’t ultimately be to print it all out – and as Cameron Neylon pointed out, this is a form of caching rather than preservation – online content isn’t like printed content.

Giles’ talk provoked some discussion – but mainly about the longevity and economic viability of various Internet companies – which for me isn’t the heart of the problem. Even if companies survive, the question of how my grandchildren will access, say, my photos saved as JPEGs is far more of an issue.

The second talk I took notes from was Chris Leonard from BioMedCentral (bit hard to believe at this point, but this really wasn’t a library conference!). Chris spoke about how scientific publishing was gradually creeping outside the journal – to blogs, video and other media – but that it was difficult to keep track, and also difficult for scientists to be ‘rewarded’ for these routes of publication (in the way they are recognised and/or cited when they publish in journals).

Chris suggested an approach like that taken by Friendfeed or Faculty of 1000, which I’ve not come across before. He listed some pros and cons of these different services and suggested that what was needed was a service:

  • that is open and free
  • uses metrics to motivate contributors (RAE-worthy metrics)
  • rewards contributors for their efforts
  • archives contribution and discussions – making them citable

Chris suggested this approach would mean:

  • Scientist’s whose work is not suited to being shoehorned into a pdf may no longer need to write an article
  • The interconnected web of data could lead to new ‘article’ types
  • Unpublished research could reach a wider audience [where it is merited] and discredit crackpots

He suggested that “Peer-review Lite” should be able to sort the wheat from the chaff – if not replace the usefulness of traditional peer-review.

I think Chris is right there is a need to look at new forms of publication and how the effort put into these is recognised and rewarded. However, I also think this is a big challenge – it means changing attitudes towards how academic discourse is conducted, which will be hard to do.

On Sunday morning I skipped out early to cycle back to Bath – a beautiful ride across country, and then along the Kennet and Avon canal – and took the train home. Thanks again to all who organised, especially Mike Ellis, and all those who sponsored an excellent event.

Presenting Telstar

A few years ago at the Hay-on-Wye literary festival I went to see Lawrence Lessig present on Copyright law (I know how to have a good time!). It was a transformational experience – not in my view of copright and intellectual property (although he had very interesting things to say about that), but in my understanding of how you could use Powerpoint to illustrate a speech. As you can see from my later comment on the eFoundation’s blog – I was left both amazed and jealous. If you want to see a version of this presentation by Lessig (which is well worth it for the content alone) you can see his TED talk.

I think I was a OK presenter, and I don’t think I was particularly guilty of just reading out the slides – but I would definitely say my slides tended to be text and bulletpoint heavy. To illustrate – this is a reasonably typical presentation from that time:

Lessig’s example really made me want to change how I approached using slides. Going back to my desk, and browsing the web, I came across the Presentation Zen blog, and from there Garr Reynold’s tips on using slides. On the latter site I remember particularly being struck by the example under tip 2 (Limit bullet points and text), where the point that the presenter wants to communicate is “72% of part-time workers in Japan are women” (I have no idea if this is true by the way). The immediate impact of the slide that simply had the characters 72% on it in a huge font was something I really noticed. This lead to my style evolving, and you can hopefully see the difference in a more recent presentation I did on ‘Resource Discovery Infrastructure’

I’m definitely happier with this latter set of slides, but there are some issues. Without me actually talking, the second set of slides have a lot less meaning than the first. I’ve also found that sometimes I end up stretching for a visual metaphor, and end up with pictures that only tangentially relate to what I’m saying (I find signposts particularly flexible as a visual metaphor). In some cases the pictures became just something to look at while I talked.

So, when I had the opportunity to present a paper on the project I’m currently working on (Telstar) at ALT-C, and they actually mentioned Lawrence Lessig in their ‘guidelines for speakers’, I decided I wanted to try something slightly more ambitious (actually the guidelines to speakers wound me up a bit, since it included a suggest limit of 6 slides for a 12 minute talk – this may have influenced what happened next).

I wanted to really have a slideshow that would punctuate my talk, give emphasis to the things I wanted to say, catch the attention of the audience, and try out a few things I’d had floating around my head for a while. So I went to town. I ended up with 159 slides to deliver in 12 minutes (it actually took me more like 10 minutes on the day).

The whole process of putting together the slideshow was extremely frustrating and took a long time – for a 12 minute talk it took several days to put the presentation together – and writing the talk was not more than half that. Powerpoint is simply not designed to work well in this way – all kinds of things frustrated me. An integration with Flickr would be nice for a start. Then the ability to standardise a size and position for inserted pictures. Positioning guides when dragging elements around the slide (Keynote has had this for years, and I think the latest version of Powerpoint does as well). Basic things like the ability to give a title to a slide (so it shows in the outline view) without having to acutally add text to the slide itself. A much better ‘notes’ editing interface.

I also realised how closely I was going to have to script the talk. This isn’t how I’ve normally worked in the past. Although I’d have a script for rehearsal, by the time I spoke I would be down to basic notes and extemporise around these. This works if you basically have a ‘point per slide’ approach – but not when you have slides that are intended (for example) to display the word you are saying right at the moment you say it – in that instance if you use a synonym, the whole effect is lost (or mislaid).

So, after I’d got my script, and my slides, I started to rehearse. Again, the issue of syncing the slides so closely to what I was saying was an issue – I had to get it exactly right. I had a look at various ‘presenter’ programs available for the iPhone, thinking this could help, and came across some ‘autocue’ apps. I tried one of these, and after a bit of a struggle, got the text of my talk (with indicators where I was to move on the slides using the word [click]). The autocue worked well, although I found having to control the speed, pause it etc. could be distracting – so I had to play around with the speed, and putting in extra spacing to try to make it as close to my natural pace of delivery as possible.

I recorded myself giving the presentation so I could load it on my ipod and listen to it and rehearse along with it in the car. (I started recording myself presenting a few years ago and do find it really helpful in pointing up the places I don’t actually know what I’m saying)

Finally I was ready, and I gave the presentation to a polite audience in Manchester. How did it go? I’m not sure – I got some good questions, which I guess is a good sign. However, I did feel the tightly scripted talk, delivered with autocue, resulted in a much less relaxed and engaging presentation style – I didn’t really feel I connected with the audience, as I was too busy worrying about getting all the words right, making sure the autocue didn’t run away with me, and that I was clicking the mouse in all the right places! If you were there, I’d be interested in some honest feedback – was it all too much? Did it come across I was reading a script? What did you think? (I hope, at least, I managed to avoid falling foul of Sarah Horrigan’s 10 Powerpoint Commandments – although it may have been bad in several other ways)

I knew that when I came to put this presentation online it would be completely useless without the accompanying narration – so I decided I should record a version of the talk, with slides, to put online. This was a complete nightmare! Firstly I tried the built-in function in Powerpoint to ‘record a narration’. Unfortunately when you do this, Powerpoint ignores any automatic slide timings you have set – which were essential to some of the effects I wanted to achieve.

I then decided I’d do an ‘enhanced podcast’ – this is basically a podcast with pictures. I used GarageBand (on a Mac) to record my narration, while running the powerpoint on a separate machine. Once I’d done this, I exported all the slides from powerpoint to JPEG, and imported into GarageBand, and by hand, synced them to the presentation. This worked well, and I was really happy – right up until the point that I realised GarageBand automatically cropped all the images into a square – losing bits of the slides, including some of the branding I absolutely had to have on there. So that was another 2 hours down the drain.

I then though about using ‘screen capture’ software to capture the slideshow while it played on the screen, and my narration at the same time. The first one I tried couldn’t keep up with the rapidly changing slides, and the second crashed almost before I started.

I finally decided that iMovie would be the easiest thing to do – I’d re-record the narration with GarageBand, and use the ability of iMovie to import stills and use them instead of video, syncing their duration with the narration track. It took several attempts (not least because the shortest time iMovie will display any image seems to be 0.2s – and I had some images that were timed to display for only 0.1s – I eventually had to give up on this, and settle for the 0.2s for each image, which means that there is a slightly long pause at one point in the presentation)

Overall I’m much more pleased with this recorded version than with the live performance – which I think lacked any ‘performance’. The autocue application worked really well when sitting in front of a computer talking into the microphone. There are still some issues – you may notice some interference on the track, which comes from my mobile phone interacting with some speakers I forgot to turn off. However I think it works well, and actually as a video as opposed to a ‘slidecast’ is more portable and distributable than a ‘slidecast’ I think. It’s on YouTube, and there is also a downloadable version you can use on your PC, or your portable device.

Finally, once I’d put the video on YouTube, I was able to add Closed Captioning (using the free CaptionTube app – although not bug free) – and here, having the script written out was very helpful, and it wasn’t too difficult to add the subtitles (although I do worry whether some of them are on the screen just a bit too briefly).

Would I do it again? I suspect that I was a little guilty this time of putting style before substance – I’m pleased with the video output, but I felt the live presentation left something to be desired. Perhaps if I’d known the script better, and hadn’t been relying on the autocue to make sure I was keeping to the script, it might have been better. But, I guess that it isn’t suprising that something that works on screen is going to be different to something that works on stage.

I think the other thing that I’ve realised, is that although my powerpoint may be prettier, I’m probably still just an OK presenter. If I’ve got good content I do an OK job. Perhaps what I need is to look at how I present – my writing, and what you might call my stage presence I guess – afterall, if I get that right, who is going to care about the slides?

Anyway, after all that, here it is – if you are interested…

I’d be interested to hear what you think …

Twitter – a walk in the park?

This week I’ve been at the ALT-C conference in Manchester. One of the most interesting and thought provoking talks I went to was by David White (@daveowhite) from Oxford, who talked about the concept of visitors and residents in the context of technology and online tools.

The work David and colleagues have done (the ISTHMUS project) suggests that moving on from Prensky’s idea of ‘digital natives and immigrants’ (which David said had sadly been boiled down to in popular thought as ‘old people just can’t do stuff’ – even if that wasn’t what Prensky said exactly), that it was useful to think in terms of visitors and residents.

Residents are those who live parts of their life online – their presence is persistent over time, even when they aren’t logged in. On the otherhand Visitors tend to log on, complete a task, and then log off, leaving no particular trace of their identity.

The Resident/Visitor concept isn’t meant to be a binary one – it is a continuum – we all display some level of both types of behaviour. Also, it may be that you are more ‘resident’ in some areas of your life or in some online environments, but more a ‘visitor’ in others.

I think the most powerful analogy David drew was to illustrate ‘resident’ behaviour as people milling round and picnicing in a park. They were ‘inhabiting’ the space – not solving a particular problem, or doing a particular task. It might be that they would talk to others, learn stuff, experience stuff etc. but this probably wasn’t their motivation in going to the park.

On the otherhand a visitor would treat an online environment in a much more functional manner – like a toolbox – they would go there to do a particular thing, and then get out.

David suggested that some online environments were more ‘residential’ than others – perhaps Twitter and Second Life both being examples – and that approaching these as a ‘visitor’ wasn’t likely to be a successful strategy. That wasn’t to pass judgement on the use or not of these tools – there’s nothing to say you have to use them.

David also noted that moving formal education into a residential environment wasn’t always easy – you can’t just turn up in a pub as a teacher and start teaching people (even if those same people are your students in a formal setting) – and that the same is true online, An example was the different attitudes from two groups of students to their tutors when working in Second Life – in the first example the tutor had worked continually in SL with the students, and had successfully established their authority in the space. In the second example a tutor had only ‘popped in’ to SL occasionally, and tried to act with the same authority – which grated on the students.

At the heart of the presentation was the thesis that we need to look much more at the motivations and behaviours of people, not focus on the technology – a concept that David and others are trying to frame – currently under the phrase ‘post-technical’. Ian Truelove has done quite a good post on what post-technical is about.

Another point made was that setting up ‘residential’ environments could be extremely cheap – and you should think about this when both planning what to do and what your measures of ‘success’ are – think about the value you get in terms of your investment.

The points that David made came back to me in a session this morning on Digital Identity (run by Frances Bell, Josie Fraser, James Clay and Helen Keegan). I joined a group discussing Twitter, and some of the questions were about ‘how can I use Twitter in my teaching/education’. For me, a definite ‘resident’ on Twitter, this felt like a incongruous question. I started to think about it a bit more and realised, there are ‘tool’ like aspects to Twitter:

  • Publication platform (albeit in a very restrictive format)
  • Ability to publish easily from mobile devices (with or without internet access)
  • Ability to repurpose outputs via RSS

This probably needs breaking down a bit more. But you can see that if you wanted to create a ‘news channel’ that you could easily update from anywhere, you could use Twitter, and push an RSS version of the stream to a web page etc. In this way, you can exploit the tool like aspects of Twitter – a very ‘visitor’ approach.

However, I’d also say that if you want to do this kind of thing, there are probably better platforms than Twitter (or at least, equally good platforms) – perhaps the WordPress Microblog plugin that Joss Winn mentioned in his session on WordPress (another very interesting session).

For me, the strength of Twitter in particular is the network I’ve built up there (something reinforced by the conference as I met some of my Twitter contacts for the first time – such as @HallyMk1, who has posted a great reflection on the conference – although I should declare an interest – he says nice things about me). I can’t see that you can exploit this side of Twitter without accepting the need to become ‘resident’ to some degree. Of course, part of the issue then becomes whether there is any way you can exploit this type of informal environment for formal learning – my instinct is that this would be very difficult – but what you can do is facilitate for the community both informal learning and access to formal learning.

As an aside, one of the things that also came out of the Digital Identities session was that even ‘visitors’ have an online life – sometimes one they aren’t aware of – as friends/family/strangers post pictures of them (or write about them). We all leave traces online, even if we don’t behave as residents.

The final thread I want to pull on here is a phrase that was used and debated (especially I think in the F-ALT sessions) “it’s not about the technology'”. This was certainly part of the point that David White made – that people’s motivations were much more important than any particular technology they would use to achieve their goals. He made the point that people who don’t use Twitter don’t avoid doing so because they aren’t capable, or don’t understand, they just don’t have the motivation to use it.

Martin Weller has posted on this and I think I agree with him when he says “I guess it depends on where you are coming from” – and I think the reason that the phrase got debated so much is that the audience at ALT-C is coming from many different places.

I’m guilty of liking the ‘shiny shiny’ stuff as much as any other iPhone owning geek – but the thing that interests me in this context is what the impact is likely to be on education (or more broadly to be honest, society) – I’m not in the position of being immediately concerned about how the Twitter or iPhones or whatever else should be used in the classroom.

I do think that we need to keep an eye on how technology continues to change because I think a very few technologies impact society to the extent that our answers need to change – but the question remains the same whatever – how are we going to (need to) change the way we educate to deal with the demands and requirements of society in the 21st Century.

IceRocket Tags: