Last year I blogged about my entry into the JISC MOSAIC competition which I called ‘ReadtoLearn’. The basic idea of the application was that you could upload a list of ISBNs, and by using the JISC MOSAIC usage data the application would generate a list of course codes, search for those codes on the UCAS web catalogue, and return a list of institutions and courses that might be of interest to you, based on the ISBNs you had uploaded.
While I managed to get enough done to enter the competition, I had quite a long ‘to do’ list at the point I submitted the entry.
The key issues I had were:
- You could only submit ISBNs by uploading a file (using http post)
- The results were only available as an ugly html page
- It was slow
Recently I’ve managed to find some time to go back to the application, and have now added some extra functionality, and also managed to speed up the application slightly (although it still takes a while to process larger sets of ISBNs).
Another issue I noted at the time was that because “the MOSAIC data set only has information from the University of Huddersfield, the likelihood of matching any particular ISBN is relatively low”. I’m happy to say that the usage data that the application uses (via an API provided by Dave Pattern) has been expanded by a contribution from the University of Lincoln.
One of the biggest questions for the application is where a potential user would get a relevant list of ISBNs from in the first place (if they even know what an ISBN is). I’m still looking at this, but I’ve updated the application so there are three ways of getting ISBNs into the application. The previous file upload still works, but now also a comma separated list of ISBNs can be submitted to the application (using http get) and a URL of a webpage (or RSS feed etc.) containing ISBNs can be submitted, and ISBNs will be extracted using regular expressions (slower, but gives a very generic way of getting ISBNs into the application). I would like to look at further mechanisms such as harvesting ISBNs from an Amazon wishlist or order history, or a LibraryThing account, but for the moment you could submit a URL and the regular expression should do the rest.
Rather than the old HTML output, I’ve now made the results available as XML instead. Although this is not pretty (obviously), it does mean that others can use the application to generate lists of institutions/courses if they want. On my to do list now is to use my own XML to generate a nice HTML page (eating your own dog food I think they call it!).
I also restructured the application a little, and split into two scripts (which allowed me to also provide a UCAS code lookup script separately)
Finally, one issue with the general idea of the application was the question of how much of an overlap with the books borrowed by users on a specific course should lead to a recommendation. For example, if 4 ISBNs from your uploaded list turned out to all have been borrowed by users on courses with the code ‘W300’, should this constitute a recommendation to take a W300 course? My solution was to offer two ‘match’ options – one was to find ‘all’ matches – this meant that even a single ISBN related to a course code would result in you getting a recommendation for that course code. The second option was to ‘find close matches only’ – this only recommended a course code to you if the number of ISBNs you matched was at least 1% of the total ISBNs related to that course code in the usage data. I decided I would generalise this a bit, so you can now specify the percentage of overlap you are looking for (although experience suggests that this is going to be low based on the current data – perhaps less than 1)
So, the details are:
Application URL:
http://www.meanboyfriend.com/readtolearn/studysuggest
GET Parameters:
match
Values: ‘All’ or a number between 0 and 100 (must be >0)
Definition: Percentage overlap between ISBNs in submitted list related to a course code, and total ISBNs related to the course code that will constitute a ‘recommendation’. ‘All’ will retrieve all courses where at least one ISBN has been matched.
isbns
Values: a comma separated list of 10 or 13 digit ISBNs
url
Values: a url-encoded url (include ‘http etc.’) of a page/feed which include ISBNs. ISBNs will be extracted using a regular expression. (See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm for information on URL encoding)
If both isbn and url parameters are submitted, all ISBNs from the list and the specified webpage will be used.
Example:
An example request to the script could be:
http://www.meanboyfriend.com/readtolearn/studysuggest?match=0.5&isbns=0722177755,0722177763,0552770884,043999358,0070185662,0003271323,0003271331,0003272788
Response:
The response is xml with the following structure (this is an example with a single course code):
<study_recommendations> | |
<course type=”ucas” code=”V1X1″ ignore=”No” total_related=”385″ your_related=”3″> | |
<items> | |
<item isbn=”0003271331″></item> | |
<item isbn=”0003271323″></item> | |
<item isbn=”0003272788″></item> | |
</items> | |
<catalog> | |
<provider> | |
<identifier>S84</identifier> | |
<title>University of Sunderland</title> | |
<url>http://www.ucas.com/students/choosingcourses/choosinguni/instguide/s/s84</url> | |
<course> | |
<identifier>997677</identifier> | |
<title>History with TESOL</title> | |
<url>http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/Dhh-QG8Bhe33Egpbb227I8OPTGQUw-VTyY/HAHTpage/search.HsDetails.run?n=997677</url> | |
</course> | |
</provider> | |
<provider> | |
<identifier>H36</identifier> | |
<title>University of Hertfordshire</title> | |
<url>http://www.ucas.com/students/choosingcourses/choosinguni/instguide/h/h36</url> | |
<course> | |
<identifier>971629</identifier> | |
<title>History with English Language Teaching (ELT)</title> | |
<url>http://search.ucas.com/cgi-bin/hsrun/search/search/StateId/Dhh-QG8Bhe33Egpbb227I8OPTGQUw-VTyY/HAHTpage/search.HsDetails.run?n=971629</url> | |
</course> | |
</provider> | |
</catalog> | |
</course> | |
</study_recommendations> |
The ‘catalog’ element essentially copies the data structure from XCRI-CAP which I’ve documented in my previous post – I’m not using this namespace at the moment, but I may come back to this when I have time. The ‘course’ and ‘provider’ element can both be repeated.
If you are interested in using it please do, and drop me a comment here if you have examples, or suggestions for further improvements.
Hi Owen
I’ve been trying this with our public library catalogue, using the url method, rather than isbns. I tried the following –
http://www.meanboyfriend.com/readtolearn/studysuggest?match=0.5&url=http://www.surreylibraries.org/02_Catalogue/02_004_TitleResults.aspx?page=1&searchTerm=java&searchType=99&searchTerm2=&media=&branch=&authority=&language=&junior=&referrer=02_001_Search.aspx
I didn’t get any recommendations. I suppose this could just be because the stock in our libraries is understandably less academic than Huddersfield and Lincoln and therefore can’t match based on the ISBNs in the opriginal dataset. I’d really like to see how this works with our catalogue, so I’ll have to think of another plan of attack of identifying what ISBNs in the dataset you’re using match ISBNs in our system. Think I’ll have to do some comparisons.
I couldn’t try it with any items I’ve got on order, or with a list of titles I borrowed, as the url just presents a generic aspx url, without details.
It is an interesting idea and I would like to see it working with some of our real world data.
OK – realised that this is because the URL needs to be encoded – otherwise the ampersands (and other characters) in the submitted URL are mistaken for part of the applications URL. I’ve now updated the script – you have to submit an encoded URL. If you try:
http://www.meanboyfriend.com/readtolearn/studysuggest?match=all&url=http://www.surreylibraries.org/02_Catalogue/02_004_TitleResults.aspx%3Fpage%3D1%26searchTerm%3Djava%26searchType%3D99%26searchTerm2%3D%26media%3D%26branch%3D%26authority%3D%26language%3D%26junior%3D%26referrer%3D02_001_Search.aspx
You should find this works (note I had to use the ‘match=all’ as a 0.5% overlap doesn’t get you anything.
I’ll look at adding some better error checking/responses so you can more easily see whether it was that no ISBNs were found in the page, or if they were found, but there were no matches.
Thanks Owen.
I’ll have a look at which other parts of our catalogue this will work on.
I’ve added in a bit more feedback in the xml file so you can (hopefully) see what is happening, even when no courses have been recommended. At the top of the xml file is a new element, which contains a list of all the items (with isbns) that were used by the script. This list includes any errors encountered when trying to retrieve information on the ISBN from the MOSAIC usage data.
I’ve also included any errors from attempts to search the UCAS course catalogue for course codes in an element (within the top level element)
Hi Owen
Following on from our recent chat here’s some thoughts on the results Read To Learn is returning. If I was someone considering a course, I’d be interested in where the course was and a brief description of it. I wouldn’t be worried about course/provider codes at this stage. It’s handy having the link to the course.
I tried it with this url. It worked on the 2nd attempt, but was very slow. First attempt timed out. Not sure where the speed problem is.
http://www.meanboyfriend.com/readtolearn/studysuggest?match=all&url=http://www.surreylibraries.org/05_ReadingLists/05_004_ReadingListTitleResults.aspx%3Ftype%3DLIST%26code%3DEATING%20DISORDERS
I like the idea of this. It could be very useful.
Thanks Gary,
The speed problem you are seeing is because one of the courses matched by the ISBNs is course code L500. In the MOSAIC dataset there are over 40,000 rows related to this. My script gets the information from the MOSAIC dataset using the URL:
http://library.hud.ac.uk/mosaic/api.pl?isbn=&ucas=L500&show=minimum&prog=&years=&rows=
If you follow this, you’ll find it is slow to respond. However, what I’ve found testing this evening is that the speed I’m seeing from the server on which I’m hosting the script is much much slower than I get when running it off my laptop at home! At home it takes about 15 seconds to respond, whereas on the server it takes around a minute! I’ve no idea why – I’ll see if Dave Pattern at Huddersfield has any ideas.
In terms of what information about each course is returned by my script at the moment I grab this information from the course search results page on the UCAS website – and to be honest there is little point in me not returning any of this, as it doesn’t cost anything to do. I absolutely agree that a short description of the course would be useful – but the UCAS website doesn’t actually include any useful information here (if there is a link somewhere to more course information it is back on the institutions website)!
With some extra effort I could grab the ‘mode of study’ (that is full-time or part-time), the level of qualification (BA Hon etc.) and the length (in years) from the UCAS results page if that would be of any use? I think that’s about it though.
I know there is some work going around using a format called XCRI-CAP to enable institutions to publish their course catalogues in a ‘harvestable’ format, and there is a demonstration ‘aggregator’ which harvests XCRI-CAP data and offers a unified course catalogue – which would make it easier to do all of this – but at the moment only a handful of institutions are doing this 🙁
Thanks Owen.
It seems bizarre that the server runs slower than the lap top, but that’s the fun of computers, isn’t it? Not entirely predictable.
If L500 has so many rows I can see why it’s so slow.
If it doesn’t add any overheads, then I agree it makes sense to just keep on returning what you are already returning in the XML. I was just thinking from the point of view as a possible student. Shame about the course description – some of the links to the courses do provided a bit of extra info, but it’s not consistent. If XCRI-CAP took off that would definitely help.
Thanks.