Scraping, scripting and hacking your way to API-less data

Mike Ellis from eduserv talking about getting data out of web pages.

Scraping – basically allows you to extract data from web pages – and then you can do stuff with it! Some helpful tools for scraping:

  • Yahoo!Pipes
  • Google Docs – use of the importHTML() function to bring in data, and then manipulate it
  • dapper.net (also mentioned by Brendan Dawes)
  • YQL
  • httrack – copy an entire website so you can do local processing
  • hacked search – use Yahoo! search to search within a domain – essentially allows you to crawl a single domain and then extract data via search

So, once you’ve scraped your data, you need some tools to ‘mung’ it (i.e. manipulate it)

  • regex – regular expressions are hugely powerful, although can be complex – see some examples at http://mashedlibrary.ning.com/forum/topics/extracting-isbns-from-rss
  • find/replace – can use any scripting language, but you can even use Word (I like to use Textpad)
  • mail merge (!) – if you have data in excel, or access, or csv etc. you can use mail merge to output with other information – e.g. html
  • html removal – various functions available
  • html tidy – http://tidy.sourceforge.net – can chuck in ‘dirty’ html – e.g cut and pasted from Word, and tidy it up

Processing data:

  • Open Calais – service from Reuters  that analyses block of text for ‘meaning’ – e.g. if it recognises the name of a city it can give information about the city such as latitude/longitude etc.
  • Yahoo!Term Extraction – similar to Open Calais – submit text/data and get back various terms – also allows tuning so that you can get back more relevant results
  • Yahoo!geo – a set of Yahoo tools for processing geographic data – http://developer.yahoo.com/geo

The ugly sisters:

  • Access and Excel – don’t dismiss these! They are actually pretty powerful

Last resorts:

Somewhere I have never travelled

This presentation by Brendan Dawes – http://www.brendandawes.com/ (powered by WordPress)

Brendan quite into data – “data porn” – visualising data. Saying that much of the web is still designed as if it’s in print.

Making ‘weird creatures’ out of keywords http://www.brendandawes.com/?s=redux – ‘creatures’ size indicates popularity, speed they move depends on age – but this stuff doesn’t come with an instruction manual – there is nowhere that these links between data and behaviour is documented for the ‘end user’ – but just putting it out there, and trying it out.

‘Interfaces’ are important – Brendan likes to collect ideas in ‘Field Notes’ books – http://fieldnotesbrand.com/. Also has a firewire drive full of ‘doodles’ as his ‘digital notebook’ – just bits and pieces of stuff that may do one thing – e.g. a drawing app, that allows you to draw things in black ink – that sat there for ages, he did nothing with it. Then had an idea that he wanted to be able put stuff on lines that he had drawn – found something that someone else had done online – and he had put that on his digital notebook.

Brendan wanted to do something http://www.daylife.com/

(aside – When you design stuff for people, avoid colours – as people can dump a perfectly good idea if you’ve done it in the wrong colour! Use black and white, because it doesn’t upset anyone 🙂

What would happen if we removed interfaces completely? Allowed people to build their own interface?

So – all of these bits and pieces came together as http://doodlebuzz.com/ – allows you to do a search – then you draw a line to see the results displayed.

Memoryshare – a BBC project to share memories. Original version had a rather dull interface – didn’t engage people, so not very good usage – although the content is very compelling when you start reading. Brendan and team did a range of prototypes – very open brief – basically do anything you want.

Took ideas done with the Daylife example – displaying time based events on a spiral line – great ‘wow’ moment when you see the spiral on the screen, and then as you zoom in it becomes obvious that it is a 3d environment – very, very pretty! Original demo was in Flash, which couldn’t cope with the amount of data in memoryshare – but the BBC really liked design, so figured out how to do it – see the results at http://www.bbc.co.uk/dna/memoryshare/ – compare this to the old design at the Internet Archive Wayback Machine.

Brendan now moving onto using data to produce physical objects – mentioned a site I didn’t get (Update: thanks to @nicoleharris got this now http://www.ponoko.com/make-and-sell/how-to-make) that allows you to upload a design and get it made – so for example Brendan has had some wooden luggage tags made with data displayed on them. Moo.com has an API – you can pump data in and get physical objects out. Brendan has written something that takes data from wefeelfine.org and pushes to moo.com to make cards – transfers transient digital data into less transient physical data

Visualisation

Iman Moradi is talking about how we organise library stock and spaces – he’s going through at quite a pace, so very brief notes again.

Finding things is complex

It’s a cliched that library users often remember the colour of the book more than the title – but why don’t we respond to this? Organise books by colour – example from Huddersfield town library.

Iman did a demonstrator – building a ‘quotes’ base for a book – use a pen scanner to scan chunk of text from book, and associate with book via ISBN – starts to build a set of quotes from the book that people found ‘of interest’

Think about libraries in terms of games – users are ‘players’, the library is the ‘game environment’. Using libraries is like a game:

  • Activities = Finding, discovery, collection
  • Points/levels = acquiring knowledge

Mash Oop North

Today I’m at Mash Oop North aka #mashlib09 – and kicking off with a presentation from Dave Pattern – some very brief notes:

Making Library Data Work Harder

Dave Pattern – www.slideshare.net/daveyp/

Keyword suggestions – about 25% of keyword searches on Huddersfield OPAC give zero results.
Look at what people are typing in the keyword search – Huddersfield found ‘renew’ was a common search term – so can pop up a information box with information about renewing your books.

By looking at common keyword combinations can help people refine their searches

Borrowing suggestions – people who borrowed this item, also borrowed …
Tesco’s collect and exploit this data. Do libraries sometimes assume we know what is best for our users – but we perhaps need to look at data to prove or disprove our assumptions

Because borrowing driven by reading lists, perhaps helps suggestions stay on-topic

Course specific ‘new books’ list – based on what people on specific courses borrow
Able to do amazon-y type personalised suggestions

Borrowing profile for Huddersfield – average number of books borrowed shows v high peak in October, lull during the summer – now can see the use of the suggestions following this with a peak in November.

Seems to be a correlation between introduction of suggestions/recommendations with increase in borrowing – how could this be investigated further?

Started collecting e-journal data via SFX – starting to do journal recommendations based on usage.

Suggested scenario – can start seeding new students experience – 1st time student accesses website can use ‘average’ behaviour of students on same course – so highly personalised. Also, if information delivered via widgets could drag and drop to other environments.

JISC Mosaic project, looking at usage data (at National level I think?)

So – some ideas of stuff that you might do with usage data:

#1 Basic library account info:
Just your bog standard library optionss
– view items on loan. hold requests etc
– renew items
Confgure alerting options
– SMS, Facebook, Google Teleppathy
Convert Karma
– rewards for sharing information/contributing to pool of data – perhaps swap karma points for free services/waiving fines etc.

#2 Discovery service
Single box for search

#3 Book recommendations
Students like book covers
Primarily a ‘we think you might be interested in’ service
Uses database of circulation transactions, augmented with Mosaic data
time relevant to the modules student is taking
Asapts to choices student makes over time

#4 New books
Data-mining of books borrowed by student on a course
Provide new books lists based on this information (already doing this at Huddersfield I think)

#5 Relevant Journals

#6 Relevant articles
– Whenever student interacts with library services e.g. keywords etc. – refines their profile

Quick Reference

The Telstar Project is looking at how to integrate references to resources into a VLE, making it as easy as possible for students to access the referenced resources, while encouraging students (and teachers?) to adopt good practice in referencing and citations – e.g. Using an appropriate reference/citation style)

If you are immersed in the world of Higher education, and especially HE libraries, the above probably makes some kind of sense to you. However, as I have started to look at the problem I’ve realised that I’m not particularly consistent in the way I talk about references and resources, and that I sometimes want to make subtle distinctions between (what I see as) different types of references/resources. I want to try to establish some definitions, and air some of the distinctions I make in my own mind to see if they are really important, or whether I’m guilty of over complicating things.

To start with some definitions:

Resource

I started with a rather narrow view of a resource, but after discussion on Twitter I was easily persuaded that a ‘resource’ was essentially anything. The only caveat I’d add in this context is that you must be able to reference it – although I’m not sure if this is a necessary caveat (is there anything that can’t be referenced?). So my definition is this:

A resource is something that can be referenced.

In the context of teaching and learning materials common resources will be:

Books (print or electronic)
Journal articles (print or electronic)
Book Chapters (print or electronic)
Websites
Databases

Reference

I think my definition of a reference is relatively straightforward.

A Reference is a description of a resource to the extent that the resource could be discovered on the basis of the description.

Essentially a reference has to be enough for ‘the reader’ to be able to go and find the relevant resource.

Citation

I struggled a bit more with the definition of a citation. This was because I was actually trying to find a word for a different concept – something I’ll expand on below. This was clearly using the term citation in a way that wasn’t consistent with the common use. So, my current definition of a citation is:

A citation is an in-context pointer to a reference.

A citation would usually appear in a body of text where you might put a reference, but for the purposes of readability you simply put a pointer to a reference usually in a footnote or endnote to the text.

Other concepts

There is another distinction I find myself wanting to make, but I’m not sure if making these fine grained distinctions is useful or necessary – I’d be interested in comments on this concept:

Something that refers to a specific part (or aspect) of the thing that is referenced. A reference would tend to point at say a book or a chapter – would it be useful to have a term for when you refer to a specific part of a resource, when the reference points only at the general resource? If I directly quote from a resource, then I’m not just citing that resource, but a very specific bit of that resource. Does this make a difference?

A similar but slightly different thing is that there is a difference between wanting to point to a website as a general resource, and pointing to a website for the purposes of citation – in the latter case you would want to include the date that the website was accessed for the particular piece of information you are using.

Comments on the definitions, and any discussion of the latter points welcomed!

At one remove

You will have seen from my previous post that I’ve moved this blog recently. There were a few challenges associated with this which I want to document here, but perhaps the first thing to tackle is why I was moving the blog in the first place.

The domain www.meanboyfriend.com came about through a joke between me and my girlfriend (now wife) , Damyanti, about what a mean boyfriend I was, and how she would publish a list of my misdemeanours on a website dedicated to this – meanboyfriend.com (at least, I think it was a joke). When we decided to setup a blog, buying the meanboyfriend.com domain seemed like a good punchline. I can’t remember now whether choosing our own domain name was the result of clear thinking about wanting to own the domain on which our stuff lived or not – but I think in retrospect it was a good decision (rather than simply using the URL provided by Typepad).

At the time (6 years ago) the Typepad blogging platform was getting good reviews, and Moveable Type (which powered Typepad) was one of, if not the, leading blogging platforms. We setup a joint account with Typepad – because of the type of account we have with Typepad, we have a single user – which is our joint account, and all entries on our blogs appeared to be by our amalgamated personality – damyantiandowen. Also the FOAF file that Typepad will automatically create for you was for this joint identity. We are also limited to three blogs on the account.

Having sorted out the technical side, we setup our first blog – Overdue. This was a personal blog, which was aimed primarily at friends and family. We also used the Typepad photo facility to put up photos from holidays etc. To be honest, we’ve never been that great at updating the blog, although we use the photos a lot (and as a result have never really invested in Flickr or Picasa or other similar photo sharing services). After a hiatus of over a year covering the whole of 2008, we decided that we would try to refocus the blog on food/drink stuff  – see our explanation at Foods for thought. As this is truly a ‘joint’ blog, entries appearing as authored by damyantiandowen are fine – and although each entry is generally written by one or the other of us, it feels like a joint venture.

Shortly after this, I decided that a professional blog would be useful to record thoughts and ideas relating to my work. I set this up as Overdue Ideas (see what I’m doing here?). This was mapped to a URL we owned, still under the meanboyfriend.com domain (http://www.meanboyfriend.com/overdue_ideas). Although in theory I would have been happy for this to be a joint blog (and I should acknowledge that many of my ideas and posts come out of conversations with Damyanti), in practice I was the only author. This made the joint account a bit of an issue – not on a day to day basis, but just occasionally. Last week I was contacted by someone wanting to quote Overdue Ideas, and unsure whether the quote should be attributed to me or Damyanti. This confusion has happened more than once.

Some years passed with this being the basic situation – 1 account, 2 blogs, some photos and a few bits a pieces were hosted on Typepad and appeared under http://www.meanboyfriend.com. As our account would support 3 blogs, we are currently using the 3rd blog as a protected file space (as far as I can tell on Typepad you can only protect at the blog level, not on individual files or pages) for stuff we only want to share with specific people.

So why move?

Over time, I feel that the Typepad application hasn’t quite kept up with state of the art in blogging, and as I saw what others were acheiving with WordPress I got some tech envy. WordPress supports a huge array of plugins, and Akismet seems to be state of the art as far as catching comment spam goes – although this wasn’t a massive problem on my blog, it was an irritant with one or two spam messages a week to clear out (I should say that this was the stuff that got through – Typepad’s own spam filters caught a lot of spam for me that didn’t make it to the blog)

Also, the issue of our con-fused (see Neal Stephenson) identities was an occasional issue – especially as discussions about online identity moved on I realised that we had a bit of a problem here. As well as the confusion for readers – who was actually writing this blog, and who had authored which post – there were other issues. Typepad automatically creates FOAF files – but for us, this was for our joint identity. Typepad also supports OpenID, but again we got one OpenID between the two of us.

The final push came when Damyanti wanted to setup her own blog – which would have taken us beyond our 3 blog limit.

One solution to much of this would have been to upgrade our Typepad account (from ‘Plus’ to ‘Pro’). This would have allowed us to have unlimited blogs, and unlimited authors. But in the end my techno-lust won the day – I wanted a bit more flexibility, and the ability to do other things (e.g. install other software).

It looked like it was time to move blogging platforms to support our separate identities, multiple blogs and satisfy my techno-lust. Having seen a number of people I know on Twitter mentioning Dreamhost, and getting some good feedback when I asked on how it was working, I decided to go with them as a host. As I’ve already mentioned, I’d been admiring what people could do with WordPress – I was blown away by the iPhone theme that Joss Winn has on his blog (when viewed with an iPhone)

So – you are now reading this blog powered by WordPress, and hosted by Dreamhost. The move was slightly traumatic, but if I can, I’ll document this separately. If you are thinking of doing a similar move (and are of the tech inclination) I’d recommend Rob Styles post on moving from Typepad to WordPress for information on dealing with redirecting URLs etc – something I struggled with (and still haven’t completely dealt with).

Moving Type

I’m in the process of moving this blog. It is now powered by WordPress (rather than Typepad/Moveable Type previously). Although I have migrated all the content, links to posts may currently be broken  – I’m in the process of fixing these, but it may take me a couple of days – please be patient!

In the meantime I think all the RSS/Atom feeds are working OK and you shouldn’t need to do anything if you subscribe via one of the feeds.

Super! Mashing! Great!

If you like playing around with bibliographic and other library data (and let’s face it, who doesn’t?) then you are in for a good summer.

Two events to get into your diary now are the WorldCat Mashathon in Amsterdam on 13/14 May, and Mash Oop North (tag is mashlib09) in Huddersfield, UK, on 7th July.

The Worldcat Mashathon is an event organised by OCLC which promises access to data derived from the 1.2 billion records in Worldcat via a variety of web services. This event follows on from the previous Worldcat Hackathon held in New York City last year – to get a flavour of the event you can see a video summary on the Hackathon on YouTube. The OCLC Developer Network wiki has further details and registration.

Mash Oop North is the 2009 incarnation of Mashed Libraries UK. As the organiser of the previous mashlib08 event I can’t really comment on exactly how excellent it was, but Mash Oop North is being organised by Dave Pattern and others, so I can objectively say it is bound to be a brilliant event. To see the kind of thing that happened at mashlib08, and to keep up to date with new of Mash Oop North, keep an eye on http://mashedlibrary.ning.com. Mash Oop North is being sponsored by Talis, although if you are interested in supporting the event, you may want to consider donating a prize (if you aren’t sure what prize to offer, may I suggest a speedboat?)

[For a guide to the cultural references in this post, see http://en.wikipedia.org/wiki/Bullseye_(UK_game_show)]

Technorati Tags: ,

JISC09 Closing Keynote – Ewan McIntosh

The closing keynote is from Ewan McIntosh, who is Digital Commissioner for 4iP – Channel 4’s Innovation for the Public Fund.

Ewan mentioning The Guardian’s Datastore (and reflecting that he wished ‘they’ (presumably Channel 4) had done it first!) – this is a collection data which the Guardian compiles, and is now making available in ways that encourage reuse (although you have to understand the data to make sensible mashups) – you can see some examples from Tony Hirst on OUseful.info

Now mentioning ‘MySociety‘ and ‘Theyworkforyou‘ – noting how making data reusable opens up ways of allowing interacting with the data and combining it to uncover new information. However, opening up data is difficult – example of European newspapers accusing Google of ‘stealing’ their information because they use headlines from their websites – but Ewan noting that Google is driving traffic to the newspapers via the route.

“Free is a hard price to beat”

Mentioning John Houghton and Charles Oppenheim report on economic impact of Open Access – if you rethink the model then there are savings to be made.

“Destination anywhere”

4iP funding lots of projects. But lots of proposals start “X is a site which…” – they are thinking in terms of ‘destinations’ – and Universities are the ‘ultimate destination’. But most people visit only about 6 websites in a day – if you see your website as a destination, then you are saying you are going to compete with those 6 top websites – you are really going to struggle with this.

The VLE is a destination. The only reason students go there is because Universities ‘compel’ them to – it is the only place they can get the information they need. However, this results in students visiting and leaving as soon as they can.

“Participation culture”

Higher Education is not a participatory culture. On the web the current ‘top’ participatory environment is probably Facebook. Example of ‘Who has the biggest brain?’ – like brain training – but you play against others. 50 million players (in 6 months)

iMob – iPhone game a text based strategy game.

Battlefront – a Channel 4 education project – via MySpace and Beebo – encourages young people to get involved in campaigning on issues they care about.

Ewan just said “Hands up if you are not currently twittering” (most of the room) – “you are doing nothing!”. Those twittering are participating – being much more cognitively active.

Ewan describing different ‘spaces’:

  • Watching spaces (tv, theatre, gigs)
  • Participation spaces (marches, meetings, markets)
  • Performing spaces (Second Life, WoW, Home)
  • Publishing spaces (Blogging, Flickr)
  • Group spaces (Bebo, Facebook)
  • Secret Spaces (Mobile, SMS, IM) – sounds like the ‘backchannel’?

The mobile phone is one of the most exciting developments in learning – Google Android and iPhone incredible platforms. But Universities not realising that students are leapfrogging tethered screens to go for mobile. Ewan suggests that the vast majority of students have mobile devices that access the internet – but does your university provide mobile services?

Ewan showing how if you represent his Facebook contacts graphically you can see how the contacts in Academia tend only to be connected to each other – it is a closed world.

“People don’t just do stuff because it’s in your business plan”

Ewan says “I don’t buy the Gen-Y stuff – the Google Generation, the Digital Natives”. It has nothing to do with being ‘young’ – but being ‘youthful’.

Parents think that young people spend about 18.8 hours per week online – but actually they spend an average 43.5 hours per week online – where is this missing time?

Don’t romanticise creativity – it isn’t easy. 90% of the a-v output that people consume comes from LA based corporations – this is not ‘building on the shoulders of giants’.

Access to creative technology comes far too late for most children. Higher Education and JISC can apply pressure to the school sector to give access to, and make use of creative technology.

Ewan says Anonymity is not a bad thing. Some examples where Anonymity does not work – School of Everything, Landshare (both with money from Channel 4). However, some services only work with anonymity – e.g. Embarassing Teenage illnesses (also C4)

Ewan showing a grid that helps thinks about startups – but he suggests it could also be used for University web services, or even other activites:.

 

  Visitor (just looks at stuff) Fan (will sign up but not create content) Contributor (uploads content, comments etc.)
Grab the attention      
Timescale      
Keep the attention again and again      
Timescale      
Turn the value into a tangible assett      
Timescale      

Ewan encourages us to think about applying the grid to your own online offerings (wonder what this would look like for an OPAC?)

Technorati Tags:

JISC09 – Moving from print to digital: e-theses highlight the issues

I’m chairing this session, so may be a bit difficult to blog (since I can’t see the screen from the front). The session goes from the international (DART), to the national (EThOS/EThOSNet), to the institutional (the From Entry to EThOS project at Kings College London)

First up, Chris Pressler (from the University of Nottingham) talking about DART:

DART-Europe – started as an 18 month project between a small group of academic institutions and Proquest. The first phase focussed on the creation of a simple search service to e-theses.

In the first phase the technology wasn’t too difficult, but some question about the business model. Proquest have a commercial service in the USA – but it didn’t seem suitable in Europe.

DART-Europe is now in the second phase administered by Nottingham and UCL – it is no longer a project, but an ongoing service. All partners have a seat on the DART board (really, there is a DART board). Although a UK led project partners (and potential partners) from across Europe.

  • DART now providing access to over 100,000 full-text e-theses. The thesis records come from:
    • 34 data sources (national, consortial or institutional)
    • 13 countries
    • 150 institutions
  • Daily updates
  • Data collection using simple OAI Dublin Core – but MODS and MARC also supported. Took an extremely simple approach to metadata – just 5 pieces of information per thesis.
  • Takes a pragmatic outlook
    • aims to keep things simple – minimise barriers

DART exposes theses to Google (wasn’t very clear how though?)

Although DART takes a simple approach, metadata still needs work.

DART now supports RSS, alerts, export results, multilingual interfaces, and provides usage statistics

How much does it cost to run DART? Not clear – need to look at this, and also benefits. Need to answer the question of whether this can run as a institutional supported service.

DART-Europe has other technical insterests – digital preservation, retrodigitisation…

Conclusions:

  • No dedicated funding means progress incremental – but has produced tangible results
  • Time to start marketing portal to academic community
  • DART-Europe provides a networking organisation for partners – not just about thesis issues

Next up EThOS/EThOSNet (declaration of interest, I’m the Project Director for EThOSNet):

EThOS aims

  • single point of access for UK HE Doctoral theses
  • Support HEIs in transition from print to electronic theses (via a toolkit)
  • digitise existing paper theses

Different participation options supported by EThOS

  • Open Access Sponsor – institution makes ‘up front’ payment to cover digitisation of a set number of theses
  • Associate Member Level 1 – institutions pays as it goes – each time a thesis is digitised, billed monthly
  • Associate Member Level 2 – the first researcher pays, then the digitised version available free
  • Associate Member Level 3 – EThOS simply routes the requester to the awarding institution (where the institution does not want EThOS to digitise theses)

EThOS takes an ‘opt-out’ approach – will put up theses without seeking author permission, but have strong rapid takedown policy so that if an author does not wish their thesis to be made available via EThOS it can be removed immediately.

98 UK HE institutions have signed up for EThOS.

Now Tracy Kent from University of Birmingham talking about the impact of EThOS on Birmingham.

  • University of Birmingham – is an Open Access Sponsor
  • From old ‘microfilm’ service, Birmingham used to supply 5-6 theses per week. In the first few weeks of EThOS going into public beta, providing 5-10 per day
  • University of Birmingham already had some theses in its institutional repository UBIRA – these are harvested by EThOS in order that they can be supplied via EThOS
  • Costs shifted from handling document supply requests to converting and loading etheses into reposityr to facilitate ‘front loading’ of e-thesis content
  • University of Birmingham took decision that if one of their users wanted a thesis from EThOS from a ‘Level 2’ member (i.e. equivalent of ILL) then this would have to be covered from researchers budgets, not from the library ILL budget

Birmingham contacted about 500 authors – only 5 got in touch to say that they would not want to be part of EThOS. A further 10 said they’d like to be included but couldn’t because of publisher restrictions (i.e. they had published, or were going to publish)

Birmingham have a number of procedures in place to check theses before they go to be digitised and believe that this due diligence approach combined with EThOS rapid takedown policy means that they are acting is a responsible way – and so far have had no requests for takedown from authors.

Birmingham have seen that once a thesis is on EThOS it is usually downloaded many times.

The service means that

  • Birmingham University thesis content is being seen and accessed
  • There is a changing role for document supply staff
  • There is a need to train authors to seek out necessary permissions and to ensure that submitted theses have the necessary permissions

Finally in the EThOS section Anthony Troman from the British Library. British Library run the EThOS service – they use a digitisation suite to digitise the paper theses, and make available to the end user by download, or (for additional payment) in other formats such as CD-ROM or paper.

Some questions that have come up:

  • Why not continue with microfilm service?
    • Requests for this service have been declining over the last few years – and was costing the BL large amounts of money
    • The system was not economically viable or sustainable
    • In 2 months usage 8517 individual theses requested for digitisation – well over a years worth under the microfilm service
    • In 2 months 17000 downloads
  • Popularity causing some problems with demand
    • New scanner installed
    • Double shifts – digitisation running 8am-midnight every day
  • Increase in quality between microfilm and digitised

Unfortunately this all costs money! However, a fundamental principal was that ideally theses should be free at point of use. Unfortunately the popularity means that some institutions who have made an upfront contribution are already running short of funds – but there are several options for institutions in this situation and they should contact the BL to discuss options.

Once a thesis is digitised – noone has to pay again – not the institution or the researcher.

Finally (running late which as chair is my fault!) Patricia Methven and Vikas Deora from Kings talking about Entry to EThOS:

Patricia reflecting how many different parts on the institution that needed to be involved in the move to e-theses. Now Vikas saying that Entry to EThOS about the ‘born digital’ theses rather than digitisation.

At Kings e-thesis submission is not mandatory. The Exam Office was keen to test student takeup and to streamline administration. The library was keen to see born-digital deposit due to storage issues and EThOS participation as important drivers. Vikas says with feeling (as a PhD) “The last thing you want to do once you have finished your thesis is to go to a website and fill out hundreds of pieces of information”!

The project looked at creating an e-thesis submission workflow
– how to capture the metadata, integrate with existing workflows, integrate with the repository (Fedora in this case) etc.

Found the student record system as a key source of data – this captures a lot of information about the title of thesis, names of tutors, status of student (e.g. writing up) – and the status of the student was seen as  the driver for the workflow. Because the data is coming from within the institution, the Exam Office don’t need to do further checking – so there were real benefits to the Exam Office which came out of the project – you need to convince them that this is going to save them work!

Bibliographic services had concerns about the metadata – assigning subject headings and keywords etc. So the project tried to integrate this into the workflow, so that the library could still classify the theses. They harvest back  information from the library system (e.g. subject headings) – they weren’t allowed to write into the library system (sounds like there is double entry going on here?)

Student doesn’t have to enter any information when they upload the thesis – just upload the pdf, check the information and it is submitted to the repository.

Kings recommend that the file the student submits is the ‘source’ file – e.g. Word doc or LaTeX etc. They can also submit PDF, or the conversion will be done for them – this allows for more flexibility in terms of long term preservation.

Literally takes 25-30secs for a student to submit an ethesis. Vikas sees this as absolutely key.

What’s next?

  • Move from e-thesis to Virtual Research Environment
  • Policy decision with exam board – does e-submission become mandatory? (Vikas sees this as key to adoption)
  • Embargos

Q: Has EThOS considered changing approaches to Intellectual Property rights after 2 months?

A: No – lots of issues around the IP issues, but must manage issues. Some institutions taking a ‘trial’ approach where they agree with legal advisors to try it out for a short period, subject to review, as a way of starting out, and hopefully getting agreement for long term committment if no legal problems come up. Also mention that institutions may well be insured against legal action.

Technorati Tags: