I’m briefly at the Open Repositories 2012 conference in Edinburgh, and this morning in a session about ‘repository services’ – which sounds like a nice easy session to ease into the morning, but is actually diving into some pretty hard technical detail pretty quickly!
There are three papers in this session.
Built to scale?
Edwin Shin is describing using Hydra (a repository stack built on Fedora, Solr, Blacklight). I missed the start, but the presentation is about dealing with very large numbers of digital objects – from millions to hundreds of millions. It’s a pretty technical talk – optimisation of Solr through sharding, taking a ‘sharded’ approach to Fedora (in the ActiveFedora layer).
Perhaps high level lessons to pull out are that you ought to look at how people use a system when planning quite technical aspects of the repository. For example – they reworked their disaster recovery strategy based on knowledge that vast number of requests were for current year – since the full system recovery takes days (or weeks?) they now deposit objects from current year so they can be restored first and quickly.
Similarly with Solr optimisation – having done a lot of generic optimisation they were still finding performance (query response times) far too slow on very larges sets of documents. By analysing how the system was used they were able to perform some very specific optimisations (I think this was around increasing the filtercache settings) to achieve a significant reduction in query response times.
Inter-repository Linking of Research Objects with Webtracks
This session being presented by Shirley Ying Crompton. Shirley describing how the research process leads to research data and outputs being stored in different places with no links between them. So decided to use RDF/linked data to added structured citation links between research objects (and people – e.g. creators).
However, different objects created in different systems – so how to make sure objects are linked as they are created? Looked at existing protocols for enabling links to be created:
- Trackbacks – use for blogs/comments
- Semantic pingback – an RPC protocol to form semantic links between objects
- Salmons – RSS protocol
- …
Decided to take ‘webtracks’ approch – this is an inter-repository communication protocol. The Webtracks InteRCom protocol – allows formation of links between objects in two different repositories. InteRCom is two stage protocol – first stage is ‘harvest’ to get links, then second stage ‘request’ a link between two objects.
InteRCom implementation has been done in Java, available as open source – available for download from http://sourceforge.net/projects/webtracks/.
Shirley says: Webtracks facilitates propagation of citation links to provide a linked web of data – uses emerging linked data environment and support linking between diverse types o digital research objects. There are no constraints on link semantics or metadata. Importantly (for the project) is that it does not rely on centralised service – it is peer-to-peer.
Webtracks has been funded by JISC and is a collaboration between the University of Southampton an the STFC – more information at http://www.jisc.ac.uk/whatwedo/programmes/mrd/clip/webtracks.aspx
ResourceSync: Web-based Resource Synchronization
This session is of particular interest to me, and I took more extensive notes – so I’ve put these into a separate post http://www.meanboyfriend.com/overdue_ideas/2012/07/resourcesync-web-based-resource-synchronization/