Andy Powell – Web 2.0 and Repositories

Andy has been live blogging at efoundations but clearly won’t be doing this for his own session.

Andy is describing repositories.

What we do with repositories:

manage
deposit
disclose
make openly available
curate
preserve

What is in repositories:

scholarly publications
learning objects
research data

Andy is going to focus on scholarly publications today, although much of what he says will be applicable across the board.

In terms of how HE is building an infrastructure, it is mainly an ‘institutional’ focus – although not exclusively – exceptions include arXiv, RePEC, JORUM. But Andy believes that the ‘political’ agenda is around institutional approaches.

Interoperability is done via centralised aggregators – usually national, sometimes global – Intute (national), OAIster (global). Interoperability is essentially at the level of harvesting metadata (usually simple Dublin Core) using OAI-PMH – not usually harvesting content.

Content in repositories is very often PDF.

Andy just noting SWORD, recently developed as a deposit API.

So – having given the background, Andy wants to look at the issues (now 5 issues – having grown from 3 when Andy first did a similar presentation – the issues grow as he looks at it more – oh dear):

#1 We talk about ‘repositories’

There is a real issue with terminology. The term ‘repository’ is pretty woolly. Whereas a focus on ‘making content available on the Web’ would be more intuitive to researchers.

I agree this is an issue – although from what Peter Murray-Rust’s talk we might be better in saying ‘backup systems’ rather than either of these 🙂

Andy noting we don’t talk about ‘Content Management Systems’ – which may be a good thing, but we need to acknowledge that in general terms ‘repositories’ are ‘content management systems’. If we started thinking about this, then we might start talking about ‘surfacing content’ on the web, rather than focussing on specific protocols (i.e. OAI-PMH)

#2 We don’t emphasise

Google indexing – where is the discussion about ‘Search Engine Optimisation’
RSS Feeds
‘Widgets’

#3 Our focus is on sharing metadata

Even though we have full-text to share – and what we do share is PDF rather than a ‘native web’ format. Also the metadata we do share tends to be simple Dublin Core – inconsistently applied. Andy arguing that simple DC is too simple to build compelling discovery services, but too complicated for the user – they are put off adding metadata

#4 We ignore the Web Architecture

We have tended to adopt service oriented approaches – in line with tradition of digital library approaches.

Focus is building services that give access to data, rather than a ‘resource oriented’ approach which is being adopted in the more general web world. We don’t tend to adopt REST (and architectural style with a focus on resources with simple set of operations)

#5 We are antisocial

‘we’ (presumably the HE environment?) tend to treat content in isolation for the social networks that need to grow around that content

Successful repositories in a more generic sense (Flickr, YouTube, Slideshare, etc.) tend to promote the social activity that takes place around content as well as the content management and disclosure activity.

One thing that occurs to me here is a question about what services Flickr etc. are providing – what are their terms of service (clearly I need to go read these!) – but they may not have the same values that HE Institutions, and their repository services have.

Andy addressing my last point there which is that the institutional approach has fundamental mismatch with the real-life social networks adopted by researchers – which tend to be subject-based, cross-institutional and global. So while institutional approaches have some strengths – preservation etc. – they don’t get any ‘network’ effect.

We are ending up with ’empty’ repositories, having to ‘mandate’ deposit to get content, rather than making a compelling offering that researchers want to use.

So, Andy is suggesting we need to look at moving back to subject based, global repositories that concentrate content so that we can take advantage of the ‘network’ effect etc. This is where we started with arXiv.

There is a question of why in other examples – e.g. blogs – we can work with a distributed network of content, and services that aggregate this content (e.g. Technorati). Perhaps reasons this hasn’t worked with repositories is that blogs are under individual control, and the ‘glue’ (RSS) is lightweight and easy to apply. This doesn’t seem to be the case with repositories – although Andy admits he isn’t altogether sure why repositories don’t seem to have been successful in this way.

Andy noting that having this kind of challenge to repository ‘received wisdom’ is very difficult – it is very political, and those involved are reluctant to engage in discussion (possibly as it distracts from the Open Access agenda).

So – where do we go from here?

Andy suggests that we need to look at examples like Slideshare (a service that shares presentations). This might be what a ‘Web 2.0’ repository looks like:

a high quality browser-based document viewer (not a ‘helper’ application like Acrobat)
tagging, commentary, more-like-this, favourites
persistent (cool) URIs to content
ability to form simpler social groups
ability to embed documents in other web sites
high visibility to Google
use of ‘the cloud’ (Amazon S3?) to provide scalability

Andy suggesting we need to ‘go simple’. Develop simple(ish) repositories with complex services (search/aggregation) overlaying them.

Alternatively, we could go more ‘complex’ – the ‘Semantic Web’ approach – creating richer metadata about scholarly publications than we currently do, we explicitly adopt a complex data model etc.

Examples of this ‘complex’ approahc is SWAP (Scholarly Works Application Profile) which capture relationahsips between works, expressions, manifestations, items and agents. Also ORE (OAR Object Re-use and Exchange) – captures relationships between objects.

Andy throwing up diagrams of SWAP and ORE – these work well together, but much more complex approach.

Andy’s main points:

Look to Web 2.0
Need global concentration to get network effect
Simple DC too simple and too complex
SWAP and ORE may point to new approaches, but comes with extreme complexity

In conclusion:

Flickr was a response to digital photography – it wasn’t an attempt to create an ‘online photo album’

We need an approach to digital research that is not an attempt to recreate paper based scholarly communication – we need to re-think (‘re-envision’ in Andy’s words) scholarly communication in the digital age.

Question: Someone in the audience saying “when you said repositories were about ‘making content available on the web’ suddenly I understood why we had repositories – something that I had previously not understood”. I’m afraid I missed the next point, but it garnered a response from Andy along the lines of:

Answer: We are still working in a print/electronic hybrid environment for scholarly communication. We have a situation where you can have lots of copies of the same thing in different places and in different formats. Users want to get access to the most available copy. Is this the difference? (I’m not entirely sure this is the whole story – there is an issue about business models and charging for access – this is why it’s important which copy you are looking at)

Comment: From Phil Casey from BMJ – saying it is not about managing the delivery mechanism, but it is about the data.

Technorati Tags: xiphos

Overdue Ideas

Ideas linking Libraries, Computing, E-learning, and anything else that springs to mind.