Now the second speaker – Reagan Moore from San Diego Supercomputer Centre (SDSC) – is talking about the implications for institutions managing Scholarly Communications. The first point being the huge amount of data we are talking about. At SDSC we are talking about (overall) over 500 terabytes of data, and in the order of 68 million files!
To put this in context, it is estimated that all the books in the Library of Congress consist of only 20 terabytes of data (in text) – so the amount of information is absolutely huge. We need to quickly realise that we are dealing with information on a new scale – and start to get to grips with the problems and solutions.
At SDSC they have a ‘data grid’ which is essentially a middle layer sitting between the data access methods (essentially the user (or machine) interfaces to the data), and the data storage – which may be a variety of formats in a variety of systems.
All of this stuff is better illustrated by diagrams!
Essentially, what they call the ‘Data Grid’ allows the abstraction of user interface from the storage. In terms of the ‘digital library’, they have done an integration with DSpace which allows the management of information spread across different file systems in different locations.
Talking about the National Science Digital Library (which uses Fedora), Reagan is saying that they needed to crawl and harvest material from web sites, and found that the majority of URLs being entered into their system became invalid after 3 months. So they have been looking at how to build a digital archive of this material. However, I think we need to ask some hard questions about what we preserve. Should we not ask why the material is disappearing so quickly from the original source?
Perhaps more interestingly huge amounts of data is being produced by research centres (e.g. Astronomy data, Seismic studies) – and how we store, and provide access to all this.
An interesting question is what the role of the library is in the world of data grids and huge amounts of data. Reagan is indicating that the library is still seen as both the ‘indexer’ but perhaps more importantly as the provider of the user interface to these collections. (phew – that’s a relief – no chance of librarians being out of a job then). However, perhaps we need to look at the skill sets needed. Paul Ayris is suggesting that we need to look to Archivists and Records Managers for skills – which links back to my question above about what we preserve and what we don’t. Steven Pinfield, for the University of Nottingham, who is chairing the session has also prompted Ex Libris to think about the role of the library system software supplier to a world in which the ‘data grid’ is a key part of the information landscape.
Some interesting questions about the approach of distributed data, and also the question of access to information. The demands of keeping certain data confidential (to preserve IPR) means controlling access to every ‘object’ in a data grid – so we are talking about controlling access to 68 million individual files. However, there is also a move to making the data available to the wider community – the raw data should be available, the analysis is where the ‘value’ is added