Talis Research Day – Codename Xiphos

I’m at a Talis Research Day today looking at a number of issues that are ‘hot topics’ in education and research at the moment. The program for the day looks great, with presentations by Peter Murray-Rust from the University of Cambridge, who is a proponent of opening up research and research data – I’d recommend his blog to catchup with the latest work in this area. Peter is talking about ‘Data-driven research’. Following this Andy Powell from Eduserv is talking about Web 2.0 and repositories.

First up – Peter M-R:

Peter presents using HTML – although it’s hard work, he believes that the common alternatives (Powerpoint, PDF) destroy data. I think the question of ‘authoring tools’ – not just for presentations, but in a more general sense of tools that help us capture data/information – is going to come to the fore in the next few years.

Peter has a go at publishers – claiming that publishers are in the business of preventing access to data, rather than facilitating it (at this points asks if there are any publishers in the audience – two sheepish hands are raised). Peter, also mentioning that Chemistry is particularly bad as a discipline in terms of making data accessible – with the American Chemical Society being real offender.

Peter’s talk tend to be pretty impromptu – so he is just listing some topics he may (or not) touch on today:

  • Why data matters
  • What is Open Data
  • Differences between Open Access and Open data
  • Demos
  • Repositories
  • eTheses
  • OpenNoteBook Science
  • Semantic data and the evils of PDF
  • Scinec Commons, Talis and the OKF
  • Possible Collaborations

Peter demonstrating how a graph without metadata is meaningless – showing a graph on the levels of Atmospheric Carbon Dioxide. If this was in paper form and we wanted to do some further analysis – it would take a lot of effort to take measurements off the graph – but if we have the data from behind the graph, we can immediately leap to doing further work.

Peter now noting that a scholarly publication looks very much now as it would have done 200 years ago. Showing a pdf of an article from Nature – and making the point that all looks great (illustrations of molecules, proteins and reactions etc.) but completely inaccessible to machines.

Peter noting that most important bio-data that is published is publicly accessible and reusable – but this is not true in chemistry. This means in the article, the data about the proteins is publicly accessible, but the information on chemical molecules is not – although covered in the same article.

Peter illustrating how there is a huge industry based on moving and repurposing data (e.g. taking publicly available patent data, and re-distributing in other formats etc.)

Peter now showing how a data rich graph is reduced to a couple of data points to ‘save space’ in journals – a real paper-based paradigm – we need to get away from this. Similarly experimental protocols are reduced to condensed text strings.

Peter now showing ‘JoVE’ – Journal of Visualised Experiments. In this online publication where scientific protocols are published in both textual and audio-visual format  – so much richer in detail than the type of summarisation that journals currently support. Peter notes – this is really important stuff – failure to provide enough detail to recreate an experiment, it can have a huge impact on your reputation and career.

Peter now moving onto ‘big science’ – relating his visit to CERN – how the enormous amounts of data generated by the Large Hadron Collider is captured, as well as relevant metadata. However, most science is not like this – not on this scale. Peter is relating the idea of ‘long tail’ science (coined by Jim Downing) – this is the small scale science, that is still generating (over all activity) large amounts of data – but each from small activities. This is really relevant to me, as this is exactly the discussion I was having at Imperial yesterday – looking at the approach taken by ‘big science’ and wondering if it is applicable to most of the research at Imperial.

So in Long-tailed Science, you may have a ‘lab’ that will have a reasonably ‘loose’ affiliation to the ‘department’ and ‘institution’. Peter noting that most researchers have experience data-loss – and this can be a real selling point for data and publication repositories.

Peter showing a thesis with many diagrams of molecules, graphs etc. Noting there is no way to effectively extract the information about molecules from the paper, as it is a PDF. He is demonstrating a piece of software which extracts data from a chemical thesis – demonstrating this from a thesis authored in Word, and using OSCAR (a text-mining tool tuned to work in Chemistry) – and shows how it can extract relevant chemical data, can display it in a table, reconstruct spectra (from the available data in the text – although these are not complete).

Peter asking (rhetorically) what are the major barriers – e.g. Wiley threatened legal action against a student who put graphs on their website.

Peter now demonstrating ‘CrystalEye’ – a system which spiders the web for crystals – reads the raw data, draws a ‘jmol’ view (3d visualisation) of the structure, links to the journal article etc. This brings together many independent publications in a single place showing crystal structures. Peter saying this could be done across chemistry – but data is not open, and there are big interests that lobby to keep things this way (specifically mentioning Chemical Abstracts lobbying the US Government)

Peter now talking about development of authoring tools – pointing out that this is much more important that a deposition tool – if the document/information is authored appropriately, it is trivial to deposit (it occurs to me that as long as it is on the open web, then deposit is not the point – although there is some question of preservation etc – but you could start to take a ‘wayback machine’ type approach). Peter is demonstrating how an animated illustration of chemical synthesis can be created from the raw data.

Peter now coming on to Repositories. Using ‘Sourceforge’ (and computer code repository) as an example. Stressing the importance of ‘versioning’ within Sourceforge – trivial to go back to previous versions of code. Need to look at introducing these tools for science. He is involved in a project called ‘Bioclipse’ – a free, open-source, workbench for chemo- and bioinformatics using a ‘sub versioning’ approach (based on Eclipse which is a software subversioning package) – Bioclipse stores things like spectra, proteins, sequences, molecules etc.

Peter mentioning issues of researchers not wanting to share data straightaway – we need ‘ESCROW’ systems that can store information which is only published more openly at a later date. The selling point is keeping the data safe.

Peter dotting around during the last few minutes of the talk, mentioning:

  • Science Commons (about customising Creative Commons philosophy for Science)
    • how to license data to make it ‘open’ under appropriate conditions – this is something that Talis has been working on with Creative Commons.
    • Peter saying that, for example, there should be a trivial way of watermarking images so that researchers can say ‘this is open’ – and then if it is published, it will be clear that the publisher does not ‘own’ or have copyright over the image.

Questions:

Me: Economic costs of capturing data outside ‘big science’

PMR: If we try to retro-fit costs are substantial. However, data capture can be marginal cost if done as part of research. Analogy of building motorways and cyclepaths – very expensive to add cyclepaths to motorways, but trivial to build at the same time.

Some interesting discussion of economics…

Technorati Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.