Achieving total Finance Management

This may not sound like the most thrilling session (especially straight after lunch), but I’m hoping they are going to talk about integration with corporate finance systems. Talis Keystone seems to be the main ‘integration’ product – we are going to get a case study from Liverpool Hope University.

What is Talis Keystone? Uses standards (IT and Library standards) to allow integration.

At Liverpool Hope – small budget, with purchasing from consortium approved suppliers, as well as credit card purchases from Amazon. They use the ‘Agresso’ finance system, recently changed from the ‘Opera’ finance system.

The drivers were to avoid double data entry, getting up-to-date financial records that match on both systems, ability to search Finance system with standard data (e.g. use same order numbers on both systems). They decided to deal only with One-off purchases to start with, and couldn’t deal with purchase card in the first instance.

Need to be able to deal with New orders, good received, cancellations, part receipting, part cancellations etc.

Started by having a very detailed meeting with all the relevant players – Talis, Library, Finance, IT. They flow diagrammed the Library acquisitons process and the Finance process, matching the two together. Clearly identified what was going to be included, and what excluded in the project. Also identified limitations – e.g. Agresso could not accept changed information e.g. price, quantity (this has to be amended manually) – this sounds like quite a serious limitation to me!

They then had a followup meeting looking at data fields in both systems, and how they mapped to each other.

After getting the technical side sorted, they did structured testing, with both ‘standard’ scenarios, and some ‘try to break the system’ unexpected but realistic scenarios. Load testing. Testing was a time consuming part of the project.

Now at the point of implementation – need to sort out who does what (finance or library), especially for problem solving, need to automate some procedures, need to put in appropriate monitoring, and look at working practices.

Although not live yet, it sounds like a great project. I’ve been looking at how we handle financial transactions, and I think we might want to look at running a similar project.

Integrating Library Services

Just dashed from one presentation to another in a different room – unfortunately the start and end times don’t quite match, so I’ve come in halfway through a presentation from Queens University Belfast about integrating library services (apologies to Nicole Harris who was presenting on Federated Access Management, but I’ve seen quite a lot on this before…)

OK – so Queen’s Online seems to be the Queen’s portal. They have a ‘My Library’ channel which shows overdues, holds, etc. with single sign on. This is similar to a project we are involved in at Imperial at the moment.

Queen’s have introduced a strict fines policy, so they looked to make it as easy as possible for students to manage their account – to keep their loans current etc. Also looked at easiest way to pay. They established a project group with people from the library and the portal project. Implemented an e-pay solution, so fins can be paid online (use WorldPay to process credit card transactions).

They are describing how they setup the online payment service, using Talis Keystone, Queen’s Online and Worldpay, using Web Services – I won’t go into the detail, but the important thing here is the use of Web Services making it easy to integrate the library account with other systems.

The success of the work with Queen’s Online team is opening doors (really pays to work well with the University IT dept – they can help make things happen!). They are looking at increasing their presence in the portal, as well as smartcard epayments.

Andy Latham from Talis is summing up, basically saying that Talis Keystone is the solution for integration – I guess it is a Talis conference 🙂 Just finishing with the IT/Library video that is a take on the Mac/PC adverts – but I can’t track down the URL – if anyone knows it, drop it into the comments.

Integrating Library Services

Just dashed from one presentation to another in a different room – unfortunately the start and end times don’t quite match, so I’ve come in halfway through a presentation from Queens University Belfast about integrating library services (apologies to Nicole Harris who was presenting on Federated Access Management, but I’ve seen quite a lot on this before…)

OK – so Queen’s Online seems to be the Queen’s portal. They have a ‘My Library’ channel which shows overdues, holds, etc. with single sign on. This is similar to a project we are involved in at Imperial at the moment.

Queen’s have introduced a strict fines policy, so they looked to make it as easy as possible for students to manage their account – to keep their loans current etc. Also looked at easiest way to pay. They established a project group with people from the library and the portal project. Implemented an e-pay solution, so fins can be paid online (use WorldPay to process credit card transactions).

They are describing how they setup the online payment service, using Talis Keystone, Queen’s Online and Worldpay, using Web Services – I won’t go into the detail, but the important thing here is the use of Web Services making it easy to integrate the library account with other systems.

The success of the work with Queen’s Online team is opening doors (really pays to work well with the University IT dept – they can help make things happen!). They are looking at increasing their presence in the portal, as well as smartcard epayments.

Andy Latham from Talis is summing up, basically saying that Talis Keystone is the solution for integration – I guess it is a Talis conference 🙂 Just finishing with the IT/Library video that is a take on the Mac/PC adverts – but I can’t track down the URL – if anyone knows it, drop it into the comments.

eScience, Scholarly Communication and the Transformation of Research Libraries

This talk by Tony Hey – Corporate VP for External Research, Microsoft Research.

So, Tony is saying that we are seeing an ’emergence of a new Data-Centric paradigm for research’, and that Web 2.0 students won’t use the library in the traditional way – so there is a need to redefine the role of the research library.

We have seen (and continue to see) and explosion in the amount of data being produced in scientific research – huge amounts of data being produced by instruments, simulations, sensor networks – we are able to ‘measure’ stuff to an overwhelming degree. Tony sees management and ‘curation’ of this data as a huge challenge for the research community – he says the scale of the challenge is one of the reasons he joined MS.

The ‘Scientific Data Deluge’ – data collection, data processing, digital preservation.

An example – ‘Fighting HIV with Computer Science’:
Research from ‘Spam Blocking’ machine learning project, which then moved to use of machine learning in tools that scientists can use. The original project was aimed to analyse huge amounts of data as to whether it was spam or not – led to drawing out correlations in huge data sets on HIV.

Cyberinfrastructure – this is the real problem, the ‘calculation’ bit is easy, it is the infrastructure needed (both technical and organisational) that is the problem. Tony references the NSF report on this (http://www.nsf.gov/pubs/2007/nsf0728/index.jsp).

Tony makes the point that it isn’t just about e-Science, but e-Research – the same issue applies to arts and humanities.

Tony says research today is:

  • Data intensive
  • Compute intensive
  • Collaborative
  • Multi-disciplinary

Today – web users are using tools that could really help here, but typically Researchers are using custom standalone tools, the ‘sharing’ process is still via long publication process, physical meetings etc.

In eResearch data is easily accessible, shareable, (eg. http://cas.sdss.org/dr5/en), services expose functionality (e.g. BLAST from the NLM, http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome), services are in the cloud rather than installed locally (e.g. Amazon Web Services – S3, EC2 – this also used for home storage  solutions – JungleDisk).

Researchers can be seen as ‘extreme information workers’ – looking for subtle signs in the information available.

Publications as live documents – starting to see examples of figures in electronic publications that are based on ‘live’ data – so the reader can change aspects of a graph, plot different scales, overlay other data etc.

Just discovered that quite a few of the slides that Tony is using are available at http://research.microsoft.com/workshops/CEfS2007/presentations/TonyHey.pdf (although this is from a different talk, many of the slides seem to be the same).

Microsoft are building a Virtual Research Environment (VRE) with the British Library – looks like a web portal with stuff like RSS feeds, funding opportunity alerts, saved searchers, integration with MS tools (e.g. OneNote) for bibliography, Word and Excel 2007 – could add external tools to the ‘ribbon’ – e.g. library research tools)

Tony is going through his slides quite quickly so hard to capture. Now onto Scholarly publishing – the rules are changing – comparing to the Music Industry and music downloads – scholarly publishing industry (publishers and libraries/universities/academics) need to adjust.

Funding bodies now starting to make deposit of research results (publications, data and primary materials) mandatory as part of funding agreement (e.g. ERC)

Referencing article by Paul Ginsparg ‘As we may read‘ published in the Journal of Neuroscience, Sept 20, 2006. Ginsparg was the driving force behind ArXiV – he sees this model being adopted across all research areas. Also, sees a role for libraries and societies – perhaps reclaiming roles they fulfilled in the 19th century. Tony suggests that libraries are not necessarily fulfilling this function – I would argue that universities are not clear they want this…

If you look at ranking of universities on Google Scholar – University of Southampton is the top ranking UK University in this measure – which isn’t a ‘quality’ judge, but think about how available this information is – this means that papers from UoS get more visibility, more citations, more influence.

All the tools to support this need to be completely straightforward for the researcher – no extra effort.

The EU PLANETS Project – Digital Preservation – use of XML – specifically the Office OpenXML – now an ECMA Standard – but also open source ODF to OOXML converter – ODF is the ‘Open Document Format

Tony Hey leaves us with a challenge – once eResearch is ‘in the Cloud’  where is the Research Library?

Question: Will commercial publishers be destroyed by OA?
Answer: No – MS working with publishers. Tony thinks the ‘big’ ones will be fine – Science, Nature etc. But smaller publications may be more challenged – however Tony is keen to work with smaller publications to see how this can work – he doesn’t want them to go out of business but he believes the business model has to change.

Question: Where does payment come in?
Answer: Tony seems not particularly in favour of Author pays – sees problems with the model

Question: Who curates data in ‘mashups’
Answer: It’s a problem – if data coming from different sources, are they all conforming to the same curation standards – seems unlikely – perhaps this is where more commercial opportunity here.

Question (from me): Do researchers want to share their data – data is valuable?
Answer: Tony’s personal opinion is that they should have to share their data, but perhaps after a certain amount of time – keen to stress this is his personal view.

eScience, Scholarly Communication and the Transformation of Research Libraries

This talk by Tony Hey – Corporate VP for External Research, Microsoft Research.

So, Tony is saying that we are seeing an ’emergence of a new Data-Centric paradigm for research’, and that Web 2.0 students won’t use the library in the traditional way – so there is a need to redefine the role of the research library.

We have seen (and continue to see) and explosion in the amount of data being produced in scientific research – huge amounts of data being produced by instruments, simulations, sensor networks – we are able to ‘measure’ stuff to an overwhelming degree. Tony sees management and ‘curation’ of this data as a huge challenge for the research community – he says the scale of the challenge is one of the reasons he joined MS.

The ‘Scientific Data Deluge’ – data collection, data processing, digital preservation.

An example – ‘Fighting HIV with Computer Science’:
Research from ‘Spam Blocking’ machine learning project, which then moved to use of machine learning in tools that scientists can use. The original project was aimed to analyse huge amounts of data as to whether it was spam or not – led to drawing out correlations in huge data sets on HIV.

Cyberinfrastructure – this is the real problem, the ‘calculation’ bit is easy, it is the infrastructure needed (both technical and organisational) that is the problem. Tony references the NSF report on this (http://www.nsf.gov/pubs/2007/nsf0728/index.jsp).

Tony makes the point that it isn’t just about e-Science, but e-Research – the same issue applies to arts and humanities.

Tony says research today is:

  • Data intensive
  • Compute intensive
  • Collaborative
  • Multi-disciplinary

Today – web users are using tools that could really help here, but typically Researchers are using custom standalone tools, the ‘sharing’ process is still via long publication process, physical meetings etc.

In eResearch data is easily accessible, shareable, (eg. http://cas.sdss.org/dr5/en), services expose functionality (e.g. BLAST from the NLM, http://www.ncbi.nlm.nih.gov/BLAST/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome), services are in the cloud rather than installed locally (e.g. Amazon Web Services – S3, EC2 – this also used for home storage  solutions – JungleDisk).

Researchers can be seen as ‘extreme information workers’ – looking for subtle signs in the information available.

Publications as live documents – starting to see examples of figures in electronic publications that are based on ‘live’ data – so the reader can change aspects of a graph, plot different scales, overlay other data etc.

Just discovered that quite a few of the slides that Tony is using are available at http://research.microsoft.com/workshops/CEfS2007/presentations/TonyHey.pdf (although this is from a different talk, many of the slides seem to be the same).

Microsoft are building a Virtual Research Environment (VRE) with the British Library – looks like a web portal with stuff like RSS feeds, funding opportunity alerts, saved searchers, integration with MS tools (e.g. OneNote) for bibliography, Word and Excel 2007 – could add external tools to the ‘ribbon’ – e.g. library research tools)

Tony is going through his slides quite quickly so hard to capture. Now onto Scholarly publishing – the rules are changing – comparing to the Music Industry and music downloads – scholarly publishing industry (publishers and libraries/universities/academics) need to adjust.

Funding bodies now starting to make deposit of research results (publications, data and primary materials) mandatory as part of funding agreement (e.g. ERC)

Referencing article by Paul Ginsparg ‘As we may read‘ published in the Journal of Neuroscience, Sept 20, 2006. Ginsparg was the driving force behind ArXiV – he sees this model being adopted across all research areas. Also, sees a role for libraries and societies – perhaps reclaiming roles they fulfilled in the 19th century. Tony suggests that libraries are not necessarily fulfilling this function – I would argue that universities are not clear they want this…

If you look at ranking of universities on Google Scholar – University of Southampton is the top ranking UK University in this measure – which isn’t a ‘quality’ judge, but think about how available this information is – this means that papers from UoS get more visibility, more citations, more influence.

All the tools to support this need to be completely straightforward for the researcher – no extra effort.

The EU PLANETS Project – Digital Preservation – use of XML – specifically the Office OpenXML – now an ECMA Standard – but also open source ODF to OOXML converter – ODF is the ‘Open Document Format

Tony Hey leaves us with a challenge – once eResearch is ‘in the Cloud’  where is the Research Library?

Question: Will commercial publishers be destroyed by OA?
Answer: No – MS working with publishers. Tony thinks the ‘big’ ones will be fine – Science, Nature etc. But smaller publications may be more challenged – however Tony is keen to work with smaller publications to see how this can work – he doesn’t want them to go out of business but he believes the business model has to change.

Question: Where does payment come in?
Answer: Tony seems not particularly in favour of Author pays – sees problems with the model

Question: Who curates data in ‘mashups’
Answer: It’s a problem – if data coming from different sources, are they all conforming to the same curation standards – seems unlikely – perhaps this is where more commercial opportunity here.

Question (from me): Do researchers want to share their data – data is valuable?
Answer: Tony’s personal opinion is that they should have to share their data, but perhaps after a certain amount of time – keen to stress this is his personal view.

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Euan Semple – keynote

The opening keynote is from Euan Semple (http://www.euansemple.com/). Euan is at the BBC as head of Knowledge Management, and has had to help the BBC adapt to ‘Web 2.0’. When faced with a manager who said ‘if I gave my staff access to that kind of tool, they would just end up wasting their time’ – Euan’s reply was ‘have you thought that your recruitment policy might not be working?’.

So Euan’s opening question is what will ‘Businesslike’ look like when business isn’t like business any more?

Euan’s talking about tools he has used or seen used in the process of implementing technology in the area of KM. Firstly ‘Talk.Gateway’ – a discussion/chat board. He draws a distinction between this approach to ‘document management’ systems "where information goes to die gracefully". He suggests that by using something like a discussion board allows you to access all the collective knowledge of the organisation (in the case of the BBC accessing the collective knowledge of 23k employees). An example where someone asked about a policy and got 6 different answers, as well as a link to the official policy document. Euan’s point is that the discussion board didn’t create this sitution, but surfaced it – don’t blame the system for surfacing inconsistencies or problems.

Second tool, Connect.Gateway – a place where you can post details about yourself – expertise, interests, contact details etc. plus ability to join ‘interest groups’ to bring together people with common interests – espeically the ‘new’ stuff that wasn’t captured in the corporate structure.

Euan is pretty sceptical about structure in these systems – taxonomies etc. He says that with the discussion board, originally there were just two sections. Eventually he came under pressure to provide more structure to the boards – however, as soon as he did this, usage drops. He draws a parallel to a ‘cotswold village’ that grows up gradually over time with no particular plan, compared to organised ‘new towns’ like Milton Keynes which are ‘planned’ to be systematic, but end up being very easy to get lost in. I’m not sure this completely holds up – there are definite advantages and disadvantages to both approaches, but the point with Milton Keynes is that once you understand the layout, it becomes quite easy. Also with a systematic approach, then you can apply the same system to different places – once you understand the system for numbering/naming streets in on US city, you can apply it in others. However, each Cotswold village is different. To make this more concrete, the point is that once you understand LCSH you can apply that to each library catalogue you use, but if we all used local terminology then this would not be possible. On the otherhand this perhaps means you don’t get the advantage of localisation which leads to ease of use for regular users. I think a dual approach can work, and there is no doubt that libraries have traditionally taken a very structured approach, and haven’t yet exploited the ‘organic growth’ approach to any extent.

Euan has just covered blogs as a communication and dicussion tool, and is now mentioning wikis – these are all tools that have been used at the BBC.

Just as an aside – one of my reasons for blogging (especially conferences) is to share information with colleagues. However, I also want to engage in a discussion with a wider community online. At Imperial they have recently introduced the ‘Confluence’ system for blogging and wikis, which I think is great, and some of my teams are already using, or investigating. However, at the moment the blogs we can setup on Confluence are only available internally – which wouldn’t support me in engaging with the wider community – hence I’m blogging on my own site instead. I hope that this might change…

So, wikis – Euan making the point that they are highly auditable, and to some extent self-correcting.

The BBC now have guidelines on blogging etc – again, something I asked about at Imperial before I started blogging as an Imperial employee – but at the moment there doesn’t seem to be anything in Imperial policies or guidelines relating to this.

Euan now coming onto tagging etc. name checking David Weinberger and his book ‘Everything is Miscellaneous’ (http://www.amazon.com/Everything-Miscellaneous-Power-Digital-Disorder/dp/0805080430). Euan is covering use of del.icio.us – with use of tags and his ‘network’ of trusted people who use del.icio.us. Also use of RSS to track this – and telling a story of how he did a talk, and when he came off stage his RSS aggregator picked up a new item tagged with his name, and found that it was someone blogging the talk he had just given (wonder if he will pick up this post?)

Euan mentioning the use of the Google blog search – different type of content to what you would get in response to a normal Google search – he argues more useful.

Now mentioning last.fm – I still haven’t got into this (probably don’t listen to enough music!) – but the point is the power of the ‘network’ – harnassing the knowledge of a network of people. Suggesting that something like this for TV is on it’s way – why watch a programmed ‘channel’ when you can choose to watch something that your ‘trusted’ network is recommending.

Now mentioning ‘Plazes‘ which I haven’t come across – once you connect to a  wireless network, it works out where you are and shows it on a map – so people can see where you are, and you can see if you are near to people you want to meet etc.

Twitter – the ‘intenstity of the mundane’ – what about ‘on my way to meeting with CEO’
Facebook – making contact
Dracos.co.uk – tracks changes to BBC News homepage – allows you to see stuff that has been changed – so can’t hide stuff that you’ve said…

Some final examples – Innocentive from a pharmaceutical company where questions can be posted, and people can bid for answers – story of a member of staff at an Indian university who set the questions for students, and posted answers – one student got £75k for an answer.
A final lighthearted example of an online application – Meeting Miser – works out how much a meeting has cost the organisation based on time and salaries of those involved – the point being, don’t value physical meetings over virtual collaboration.

Coffee time!

Talis Insight 2007

http://www.talis.com/applications/news_and_events/talis_insight.shtml

Over the next 2 days I’m at the Talis Insight conference in Birmingham (UK). Although Imperial don’t use any Talis products (and don’t have any specific plans to either), I’m hoping that the conference is still very relevant – the programme is varied, and although, as you might expect, it covers a number of Talis products, it also picks up on a number of trends in the Library technology sector.

I’m particularly looking forward to talks by David Patten, and Marshall Breeding about ‘Next Gen’ library catalogues, as well as looking at the Talis approaches to ‘Next Gen’, systems integration, ERM, and Resource/Reading list management.

I’m also hoping to use the event to restart my sporadic blogging career – usually this is limited to conferences, but I’m hoping that I might manage something a bit more frequent from now on.

If anyone else is blogging or tracking this, I’m going to use Insight07 to tag these posts.

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa

Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa