Google Books – Balderdash and Piffle?

Balderdash and Piffle

The BBC are currently showing a series called Balderdash and Piffle which encourages viewers to help track the origins or words or phrases, and to identify the earliest usage in print or recorded media – this is done in collaboration with the OED.

I was watching this on Friday evening, and was suprised that earliest recorded printed occurence of the phrase "the dog’s bollocks" to describe something really great (cf. bee’s knees, cat’s whiskers) was 1989. So, I thought (in my usual slightly headstrong way) that I might find something earlier if I did some searching online.

Google Books

I quickly found myself at Google books, and for the first time used it in anger. As usual Google allows me to use inverted commas to indicate a phrase, but I almost immediately found that the basic search didn’t allow me to limit by publication date, so I moved onto the advanced search options. This did let me limit by publication date which was great – I could now only look for items that were published before 1989.

This turned up two hits, one from a "Dictionary of Jargon" apparently published in 1987, and one from "Vision of Thady Quinlan" from 1974. I’ll deal with these one at a time below

Vision of Thady Quinlan

In the brief results this gives the context for the use of the phrase as follows:

"I don’t give a dog’s bollocks who he is, or who you might be, or what you think you can do. You stay. He goes." Finn dropped the cases. …

This is clearly not the usage of the phrase I was looking for.

Oddly when I look at the detailed record, this extract is not present, and the ‘snippet’ which should show the context is missing with a rather distorted "Image not available" This is irritating, but because the context is so clear in the brief view it doesn’t hamper my research – more on this later.

Dictionary of Jargon

This is dated from 1987. However, in this case there is no extract in the brief view. Going through to the full record, there is no snippet. There is some basic metadata – author, publication date, publisher, subject areas, number of pages, where the scan has come from, digitization date.

One issue with the metadata is that the author name is listed as "Jonathon. Green" – note the fullstop in the middle of this – I don’t think this changes the meaning, but it points to the quality of the metadata, and this type of issue could lead to ambiguity in other contexts.

I can’t take this any further without seeing the book – without getting into the rights and wrongs of digitising, this is where I regret the lack of the full text available. There is a link to ‘Find this book in a library’, which links me through to Open Worldcat – and I find that the nearest library (that Worldcat knows about) is 6 miles away – that’s not bad going. I’d need to go to check the actual book and usage – but if it bears out it’s promise, that’s about 10 minutes work to out research the OED and BBC!

Dodgy Metadata?

I moved onto other phrases in the BBCs/OEDs list and found what seemed to be earlier than recorded usage of "mucky pup" meaning a habitually messy or dirty child or adult. In this case it is in "From a Pitman’s Notebook"

In the brief display this is listed as by Arthur Archibald Eaglestone from 1925 – pre-dating the evidence found by the BBC programme, which had dated it to a popular song in 1934. In the brief display, it also puts the phrase into context "Tha mucky pup! Ah’ll bet tha’s ‘ad ter coom doon’t chimbley this mornin’ ‘ is
accepted with a sheepish grin" which confirms that the usage is correct.

When I go through to the full record, finally I get a ‘snippet’ of the book displayed – but the actual usage is clearly in the line before the snippet starts – so I still don’t get a view of the phrase I’m looking for in context.

In the full record I also get a thumbnail of the digitised book cover – and immediately notice that in the thumbnail it says "BY ROGER DATALLER" – which contradicts the metadata (as noted above, this says the author is Arthur Archibald Eaglestone). Intrigued by this, I search for the book in the British Library, which seems to confirm Roger Dataller as the author. I then check the University of Michigan record, as this is where Google says the book was from. Failing to find the item on a title search, I search for both Dataller and Eaglestone as authors – and eventually find the record listed under Eaglestone – so it looks like Google’s metadata simply reflects that from the University of Michigan.

Now all this is fine – and again, my best bet is clearly to go and get the book from the BL, or perhaps even contact the University of Michigan to see if they can confirm the item details. But along with some of the other things I’ve found, it leads me to start distrusting the quality of the metadata I’m seeing.

Journals

I moved onto searching for an occurrence of "codswallop" from before 1959, and ideally something that linked it to it’s origin. I find 16 records – and the second one is dated 1869 – I’m very excited by this – almost 100 years earlier than the OED has recorded. However, as soon as I start to look at the entries in detail I notice immediately is many seem to be journals rather than books. The problem here is that the date Google records as the ‘publication’ date seems to be the original publication date. So journals are not listed by issue, but just a single record for all the articles from the journal. Unfortunately it seems to be impossible to tell which issue or date a specific piece of text is from. As an example, the search for "codswallop" finds a reference to this in (appropriately) "Library Review" – this has a use of "Load of Codswallop" dated as 1927. Looking at the full record, the snippet reveals that the usage is followed by the reference "Evening News, 4 Aug. 1970)" – clearly indicating that this particular article is much later than 1927 – but nothing further to date the actual usage. The other results for codswallop have a similar problem – but without the helpful glossing to give any indication of date.

Summary

In summary I found Google Books brilliant but ultimately frustrating. The ability to search full-text was invaluable and discovered references that (it seems) have not been found before. On the other hand, the lack of full-text display meant that it wasn’t possible to check the context, and even when a snippet displayed it far too often didn’t actually display the relevant snippet (often a line or two out).

The fact that I found errors in the metadata in a few cases made me suspicious of the quality in general. To be fair, these errors may have come from the original library metadata – and I wouldn’t have realised the error if I had simply seen the bibliographic record in the original library catalogue.

Finally the inability to narrow searching of journal/serial content down to more than the original publication date of the journal – and the inability to restrict searching to just monographs, or just series – meant that it was often impossible to tell whether what I’d found was useful or not.

Google Books and other digitisation projects have the potential to unlock information that might not otherwise ever be found. However, the implementation isn’t quite there yet, and is limited by the inability to display full text for many items.

We are a little way off understanding how full-text searching can be successfully combined with the more traditional structured searching that library catalogues offer (and systems offering faceted searching, such as Endeca, are in the process of exploiting). However, what is clear to me is that searching for information from digitised printed material is different to searching ‘the web’ (although this may simply be a function of the youth and lack of sophistication of the web I guess) and it would be great to see Google and Libraries collaborating on improving this service by combining the best of both.

UPDATED:
Just a few more observations:
You can’t limit by language of material
The OCR used doesn’t seem to work so well with foreign language materials
Quite a lot of OCR problems – ones mistaken for lowercase ‘L’ and vice versa

Classifying the catalogue

Lorcan Dempsey has posted on What is the Catalog?, and also refers to is unhappiness at the word ‘Catalogue’ in his recent Ariadne piece.

This is an interesting intersection with the recent presentations I attended at IGeLU on ‘Libraries, OPACs and a changing discovery landsacpe’. Both speakers talked about the fact that the traditional view of the library catalogue as the ‘centre’ of the library users information discovery behaviour was no longer valid in the modern environment. One of the questions in the discussion that followed these talks was ‘What do you mean by the library catalogue?’

The ideas that started to emerge out of these talks was that libraries would need to focus more on ‘local’ or ‘unique’ collections that they had stewardship of, rather than trying to catalogue the whole world (the problem is not building the Alexandrian library, but trying to do it thousands of times over?).

I remember having a discussion of what should and shouldn’t be in the catalogue about 5 years ago with a colleague, in the context of the growing number of electronic resources we were subscribing to. Currently what we refer to as our ‘library catalogue’ (when talking to our users) contains:

  • a record of our physical stock (or at least aims to – there is a fair amount of error here)
  • our e-journal titles (paid for and free, aggregations and individual titles, actually imported on a monthly basis from our SFX installation)
  • some, but not all, e-books we pay for access to (e.g. we don’t load individual MARC records for books in EEBO or ECO, but we do for Oxford Reference; we don’t track books available in aggregated databases such as Business Source Premier; we don’t load Project Gutenberg details)
  • some digital objects (online exam papers where available)

This odd mixture has some logic behind it (I won’t go into it here, but we do actually discuss this stuff and make decisions about what goes on in a very general way, if not for specific items), but it seems inevitable that there is no obvious consistency for the library user about what they should or should not expect to find if they search the catalogue.

 

So, if the catalogue is not a list of what we have physically, or what we provide access to physically and virtually, what does it become? My guess is that we are heading towards realigning the ‘catalogue’ towards the physical collection – i.e. this is what we have in the building. This is essentially where we started. We can expect our users to start in a wider world of information, and only reach the ‘catalogue’ when they get close to the ‘delivery’ phase.

If this is the case, what will it mean to the development of the catalogue. Definitely integration of inventory information with the wider world – if the user starts with a ‘big picture’ they will want to narrow it down to stuff they can get their hands on pretty quickly (just today I was frustrated in my local library not being able to narrow my search to ‘this branch, on the shelf, only’). Perhaps a focus on finding the item on the shelf – on a recent visit to Seattle, I was impressed how the layout of the non-fiction stock in the library (in a continuous dewey sequence covering several sloping floors – so you can walk continuously from 001 to 999 without any stairs etc.) made it easy to navigate the stock – especially liking the floor tiles with the dewey numbers on them for instant orientation.

This needs more thought, so hopefully I can come back to it in a future post…

‘Personal’ researcher

The introduction of federated search engines (e.g. MetaLib) seems to open up an opportunity for some kind of ‘automatic researcher’. I’m thinking of a piece of software that would do sequential searches on a variety of sources, and put together a ‘reading list’ of relevant references.

Just to describe how this might work:

The researcher puts together a list of keywords, and defines a starting point (e.g. a list of databases).
The federated search engine does a search on the databases specified by the researcher
From the results, the software could compile a number different search ‘facets’ to then continue to search on these facets. These facets could be, for example, subject words not specified in the original list and author names.
Alternatively it could something like find all the papers which cite, or are cited by, a paper retrieved by the original search.

The effectiveness of this kind of functionality would depend on the databases available for cross-searching, how effectively the results can be ‘relevance’ ranked, and how much structure their is in the retrieved records (the more context available for the retrieved records, the better I guess).

In combination with a link-server (e.g. SFX) and local library catalogue, you can even see this being able to prioritise the material easily available to the researcher…

I think all the pieces are actually already in place for this, but the functionality isn’t quite there yet. I wonder if anyone would be interested in funding a bit of research in this area… – a couple of months work with a federated search engine supplier should really be enough to get this up and running.

(I’ve used the Ex Libris products as examples here, just because these are the ones that I am familiar with, so I can kind of see how it could work using them. I’m sure the similar products from other vendors could do the same kind of thing)

Web service integration

Several useful web services are starting to emerge, and it’s time to start thinking about integrating them into library management products. The most obvious, and possibly most useful, one that springs to mind is the Google spellchecker. Misspelt searches are still a major source of failed catalogue searches, and if we could do a ‘did you mean xxx’ like Google, it would help our users.

The sad thing is that although I can see this as relatively easy to do outside our LMS, I can’t see anyway of integrating into the web OPAC. My alternative is to start writing an interface outside the LMS, which just seems pointless.

Another useful service is the OCLC ISBNx lookup – you submit an ISBN, and get back a list of ISBNs of related works (e.g. Paperback vs Hardback, previous editions etc.) – again, a neat way of extending the users search for them. As above, integration into the library catalogue interface would seem like the ideal, but I’m currently talking to the team at SFX to see if we can get it integrated into their product

These are not new ideas to the area, and Art Rhyno has already done this anyway, but it’s something that I’d like to do if I get the time. It would also be nice to see these as part of standard LMS functionality…

Referencing

We are working with e-learning a lot at the moment, mainly on integrating reading/resource lists into the learning environment. We are making some progress on this (although unfortunately it’s all behind closed doors, so I can’t demonstrate here).

Anyway, one idea which came up was that quite often references are not saved for the ‘reading list’, but rather put into the course material at the appropriate point. We thought it would be a great idea if it was possible to just ‘drag’ the reference from the library system by a very short, simple, snippet of code, which could be embedded in any webpage.

This isn’t 100% ideal (it still requires some ability to edit html), and after seeing the stuff that Art Rhynol has just come up with for his Lookup Helper, I’m quite jealous. However, since I’m talking about web material here, it doesn’t seem much of a stretch to say ‘and now paste in these 2 lines of code’.

So, how about a solution similar to the one Andy Powell and Pete Cliff came up with for RSS-xpress Lite. This is seems really neat – just very simple.

Speed could be an issue, but I’ll worry about that later…

Your search found no hits

An idea from elsewhere (don’t want to be seen plagiarising, but also don’t know about the etiquette of referencing someone without asking). Google offer a ‘did you mean to search for xxx’ instead – which is a great idea for a start (why doesn’t our library system offer this?) – but how about ‘did you mean to search in’ function. Often our users search for information in the wrong place (e.g. search for journal article details in the library catalogue). Why not do a background search for the same search terms in some other systems (e.g. a federated search engine), and offer as well as giving search results from the system they have started with, let them know how many hits they would have got if they had searched elsewhere.

Google Deskbar

Google Labs have released a beta of the Google Deskbar. This sits in your Windows taskbar and will do google searches for you. What’s neat is that it can be easily adapted to search other resources such as a library catalogue or a federated search tool if you know the syntax.

Although it’s not unique – there are other toolbars around – notably Dave’s Quick Search Deskbar which is open source. However, I found it more difficult to configure than the Google version. I also really like the fact the Google one comes with a stripped down browser included – something that has been (I think) overlooked by many commentators.

One idea that has kicked around the SFX coomunity is an ‘OpenURL’ toolbar/deskbar to help locate full-text online from a citation. Ex Libris, the company which sells SFX has actually created one for their development team to use, but seem unwilling or unable to distribute it to SFX customers. There is also the question of whether our library users will actually be interested…