In what will eventually be a series of 5 posts (I think) I’m going to walk through a real life example of some problematic MARC records I’ve been working with using a combination of three tools (the Notepad++ text editor, MarcEdit and OpenRefine). I want to document this process partly because I hope it will be useful to others (including future me) and partly because I’m interested to know if I’m missing some tricks here. I’d like to thank the Polytechnic of Namibia Library for giving me permission to share this example.
This is the first post in the series, and describes the problem I was faced with…
I was recently contacted by a library who were migrating to a new library system but they’d hit a problem. When they came to export MARC records from their existing system, it turned out that what they got wasn’t valid MARC, and wouldn’t import into the new system.
I agreed to take a look and based on a sample of 2000 records found the following problems:
- Missing indicators / indicators added in the incorrect place within the field, rather than preceding the field
- Incorrect characters used to indicate ‘not coded/no information’ in MARC field indicators
- Subfields appearing in fixed length fields
- Use of invalid subfield codes (in particular ‘_’)
- System number incorrectly placed in 002 field, rather than 001 field
- Several issues with the MARC record leader (LDR) including:
- Incorrect characters used to indicate ‘not coded/no information’
- Incorrect character encoding information (LDR/09)
- Incorrect characters in “Multipart resource record level” (LDR/19)
- Incorrect characters in “Record status” (LDR/05)
- Incorrect characters in “Bibliographic level” (LDR/07)
- Incorrect characters in “Encoding level” (LDR/17)
- Incorrect characters in “Descriptive cataloging form” (LDR/18)
- Incorrect characters in “Multipart resource record level” (LDR/19)
- Incorrect characters in “Length of the implementation-defined portion” and “Undefined” (LDR/22 and LDR/23)
At this point I felt I had a pretty good view of the issues, and agreed to fix the records to the point they could be successfully loaded into new library system – making it clear that:
- It wouldn’t be possible improve the MARC records beyond the data provided to me
- That where there was insufficient data in the export to improve the MARC records to the extent they are valid, I’d use a ‘best guess’ on the appropriate values in order to make the records valid MARC
- That I wouldn’t be trying to improve the cataloguing data itself, but only to correct the records to the point they were valid MARC records
At this point the library sent me the full set of records they needed correcting – just under 50k records. Unfortunately this new file turned up an additional problem – that incorrect ‘delimiter’, ‘field terminator’ and ‘record terminator’ characters had been used in the MARC file – which meant (as far as I could tell) that MarcEdit (or code libraries like PyMARC etc.) wouldn’t recognise the file as MARC at all.
So I set to work – my first task was to get to the point where MarcEdit could understand the file as MARC records, and for that I was going to need a decent text editor as I’ll describe in Part 2…
Hi Owen, just to pick nits a little, but underscore is valid subfield code — or at least it is reserved for local usage. See http://www.loc.gov/marc/specifications/specrecstruc.html