Bibliographic Data Storage and Access

Though there haven’t been many requests for the information in this post, we felt it would be a good idea to keep everyone up to date on the internal direction and progress we are making, much as Jason’s earlier post regarding the Staff Client did. What follows is an overview of where we are with regard to our storage and retrieval of library specific data. We will be focusing on the domain specific stuff, especially bibliographic data in the form of MARC records.

First, a little background. After working with the data for the last 3.5 months I have come to realize that binary MARC is just not suited to being stored in a relational database. The main problem is that when we normalize, or flatten out, the data so that we can have well structured tables the layout just doesn’t contain enough information to search quickly. What this mean, in plain english, is that MARC is far too flexible to impose the rules we need (this is where you find a title; this is where you find an author; field ‘x’ always mean ‘y’) in order to look up a particular piece of information very quickly. There is just too much cross-field interpretation that must happen to represent these questions in SQL.

There are several approaches that we could take to solve this problem. The first one we were looking at was to have our catalogers (thanks, Elaine!) define exactly which tags and subfields we should be indexing for display purposes. This initially looked very promising, but when it came time to extract all these chunks of data we saw that 1) the formating seemed rather haphazard, 2) we had far too many special cases to represent succinctly and 3) we ended up having to define a brand new search language that no one else would recognize, and was very nonstandard by library software … er … standards.

Then we stepped back and took a look at the other technologies we were using. The main thread we saw running through the entire system was XML. The network protocol is based on it (Jabber), our software internals use it to work with the data and our client interface will now be written in it. ( See here. )

We started by investigating the LOC and Dublin Core standards for representing MARC data as XML. The LOC MARC21Slim standard is a lossless XML wrapper for MARC data. Basically, it rewrites binary MARC records in XML, and these XML documents can be turned back into the original binary MARC format. Dublin Core, on the other hand, can not be used to recreate original MARC records, because it cannot store all the information required to recreate MARC. So, from that point of view the choice was easy: MARC21Slim it is.

But MARC21Slim is essentially the same as binary MARC. It has all the shortcomings of being difficult to use in a relational database. Well, the Library of Congress saves the day again! Enter the MODS XML format. This is an XML format that can be directly derived from MARC21Slim using an XSLT Stylesheet (more XML!) and nicely structures the data into a simple layout that is easy to search. The Stylesheet does this by applying all the MARC rules for display formating and creating a simple document describing the MARC record. The one drawback to the MODS format is that it, like Dublin Core, is a lossy format. None of the undisplayed fields from the MARC data get into the MODS document. This is not a big problem, however, as we have access to the MARC21Slim document that was used to create the MODS document.

So here is what we have now:

A table in the database that will store the original MARC21Slim document. This will be used by catalogers to update the MARC data. (Though catalogers will NOT have to learn XML. We will present this data in an interface that hides those details.)
An intermediate format (MODS) that will be used to extract useful (from the end users perspective) data from MARC records stored in MARC21Slim.

You might be asking youself, “Self, how will I get the relevant fields out of the MODS documents?” Good question. The answer, as if you couldn’t guess, is more XML technology. We will be using XPath to extract these fields. XPath is a query language that is used to find specific parts of an XML document that are interesting in a particular context. It is fairly simple, and allows us to fine tune what parts of the MODS document are of intrest to us for the purpose of text based searching. Once we get the content from the fields we want we can the store it in a metadata table. This table will hold a friendly name for the field, such as “ISBN”, and the actual content, such as “067985858X”, and a link back to the original MARC21Slim document from which this information was extracted. We will then be able to apply full-text indexing against the content column on the table, and quickly retrieve the original record who’s ISBN is equal to 067985858X.

Through all of this we are making use of technology and formats accepted by both the library community and the software development industry. This adherance to standards will really pay off when we need to extend the system, and when it comes time to integrate with any other modern software our libraries may need.

You may have noticed that everything so far only pertains to bibliographic records. Because authority records are much simpler, and because there is no finalized equivelant to the MODS format for authority records, we will need to extract the relevant fields by hand. I’ll leave that discussion for another post.