A Tale of Two Encodings


In my last post I explained the basics of how we will be handling MARC records as XML. To recap, we will store our bibliographic record as MARC21slim XML documents. This format is recommended and maintained by the Library of Congress, so we feel confident using it. Now we just need to morph our binary records into XML! (WARNING: What follows is pretty technical, so hang on tight.)

As before, I’ll start with some background on the technology, and why it is important.

Normally, binary MARC records are stored in a special extended encoding called MARC8. The majority of text that English-speaking librarians see and put into their bibliographic records is make up of 26 uppercase letters, 26 lowercase letters, ten numerals, and some punctuation. But that leaves every language not based on Latin out in the cold. That is where MARC8 comes in. Since 7 bits are enough to encode languages with small character sets, 62 alphanumeric characters and a dozen or so punctuation marks in the case of Western languages, and since MARC was developed in the West, the 7 low bits in each byte are reserved for these characters, and for some control characters that used to mark data field boundaries. That leaves us with one special bit, and in MARC8 encoding it is used to signal that a group of bytes should be interpreted as a single character. The details of this aren’t really important, but for the brave you can check out some more info here and here .

MARC8 was originally conceived in 1968, and of course there have been great (good and large) advances in character encoding since then. The industry standard for multi-lingual encoding is known as Unicode, and it’s most flexible form is called UTF-8. UTF-8 works in much the same way as MARC8, but has more widespread acceptance in general purpose software than does MARC8. And (here we go again…) UTF-8 is the native dialect of XML.

So our goal is to take binary MARC records encoded in MARC8 and turn them into UTF-8 encoded XML documents. Thankfully, the MARC8 to UTF-8 translation routines we need already exists. And since the core of Evergreen (the now official name of the open-ils.org end product) is written in Perl we will use the MARC::Charset module to handle this transcoding. There also exists a Perl module for transforming MARC records to XML records, MARC::XML, but until now these two pieces of the MARCXML puzzle were not integrated. That’s what we will do today.

First, let’s take a look at the important part of the XML output module MARC::File::XML:

sub record {
    my $record = shift;
    my @xml = ();
    push( @xml, "<record>" );
    push( @xml, "  <leader>" . escape($record->leader()) . "</leader>" );
    foreach my $field ( $record->fields() ) {
        my $tag = $field->tag();
        if ( $field->is_control_field() ) {
            my $data = $field->data();
            push( @xml, qq(  <controlfield tag="$tag">) .
                escape($data). qq(</controlfield>) );
        } else {
            my $i1 = $field->indicator( 1 );
            my $i2 = $field->indicator( 2 );
            push( @xml, qq(  <datafield tag="$tag" ind1="$i1" ind2="$i2">) );
            foreach my $subfield ( $field->subfields() ) {
                my ( $code, $data ) = @$subfield;
                push( @xml, qq(    <subfield code="$code">).
                    escape($data).qq(</subfield>) );
            }
            push( @xml, "  </datafield>" );
        }
    }
    push( @xml, "</record>\n" );
    return( join( "\n", @xml ) );
}

This code will walk through a MARC record and push the data in each field into an XML element. It’s fairly straight forward, and works well for creating the structure of an XML document. But consider the MARC8 encoded text “J.K. Rowling ; illustrations by Mary Grandpr�e.” This lives in tag 245 subfield c. If you didn’t know better you might think everything was fine, and that the illustrator’s last name was supposed to contain that “funny “a”‘. After checking this against several sources, though, we find that Mary’s last name is “Grandpré”. That is not an apostrophe after the e, it is an accent mark embedded in the UTF-8 encoded character. To get from � to é we use the MARC::Charset module:


use MARC::Charset;
my $converter = new MARC::Charset;

sub record {
    my $record = shift;
    my @xml = ();
    push( @xml, "<record>" );
    push( @xml, "  <leader>" . escape($record->leader()) . "</leader>" );
    foreach my $field ( $record->fields() ) {
        my $tag = $field->tag();
        if ( $field->is_control_field() ) {
            my $data = $field->data();
            push( @xml, qq(  <controlfield tag="$tag">) .
                escape( $converter->to_uft8( $data ) ). qq(</controlfield>) );
        } else {
            my $i1 = $field->indicator( 1 );
            my $i2 = $field->indicator( 2 );
            push( @xml, qq(  <datafield tag="$tag" ind1="$i1" ind2="$i2">) );
            foreach my $subfield ( $field->subfields() ) {
                my ( $code, $data ) = @$subfield;
                push( @xml, qq(    <subfield code="$code">).
                    escape( $converter->to_uft8( $data ) ).qq(</subfield>) );
            }
            push( @xml, "  </datafield>" );
        }
    }
    push( @xml, "</record>\n" );
    return( join( "\n", @xml ) );
}

We’re done, right? Nope. The MARC21 standard states that binary MARC records can be encoded in Unicode, and that Leader position 9 contains an ‘a’ in this case. We’ll get error if we try to pass UTF-8 encoded text through MARC::Charset::to_utf8, so we need to test for that first, and only use the UTF-8 conversion if the record is not already Unicode encoded:

use MARC::Charset;
my $converter = new MARC::Charset;

sub _perhaps_encode {
        my $data = shift;
        my $already_utf8 = shift;
        $data = $charset->to_utf8($data) unless ($already_utf8);
        return $data;
}

sub record {
    my $record = shift;
    my $unicode_marker = substr($record->leader, 9, 1);
    my $_is_unicode = 0;
    $_is_unicode++ if ($unicode_marker eq 'a');
    my @xml = ();
    push( @xml, "<record>" );
    push( @xml, "  <leader>" . escape($record->leader()) . "</leader>" );
    foreach my $field ( $record->fields() ) {
        my $tag = $field->tag();
        if ( $field->is_control_field() ) {
            my $data = $field->data;
            push( @xml, qq(  <controlfield tag="$tag">) .
                escape( _perhaps_encode($data, $_is_unicode) ). qq(</controlfield>) );
        } else {
            my $i1 = $field->indicator( 1 );
            my $i2 = $field->indicator( 2 );
            push( @xml, qq(  <datafield tag="$tag" ind1="$i1" ind2="$i2">) );
            foreach my $subfield ( $field->subfields() ) {
                my ( $code, $data ) = @$subfield;
                push( @xml, qq(    <subfield code="$code">).
                    escape( _perhaps_encode($data, $_is_unicode) ).qq(</subfield>) );
            }
            push( @xml, "  </datafield>" );
        }
    }
    push( @xml, "</record>\n" );
    return( join( "\n", @xml ) );
}

So now we’re done, RIGHT? Not exactly. In modern versions of Perl, those from 5.6 on, UTF-8 is the lingua franca. Because of this, Perl tries to “help” us by taking the bytes we just translated from MARC8 to UTF-8 and pushing them into our native system character set and encoding. In the US this is normally LATIN1, also know as ISO-8859-1. LATIN1 is an 8-bit code page, a sort of “poor man’s” extended character set. It uses that special eighth bit we mentioned before to give you an extra 128 characters, but at the exclusion of Unicode character sets. A specific example of this “help” is the character �, which has both a multi-byte UTF-8 encoding and is represented by character number 248 (0xFD) in the LATIN1 code page.

What we need is some way of grabbing the raw bytes that make up our shiny new Unicode string so that we can stop Perl from “helping” us. Enter the Encode module. This module is now standard with Perl, but if you don’t have a copy you can grab it at CPAN. Now we have all the tools we need in order to create our MARC8 encoded MARC21 to UTF-8 encoded XML translator. The final example version of our modified MARC::File::XML module now looks like this:

use Encode ();
use MARC::Charset;
my $converter = new MARC::Charset;

sub _perhaps_encode {
        my $data = shift;
        my $already_utf8 = shift;
        $data = Encode::encode( 'uft8', $charset->to_utf8($data) ) unless ($already_utf8);
        return $data;
}

sub record {
    my $record = shift;
    my $unicode_marker = substr($record->leader, 9, 1);
    my $_is_unicode = 0;
    $_is_unicode++ if ($unicode_marker eq 'a');
    my @xml = ();
    push( @xml, "<record>" );
    push( @xml, "  <leader>" . escape($record->leader()) . "</leader>" );
    foreach my $field ( $record->fields() ) {
        my $tag = $field->tag();
        if ( $field->is_control_field() ) {
            my $data = $field->data;
            push( @xml, qq(  <controlfield tag="$tag">) .
                escape( _perhaps_encode($data, $_is_unicode) ). qq(</controlfield>) );
        } else {
            my $i1 = $field->indicator( 1 );
            my $i2 = $field->indicator( 2 );
            push( @xml, qq(  <datafield tag="$tag" ind1="$i1" ind2="$i2">) );
            foreach my $subfield ( $field->subfields() ) {
                my ( $code, $data ) = @$subfield;
                push( @xml, qq(    <subfield code="$code">).
                    escape( _perhaps_encode($data, $_is_unicode) ).qq(</subfield>) );
            }
            push( @xml, "  </datafield>" );
        }
    }
    push( @xml, "</record>\n" );
    return( join( "\n", @xml ) );
}

Conclusion

I have only presented the MARC to XML direction here, but very similar techniques are used to go the other way, from an XML document to a MARC8 encoded MARC21 record.

Working with internationalization standards is not easy, especially for Westerners like myself who are used to small character sets. But it is something that needs to be done, and should be done well, especially by Westerners like myself who want their software to be used by as many people around the world as possible. It certainly isn’t any easier when you are working with multiple multi-byte encodings, but thankfully a lot of the hard work has already been done, and I get to “stand on the shoulders of giants,” so to speak.

I’d like to thank Ed Summers for inviting me to help maintain the MARC::XML perl modules. I will be cleaning up my changes and submitting them soon, so be on the lookout for an updated version. If you have any comments or suggestions please feel free drop me a line.