Repairing the Nebraska Newspapers Space-Time Continuum

Nebraska NewspapersNebraska Newspapers

Since 2007, the CDRH has been cultivating our web-based time machine, Nebraska Newspapers, in partnership with the Library of Congress’s National Digital Newspapers Program (NDNP) and funded by grants from the National Endowment for the Humanities (NEH). At time of writing, we currently have published 45 newspapers with full text and high resolution scan images. And there are more on the way after receiving our third NDNP grant in 2016.

National Endowment for the Humanities

Plattsmouth Papers

Newspaper with headline fading from "Emmet Brown Condemned" to "Emmet Brown Commended"

Upon ingest of the Plattsmouth, Nebraska newspapers, CDRH staff noted signs of a rift in the space-time continuum: some problems with the dates assigned to some of the newspaper issues. Papers from the early 1900s were miskeyed as being from the early 1800s — a time when few white settlers were living in Nebraska and certainly when no newspapers were being published here. These errors were not caught in the validation process, because the validator only checks to see if the date value has a valid date format, not whether the date is actually the correct one. Typos like this sound simple to fix, but changing the date across hundreds of filenames and within the XML of said files would take hours and hours to complete manually.

Laura’s timeline of Plattsmouth papers’ title shifts, publication dates, and LCCNs

The Plattsmouth Public Libraries soon also noted that title shifts had not been identified. Plattsmouth has had two major newspapers, The Plattsmouth Herald and The Plattsmouth Journal, but both of these newspapers have undergone title changes throughout their history. According to standard cataloging practice, each newspaper title shift requires a unique Library of Congress Control Number (LCCN). Additionally, the NDNP guidelines require the use of print LCCNs, not those identifying the microfilm versions. Although the Plattsmouth papers had not been selected for inclusion in the NDNP, CDRH staff decided they should comply with the Library of Congress standards in hopes of being able to include the papers in Chronicling America in the future.

Given the information from the Plattsmouth Public Libraries, and armed with the US Newspaper Directory, our Metadata Encoding Specialist Laura Weakly investigated and sorted out what had been mixed up with the aid of her trusty whiteboard. Laura documented a few date ranges where LCCNs had been misapplied when the paper’s name had changed and identified where the microfilm LCCN had been used. Like correcting the dates, repairing these LCCN problems would involve similarly monumental amounts of manual work. Rather than undertake this time-consuming process, Laura approached the dev team about writing scripts to save time and minimize the possibility of new typos.

Flux capacitor

Python Scripts

To write the needed scripts, I decided to use our flux capacitor Python, because it powers the Library of Congress’s Chronicling America and other NDNP websites’ Chronam software, including Nebraska Newspapers. We usually use Bash or Ruby scripts for our projects, but I wanted this to be written in the language familiar to others working on NDNP projects. I had only written a small handful of Python scripts in the past, but I looked forward to refreshing my time-saving skills with it.

Fix Dates by LCCN

Source: fix_dates_by_lccn.py

I chose to start by focusing on the incorrect dates as it seemed a little less complex than updating LCCNs.

Time travel machine date selector

The script begins by using argparse to implement common command line options and the necessary arguments to control the script, similar to how I had begun using getopts for my Bash scripts and OptionParser for my Ruby scripts.

# Arguments
# ---------
parser = argparse.ArgumentParser()

# Optional args
parser.add_argument("-d", "--dry_run", action="store_true",
 help="don't make any changes to preview outcome")
parser.add_argument("-q", "--quiet", action="store_true",
 help="suppress output")
parser.add_argument("-s", "--search_dir",
 help="directory to search (default: /batches)")
parser.add_argument("-v", "--verbose", action="store_true",
 help="extra processing information")

# Positional args
parser.add_argument("lccn", help="LCCN to be fixed")
parser.add_argument("bad_date", help="incorrect date")
parser.add_argument("new_date", help="corrected date")

args = parser.parse_args()

This allows us to execute the script and do a dry run to verify what files it finds from our arguments with more verbose output without modifying any files before we run it again in quiet mode and let it silently work its magic for us:

./fix_dates_by_lccn.py -dv sn95069723 1821 1921
Searching /batches/

Search for bad dates in batch_pm@delivery2_ver01/data/sn95069723

  Search for bad dates in /00000000036/1821111001
    Fix date in file 0045.xml
    Fix date in file 0042.xml
    Fix date in file 0047.xml
    Fix date in file 1821111001.xml
      Replace 1821 in file name with 1921
    Fix date in file 0044.xml
    Fix date in file 0043.xml
    Fix date in file 0046.xml
    Replace 1821 in dir name with 1921
    Update dates in batch XML covering sn95069723
...
./fix_dates_by_lccn.py -q sn95069723 1821 1921

The script modifies the date arguments and stores them in each of the formats necessary to use in matching filenames and XML strings. Then the script calls functions to find and collect the directory paths corresponding to the LCCN argument and the directories within which match the incorrect date argument. It accomplishes this by calling os.walk.

Moonwalking
Python’s os.walk doing its thing

After gathering the desired directory paths, the script identifies which files were ALTO XML or METS XML and uses the ElementTree XML module to read and modify them. Special care is needed here to preserve some parts of the XML documents.

XML Namespaces

In our original XML file, we see the list of namespaces at the top of the file:

<mets TYPE="urn:library-of-congress:ndnp:mets:newspaper:issue" PROFILE="urn:library-of-congress:mets:profiles:ndnp:issue:v1.5" LABEL="The Plattsmouth Journal, 1821-10-20" 
 xmlns:mix="http://www.loc.gov/mix/" xmlns:ndnp="http://www.loc.gov/ndnp" xmlns:premis="http://www.oclc.org/premis" xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.loc.gov/METS/"
 xsi:schemaLocation="
 http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/version17/mets.v1-7.xsd 
 http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-3.xsd" >

For the script to understand and retain namespaces, one must register all of the namespaces used within a file with ElementTree before it parses the XML:

# Set namespaces before parsing
ET.register_namespace("", "http://www.loc.gov/METS/")
ET.register_namespace("mix", "http://www.loc.gov/mix/")
ET.register_namespace("ndnp", "http://www.loc.gov/ndnp")
ET.register_namespace("premis", "http://www.oclc.org/premis")
ET.register_namespace("mods", "http://www.loc.gov/mods/v3")
ET.register_namespace("xsi", "http://www.w3.org/2001/XMLSchema-instance")
ET.register_namespace("xlink", "http://www.w3.org/1999/xlink")
ET.register_namespace("np", "urn:library-of-congress:ndnp:mets:newspaper")

The namespace for the <structMap> element further down in the file is not preserved by ElementTree though, so we manually re-add it:

# Restore structmap namespace that ET doesn't write
file = fileinput.FileInput(file_path, inplace=1)
for line in file:
    print line.replace('<structMap>', '<structMap xmlns:np="urn:library-of-congress:ndnp:mets:newspaper">'),

XML Comments

There are a few comments scattered throughout each XML document. I learned that retaining XML comments with ElementTree requires defining a custom XML parser:

# XML parser to retain comments
class CommentRetainer(ET.XMLTreeBuilder):

    def __init__(self):
        ET.XMLTreeBuilder.__init__(self)
        # assumes ElementTree 1.2.X
        self._parser.CommentHandler = self.handle_comment

    def handle_comment(self, data):
        self._target.start(ET.Comment, {})
        self._target.data(data)
        self._target.end(ET.Comment)

# Parse XML using custom parser
tree = ET.parse(file_path, parser=CommentRetainer())

XML Rewriting and Name Updates

With ElementTree, XML element attributes are usually rewritten in a different order than in the original file, but thankfully this is one area where XML is flexible and remains valid. Once the XML is rewritten, the script updates file and directory names.

Date and LCCN String Management

I didn’t realize until I had spent a fair amount of time testing the functionality above that it was also necessary to update the issue records within the batch XML files. The tricky part of that was extracting the tail of the incorrect date path including the LCCN to match against the corresponding strings in the XML. There are a lot of date and LCCN strings to manage throughout this process.

# Determine batch file and bad date paths
batch_path_re = re.compile('^(.+)\/sn[0-9]{8}\/')
batch_path = batch_path_re.match(bad_date_path).group(1)
batch_files = [os.path.join(batch_path, "batch.xml"), os.path.join(batch_path, "batch_1.xml")]

bad_date_path_tail_re = re.compile(".+\/({0}\/.+)$".format(lccn))
bad_date_path_tail = bad_date_path_tail_re.match(bad_date_path).group(1)

Fix LCCN by Date

Source: fix_lccn_by_date.py

Doc Brown wearing his mind reading helmet looks shocked to see Marty
Wearing Doc Brown’s mind reading helmet is necessary for keeping track of all this information

Very similarly, this script begins by handling command line options, necessary arguments, and finding the paths to the directories and files out of order. Updating the METS XML files doesn’t differ much either.

Things get interesting and complicated though when the script needs to track which related reel files are copied and deleted, and whether the changes empty the directory of the incorrect LCCN. Removed reel files and emptied directories have to be removed when updating the batch XML file as well. Moved reels also have to be inserted in the correct order in the batch XML file.

# Add copied reel to batch reels
if not copied_reel_added:
    if not args.quiet and not batch_file[-6:] == "_1.xml":
        print " Adding copied reel {0} to batch XML".format(reel_copied)

    reel_number = reel_copied.split('/')[1]
    reel_element = ET.Element("reel", {"reelNumber": reel_number})
    reel_element.text = reel_copied +'/'+ reel_number +'.xml' 
    reel_element.tail = '\n\t'

    if copied_reel_index:
        root.insert(copied_reel_index, reel_element)
    else:
        root.append(reel_element)

Testing the Scripts

While writing the scripts, I manually identified and downloaded a small handful of XML files from the affected newspapers from the server without their accompanying high resolution scan images. I kept an unmodified set of the files and repeatedly copied them to another location, ran the scripts on them, evaluated the output, and deleted them. In hindsight, I could have just made a temporary Git repository and reset or checked out the unmodified files after each time I ran the scripts on them. But this slip of the mind didn’t cost me much time.

Downloading Bulk Newspaper Files

When I felt the scripts were ready to test on the entire corpus of Plattsmouth papers, I researched how to exclude the image files so I could save time by downloading approximately 4GB of files instead of 300GB. I found an rsync option that provided exactly what I was looking for:

rsync -ahu --info=progress2 --exclude '*.jp2' server_name:/batches/pm_delivery* .

Logical Volume Manager (LVM) Snapshots

Doc Brown sitting in the flying DeLoreanEven after excluding the image files, re-copying the 4GB of XML files would be too time consuming to do after each time I ran the scripts on them. If I had thought to use Git, this may have been simpler to accomplish. I write that hesitantly because with this much data, even Git may have been significantly slower. I had been looking for a good reason to learn to use LVM snapshots since I had read about them a few years ago and have partitioned my disks for the possibility since. More specifically, I have been partitioning using LVM thin-provisioning, also referred to as LVM2.

If you’re unfamiliar, I recommend reading the Red Hat documentation on Snapshot Volumes and Creating Snapshot Volumes. Here is a brief description and their particular use in this situation:

The LVM snapshot feature provides the ability to create virtual images of a device at a particular instant without causing a service interruption.

Because the snapshot is read/write, you can test applications against production data by taking a snapshot and running tests against the snapshot, leaving the real data untouched.

This feature requires a relatively modern Linux system and the necessary underlying LVM2 partitioning scheme. I created my thin-provisioned snapshot and mounted it with the following commands:

lvcreate -s -kn --name newspapers_snapshot data/newspapers
mount /dev/data/newspapers_snapshot /data/newspapers_snapshot

I then ran my scripts on the files in /data/newspapers_snapshot. After the scripts would finish, I could nearly instantly both delete the snapshot which had been modified and create another snapshot, repeating as needed. To be even more thorough, I later downloaded the full 300GB data set. Applying the LVM snapshot process was still lightning fast with all the full-size page images.

A cursory search shows that Windows provides a similar snapshot feature via Shadow Copy and Mac OS users may gain the capability with the third-party software Paragon Snapshot. I haven’t read much into or tried them though, so I can’t say whether they work as well or not.

Updating Production Files

Doc Brown shocked, wearing dark sunglasses
My face upon seeing errors on the server command line

After I was convinced that everything was working correctly, I copied the scripts to the production server and tried to run them. But I had provided the wrong fuels for the flux capacitor!

Python 2.6

I developed the scripts on my Fedora Workstation desktop, but the production server runs the latest CentOS 6 release which uses Python 2.6. Thankfully the only change I had to make was to add positional argument specifiers to the replacement fields in the formatted strings I was passing to the print function.

print "  Could not find batches with bad date {}".format(bad_date)   # Python 2.7
print "  Could not find batches with bad date {0}".format(bad_date)  # Python 2.6

ElementTree Module

The version of the ElementTree module on CentOS 6 differed from the version on my desktop as well, so some of the functions the script calls don’t exist there. I learned that Python will first search in the directory of the script being executed for modules imported, so I tried copying the module files from my workstation to the scripts’ directory on the server. Great Scott! It worked!

DeLorean struck by lightning and disappearing leaving only fire trails

And that was how we prevented a paradox that could have caused a chain reaction that would unravel the very fabric of the space-time continuum and destroy the entire universe… or at least could have left some inaccuracies in the Plattsmouth newspapers’ history.

Many thanks to Laura Weakly, Karin Dalziel, and Jessica Dussault for their proofreading and feedback on this post.