{"id":232,"date":"2017-06-23T17:19:34","date_gmt":"2017-06-23T22:19:34","guid":{"rendered":"https:\/\/cdrhdev.unl.edu\/log\/?p=232"},"modified":"2017-06-26T15:45:42","modified_gmt":"2017-06-26T20:45:42","slug":"repairing-nebraska-newspapers-space-time-continuum","status":"publish","type":"post","link":"https:\/\/cdrhdev.unl.edu\/log\/2017\/repairing-nebraska-newspapers-space-time-continuum\/","title":{"rendered":"Repairing the Nebraska Newspapers Space-Time Continuum"},"content":{"rendered":"<h2><a href=\"http:\/\/nebnewspapers.unl.edu\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-full wp-image-233\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/neb_newspapers.jpg\" alt=\"Nebraska Newspapers\" width=\"206\" height=\"245\" \/><\/a>Nebraska Newspapers<\/h2>\n<p>Since 2007, the CDRH has been cultivating our web-based time machine, <a href=\"http:\/\/nebnewspapers.unl.edu\">Nebraska Newspapers<\/a>, in partnership with the Library of Congress&#8217;s <a href=\"http:\/\/www.neh.gov\/projects\/ndnp.html\">National Digital Newspapers Program (NDNP)<\/a> and funded by grants from the <a href=\"http:\/\/www.neh.gov\">National Endowment for the Humanities (NEH)<\/a>. At time of writing, we currently have published 45 newspapers with full text and high resolution scan images. And there are more on the way after receiving our third NDNP grant in 2016.<\/p>\n<p><a href=\"http:\/\/www.neh.gov\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-234 aligncenter\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/neh_logo-300x74.png\" alt=\"National Endowment for the Humanities\" width=\"300\" height=\"74\" \/><\/a><\/p>\n<h3 class=\"clearfix\">Plattsmouth Papers<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-258 size-full\" title=\"Source: https:\/\/giphy.com\/gifs\/back-to-the-future-bttf-backtothefuture-xVywpN1j2PrzO\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-paper.gif\" alt=\"Newspaper with headline fading from &quot;Emmet Brown Condemned&quot; to &quot;Emmet Brown Commended&quot;\" width=\"360\" height=\"198\" \/><\/p>\n<p>Upon ingest of the Plattsmouth, Nebraska newspapers, CDRH staff noted signs of a rift in the space-time continuum: some problems with the dates assigned to some of the newspaper issues. Papers from the early 1900s were miskeyed as being from the early 1800s &#8212; a time when few white settlers were living in Nebraska and certainly when no newspapers were being published here. These errors were not caught in the validation process, because the validator only checks to see if the date value has a valid date format, not whether the date is actually the correct one. Typos like this sound simple to fix, but changing the date across hundreds of filenames and within the XML of said files would take hours and hours to complete manually.<\/p>\n<figure id=\"attachment_238\" aria-describedby=\"caption-attachment-238\" style=\"width: 225px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-238\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/plattsmouth-plan-225x300.jpg\" alt=\"\" width=\"225\" height=\"300\" srcset=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/plattsmouth-plan-225x300.jpg 225w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/plattsmouth-plan-768x1024.jpg 768w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/plattsmouth-plan-1280x1707.jpg 1280w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/05\/plattsmouth-plan.jpg 1536w\" sizes=\"auto, (max-width: 225px) 100vw, 225px\" \/><figcaption id=\"caption-attachment-238\" class=\"wp-caption-text\">Laura&#8217;s timeline of Plattsmouth papers&#8217; title shifts, publication dates, and LCCNs<\/figcaption><\/figure>\n<p>The <a href=\"https:\/\/www.plattsmouth.org\/index.php?option=com_content&amp;view=article&amp;id=94&amp;Itemid=44\">Plattsmouth Public Libraries<\/a> soon also noted that title shifts had not been identified. Plattsmouth has had two major newspapers, <em>The Plattsmouth Herald<\/em> and <em>The Plattsmouth Journal<\/em>, but both of these newspapers have undergone title changes throughout their history. According to standard cataloging practice, each newspaper title shift requires a unique Library of Congress Control Number (LCCN). Additionally, the NDNP guidelines require the use of print LCCNs, not those identifying the microfilm versions. Although the Plattsmouth papers had not been selected for inclusion in the NDNP, CDRH staff decided they should comply with the Library of Congress standards in hopes of being able to include the papers in <a href=\"http:\/\/chroniclingamerica.loc.gov\/\">Chronicling America<\/a> in the future.<\/p>\n<p>Given the information from the Plattsmouth Public Libraries, and armed with the <a href=\"http:\/\/chroniclingamerica.loc.gov\/search\/titles\/\">US Newspaper Directory<\/a>, our Metadata Encoding Specialist <a href=\"https:\/\/twitter.com\/lweakly\">Laura Weakly<\/a> investigated and sorted out what had been mixed up with the aid of her trusty whiteboard. Laura documented a few date ranges where LCCNs had been misapplied when the paper&#8217;s name had changed and identified where the microfilm LCCN had been used. Like correcting the dates, repairing these LCCN problems would involve similarly monumental amounts of manual work. Rather than undertake this time-consuming process, Laura approached the dev team about writing scripts to save time and minimize the possibility of new typos.<\/p>\n<h2><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-264 size-medium\" title=\"Source: http:\/\/backtothefuture.wikia.com\/wiki\/Flux_capacitor\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-flux-capacitor-300x300.jpg\" alt=\"Flux capacitor\" width=\"300\" height=\"300\" srcset=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-flux-capacitor-300x300.jpg 300w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-flux-capacitor-150x150.jpg 150w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-flux-capacitor.jpg 720w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/h2>\n<h2>Python Scripts<\/h2>\n<p>To write the needed scripts, I decided to use our flux capacitor <a href=\"https:\/\/www.python.org\/\">Python<\/a>, because it powers the Library of Congress&#8217;s Chronicling America and other NDNP websites&#8217; <a href=\"https:\/\/github.com\/LibraryOfCongress\/chronam\">Chronam software<\/a>, including Nebraska Newspapers. We usually use Bash or Ruby scripts for our projects, but I wanted this to be written in the language familiar to others working on NDNP projects. I had only written a small handful of Python scripts in the past, but I looked forward to refreshing my time-saving skills with it.<\/p>\n<h3>Fix Dates by LCCN<\/h3>\n<p>Source: <a href=\"https:\/\/github.com\/open-oni\/open-oni-scripts\/blob\/master\/nebraska\/fix_dates_by_lccn.py\">fix_dates_by_lccn.py<\/a><\/p>\n<p>I chose to start by focusing on the incorrect dates as it seemed a little less complex than updating LCCNs.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-267 aligncenter\" title=\"Source: http:\/\/backtothefuture.wikia.com\/wiki\/Time_circuits\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-time-circuits-e1498246266617.png\" alt=\"Time travel machine date selector\" width=\"380\" height=\"230\" \/><\/p>\n<p>The script begins by using <a href=\"https:\/\/docs.python.org\/2\/howto\/argparse.html\">argparse<\/a> to implement common command line options and the necessary arguments to control the script, similar to how I had begun using <a href=\"http:\/\/wiki.bash-hackers.org\/howto\/getopts_tutorial\">getopts<\/a> for my Bash scripts and <a href=\"http:\/\/ruby-doc.org\/stdlib-2.4.1\/libdoc\/optparse\/rdoc\/OptionParser.html\">OptionParser<\/a> for my Ruby scripts.<\/p>\n<pre># Arguments\r\n# ---------\r\nparser = argparse.ArgumentParser()\r\n\r\n# Optional args\r\nparser.add_argument(\"-d\", \"--dry_run\", action=\"store_true\",\r\n help=\"don't make any changes to preview outcome\")\r\nparser.add_argument(\"-q\", \"--quiet\", action=\"store_true\",\r\n help=\"suppress output\")\r\nparser.add_argument(\"-s\", \"--search_dir\",\r\n help=\"directory to search (default: \/batches)\")\r\nparser.add_argument(\"-v\", \"--verbose\", action=\"store_true\",\r\n help=\"extra processing information\")\r\n\r\n# Positional args\r\nparser.add_argument(\"lccn\", help=\"LCCN to be fixed\")\r\nparser.add_argument(\"bad_date\", help=\"incorrect date\")\r\nparser.add_argument(\"new_date\", help=\"corrected date\")\r\n\r\nargs = parser.parse_args()<\/pre>\n<p>This allows us to execute the script and do a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Dry_run_(testing)\">dry run<\/a> to verify what files it finds from our arguments with more verbose output without modifying any files before we run it again in quiet mode and let it silently work its magic for us:<\/p>\n<pre>.\/fix_dates_by_lccn.py -dv sn95069723 1821 1921<\/pre>\n<blockquote>\n<pre>Searching \/batches\/\r\n\r\nSearch for bad dates in batch_pm@delivery2_ver01\/data\/sn95069723\r\n\r\n  Search for bad dates in \/00000000036\/1821111001\r\n    Fix date in file 0045.xml\r\n    Fix date in file 0042.xml\r\n    Fix date in file 0047.xml\r\n    Fix date in file 1821111001.xml\r\n\u00a0     Replace 1821 in file name with 1921\r\n\u00a0   Fix date in file 0044.xml\r\n\u00a0   Fix date in file 0043.xml\r\n \u00a0  Fix date in file 0046.xml\r\n \u00a0  Replace 1821 in dir name with 1921\r\n\u00a0   Update dates in batch XML covering sn95069723\r\n...<\/pre>\n<\/blockquote>\n<pre>.\/fix_dates_by_lccn.py -q sn95069723 1821 1921<\/pre>\n<p>The script modifies the date arguments and stores them in each of the formats necessary to use in matching filenames and XML strings. Then the script calls functions to find and collect the directory paths corresponding to the LCCN argument and the directories within which match the incorrect date argument. It accomplishes this by calling <a href=\"https:\/\/docs.python.org\/2\/library\/os.html#os.walk\">os.walk<\/a>.<\/p>\n<figure id=\"attachment_266\" aria-describedby=\"caption-attachment-266\" style=\"width: 339px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-266 \" title=\"Source: https:\/\/www.buzzfeed.com\/sabrinabarr\/20-reasons-why-marty-mcfly-is-the-flyest-guy-1qpyv\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-moonwalk.gif\" alt=\"Moonwalking\" width=\"339\" height=\"254\" \/><figcaption id=\"caption-attachment-266\" class=\"wp-caption-text\">Python&#8217;s os.walk doing its thing<\/figcaption><\/figure>\n<p>After gathering the desired directory paths, the script identifies which files were ALTO XML or METS XML and uses the <a href=\"https:\/\/docs.python.org\/2\/library\/xml.etree.elementtree.html\">ElementTree XML module<\/a> to read and modify them. Special care is needed here to preserve some parts of the XML documents.<\/p>\n<h4>XML Namespaces<\/h4>\n<p>In our original XML file, we see the list of namespaces at the top of the file:<\/p>\n<pre>&lt;mets TYPE=\"urn:library-of-congress:ndnp:mets:newspaper:issue\" PROFILE=\"urn:library-of-congress:mets:profiles:ndnp:issue:v1.5\" LABEL=\"The Plattsmouth Journal, 1821-10-20\" \r\n xmlns:mix=\"http:\/\/www.loc.gov\/mix\/\" xmlns:ndnp=\"http:\/\/www.loc.gov\/ndnp\" xmlns:premis=\"http:\/\/www.oclc.org\/premis\" xmlns:mods=\"http:\/\/www.loc.gov\/mods\/v3\" xmlns:xsi=\"http:\/\/www.w3.org\/2001\/XMLSchema-instance\" xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" xmlns=\"http:\/\/www.loc.gov\/METS\/\"\r\n xsi:schemaLocation=\"\r\n http:\/\/www.loc.gov\/METS\/ http:\/\/www.loc.gov\/standards\/mets\/version17\/mets.v1-7.xsd \r\n http:\/\/www.loc.gov\/mods\/v3 http:\/\/www.loc.gov\/standards\/mods\/v3\/mods-3-3.xsd\" &gt;<\/pre>\n<p>For the script to understand and retain namespaces, one must register all of the namespaces used within a file with ElementTree before it parses the XML:<\/p>\n<pre># Set namespaces before parsing\r\nET.register_namespace(\"\", \"http:\/\/www.loc.gov\/METS\/\")\r\nET.register_namespace(\"mix\", \"http:\/\/www.loc.gov\/mix\/\")\r\nET.register_namespace(\"ndnp\", \"http:\/\/www.loc.gov\/ndnp\")\r\nET.register_namespace(\"premis\", \"http:\/\/www.oclc.org\/premis\")\r\nET.register_namespace(\"mods\", \"http:\/\/www.loc.gov\/mods\/v3\")\r\nET.register_namespace(\"xsi\", \"http:\/\/www.w3.org\/2001\/XMLSchema-instance\")\r\nET.register_namespace(\"xlink\", \"http:\/\/www.w3.org\/1999\/xlink\")\r\nET.register_namespace(\"np\", \"urn:library-of-congress:ndnp:mets:newspaper\")<\/pre>\n<p>The namespace for the &lt;structMap&gt; element further down in the file is not preserved by ElementTree though, so we manually re-add it:<\/p>\n<pre># Restore structmap namespace that ET doesn't write\r\nfile = fileinput.FileInput(file_path, inplace=1)\r\nfor line in file:\r\n    print line.replace('&lt;structMap&gt;', '&lt;structMap xmlns:np=\"urn:library-of-congress:ndnp:mets:newspaper\"&gt;'),<\/pre>\n<h4>XML Comments<\/h4>\n<p>There are a few comments scattered throughout each XML document. I learned that retaining XML comments with ElementTree requires <a href=\"https:\/\/stackoverflow.com\/questions\/4474754\/how-to-keep-comments-while-parsing-xml-using-python-elementtree\/27333347#27333347\">defining a custom XML parser<\/a>:<\/p>\n<pre># XML parser to retain comments\r\nclass CommentRetainer(ET.XMLTreeBuilder):\r\n\r\n    def __init__(self):\r\n        ET.XMLTreeBuilder.__init__(self)\r\n        # assumes ElementTree 1.2.X\r\n        self._parser.CommentHandler = self.handle_comment\r\n\r\n    def handle_comment(self, data):\r\n        self._target.start(ET.Comment, {})\r\n        self._target.data(data)\r\n        self._target.end(ET.Comment)\r\n\r\n# Parse XML using custom parser\r\ntree = ET.parse(file_path, parser=CommentRetainer())<\/pre>\n<h4>XML Rewriting and Name Updates<\/h4>\n<p>With ElementTree, XML element attributes are usually rewritten in a different order than in the original file, but thankfully this is one area where XML is flexible and remains valid. Once the XML is rewritten, the script updates file and directory names.<\/p>\n<h4>Date and LCCN String Management<\/h4>\n<p>I didn&#8217;t realize until I had spent a fair amount of time testing the functionality above that it was also necessary to update the issue records within the batch XML files. The tricky part of that was extracting the tail of the incorrect date path including the LCCN to match against the corresponding strings in the XML. There are a lot of date and LCCN strings to manage throughout this process.<\/p>\n<pre># Determine batch file and bad date paths\r\nbatch_path_re = re.compile('^(.+)\\\/sn[0-9]{8}\\\/')\r\nbatch_path = batch_path_re.match(bad_date_path).group(1)\r\nbatch_files = [os.path.join(batch_path, \"batch.xml\"), os.path.join(batch_path, \"batch_1.xml\")]\r\n\r\nbad_date_path_tail_re = re.compile(\".+\\\/({0}\\\/.+)$\".format(lccn))\r\nbad_date_path_tail = bad_date_path_tail_re.match(bad_date_path).group(1)<\/pre>\n<h3>Fix LCCN by Date<\/h3>\n<p>Source: <a href=\"https:\/\/github.com\/open-oni\/open-oni-scripts\/blob\/master\/nebraska\/fix_lccn_by_date.py\">fix_lccn_by_date.py<\/a><\/p>\n<figure id=\"attachment_273\" aria-describedby=\"caption-attachment-273\" style=\"width: 400px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-273\" title=\"Source: robotsfuture.blogspot.com\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-doc-brown-helmet.jpg\" alt=\"Doc Brown wearing his mind reading helmet looks shocked to see Marty\" width=\"400\" height=\"217\" srcset=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-doc-brown-helmet.jpg 700w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-doc-brown-helmet-300x163.jpg 300w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><figcaption id=\"caption-attachment-273\" class=\"wp-caption-text\">Wearing Doc Brown&#8217;s mind reading helmet is necessary for keeping track of all this information<\/figcaption><\/figure>\n<p>Very similarly, this script begins by handling command line options, necessary arguments, and finding the paths to the directories and files out of order. Updating the METS XML files doesn&#8217;t differ much either.<\/p>\n<p>Things get interesting and complicated though when the script needs to track which related reel files are copied and deleted, and whether the changes empty the directory of the incorrect LCCN. Removed reel files and emptied directories have to be removed when updating the batch XML file as well. Moved reels also have to be inserted in the correct order in the batch XML file.<\/p>\n<pre># Add copied reel to batch reels\r\nif not copied_reel_added:\r\n    if not args.quiet and not batch_file[-6:] == \"_1.xml\":\r\n        print \" Adding copied reel {0} to batch XML\".format(reel_copied)\r\n\r\n    reel_number = reel_copied.split('\/')[1]\r\n    reel_element = ET.Element(\"reel\", {\"reelNumber\": reel_number})\r\n    reel_element.text = reel_copied +'\/'+ reel_number +'.xml' \r\n    reel_element.tail = '\\n\\t'\r\n\r\n    if copied_reel_index:\r\n        root.insert(copied_reel_index, reel_element)\r\n    else:\r\n        root.append(reel_element)<\/pre>\n<h2>Testing the Scripts<\/h2>\n<p>While writing the scripts, I manually identified and downloaded a small handful of XML files from the affected newspapers from the server without their accompanying high resolution scan images. I kept an unmodified set of the files and repeatedly copied them to another location, ran the scripts on them, evaluated the output, and deleted them. In hindsight, I could have just made a temporary Git repository and reset or checked out the unmodified files after each time I ran the scripts on them. But this slip of the mind didn&#8217;t cost me much time.<\/p>\n<h3>Downloading Bulk Newspaper Files<\/h3>\n<p>When I felt the scripts were ready to test on the entire corpus of Plattsmouth papers, I researched how to exclude the image files so I could save time by downloading approximately 4GB of files instead of 300GB. I found an rsync option that provided exactly what I was looking for:<\/p>\n<pre><tt>rsync -ahu --info=progress2 <strong>--exclude '*.jp2'<\/strong> server_name:\/batches\/pm_delivery* .\r\n<\/tt><\/pre>\n<h3>Logical Volume Manager (LVM) Snapshots<\/h3>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-274 alignright\" title=\"Source: http:\/\/backtothefuture.wikia.com\/wiki\/DeLorean_time_machine\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-delorean.png\" alt=\"Doc Brown sitting in the flying DeLorean\" width=\"400\" height=\"206\" srcset=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-delorean.png 846w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-delorean-300x155.png 300w, https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-delorean-768x396.png 768w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/>Even after excluding the image files, re-copying the 4GB of XML files would be too time consuming to do after each time I ran the scripts on them. If I had thought to use Git, this <em>may<\/em> have been simpler to accomplish. I write that hesitantly because with this much data, even Git may have been significantly slower. I had been looking for a good reason to learn to use LVM snapshots since I had read about them a few years ago and have partitioned my disks for the possibility since. More specifically, I have been partitioning using LVM thin-provisioning, also referred to as LVM2.<\/p>\n<p>If you&#8217;re unfamiliar, I recommend reading the Red Hat documentation on <a href=\"https:\/\/access.redhat.com\/documentation\/en-US\/Red_Hat_Enterprise_Linux\/7\/html\/Logical_Volume_Manager_Administration\/lv_overview.html#snapshot_volumes\">Snapshot Volumes<\/a> and <a href=\"https:\/\/access.redhat.com\/documentation\/en-US\/Red_Hat_Enterprise_Linux\/7\/html\/Logical_Volume_Manager_Administration\/LV.html#snapshot_command\">Creating Snapshot Volumes<\/a>. Here is a brief description and their particular use in this situation:<\/p>\n<blockquote><p>The LVM snapshot feature provides the ability to create virtual images of a device at a particular instant without causing a service interruption.<\/p>\n<p>&#8230;<\/p>\n<p>Because the snapshot is read\/write, you can test applications against production data by taking a snapshot and running tests against the snapshot, leaving the real data untouched.<\/p><\/blockquote>\n<p>This feature requires a relatively modern Linux system and the necessary underlying LVM2 partitioning scheme. I created my thin-provisioned snapshot and mounted it with the following commands:<\/p>\n<pre><tt>lvcreate -s -kn --name newspapers_snapshot data\/newspapers\r\nmount \/dev\/data\/newspapers_snapshot \/data\/newspapers_snapshot<\/tt><\/pre>\n<p>I then ran my scripts on the files in \/data\/newspapers_snapshot. After the scripts would finish, I could nearly instantly both delete the snapshot which had been modified and create another snapshot, repeating as needed. To be even more thorough, I later downloaded the full 300GB data set. Applying the LVM snapshot process was still lightning fast with all the full-size page images.<\/p>\n<p>A cursory search shows that Windows provides a similar snapshot feature via <a href=\"https:\/\/en.wikipedia.org\/wiki\/Shadow_Copy\">Shadow Copy<\/a> and Mac OS users may gain the capability with the third-party software <a href=\"https:\/\/www.paragon-software.com\/technologies\/backup-solutions\/snapshot-mac\/\">Paragon Snapshot<\/a>. I haven&#8217;t read much into or tried them though, so I can&#8217;t say whether they work as well or not.<\/p>\n<h2>Updating Production Files<\/h2>\n<figure id=\"attachment_275\" aria-describedby=\"caption-attachment-275\" style=\"width: 200px\" class=\"wp-caption alignright\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-275 size-full\" title=\"Source: http:\/\/backtothefuture.wikia.com\/wiki\/Great_Scott\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-great-scott.jpg\" alt=\"Doc Brown shocked, wearing dark sunglasses\" width=\"200\" height=\"197\" \/><figcaption id=\"caption-attachment-275\" class=\"wp-caption-text\">My face upon seeing errors on the server command line<\/figcaption><\/figure>\n<p>After I was convinced that everything was working correctly, I copied the scripts to the production server and tried to run them. But I had provided the wrong fuels for the flux capacitor!<\/p>\n<h3>Python 2.6<\/h3>\n<p>I developed the scripts on my Fedora Workstation desktop, but the production server runs the latest CentOS 6 release which uses Python 2.6. Thankfully the only change I had to make was to add positional argument specifiers to the replacement fields in the <a href=\"https:\/\/docs.python.org\/2\/library\/string.html#formatstrings\">formatted strings<\/a> I was passing to the print function.<\/p>\n<pre>print \"\u00a0 Could not find batches with bad date {}\".format(bad_date)   # Python 2.7\r\nprint \"\u00a0 Could not find batches with bad date {0}\".format(bad_date)  # Python 2.6<\/pre>\n<h3>ElementTree Module<\/h3>\n<p>The version of the ElementTree module on CentOS 6 differed from the version on my desktop as well, so some of the functions the script calls don&#8217;t exist there. I learned that <a href=\"https:\/\/docs.python.org\/2\/tutorial\/modules.html#the-module-search-path\">Python will first search in the directory of the script being executed for modules imported<\/a>, so I tried copying the module files from my workstation to the scripts&#8217; directory on the server. Great Scott! It worked!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-265 size-full\" title=\"Source: https:\/\/giphy.com\/gifs\/bttf-GzotTsHXPAdyw\" src=\"https:\/\/cdrhdev.unl.edu\/log\/wp-content\/uploads\/2017\/06\/bttf-fire-trails.gif\" alt=\"DeLorean struck by lightning and disappearing leaving only fire trails\" width=\"500\" height=\"266\" \/><\/p>\n<p>And that was how we prevented a paradox that could have caused a chain reaction that would unravel the very fabric of the space-time continuum and destroy the entire universe&#8230; or at least could have left some inaccuracies in the Plattsmouth newspapers&#8217; history.<\/p>\n<p>Many thanks to Laura Weakly, Karin Dalziel, and Jessica Dussault for their proofreading and feedback on this post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nebraska Newspapers Since 2007, the CDRH has been cultivating our web-based time machine, Nebraska Newspapers, in partnership with the Library of Congress&#8217;s National Digital Newspapers Program (NDNP) and funded by grants from the National Endowment for the Humanities (NEH). At time of writing, we currently have published 45 newspapers with full text and high resolution&hellip;<\/p>\n <a href=\"https:\/\/cdrhdev.unl.edu\/log\/2017\/repairing-nebraska-newspapers-space-time-continuum\/\" title=\"Repairing the Nebraska Newspapers Space-Time Continuum\" class=\"entry-more-link\"><span>Read More<\/span> <span class=\"screen-reader-text\">Repairing the Nebraska Newspapers Space-Time Continuum<\/span><\/a>","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"Layout":"","footnotes":""},"categories":[40],"tags":[44,41,42,43],"class_list":["entry","author-techgique","post-232","post","type-post","status-publish","format-standard","category-utilities","tag-lvm","tag-newspapers","tag-python","tag-xml"],"_links":{"self":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/232","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/comments?post=232"}],"version-history":[{"count":41,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/232\/revisions"}],"predecessor-version":[{"id":289,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/posts\/232\/revisions\/289"}],"wp:attachment":[{"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/media?parent=232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/categories?post=232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cdrhdev.unl.edu\/log\/wp-json\/wp\/v2\/tags?post=232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}