the archive raider: proof of concept and ocr testing

4
THE ARCHIVE RAIDER: PROOF OF CONCEPT & OCR TESTING Collection & Goals The Lyndon Baines Johnson Library and Museum (LBJ) plays host to a collection of recordings and transcripts of the President’s telephone conversations and meetings. These recordings are available online in WAV and MP3 format at the Miller Center, a nonpartisan public policy institution affiliated with the University of Virginia. 1 However, the transcripts associated with the recordings are behind a paywall and, to my knowledge, are not synced up with the available audio. Furthermore, the LBJ will not permit scanning or copying of the thousands of pages of transcripts without paying what would amount to a substantial fee. The goal of this project was to capture a large volume of transcript pages as efficiently at possible. The resulting captured images needed to be of sufficiently high quality as to permit OCR to render these transcripts into machine-readable text, which could then be synced with audio recordings in GLIFOS. Set-Up Using a scanner was out of the question due to LBJ policy, so Konrad Lawson’s guide to building an ultra-portable copy stand was used to create an “archive raider.” 2 The archive raider attaches to the researcher’s table with a heavy-duty clamp; this project used Manfrotto’s Super Clamp, which can fit a table edge up to 2.17 inches wide and bear a load of 33.07 pounds. 3 We attached Manfrotto’s aluminum 2-Section Articulated Arm with Camera Bracket to this clamp. The arm has a maximum length of 23.82 inches and can manage 3.31 pounds at maximum extension, which is sufficient to hold a professional-quality DSLR camera. 4 (See Figure 1 for an image of this setup). Attached to the camera bracket was my one-year old Canon PowerShot SX130 IS, a compact zoom camera with 12 MP and 12x optical zoom range that starts at 28mm. It does well in low light and features continuous autofocus, image stability, and a self-timer that can delay the shot by 2 Franny Gaede INF 385R: Survey of Digitization Quinn Stewart May 10, 2010 1 Miller Center, http://millercenter.org /. 2 Lawson, Konrad. The Chronicle of Higher Education, "The Articulated Arm of an Archive Raider." Last modified December 07, 2010. http://chronicle.com/blogs/profhacker/the-articulated-arm-of-an-archive- raider/29243 . 3 Manfrotto Super Clamp, http://www.manfrotto.us/super-clamp-with-2908-standard-stud . 4 Manfrotto Articulated Arm, http://www.manfrotto.us/2-section-single-articulated-arm-w-camera- bracket-143bkt Figure 1: The archive raider in action

Upload: mfgaede

Post on 19-Oct-2015

44 views

Category:

Documents


2 download

DESCRIPTION

The Lyndon Baines Johnson Library and Museum (LBJ) plays host to a collection of recordings and transcripts of the President’s telephone conversations and meetings. These recordings are available online in WAV and MP3 format at the Miller Center, a nonpartisan public policy institution affiliated with the University of Virginia. However, the transcripts associated with the recordings are behind a paywall and, to my knowledge, are not synced up with the available audio. Furthermore, the LBJ will not permit scanning or copying of the thousands of pages of transcripts without paying what would amount to a substantial fee. The goal of this project was to capture a large volume of transcript pages as efficiently at possible. The resulting captured images needed to be of sufficiently high quality as to permit OCR to render these transcripts into machine-readable text, which could then be synced with audio recordings in GLIFOS.

TRANSCRIPT

  • THE ARCHIVE RAIDER: PROOF OF CONCEPT & OCR TESTING

    Collection & Goals

    The Lyndon Baines Johnson Library and Museum (LBJ) plays host to a collection of recordings and transcripts of the Presidents telephone conversations and meetings. These recordings are available online in WAV and MP3 format at the Miller Center, a nonpartisan public policy institution affiliated with the University of Virginia.1 However, the transcripts associated with the recordings are behind a paywall and, to my knowledge, are not synced up with the available audio. Furthermore, the LBJ will not permit scanning or copying of the thousands of pages of transcripts without paying what would amount to a substantial fee.

    The goal of this project was to capture a large volume of transcript pages as efficiently at possible. The resulting captured images needed to be of sufficiently high quality as to permit OCR to render these transcripts into machine-readable text, which could then be synced with audio recordings in GLIFOS.

    Set-Up

    Using a scanner was out of the question due to LBJ policy, so Konrad Lawsons guide to building an ultra-portable copy stand was used to create an archive raider.2 The archive raider attaches to the researchers table with a heavy-duty clamp; this project used Manfrottos Super Clamp, which can fit a table edge up to 2.17 inches wide and bear a load of 33.07 pounds.3 We attached Manfrottos aluminum 2-Section Articulated Arm with Camera Bracket to this clamp. The arm has a maximum length of 23.82 inches and can manage 3.31 pounds at maximum extension, which is sufficient to hold a professional-quality DSLR camera.4 (See Figure 1 for an image of this setup).

    Attached to the camera bracket was my one-year old Canon PowerShot SX130 IS, a compact zoom camera with 12 MP and 12x optical zoom range that starts at 28mm. It does well in low light and features continuous autofocus, image stability, and a self-timer that can delay the shot by 2

    Franny GaedeINF 385R: Survey of Digitization

    Quinn StewartMay 10, 2010

    1 Miller Center, http://millercenter.org/.2 Lawson, Konrad. The Chronicle of Higher Education, "The Articulated Arm of an Archive Raider." Last modified December 07, 2010. http://chronicle.com/blogs/profhacker/the-articulated-arm-of-an-archive-raider/29243.3 Manfrotto Super Clamp, http://www.manfrotto.us/super-clamp-with-2908-standard-stud.4 Manfrotto Articulated Arm, http://www.manfrotto.us/2-section-single-articulated-arm-w-camera-bracket-143bkt

    Figure 1: The archive raider in action

  • or 10 seconds. The lack of support for remote capture on this camera was a major hindrance; I would suggest that remote capture is necessary for any long-term, high-volume project for reasons of ergonomics and efficiency. Future work could include testing of the Canon Hack Development Kit (CHDK), a third-party firmware enhancement that enables supported camera models to use a USB remote.5 Since the remote capture feature is standard on higher-end camera models but mostly unavailable on budget models, CHDK could potentially reduce the cost of similar projects.

    Though I spent considerable time practicing assembling the portable copy stand at home, configuring both it and the camera in the LBJ Librarys Reading Room took nearly 45 minutes. The differences in table width and lighting conditions played havoc on my intended set-up. (For the list of camera settings ultimately used for image capture, please see Figure 2.) Images were captured against a piece of white foam board, which helped align transcript pages and provided a neutral background. I used the cameras two second delay self-timer to prevent blurring caused by pressing the shutter button and disturbing the articulated arm. I was not permitted to use the flash in the reading room, but the automatic white balance and macro mode functions produced bright, crisp images without it. In an effort to create consistent image sets, I left the zoom alone for most shots.

    Testing & Results

    The actual photography proceeded at a brisk p a c e , t a k i n g , o n average, 17 seconds per

    page. This was sometimes interrupted by the need to remove staples and re-staple after capture was complete, which took about 30 seconds per trip. I only had the archivist on duty remove staples for particularly long (>6 page) transcripts; I used a bean bag to weigh down the shorter multi-page transcripts. Adjusting the bean bag for each page added about 7 seconds to capture time; single page transcripts only took ~10 seconds, compared to the 17 second average. In 90 minutes, I captured 157 transcript pages that cover the period between June 1967 and December 1967, corresponding to 15 audio recordings.

    Captured images were transferred to a MacBook and exported in TIFF format, creating files roughly 36 MB in size with dimensions of 3000 x 4000 pixels. (See Figure 3 for an example image.) I spent two

    Gaede 2

    5 Canon Hack Development Kit, http://chdk.wikia.com/wiki/CHDK.

    Program ModeAutomatic White BalanceMacro Mode2 second self-timerNo flashLeft zoom alone for nearly all imagesFigure 2: Camera settings used for image capture

    Figure 3: Excerpt from transcript number 12009, July 1967

  • hours exporting, downloading, and organizing the images, audio files from the Miller Center, and machine-encoded text files created in the second half of this project into folders by month and sub-folders by transcript number.

    OCR & Re-Speaking

    After creating these high-quality TIFFs, the next step was to create a machine-readable version to sync the transcripts with the digitized audio in GLIFOS. Since the transcripts used in this project came from an artifact-ridden service set, I wanted to determine whether OCR or re-speaking would produce the most accurate and efficient results.

    I used ABBYY FineReader 11 for Windows 7 to OCR my TIFF files. I spent one hour training ABBYY to use a custom user pattern to analyze the images and was impressed with the results, considering the idiosyncratic typewriter and diverse formatting of the various transcripts, as well as the previously mentioned artifacts. ABBYY did very well with letters and spacing, but had difficulty with all of the punctuation, particularly the full stop, which it translated as , a non-standard character. With one hour of training, it took 90 seconds for the initial read and 2 minutes to correct each page. Correcting the resulting text file took 90 seconds per page. Please note that this work was done with a virtualization of Windows on a MacBook, which introduced minor lag.

    I used Dragon Dictate for Mac OS X Lion to re-speak from the images to a text file. I spent about five minutes doing the initial training of my Dragon Dictate profile and another five minutes teaching the program some of the common names appearing in the transcripts. Dragon accurately transcribed what I was saying and performed particularly well on transcripts that contained summaries of conversations rather than formatted dialogue. A dialogue-formatted transcript page took a little more than seven minutes to re-speak, while a paragraph-formatted transcript page took about five minutes. I had particular trouble getting Dragon to transcribe the umms and uhhs that litter the directly transcribed dialogues.

    I also attempted to re-speak from an audio file that only had a summary transcript available; this was an exercise in frustration caused by poor audio quality and thick accents. If there is no extant transcript for a conversation, I recommend creating one with a keyboard rather than a microphone.

    Conclusions and Recommendations

    The archive raider efficiently created good-quality, consistent, OCR-worthy results for a very small sum of money compared to a traditional copy-stand. In that sense, this was a highly successful experiment. For anyone interested in using a similar set-up, I would like to make the following recommendations:

    Remote capture is vital: obtain a camera that comes standard with this feature or see if the Canon Hack Development Kit (or a similar project) will enable your camera to support a USB remote. This will reduce the amount of time needed per page, since you will no longer need to use a time-delay to reduce blurriness as well as save you a great deal of back pain.

    Gaede 3

  • Take the time to become extremely familiar with your camera: you will not be able to control lighting conditions in your archive and must be able to adapt to natural, fluorescent, incandescent, and low light.

    Prepare your file structure ahead of time: exporting your large photos and organizing them can be quite time intensive. Research your collection and create your file directory before taking your first photograph; drag and drop on site is far easier than trying to remember what belongs where after the fact.

    There are a number of options for creating machine-readable versions of these images, each of which has potential benefits and disadvantages, depending on the document. With some time spent on training, ABBYY FineReader 11 does very well with more complicated formatting. However, it is susceptible to artifacts (as seen in my issue with the full stop) and has difficulty with handwriting. If you have documents with significant handwritten annotation, you may wish to consider re-speaking instead, though the formatting may not be preferred. Dragon Dictate performed very well with summaries that were mostly text and contained little formatting other than standard punctuation. Re-speaking from audio is to be avoided; a good typist will be much faster and encounter less frustration.

    Gaede 4