Reanimating Dead PDFs for CAT Tool Use - American Translators Association (ATA)

BY ANNETT BROWN

Basic “low-tech” tips for addressing OCR conversion issues commonly encountered with complexly formatted files (PC and Mac)

Those who know me might be baffled to see an article of mine in the Resource Review section, of all places. After all, far from being an innovator or early adopter of the latest and greatest in the tech world, I need to be nudged quite a bit before I give any new utility a serious look. But speaking to quite a few colleagues over the past 18 years about being a freelance translator, I know I’m not the only tech-averse translator out there.

Fortunately, I tend to make up for some of my lack of software savvy in other ways (e.g., by being a high-speed alphanumeric typist, keyboard short-cutter, and formatting wizard of Word documents). This has allowed me to devise my own alternative approach to addressing various frustrating issues we translators often encounter: image-based texts (i.e., dead or flat PDFs) that were poorly converted by optical character recognition (OCR) utilities.

The scope of this article will be confined to complexly formatted files. In other words, if you handle nothing but plain, single-column text files, you might not have reached the level of annoyance needed to keep reading. Conversely, many legal, medical, and technical translators will know what I’m talking about: dead PDFs containing texts over multiple columns along with various headers, footers, footnotes, side margins, imbedded images, graphics, tables, etc. What about files full of stamps, stickers, and handwritten insertions? And for some additional sanity-robbing kicks, the files are often scanned crookedly.

Generally speaking, OCR conversion issues become most infuriating when they completely thwart the benefits of CAT tool use. The good news is that misspelled words, hyphenation issues, incorrect line breaks, cut-off sentences, rogue tags, and other nuisances can be addressed quite well at various levels of technical sophistication. However, providing an exhaustive review here would go beyond the scope of this article. Much has already been written about the good, the bad, and the ugly of OCR technology by authors far more tech-savvy than I am. My own method hovers at the lower end of the sophistication continuum and involves:

Cutting/pasting text blocks from the PDF into a blank Word document
Saving the Word document asa PDF file
Converting the PDF file back into a Word file using OCR technology
Cleaning up the converted text
CAT tool translation and reformatting

Step 1

If OCR tools worked perfectly, this step would hardly be needed. But the reality is that the text to be converted by OCR is often not very conducive to CAT tool use. Instead, I spend an extra couple of minutes and use Acrobat Reader (PC/Mac) or Preview (Mac) to take screenshots (aka snapshots) of the PDF text blocks. Then I paste them into a blank Word file in the order in which I will want to handle them later. I typically start with headers and footers, omitting any duplicates, followed by side margins and then the main text, while omitting handwritten and other barely legible text. Tables come with their own learning curve. The converted outcomes are hard to predict and there’s no one-fix solution. More often than not, I take individual screenshots of each table column and continue handling the text column by column (see below).

One major advantage of this approach is that flowing text will end up as flowing text. This is particularly helpful when dealing with journal articles where the main text is frequently interrupted by page/section breaks, images, graphics, footnotes, etc.

Another advantage is that I can straighten crookedly scanned text at this point. While this is not an issue for the human eye, OCR software tends to be able to handle no more than a few degrees of “crookedness.” You can adjust the pasted-in screenshot objects either by turning the green rotation handle at the top of the object or by right-clicking (whether on a Mac¹ or PC²) → Format Picture → 3-D Rotation, and then adjusting the Z-axis by however many degrees upward or downward as necessary.

Step 2

Save the Word file as a PDF file. On the “Save As” screen, Mac users click on “Format” in the bottom section and PC users click on “Save as type:”; then, Mac/PC users both select “PDF” from the drop-down window.

Step 3

Convert the newly obtained PDF file back into a Word file using your OCR utility. I wholeheartedly agree with Jost Zetzsche in that the best OCR conversion solution out there may be the one offered through Wordfast Anywhere, which is available for free online. Wordfast Anywhere actually uses the server version of ABBYY, but benefits from some additional proprietary cleanup routines that ultimately produce better outcomes than ABBYY per se. Unfortunately, this comes with a big caveat. Due to clauses in their non-disclosure agreements with clients or agents, many translators often find themselves unable to use this tool. The same caveat exists, of course, for other popular cloud-based services.

Instead, I routinely use either ABBYY FineReader Express Edition for Mac or ABBYY FineReader 10 Professional Edition for PC. Neither of these products deliver top results, but by tweaking the inputs and outputs it’s possible to eventually end up with CAT-friendly text files.

Step 4

Now that you have your Word file, let’s do some cleanup work. This is where blindly knowing a multitude of keyboard shortcuts will come in handy. For a summary of some pertinent shortcuts and additional links, see the table below.

Substep 4.1

In the past, I foolishly decided to work with converted text that looked great, only to find out later that it took me longer to fix the remaining issues than to reformat everything from scratch. Unless some of the more complex text was converted particularly well, my preference these days is to get rid of most of the formatting. This will also leave you with far fewer tags to deal with and can be done by various methods. Mine is to select the text and then copy, delete, and re-insert it as unformatted text, all by using shortcuts that allow me to keep my hands on the keyboard.

Mac: Command+A → Command+C → Command+X → Control+Command+V → Select “Unformatted text”

PC: Ctrl+A → Ctrl+C → Ctrl+X → Ctrl+Alt+V → Select “Unformatted text”

Alternatively, if you wish to force the text you’re pasting in to match the text around it, use Option+Shift+Command+V on a Mac, or right-click “Paste Options,” “Merge formatting icon” on a PC.

Substep 4.2

PC users can rejoice for the following options, which are not (yet?) available for Mac users. Much of the post-cleanup can be done with utilities such as TransTools (aka Translator Tools³) or CodeZapper⁴. A small fee may apply. These utilities simplify the cleanup by providing consistent spacing, removing incorrect line breaks, hyphens, or rogue tags, and handling a number of other annoyances created by OCR utilities. I have TransTools installed on my PC and find it quite useful. However, since I work mostly on a Mac, I perform some of these tasks “semi-manually.” Here are a few tips if you find yourselves in similar shoes.

Creating Consistent Spacing: My formatting philosophy prescribes that the final document should never contain more than one single space at a time. If I need larger spaces, I use tabs. If my OCR-converted document ends up containing numerous consecutive spaces, I remove those with the help of the Find/Replace command by simply entering the maximum amount of continuous spaces found in the OCR-converted file (lets say, nine spaces) and replacing those by a single space. Then I consecutively remove a space from the top field (i.e., eight spaces, seven spaces … down to two spaces), each time replacing them by one space. Too cumbersome? The process takes only a few seconds, but can be replaced by a relatively simple macro (Mac: click Tools → Macro → Record New Macro; PC: click View → Macro → Record Macro).
Removing Line Breaks (Hard Returns): Go to Find/Replace by pressing
Mac: Shift+Command+H
PC: Ctrl+H
Then type “^p” into the top field and a “single space” in the bottom field. Then perform Find/Replace as needed. Check each instance individually. (Make sure you stay away from the “Replace All” option!)
Managing Line Spacing: Irrespective of what the OCR tool might produce in terms of line spacing, I generally find managing line spaces in Word cumbersome. If I have the time, I go to the paragraph settings by pressing Option+Command+M (Mac) or right-clicking on paragraph (PC) and adjusting the spacing.
If I’m under serious deadline pressure, I simply insert an empty row between paragraphs and increase/decrease the font size as needed by pressing
Mac: Command+“[” or Command+ “]”
PC: Ctrl+“[” or Ctrl+ “]”
Hyphenation: I’m not aware of any magic trick that will solve all hyphenation problems instantly. The thing is, of course, that some hard hyphens will be legitimate, others not so much. Technically, a spell check of the source language will flag those words, but with a long document it might be too time consuming to fix all instances this way. It may be faster to use “Find/Replace” to look for hard hyphens and replace any inappropriate ones with “nothing.” Soft hyphens (aka non-breaking or optional hyphens) shouldn’t pose any issues in your CAT tool.
Tables: As I alluded to earlier, the conversion of tables can be a bit dicey and may well deserve an article of its own. Sometimes the results are great, other times a lot of doctoring is required. In the latter case, I tend to convert each column separately and then piece the table back together, occasionally converting the table to text and back to a table. (See Table below.)
Spell Check: If you haven’t done one yet, now is a good time.

Step 5

At this point, your file should be ready to be imported into your CAT tool. Once your translation is done, you’ll need to reformat the text to match the source. But unencumbered by a lot of underlying formatting issues, I find this process to be rather smooth. I typically prefer to incorporate tables (with or without visible gridlines) to text boxes or columns. And once again, my ability to resort to an arsenal of shortcuts saves me a lot of time—but that’s another topic altogether.

Limitations

Needless to say, there are instances where neither the most sophisticated software nor the smartest tricks will produce a meaningful outcome within a reasonable timeframe. In my experience, the worst culprits here are documents that have been faxed more than twice (usually discernible by multiple fax transmission headers). The time it takes to fix, for example, a plethora of incorrectly converted vowels might just not be worth it. The same goes for fax transmissions that have vertical lines cutting through the main text. If the problems affect only a paragraph or so and you’re set up for speech recognition, it might be worthwhile to dictate the problematic text into your Word file. I have done so successfully several times, but I generally try to stay away from the most awful fax transmissions out there.

You Don’t Have To Be a Geek!

I have been using this approach for nearly two years now. It virtually allows me to have it all. In most cases, there is no need to forego the use of your vast CAT memories/glossaries and other CAT tool benefits just because you’re dealing with dead PDFs. You will easily make up the extra time spent up front by leveraging the many advantages of your CAT tool! Who says you have to be a geek? Even novices can reanimate dead PDFs and boost productivity, income, and job satisfaction.

Notes

Using Mac software: Word^® for Mac 2011.
Using PC software: Microsoft Office Professional Plus 2010.
Translator Tools, www.translatortools.net.
CodeZapper, http://bit.ly/CodeZapper.

Annett Brown is an ATA-certified German>English translator. Academically trained in linguistics in Germany, Russia, and the U.S., she started out as a part-time freelancer in 1999. In the intervening years, she earned a BA in psychology and, while employed with a pharmaceutical company, an executive MBA in project management. Armed with her in-depth knowledge of various functions within a pharmaceutical company (e.g., production, packaging, quality control/assurance, regulatory/medical affairs, marketing, etc.) and special knowledge regarding the handling of controlled substances, she has been a successful full-time pharma/medical freelance translator since 2009. Contact: transbrown@aol.com.

Jost Zetzsche is chair of ATA’s Translation and Interpreting Resources Committee. He writes the “Geekspeak” column for The ATA Chronicle. He is also the co-author of Found in Translation: How Language Shapes Our Lives and Transforms the World, a robust source for replenishing your arsenal of information about how human translation and machine translation each play an important part in the broader world of translation. Contact: jzetzsche@internationalwriters.com.