Quick and dirty OCR, included with Office 2003 and 2007

At work we have a document scanner that outputs files to PDF and e-mails it to us, but the PDFs are really just full-page images mashed together as a PDF because the scanner doesn’t have OCR capability.

Here’s how to extract the text using Microsoft Office 2003 or 2007. It’s imperfect, but here’s what you can do with the tools you already have.
1. Open Microsoft Office Document Imaging, which is buried in the Microsoft Office Tools folder in the start menu.

2. Open your PDF.

3. Click on Page 1 and select Copy.

4. Switch to Microsoft Office Document Imaging and go to Page, Paste Page.

5. Repeat steps 4 and 5 for every page of the document.

6. Select Tools, Recognize test using OCR.

7. Select Tools, Send text to Word. You may have to play with the options to see what gives you the best results.

8. Switch over to Word and clean up/edit the text by hand and save.

The process won’t perfectly preserve your formatting and you’ll get the standard suite of OCR errors, but it’s usually better than retyping the whole thing from scratch. For documents with complex formatting, you may find it better to copy and paste one column at a time rather than a page at a time, and then you can crop out the portions of the page you don’t want or care about, such as headers and footers and logos.

If you have text that’s been scanned into images like GIFs, PNGs, JPEGs (shudder) or TIFFs, the same trick works. You’ll just have to use an image viewer rather than Acrobat.

Buying proper OCR software like Omnipage is probably worth it if you do a lot of scanning. But for your occasional OCR needs, this solution can get the job done, without you having to buy or install anything.

%d bloggers like this:
WordPress Appliance - Powered by TurnKey Linux