Quick and dirty OCR, included with Office 2003 and 2007

At work we have a document scanner that outputs files to PDF and e-mails it to us, but the PDFs are really just full-page images mashed together as a PDF because the scanner doesn’t have OCR capability.

Here’s how to extract the text using Microsoft Office 2003 or 2007. It’s imperfect, but here’s what you can do with the tools you already have.
1. Open Microsoft Office Document Imaging, which is buried in the Microsoft Office Tools folder in the start menu.

2. Open your PDF.

3. Click on Page 1 and select Copy.

4. Switch to Microsoft Office Document Imaging and go to Page, Paste Page.

5. Repeat steps 4 and 5 for every page of the document.

6. Select Tools, Recognize test using OCR.

7. Select Tools, Send text to Word. You may have to play with the options to see what gives you the best results.

8. Switch over to Word and clean up/edit the text by hand and save.

The process won’t perfectly preserve your formatting and you’ll get the standard suite of OCR errors, but it’s usually better than retyping the whole thing from scratch. For documents with complex formatting, you may find it better to copy and paste one column at a time rather than a page at a time, and then you can crop out the portions of the page you don’t want or care about, such as headers and footers and logos.

If you have text that’s been scanned into images like GIFs, PNGs, JPEGs (shudder) or TIFFs, the same trick works. You’ll just have to use an image viewer rather than Acrobat.

Buying proper OCR software like Omnipage is probably worth it if you do a lot of scanning. But for your occasional OCR needs, this solution can get the job done, without you having to buy or install anything.

Dave Farquhar

David Farquhar is a computer security professional, entrepreneur, and author. He has written professionally about computers since 1991, so he was writing about retro computers when they were still new. He has been working in IT professionally since 1994 and has specialized in vulnerability management since 2013. He holds Security+ and CISSP certifications. Today he blogs five times a week, mostly about retro computers and retro gaming covering the time period from 1975 to 2000.

The Silicon Underground

Quick and dirty OCR, included with Office 2003 and 2007

Like this:

Related stories by Dave Farquhar