From paper to searchable PDF on the cheap

Becky waring By Becky Waring

You don’t have to shell out $500 for software that converts scanned paper documents into searchable PDF or Office files.

One of the three programs I tested is the clear winner in turning all your scanned images into fully indexed documents.

Ditch the software bundled with your scanner

I’m addicted to the quick and streamlined (and free) Copernic Desktop Search utility. The program finds past articles to help me research new ones, it locates that six-year-old e-mail from my sister, and it looks up forgotten serial numbers. Basically, Copernic unlocks all the data floating around on my hard drive.

Get our unique weekly Newsletter with tips and techniques, how to's and critical updates on Windows 7, Windows 8, Windows XP, Firefox, Internet Explorer, Google, etc. Join our 480,000 subscribers!

PC Drive Maintenance (Excerpt)

Subscribe and get our monthly bonuses - free!

Your hard drives store photos, books, music and film libraries, letters, financial documents and so on. This ebook is aimed at helping you understand your hard drives, expand their capacities and length of life, and recover what you can from them when they fail. We're offering you a FREE Excerpt! Get this excerpt and other 4 bonuses if you subscribe FREE now!



But I have a ton of data sitting untouched and forgotten simply because it resides only on paper: my print magazine articles, financial statements, letters from friends and family, etc.

I’ve been looking for good, low-cost software that can turn my scanned paper into searchable PDF or Word documents. I’m not willing to drop a cool $400 on ABBYY FineReader Professional or $500 for Nuance’s OmniPage Professional, the two leading optical-character recognition (OCR) programs for translating scans into text. Nor can I afford $300 for Adobe Acrobat Standard.

My five-year-old scanner — which was top-of-the-line when I bought it and is still very good — came with third-party OCR software, but the program doesn’t work with Vista and lacks upgrade privileges. I can scan documents, but if I want to translate those scans into searchable text, I’m on my own.

Even newer scanners, particularly cheap multifunction models, may lack OCR software. Or worse, you might be stuck with a poor-quality recognition engine. OCR is all about accuracy. If you have to open every page to correct a lot of scan errors manually, it will cost you a great deal of time and aggravation. You’re better off spending a little money on a program that does the job right the first time.

The best all-round scan-conversion tool

For all-round accuracy, formatting prowess, and depth of features, Nuance PDF Converter Pro ($99.99) is the clear winner.

Nuance is also the developer of OmniPage, so PDF Converter Pro’s OCR chops are to be expected. What did surprise me were the extremely powerful additional features packed into this relatively low-cost tool, such as PDF editing and creation (including fillable forms support). It also offers 128-bit encryption and password protection of converted files.

The program even lets you create a searchable PDF archive of Outlook e-mails in one step. PDF Converter Pro is not only the best conversion tool, it’s the best value to boot.

Nuance also sells PDF Create, a $50 program that can do OCR on scanned PDFs, but the program can save the results only as PDFs, not as Office files (.doc, .xls, etc.). Also, PDF Create lacks PDF Converter Pro’s powerful editing features, so it’s useful only for packaging existing documents as PDFs. Unless you have very basic scanning needs, PDF Converter Pro is well worth the extra expense.

PDF Converter Pro has a free 30-day trial version. The software comes as both a standalone program and as plug-ins for Word, Excel, PowerPoint, Outlook, and Internet Explorer, so you can open and convert a document on the fly right within Word, for example.

If you right-click a file in Windows Explorer, you can convert it by choosing the option the program adds to the context menu. There’s no need to open the program first. I like to be very selective about my toolbar add-ins, so I used PDF Converter Pro’s Custom install option to choose exactly which Office and Windows integration features I wanted.

After I ran my test scan file through the converter, I output the results as a Word document (Office 97, 2000, 2003, and 2007 are all supported). The test scan consisted of three pages, all pretty severe tests of OCR capability:

• A complex page from a Consumer Reports review, complete with headlines, photos, captions, white-on-black text, and both two- and three-column material on the same page;

• A bank statement with varying format tables and some areas of shaded background;

• A business letter with graphic logo and signature plus some bold, italic, and underlined text.

PDF Converter Pro delivered the best overall results in converting my test scan to Word. The text was nearly error-free, and the program did an amazing job of reproducing the document’s format, including columns, justification, font selection, and bold and italic text.

The program handled white-on-black text without a hitch and kept graphics in the right places. The conversion was speedy, taking less than a minute. As with all the programs I tried, conversions are so fast you’ll spend more of your time doing the actual scanning.

The one place PDF Converter Pro fell short was the bank-statement scan, where it had problems reproducing the text in shaded areas. Probably a more OCR-friendly scan would solve the problem.

Bonus tip: For best results with text recognition, scans should be done in black and white (or grayscale, if you want to keep graphics) at 300 or 400 dpi. Higher-resolution scans just slow things down and require more storage space. Also, keep scans straight so the document is not skewed on the page, and adjust contrast and brightness so that the background is bright white (eliminating the lines in ruled paper, for example).

PDF converter pro test results
PDF converter pro test results
Figure 1. PDF Converter Pro did an amazing job of converting this scan of a Consumer Reports page (top) to Microsoft Word (bottom).

Since all three of the programs I tested have the ability to convert standard nonscanned PDFs — such as those you might download from the Internet or receive from a colleague via e-mail — to Word, Excel, or PowerPoint formats, I also tested conversion of a three-page tutorial PDF from FileMaker.

PDF Converter Pro did a beautiful job here — as you would expect, given the “perfect” source material. The flaw was a scattering of extra spaces in the middle of words here and there. Interestingly, the second-place performer, ABBYY PDF Transformer, put extra spaces in exactly the same spots, despite being based on the FineReader OCR engine, OmniPage’s direct competition.

You can find and correct these errors easily using Word’s spell-check tool, but they should have been avoided entirely — as the third-place converter was able to do.

The quickest and easiest tool is also one of the best

While PDF Converter Pro was the overall winner, ABBYY PDF Transformer Pro ($99.99) was a close second on accuracy. Also, ABBYY’s program has the best user interface by far: All options are clearly laid out in one window, and two big buttons are helpfully labeled 1 and 2 for opening your scan file and then transforming it. Tips for better conversion results are prominent as well.

Installation of PDF Transformer’s 15-day/50-page free trial went smoothly. Again I used the Custom option to select exactly which Office integration add-ons I wanted to install. Embedded conversion buttons are available for Word, Excel, and Outlook.

You can also use PDF Transformer to create searchable PDFs from Word, Excel, PowerPoint, and Visio documents, including password-protected and permission-controlled PDFs. However, the program has fewer “extra” features, such as PDF editing, than PDF Converter Pro.

ABBYY also makes a $50 program called ScanTo Office that converts scans to PDFs, but this program can’t convert PDFs to Word format and doesn’t create searchable PDFs from nonscanned PDFs. Again, unless you have very basic scanning needs, the extra money is worth spending for the Pro version.

PDF Transformer did a good job with my test file, handling the bank statement better than PDF Converter Pro did, but the program tripped on some of the images and captions on the Consumer Reports page. Another flaw was that some phrases that should have been entirely in bold were formatted with just a few characters in bold here and there.

The business letter converted nearly perfectly, however, and the conversion from regular PDF had only the same problem with scattered extra spaces as PDF Converter Pro experienced.

One other peeve was that PDF Transformer does not ask you to rename your files when you convert them, as the other two programs do. But overall, I really like this program’s ease of use and might have chosen it over PDF Converter Pro if not for the extra features PDF Converter Pro offers for the same price.

The least-capable converter is the accuracy champ

If not for one redeeming quality, I would have passed completely on Investintech’s Able2Extract Pro ($129.95). The program lacks Office integration, fails badly in formatting and graphic retention, and has virtually no extra features. It’s also more expensive than the other two converters I looked at. However, Able2Extract delivered the most accurate text conversion of the three.

In my test documents, the financial statement was reproduced almost perfectly, with layout in place. The Consumer Reports page and business letter had a single small error on each page (in areas the other two programs did much worse on).

All these errors can be corrected in just a few minutes. However, bold and italic formatting was lost in the conversion, graphics did not come through at all, and all the text was in a single font. Basically, if you need only accurate text rather than fully formatted pages, Able2Extract Pro wins.

Similarly, in the FileMaker PDF conversion test, Able2Extract was the only one of the three programs that did not insert random spaces into the middle of words (proving that it’s possible to avoid this problem). Annoyingly, the program makes you click through two dialog boxes for each conversion.

Since Able2Extract is based on Nuance OCR technology, it’s hard to understand why the results were so dramatically different from Nuance’s own PDF Converter Pro, but they were.

Able2Extract has a seven-day free trial that converts a maximum of three pages at a time, so you can try it out and see whether the accuracy overcomes its other limitations. There’s also a $69.95 Able2Doc version that converts only to Word/XPS and can’t output to Excel, PowerPoint, or PDF formats.

Becky Waring has worked as a writer and editor for PC World, NewMedia Magazine, CNET, The San Francisco Chronicle, Technology Review, Upside Magazine, and many other news sources. She alternates the Best Software column with Windows Secrets contributing editor Scott Spanbauer.
= Paid content

All Windows Secrets articles posted on 2008-07-03:

Becky Waring

About Becky Waring

Becky Waring has worked as a writer and editor for CNET, ZDNET, Technology Review, Upside Magazine, and many other news sources. She alternates the Best Software column with Windows Secrets contributing editor Scott Spanbauer.