Results 1 to 5 of 5
  1. #1
    5 Star Lounger
    Join Date
    Jan 2011
    Location
    Seattle, WA
    Posts
    1,070
    Thanks
    42
    Thanked 132 Times in 86 Posts

    Moving documents from paper to bits to text




    BEST SOFTWARE


    Moving documents from paper to bits to text


    By Lincoln Spector

    Remember the paperless office? Never happened; we still have piles of paper that take up too much room, can be difficult to search, and can't be encrypted. OCR software lets you scan important documents and turn them into searchable PDFs. But the technology is still far from perfect.
    On the other hand, you'll miss a lot of junk, too.

    The full text of this column is posted at windowssecrets.com/best-software/moving-documents-from-paper-to-bits-to-text/ (paid content, opens in a new window/tab).

    Columnists typically cannot reply to comments here, but do incorporate the best tips into future columns.

  2. #2
    3 Star Lounger
    Join Date
    Dec 2009
    Location
    Courtenay, BC
    Posts
    244
    Thanks
    9
    Thanked 16 Times in 15 Posts
    Thanks, Lincoln.
    Another option you didn't mention is getting the software with a scanner. I got Fujitsu's ScanSnap S1300i. It's a full duplex scanner (both sides) that makes scanning office docs from business cards to 8.5x14 docs fast and easy. Much easier than a flatbed. Auto-corrects and adjusts all the typical things. (colour or not, duplex or not, straightening, etc.) You can set it to default to various formats, and with or without OCR. I typically have it scan to PDF, then OCR as a separate step those documents that would benefit by it. (I also scan photos, notes etc) It came with ABBYY - the only limitation being that it checks the Meta tags to ensure it was scanned with the ScanSnap.

    I've now processed thousands of pages with it - the old file cabinets, archives, shurlock books, and binders of grad work. Vastly easier than a page by page flatbed. And now all fully searchable and quotable.

    I've also used ABBYY and Fujitsu scanners professionally in a shop that processed thousands of pages to PDF daily so I knew they were both excellent and high quality.

  3. #3
    WS Lounge VIP mrjimphelps's Avatar
    Join Date
    Dec 2009
    Location
    USA
    Posts
    3,396
    Thanks
    445
    Thanked 404 Times in 376 Posts
    How accurate are your scans? My experience and thought is that you don't get perfect accuracy with the scans, that you always need to proofread after scanning.

    I haven't done it in a good while, so maybe things have improved.

  4. #4
    New Lounger
    Join Date
    Dec 2009
    Location
    Golden, CO
    Posts
    19
    Thanks
    3
    Thanked 2 Times in 2 Posts
    I think you're wrong about "Window's own search tool" not finding words inside pdfs. I just checked that with Windows 7 SP1 by typing an unusual word which occurs in a dozen of my pdfs into the Start Button's search box. It instantly popped up all of those pdfs in the search results window under "documents". (That said, I still greatly prefer X1 Search over Windows search - far more flexible, and also instantly displays, with the search term highlighted, the files it finds if they are a common file type (.doc, .xls, pdf, etc).) It will also index and search Outlook, although I don't use that feature.

    Also, for those applications that require proofing, editing, etc, you might also consider Omnipage, from Nuance. I haven't used ABBYY FineReader, but from reading your description, it sounds like Omnipage is about the same price ($150 list) and will do everything you describe, plus more. Omnipage may be a bit more complicated to use because of the extra capabilities. It includes the ability to scan multipage documents, including large books, where is worth while to optimize the recognition accuracy to include the specific peculiarities of the font used in the specific document being scanned. I've found this particularly valuable when scanning old geneological documents found in libraries or on the web, where the documents may have already been copied or scanned in a sub-optimal manner, with relatively poor quality.

    Did you check how ABBYY FineReader handles multi-column documents? If a document is to undergo further editing, it's really important (and not easy) for the OCR app to properly interpret the column-to-column flow.

    And a final comment: If a document is undergo further editing after OCR, I recommend using a plain text output from the OCR application, rather than .doc. I've found that, while the .doc file may look just fine, it is very difficult to do additional formatting, because the styles applied by Omnipage (and I assume also ABBYY FineReader - I think this is a fundamental problem) are very complex and also somewhat haphazard - changing to fit the local context.

  5. #5
    New Lounger
    Join Date
    Jan 2010
    Location
    Torbay, England, UK
    Posts
    4
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I also think Lincoln is wrong about "Window's own search tool" not finding words inside PDFs.
    However, to get it working may need installation of the "Adobe PDF iFilter".
    I think the latest is version 11, and there are 32-bit and 64-bit variants, but a Google search will reveal all.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •