Results 1 to 7 of 7
  1. #1
    New Lounger
    Join Date
    Jan 2011
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts

    OCR recognizes table as two printed columns one after another

    I am a member of a small local and non-profit genealogy society http://cfgs.org. One of our functions is to publish old birth and death records online where others can find them. Years ago someone typed a many pages of tombstone records from local cemeteries. These tables contain the person’s name along with a birth and/or death date. They are in an unlined table format with a large space between the name and dates.

    We want to put these records online so people can find them. Several of us have tried scanning these records using the built-in OCR with our all-in-one printers. We have also scanned them into PDF format and used Acrobat Pro to recognize the text. In both cases the result is two columns, like newspaper columns instead of a table. This disassociates the dates from the names and when this is places in a word processor or spreadsheet we get one column with the dates are below the names.

    We really would like to avoid having volunteers manually enter these records into Excel or a database like Access. Does anyone have a suggestion on how to get these typed pages into a data table format?

    Another problem, but totally different, is that neither Excel nor Access recognize a number as a date prior to Jan 1, 1900 and many of our dates are older than this. This makes sorting or searching difficult.

    Thanks,
    Walter

  2. Subscribe to our Windows Secrets Newsletter - It's Free!

    Get our unique weekly Newsletter with tips and techniques, how to's and critical updates on Windows 7, Windows 8, Windows XP, Firefox, Internet Explorer, Google, etc. Join our 480,000 subscribers!

    Excel 2013: The Missing Manual

    + Get this BONUS — free!

    Get the most of Excel! Learn about new features, basics of creating a new spreadsheet and using the infamous Ribbon in the first chapter of Excel 2013: The Missing Manual - Subscribe and download Chapter 1 for free!

  3. #2
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    3,775
    Thanks
    0
    Thanked 162 Times in 150 Posts
    Hi Walter,

    To generate a table format with the data you have, someone will need to add the table gridlines to each page (on photocopies, perhaps) before doing the OCR. As for sorting, once OCR'd into a Word table, Word's table tools may allow you to do that where Excel & Access won't.
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  4. The Following User Says Thank You to macropod For This Useful Post:

    mrjimphelps (2014-01-28)

  5. #3
    New Lounger
    Join Date
    Jan 2011
    Posts
    8
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Thanks Paul but we have around 100 pages. It would be easier to retype than to add lines under every record on each page and then still have to mine the data from a Word or other file.

    A sample of what we have in a readable PDF can be seen at https://dl.dropboxusercontent.com/u/...as_Cem_all.pdf.

    Walter

  6. #4
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    3,775
    Thanks
    0
    Thanked 162 Times in 150 Posts
    Hi Walter,

    I've taken a look at the PDF in the link and I have to say that, even if it was in a table format (i.e. with gridlines), it would still require considerably more work to make sortable & searchable. Most of the given name entries don't have surnames on the same lines, plus some rows have extra information about age at death.

    That said, you can at least get a start by taking the exported names list for each page and turning that into a single-column table, which you can then split into two columns into the second of which you can then drag the corresponding date entries from lower-down the document. Whilst this will still be a lot of work, it should be much faster than re-typing the lot.

    PS: You could even split the table into three columns and put the birthdates and demise dates into separate columns.
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  7. #5
    New Lounger
    Join Date
    Nov 2011
    Posts
    18
    Thanks
    2
    Thanked 1 Time in 1 Post
    Hi, Walter. I've only just seen this thread so I hope my comment will still be useful!

    I have found that, even in a pdf, you can highlight & copy a section of text and paste it into Excel. You will probably have to do this in two stages, one for each column, but it has worked for me.

    Good Luck!
    Williss

  8. #6
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    3,775
    Thanks
    0
    Thanked 162 Times in 150 Posts
    Quote Originally Posted by Williss View Post
    I have found that, even in a pdf, you can highlight & copy a section of text and paste it into Excel. You will probably have to do this in two stages, one for each column, but it has worked for me.
    Did you even look at the PDF? Have you tried this? I doubt it for, if you had, you'd know it won't produce the desired output. As I've already said, a lot of post-processing will be required.
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  9. #7
    New Lounger
    Join Date
    Dec 2009
    Location
    New Jersey, USA
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    I think one of the Abbyy OCR products will allow you to define pages that are tables, not columns. I used this at work a couple years ago for a project. I don't remember the exact name of the product, and I know they have a few different OCR products. It wasn't too expensive.

    Hope this helps.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •