Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts

    OCR from Source That Is Always Upside Down

    Not sure if this is the right thread for this.

    I am using Windows 8.1, Word 2013, and Acrobat XI.

    Ultimately, I want this in Word, or exported from Word to TXT. I am not sure if my issue is best addressed in Acrobat before I get into Word, or in Word after I get it in.

    My problem starts with a PDF from a scan of an original with an even number of columns, with a header centered across each pair.


    Except that adjacent columns are upside-down with respect to each other (content is different in adjacent columns). This is intentional in the source.

    So my scan looks something like this (although there are 8 columns on a page):


    ------ ------

    ▲ ▼ ▲ ▼


    I ultimately want text from this. If I, say, export to Word the first and third columns come through OK, but the OCR fails badly on the second and fourth columns. If you think about the arrangement of the icons above ... my scan is always partially upside down.

    Obviously, I would not have a problem if my source had been arranged this way:

    ------ ------

    ▲ ▲ ▲ ▲


    FWIW: So that I would have something reasonably human readable, I rotated each page 180 degrees when doing the scanning. So I have a duplicate of the above in this form:

    ------ ------

    ▼ ▲ ▼ ▲



    Thus a normal human can read columns 1 and 3 from page 1, then columns 1 and 3 from page 2 ... and get the whole thing.

    Suggestions anyone?

    P.S. I have tried the low-tech version of cutting a hard copy of the scan into strips, and doing the conversion of those one strip at a time. This works fine. But ... I also timed it out, and it takes about the same amount as just retyping the original into Word by hand.

  2. #2
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    I would scan it all one way and then scan it all the other. Then with the two scan results in Word delete the 'upside down' columns and paste the remainder back together.

    If there are rows of data as well as columns then the scan is likely to need to reverse the row order on the inverted columns so you might need to include a numbered column and sort the table before pasting the columns back into the main doc.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  3. The Following User Says Thank You to Andrew Lockton For This Useful Post:

    boobounder (2016-09-30)

  4. #3
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by Andrew Lockton View Post
    I would scan it all one way and then scan it all the other.
    OK. I did that.
    Quote Originally Posted by Andrew Lockton View Post
    Then with the two scan results in Word delete the 'upside down' columns and paste the remainder back together.
    I am noting that you said to do this after I am in Word, not in Adobe.

    So it's not clear to me how I should delete the problem columns. This is an image when it's in Adobe, but by the time I get to Word it is in characters organized into paragraphs.

    The thing is, I do not know how Word had ordered those paragraphs (or even how do find paragraph numbers).

    They are not organized this way:

    1 2 3 4
    5 6 7 8

    Nor are they organized this way:

    1 3 5 7
    2 4 6 8

    They are all mixed up, something like this:

    1 4 2 7
    3 6 5 8

    So even if I highlighted and used the arrows keys to progress through the paragraphs, I end up with a very choppy selection.

  5. #4
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    What are you using to perform the OCR? That package may have the ability to recognise columns or tabular data but you will need to specify this. Check to see if there are any options before doing the OCR.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  6. The Following User Says Thank You to Andrew Lockton For This Useful Post:

    boobounder (2016-09-30)

  7. #5
    Silver Lounger Charles Kenyon's Avatar
    Join Date
    Jan 2001
    Location
    Sun Prairie, Wisconsin, Wisconsin, USA
    Posts
    2,049
    Thanks
    124
    Thanked 119 Times in 116 Posts
    Adobe Acrobat or Acrobat Reader?

    Acrobat can do the OCR and usually does a pretty good job.
    Charles Kyle Kenyon
    Madison, Wisconsin

  8. The Following User Says Thank You to Charles Kenyon For This Useful Post:

    boobounder (2016-09-30)

  9. #6
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by Charles Kenyon View Post
    Adobe Acrobat or Acrobat Reader?
    Acrobat XI. I wrote that in the second line of the thread.
    Quote Originally Posted by Charles Kenyon View Post
    Acrobat can do the OCR and usually does a pretty good job.
    I am not clear whether the OCR is done in Acrobat before the export, or if an image is sent to Word that then does the OCR.

    Whichever program does the OCR, it does a fine job on the columns that are right-side up. But it does a terrible job on the columns that are upside-down (go figure ;-) And that terrible job goofs up Word's ability to export any of the content to TXT reasonably by mixing up the paragraphs.

  10. #7
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    In that case, Acrobat is where the OCR is performed. Word is not capable of performing OCR (as far as I know).

    Try activating the 'layout' setting when doing the OCR. This is what should allow it to process columns of text rather than reading all the way across the page.
    File > Save As > Microsoft Word > Word Document > Settings > Retain Page Layout
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  11. The Following User Says Thank You to Andrew Lockton For This Useful Post:

    boobounder (2016-09-30)

  12. #8
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts
    OK. So this does ... something.

    What I get is a Word document in which the line up is better.

    But now each block of text appears to be its own little image, with dozens of them on the page now. I believe these are all text boxes.

    So now there is a new problem: how do I export the contents of all text boxes from a Word document? Just doing a save as (to TXT) gets me a blank file.

  13. #9
    WS Lounge VIP mrjimphelps's Avatar
    Join Date
    Dec 2009
    Location
    USA
    Posts
    3,401
    Thanks
    447
    Thanked 404 Times in 376 Posts
    Now that you have a separate image for each column, you should be able to rotate those images which are "upside down", to make all images right side up. Once you have done that, you might be able to OCR the results, since everything will be oriented the correct way.

  14. The Following User Says Thank You to mrjimphelps For This Useful Post:

    boobounder (2016-09-30)

  15. #10
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts
    Quote Originally Posted by mrjimphelps View Post
    Now that you have a separate image for each column
    I don't have this at all. I seem to have a bunch of arranged text boxes. It's not clear at all that I have columns in anything but appearance.

  16. #11
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    Can you post a one page sample of the source image (PDF) so we can see what options you might have?
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  17. #12
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts

    Sample Scan Attached

    Yeah sure. File attached.

    ********************

    I'm digitizing and coding an old board game. For love not money :-) My version will have different bells and whistles than other versions.

    Anyway, it was card based. The attached file shows the front face of 7 cards on 2 pages. On the first page they are turned one way, and on the second page I turned the original upside down.

    I'd need to hand type about 30 files like this. Not impossible or expensive, but why not try the digital route, right?

    I borrowed a fairly mint copy of a 35 year old game, made scans, saved to PDF, and returned the original. So this is what I've got.

    My end goal is just to get the text out in a reasonably organized format that I could slice and dice using regular expressions.
    Attached Files Attached Files

  18. #13
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    I see your dilemma. Acrobat does a pretty good job on the right way up columns and I would be inclined to cut your losses and live with half of it OCR'd and then manually data enter the variables from the upside down bits. There is a lot of static content in those cards so the data entry is not as bad as it initially seems.

    But if you are turning it into a digitised game, wouldn't you want the variable data put into a spreadsheet or database rather than simply formatting it in Word. In that case, it is probably quicker to data enter all of the content.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  19. #14
    4 Star Lounger
    Join Date
    Jun 2003
    Location
    Utah, USA
    Posts
    406
    Thanks
    35
    Thanked 5 Times in 5 Posts
    Thanks for you opinion. That's the conclusion I've been coming to.

    Yes, you are right about ultimately getting this data into a database. The process I was trying was original > scan > PDF > Word > TXT > database. Perhaps I will just jump to the end and start typing.

  20. #15
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    The problem with OCR is that is not 100% accurate so everything has to be checked anyway. If your scan was of normal prose, a spell check and grammar check would be helpful in spotting OCR errors but your content doesn't have this luxury so it will rely heavily on visual checks on pretty much every character (so that 1 doesn't become I etc)

    If it were me, I would scan and OCR the cards you have and delete all the resulting content from the upside down chunks. Then strip out the static text with some search and replaces. Once that is done, it should be simple enough to convert it all to a table in Word and then paste it across into a spreadsheet. In the spreadsheet you can add the missing fields that would have been the upside down text that was deleted from the scan.

    Or perhaps you could work out algorithms that do all the plays from a smart randomiser. In card form the game looks like it relies on a large number of cards to describe plays. On a digital game, these plays could potentially be randomly generated to provide a much larger set of scenarios. It looks like an interesting project - have fun.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •