Results 1 to 2 of 2
Thread: Cleaning scanned text (Word XP)
2004-02-21, 03:16 #1
- Join Date
- Apr 2002
- East Tennessee, USA
- Thanked 4 Times in 4 Posts
Cleaning scanned text (Word XP)
I would appreciate some suggestions on "cleaning up" scanned text. The scanner is a 3-4 year old HP (6000 series I think..I'm not at my office at this writing) and I'm using the generic software that came with it. I'm using Windows and Office XP Pro. Font and clarity of the original document make a huge difference in the scanned result, some fairly successful but some not. My problem deals with large fonts such as Courier. Smaller fonts and legible originals usually scan fairly well without much "cleaning up." However, my most recent effort was a 28 page document in a Courier font. And yes, there were 2 spaces after the . ! After scanning, the document was over 60 pages of text, each line with a paragraph mark and a blank line before the next one. And with all the space with the Courier font, the result included some unnecessary tables and sentence fragments in the wrong places. Does anyone know of a quick/easy way to delete all those paragraph marks without the tediousness of doing it one at a time.
2004-02-21, 03:44 #2
- Join Date
- May 2002
- Canberra, Australian Capital Territory, Australia
- Thanked 417 Times in 346 Posts
Re: Cleaning scanned text (Word XP)
Cleaning up documents like your's is fairly straightforward, using Search&Replace.
Assuming each line has a para mark at the end, and that there is an empty line (or a line with just a space) with a para mark separating true paras:
First: Do a Search&Replace to replace all para marks (^p) with a pair of tildes (~~), or some other character combination that isn't found in the document.
Second: Do a Search&Replace to replace every occurrence of a pair of tildes followed by a space, followed by a pair of tildes (~~ ~~) with a para mark (^p)
Third: Do a Search&Replace to replace every occurrence of a pair of tildes followed by another pair of tildes (~~~~) with a para mark (^p)
Fourth: Do a Search&Replace to replace every occurrence of a pair of tildes followed (~~) with a single space ( )
Fifth: Do a Search&Replace to replace every occurrence of a pair of spaces followed ( ) with a single space ( ). Repeat until none is found.
By now, your document should be fairly 'clean'.
If your document doesn't have a spare empty line between true paras, put one in before doing the above. If it has a a spare empty line within true paras, try Search&Replace to replace all double para marks (^p^p) with a single para mark (^p).
[MS MVP - Word]