Results 1 to 8 of 8
  1. #1
    5 Star Lounger Lugh's Avatar
    Join Date
    Jun 2010
    Location
    Indy
    Posts
    619
    Thanks
    166
    Thanked 75 Times in 66 Posts

    Finding 'weird' characters in Word

    Is there anywhere a table or list of characters which are “known good” for use in modern ebook readers? The idea being that such a resource could be used to compare against text files [mostly Word], to provide a warning if exceptions are found.

    I know it's a broad and vague question/quest, and a longshot, but maybe one of you experts here knows of such.

    Ideally a resource produced by a sub-group of the IDPF, BISG or similar standards org, which would contain only the visible characters, and ideally a ‘common usage’ subset of those—who needs a table of over 1 million characters [UTF-8] to check against?

    I have been thru the IDPF and BISG sites, but not 100% of course. My web searches on this get drowned out by explanations of how to hide or display the formatting marks [eg pilcrow etc] in Word, so I haven't been able to find anything useful.

    Background:

    We take in English language input [mostly Word files of varying vintage] from worldwide, which has typically gone thru a few people before reaching us. So I have no idea what OSs or OS languages or WPs or character encodings or fonts have been used in processing any file.

    We output ebooks which need to be technically and aesthetically in good readable condition on as many electronic reading devices as possible. Of course, no way of keeping up with the hundreds of such devices, and the various software versions of each device, so we concentrate on EPUB2 file format [don't need EPUB3 atm], Kindle formats KF7 and KF8, and PDF--I'm happy if we get those right and pass associated validation checks.

    Over a decade ago, I spent about a year developing VBA cleaning routines [almost all are Search & Replace] for incoming files. Maintained since, it's worked very well, eliminating almost all the 'dirty file' problems which plagued our early efforts. But there is still the occasional glitch which makes it thru to the final output files, where we catch it during a visual check. Then it becomes a case of identifying the unwanted character so we can include it in the Search & Replace.

    I don't lose sleep over the possibility there are invisible oddities which occasionally make it thru. I figure our EpubCheck and external 3rd parties checks of our product will catch anything bad.

    In the early days, I limited us to using ASCII encoding, to ensure compatibility with 90s-era reading devices [Palm, RocketBook etc]. I moved to UTF-8 ~7 years ago, which I still 'use'--but I don't have file-opening and file-saving steps which specify UTF-8 explicitly.
    Lugh.
    ~
    Windows 10 Pro x64 1607; Office 2016 (365 Home) x32; Win Defender, MBAM Pro

    ASRock H97 Anniversary; Xeon E3-1231V3 (like i7)
    Gigabyte GeForce GTX 970; 12GB Crucial DDR3 1600
    Logitech MX Master mouse; Roccat Isku kb

  2. #2
    5 Star Lounger Lugh's Avatar
    Join Date
    Jun 2010
    Location
    Indy
    Posts
    619
    Thanks
    166
    Thanked 75 Times in 66 Posts

    Best ways to 'Clean' Word files

    What better way to spend the holidays than reviewing old work processes before launching a new project?

    To clean a Word file of all possible detritus from old and different OSs, WP software, languages etc--what are the best ways? I can't boil it in ASCII, as I need to preserve foreign words within the mostly English texts. UTF-8 is the desired encoding. Note that these are very simple text documents, no fancy stuff required--just the potentially buggy glitch stuff removed.

    Ways I've tried or contemplated:
    Save As a Plain Text file in Word, close, paste the result into new DOCM based on template;
    Copy file into UTF-8 text editor [Editplus atm], paste the result into new DOCM based on template;
    Make a HTML file, copy and paste the result source into new DOCM based on template--the HTML would be ultra clean, made via VBA routines, not Save As Web in Word menu. Then strip the HTML via wildcard S&R.

    These would be steps taken before running a bunch of VBA Search & Replace routines to clean up the wanted content, after which styling would be applied before further processing.

    Thanks for any insights and suggestions.
    Last edited by Lugh; 2015-12-25 at 17:54. Reason: removing info contained in first post
    Lugh.
    ~
    Windows 10 Pro x64 1607; Office 2016 (365 Home) x32; Win Defender, MBAM Pro

    ASRock H97 Anniversary; Xeon E3-1231V3 (like i7)
    Gigabyte GeForce GTX 970; 12GB Crucial DDR3 1600
    Logitech MX Master mouse; Roccat Isku kb

  3. #3
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    5,054
    Thanks
    2
    Thanked 417 Times in 346 Posts
    Given how closely related the issues are, I've merged your two threads.

    If you can specify the characters the various formats/standards don't support, it would be a fairly easy task to delete/replace them via a Find/Replace macro.
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  4. #4
    5 Star Lounger Lugh's Avatar
    Join Date
    Jun 2010
    Location
    Indy
    Posts
    619
    Thanks
    166
    Thanked 75 Times in 66 Posts
    Quote Originally Posted by macropod View Post
    Given how closely related the issues are, I've merged your two threads.
    Cool, thanks, I had them as one post originally

    Quote Originally Posted by macropod View Post
    If you can specify the characters the various formats/standards don't support, it would be a fairly easy task to delete/replace them via a Find/Replace macro.
    Sure, and I would be doing that already if I could specify the characters. I'm hoping for a broad magic bullet, similar to what applying ASCII encoding did a decade ago.

    Ideally, a small subset of UTF-8 for the English reading world. It's not practical to work with the 1,000,000+ characters in the full UTF-8.
    Lugh.
    ~
    Windows 10 Pro x64 1607; Office 2016 (365 Home) x32; Win Defender, MBAM Pro

    ASRock H97 Anniversary; Xeon E3-1231V3 (like i7)
    Gigabyte GeForce GTX 970; 12GB Crucial DDR3 1600
    Logitech MX Master mouse; Roccat Isku kb

  5. #5
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    5,054
    Thanks
    2
    Thanked 417 Times in 346 Posts
    And what would you want done with the non-conforming characters? Delete them - and bear whatever consequences that has for your document - or replace them with something else? If the latter, what would you replace them with - a single character (e.g. *) or something from a lookup list?
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  6. #6
    5 Star Lounger Lugh's Avatar
    Join Date
    Jun 2010
    Location
    Indy
    Posts
    619
    Thanks
    166
    Thanked 75 Times in 66 Posts
    Quote Originally Posted by macropod View Post
    And what would you want done with the non-conforming characters?
    A warning is the ideal outcome. "This file contains characters not in the approved list". If it can identify the characters via their Unicode or similar, great; but mainly just a heads-up to scrutinize the file in detail.

    This could of course be achieved by replacing the 'unapproved' characters with some unique string like #$%, and then running a Find routine for that string. That's probably the simplest way to deal with them.

    The main problem is identifying their presence in the first place.

    Deleting or replacing as you mention wouldn't be a good idea, since the 'unapproved' characters might be intended to be 'normal' characters. Eg there are 10-12 different flavors of hyphens and dashes in most larger character code sets like UTF-8, so an 'unapproved' one would need to be replaced by a 'normal' one.

    The most recent example is a couple of occurrences over the last 3 months where a degree symbol has 'magically' appeared at the end of chapters. This MVP article says:
    Quote Originally Posted by Suzanne S. Barnhill and Dave Rado
    A degree symbol ° represents a nonbreaking space (Ctrl+Shift+Spacebar), which you can use to prevent words from being separated at the end of a line.

    This is useful for keeping dates together (so you don't end up with September
    5, 2000), as well as initials such as J. P.
    V. D. Balsdon.

    En and em spaces (on the Special Characters tab of the Symbol dialog) are also represented by the degree symbol, but there is extra space to the left of the symbol for an en space °and extra space both left and right for an em ° space.
    As you can see, the unwanted character could have intended legit usages as nbsp or dashes. In the recent occurrences, there was no legit use, they just produced blank squares at the end of chapters, which were easily caught by visual inspection.

    Visual inspection is of course not foolproof in long documents with 50-100K words, hence my quest for some more reliable way for identifying them, or at least getting a warning. Another main concern is where 'bad' characters might not appear visually, but could have a text-breaking effect in the final user's reading device.
    Lugh.
    ~
    Windows 10 Pro x64 1607; Office 2016 (365 Home) x32; Win Defender, MBAM Pro

    ASRock H97 Anniversary; Xeon E3-1231V3 (like i7)
    Gigabyte GeForce GTX 970; 12GB Crucial DDR3 1600
    Logitech MX Master mouse; Roccat Isku kb

  7. #7
    Super Moderator
    Join Date
    May 2002
    Location
    Canberra, Australian Capital Territory, Australia
    Posts
    5,054
    Thanks
    2
    Thanked 417 Times in 346 Posts
    The so-called degree symbol you referred to is just an ordinary ASCII character (decimal 160), not a double-byte Unicode character, so that's not something I'd expect to be classified as 'weird'. In any event, the solution for them, if one is needed, is to replace them (along with em-spaces & en-spaces) with an ordinary space (ASCII decimal 32), something quite easily done via Find/Replace. Likewise, em-dashes & en-dashes, etc. could be replaced with an ordinary hyphen/minus. Conversely, you might want to delete Word's optional hyphens. Similarly, if unwanted characters are appearing at paragraph ends, they can be deleted. A fairly straight-forward Find/Replace macro could handle all of that.

    Ultimately, nothing is going to happen though unless you can specify what characters are 'weird' and what you want done with them; whatever they are a fairly straight-forward Find macro can be used to test for them - just as a fairly straight-forward Find/Replace macro can be used to replace/delete them.
    Cheers,

    Paul Edstein
    [MS MVP - Word]

  8. The Following 2 Users Say Thank You to macropod For This Useful Post:

    Charles Kenyon (2015-12-30),Lugh (2015-12-29)

  9. #8
    5 Star Lounger Lugh's Avatar
    Join Date
    Jun 2010
    Location
    Indy
    Posts
    619
    Thanks
    166
    Thanked 75 Times in 66 Posts
    Thanks for your help with this Macropod, much appreciated.
    Lugh.
    ~
    Windows 10 Pro x64 1607; Office 2016 (365 Home) x32; Win Defender, MBAM Pro

    ASRock H97 Anniversary; Xeon E3-1231V3 (like i7)
    Gigabyte GeForce GTX 970; 12GB Crucial DDR3 1600
    Logitech MX Master mouse; Roccat Isku kb

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •