Results 1 to 8 of 8
  1. #1
    5 Star Lounger
    Join Date
    Oct 2002
    Location
    Wellington, Wellington, New Zealand
    Posts
    621
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Non-ascii chars in unicode (or UTF-8 converter) (2003)

    I have code that takes a Word document's UNICODE content and puts it (in a raw reformatted html way) onto a web-page hosted by a Unix box that speaks UTF-8

    (For many reasons Word's HTML is not the correct answer) <img src=/S/flee.gif border=0 alt=flee width=25 height=25>

    Anyway - the document contains unicode characters such as smart quotes, Macron characters or user defined bullets. My ideal solution is an automated conversion of UNICODE to UTF-8 that I can drive by VBA. My secondary position is to be able to write code that detects any character that is going to give me grief. This set will be small because we are really only dual language - and all the M&#x0101;ori Macron characters I already detect.

    For instance, my current method of handling 'known' special characters such as em-dash is to change them to their equivalent UTF-code (&mdash —).

  2. Get our unique weekly Newsletter with tips and techniques, how to's and critical updates on Windows 7, Windows 8, Windows XP, Firefox, Internet Explorer, Google, etc. Join our 480,000 subscribers!

    Excel 2013: The Missing Manual

    + Get this BONUS — free!

    Get the most of Excel! Learn about new features, basics of creating a new spreadsheet and using the infamous Ribbon in the first chapter of Excel 2013: The Missing Manual - Subscribe and download Chapter 1 for free!

  3. #2
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    How do you currently access the "UNICODE content"? Is this something other than Range.Text?

    Can you use HTML Tidy (someone created a COM wrapper for it)?

  4. #3
    5 Star Lounger
    Join Date
    Oct 2002
    Location
    Wellington, Wellington, New Zealand
    Posts
    621
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    Yes, it comes from range.text, and I hadn't thought of HTML_Tidy for this because the html isn't the problem (but I'm on the case now if it can do code conversion)

    I guess my problem is still "How do I detect that range.value contains a non-ascii character?" or how do I autoconvert it. Either works.

    Andrew

  5. #4
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    Here's one way to fairly quickly find characters above the basic 255:

    <code>Sub SubstituteNonAnsiChars()
    Dim p As Word.Paragraph, r As Word.Range, b() As Byte, _
    lngCount As Long, intPos As Integer, strNew As String
    ' Loop through paragraphs in document
    For Each p In ActiveDocument.Paragraphs
    Set r = p.Range
    ' Do processing of formatting
    ' === YOUR CODE HERE ===
    ' Create a byte array of characters (two slots each)
    b = r.Text
    ' Check for non-Ansi characters and replace with entities
    For lngCount = UBound( To 1 Step -2
    If b(lngCount) <> 0 Then
    ' Above 255, create entity with Unicode as hex
    intPos = (lngCount - 1) / 2
    strNew = "&#" & "x" & Right("00" & Hex(b(lngCount)), 2) & _
    Right("00" & Hex(b(lngCount - 1)), 2) & ";"
    ' Replace original range content
    r.Text = Left(r.Text, intPos) & strNew & Mid(r.Text, intPos + 2)
    End If
    Next
    ' Clean up
    Set r = Nothing
    Next
    ' Clean up
    If Not (p Is Nothing) Then Set p = Nothing
    End Sub</code>

    I only tested it on a simple document, so if you find cases where it does not give the correct results, please post a sample document for testing.

    (Added: if you insert a stop after b=r.text, you can see why the for loop is set up the way it is.)

  6. #5
    WS Lounge VIP
    Join Date
    Mar 2006
    Location
    Maryland, USA
    Posts
    677
    Thanks
    17
    Thanked 57 Times in 50 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    I'm not so sure you need to change the Unicode characters--a computer set for UTF-8 encoding should be able to display most Unicode characters if they are in the font that is being used. But you may need to change ANSI characters (decimal 0128 to 0159) and any character formed by changing the font to, say, Symbol or Wingding.

    If you don't have access to a Unix computer, try changing you browser's character encoding to UTF-8 and then viewing your file.

    PamC
    Pam Caswell

  7. #6
    5 Star Lounger
    Join Date
    Oct 2002
    Location
    Wellington, Wellington, New Zealand
    Posts
    621
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    Cool - that's my kind of code - I'll give it a whirl thanks. (I anticipate correct results)

    I'm over most of my troubles now
    The simple one that caught me was 3/4 expressed by Word as a single Arial character - in UTF-8 it seems to appear as an A umlaut followed by the 3/4 as a single character.

  8. #7
    5 Star Lounger
    Join Date
    Oct 2002
    Location
    Wellington, Wellington, New Zealand
    Posts
    621
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    Thanks for the response Pam - our service provider uses Unix and it was those rare few cases that caused the trouble - e.g. many of our users love those wingding bullets and we publish documents created by others (e.g. researchers) so have little control over content or style. Personally I think it's all Word's fault <img src=/S/grin.gif border=0 alt=grin width=15 height=15> and detecting the area the problem is in will work well for me
    (The browser's encoding had been set to UTF-8).

  9. #8
    WS Lounge VIP
    Join Date
    Mar 2006
    Location
    Maryland, USA
    Posts
    677
    Thanks
    17
    Thanked 57 Times in 50 Posts

    Re: Non-ascii chars in unicode (or UTF-8 converter

    You are very welcome. And you are very right about Microsoft being the cause of much of this confusion.

    If you plan to use find and replace to change the ANSI characters, you should know that find and replace only work with decimal (not hex) numbers and the the leading zero is very important.
    * 3 digits up to 255--or 4 digits up to 0255 but excluding the range 0128 to 0159--gives ASCII and Extended ASCII characters.
    * 4 digits from 0128 to 0159 gives Windows ANSI characters.
    For example, Alt+151 gives ù, Alt+0151 gives —

    * 4 digits or more greater than 0255 gives Unicode characters.
    Pam Caswell

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •