    HTML and MSWord2000

    Woody (or whomever :-);

    I am part of a joint editing team for an International Standard. That
    standard will be released in HTML, that HTML must pass the W3C Validator,
    it will be multi-page, it will use Unicode (MS Arial Unicode) and it will
    include extensive internal hyperlinking.

    Group-development tools for HTML-based documents appear to be
    non-existant, whereas MSWord tools for same (revision marking, comments,
    styling, equation editing, etc.) are much better.

    We have therefore adopted a production strategy based on MSWord,
    followed by a post-processing step to generate valid HTML from the initial
    "messy" HTML that MSWord (both 97 and 2000) are capable of generating. The
    former clearly does little in this regard; the latter imposes a *lot* of
    XML overhead that doesn't pass the W3C Validator. The latter also doesn't
    seem to do a very good job with consistently producing hyperlinking.

    Fortunately, there is HTML Tidy (see: )
    and it does a great number of things right (although handling Unicode doesn't
    seem to be one as yet).

    And MS also provides separate capability clean up their MSWord output:

    If you run the MS Filter, and then HTML Tidy, you can almost pass validation!
    But there is a problem (feature?) We are not (at this time) prepared to switch to
    another document production environment -- especially as we are a mixed
    PC/Mac team. So our problem/feature is significant, to us. And here's where
    your team comes in ...

    I really want (among other more minor issues) the MSWord built-in
    linking (autoreferencing to section header numbers, caption numbers,
    equation numbers) that internally maintains consistency of references despite
    reorganizations to a document, to get converted to hyperlinks in HTML. This does
    *not* seem to be done by anyone (that I've tracked down, yet) -- they uniformly
    throw away this information, and assume that you will have built specific
    hyperlinks to accomplish the same.

    Not only is this painful, but it also doesn't offer the same functionality. While
    it is nice that a hyperlink can have a different label than the target text, if the
    target text changes (say, a section number) I want the label to change too.
    However, the default labels that makes this work not only include underscores
    but are really based on MSWord internal tag-names and *not* the name/text
    that actually appears at the target location (e.g., you get "Figure_4_2_1_caption"
    instead of "Figure 4-1 Caption"). Using explicit hyperlinks doesn't provide me
    the functionality that good-old-references do (inside of MSWord). And only the
    former seem to get exported into the HTML; the latter are dropped.

    This still doesn't seem like an inappropriate expectation on my part, so what
    am I missing?

    That said, a lot of tools appear to handle CSS, and a host of other niceties.
    But it seems that if I want autoreferences to get converted to hyperlinks, I will
    need to build my own post-processor to do this.

    I can provide additional detail, and test-case files, if it turns out that you may
    have a possible partial/complete solution.



    Re: HTML and MSWord2000

    I would bet that if you mention this problem on the VB/VBA board, someone will write you a crude but functional post-processor in about 24 hours. Microsoft, on the other hand, has no incentive to add features to Office 2000 that it can use to generate sales for the next version.

