Results 1 to 7 of 7
  1. #1
    3 Star Lounger
    Join Date
    Jan 2007
    Location
    Massachusetts, USA
    Posts
    272
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Hello,

    I am saving MS Word documents out as XML files and would like to find a way
    to remove all of the MSO and other proprietary Microsoft tags from the document.

    In the end, I would like to have a valid and clean (pure) XML file. Does anyone know if this is truly possible?

    If yes, then can a macro be created to strip out the MSO and other MS related tags that are injected when saving from MS Word to XML?

    I am using MS Word 2003.

    Thanks in advance.

    -J

  2. #2
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts
    Quote Originally Posted by jamesm067 View Post
    I would like to have a valid and clean (pure) XML file. Does anyone know if this is truly possible?
    I can't help produce such a file, but I would note that XML data files typically are laid out by reference to an external schema definition. Word has it own schema definitions (I think there have been different ones over time), and it also has the feature of allowing users to use a custom schema. In fact, that was the feature that triggered the patent lawsuit in Texas that Microsoft lost, so perhaps this feature will be disabled by an office update. But until then, you could look into it. More info: microsoft-word 2003 xml custom-schema - Google Search.

  3. #3
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    How are you tagging the xml? If you are using content controls then you could possibly use the Word 2007 Content Control Toolkit to show the xml and export only the xml. This is an external tool that you use to open the word document however it is likely you may need to save it in docx format first - can you do that using 2003 with the compatibility addin?

    You may need to post a sample document if you are tagging the content some other way.

    Assuming you have a valid xml document (albeit with the massive overhead added by Word), then you should be able to construct an xsl transform to obtain only the xml tags you require.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  4. #4
    3 Star Lounger
    Join Date
    Jan 2007
    Location
    Massachusetts, USA
    Posts
    272
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Hello Jefferson and Andrew,

    Thanks for the insight, links and tool tips (e.g. Word 2007 Content Control Toolkit). Yes it will be quite interesting to see what is offered in the finalized version
    of MS Office 2010.

    I remember using a plugin one time to remove MSO tags, but the problem was that it did not get everything - further research is needed as I am using MS Word 2003
    for this exercise.

    Thanks once again for your comments.

    -J

  5. #5
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts
    I saved a Word 2003 document to XML and got a huge mess. In Word, the first two paragraphs look like this (entire doc attached at the end):

    Best Reposado Tequila/Double Gold Medal
    Blue Head Tequila Reposado Tequila, Jalisco, Mexico [40%] $40. Importer: Blue Head Tequila - Tampa, FL www.blueheadtequila.com


    This was the XML generated for those two paragraphs:

    <w><wPr><w:rPr><w:b/><w:u w:val="single"/></w:rPr></wPr><w:r><w:rPr><w:b/><w:u w:val="single"/></w:rPr><w:t>Best </w:t></w:r><wroofErr w:type="spellStart"/><w:r><w:rPr><w:b/><w:u w:val="single"/></w:rPr><w:t>Reposado</w:t></w:r><wroofErr w:type="spellEnd"/><w:r><w:rPr><w:b/><w:u w:val="single"/></w:rPr><w:t> Tequila/Double Gold Medal</w:t></w:r></w>

    <w><w:r><w:rPr><w:b/></w:rPr><w:t>Blue Head Tequila </w:t></w:r><wroofErr w:type="spellStart"/><w:r><w:rPr><w:b/></w:rPr><w:t>Reposado</w:t></w:r><wroofErr w:type="spellEnd"/><w:r><w:t> Tequila, Jalisco, Mexico [40%] $40. Importer: Blue Head Tequila - Tampa, FL www.blueheadtequila.com</w:t></w:r></w>


    I know Word understands this, but it hurts my eyes.

    I wonder whether OpenOffice.org creates more streamlined XML output?

    [attachment=88304:SFWSC09.doc]
    Attached Files Attached Files

  6. #6
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,852
    Thanks
    4
    Thanked 259 Times in 239 Posts
    If your document is not tagged with xml then perhaps these articles would be of assistance
    http://www.devx.com/dotnet/Article/17358
    http://www.xml.com/pub/a/2003/12/31/qa.html

    I haven't investigated these to see how they deal with graphics or tables but they might be able to give you enough to work on. We have seen code on this forum in the past which converts tables to html. Perhaps that can be easily adapted to convert them into xhtml.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  7. #7
    3 Star Lounger
    Join Date
    Jan 2007
    Location
    Massachusetts, USA
    Posts
    272
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Hello Andrew and Jefferson,

    Thanks for the replies. I appreciate it.
    Yes Jefferson, that is exactly what MS Word does (Word 2003 and Word 2007), upon conversion to XML. Will the court order change that?

    I am sure there is some tool out there or Macro that can be used to clean up the code during or after conversion to SGM or XML output.
    Speaking of SGM, I am still trying to find a way to save out to SGM from MS Word, but that is yet another discussion...

    Thanks for the links Andrew, I will look into it further.

    -J

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •