Results 1 to 2 of 2
  1. #1
    New Lounger
    Join Date
    Oct 2007
    Copenhagen, Denmark
    Thanked 0 Times in 0 Posts

    Internals of doc (2003 SP3)

    Hi all,

    To avoid the overhead of developing/testing and running a heavy and time consuming com/automation solution for document-parsing (a big number of documents) I was wondering if it was possible to parse a doc-file directly? Is it in any way possible to intercept and sort-of decrypt the internals of a doc *without* brining up Word in the game?

    Can I find a list of tags for such a parsing anywhere? Say the job is to understand where tables begin and end in a doc (without Word) I expect I have to translate what I find from a binary search into resonable human meaning like if it was in html: <table>... </table>.

    Another way is to translate the doc (again without Word in between) into another std. open -format like xml, html or rtf and then parse the output. This asks for a doc2html-thingy I suppose and that I don't have either.

    Any pointers to _any_ idéas or sources performing such a delicate office-task would be most appreciated.

    Kind regards,
    Michael Mogensen, denmark.

  2. #2
    Plutonium Lounger
    Join Date
    Mar 2002
    Thanked 31 Times in 31 Posts

    Re: Internals of doc (2003 SP3)

    The specification of the .doc file format is available at Microsoft Office Binary (doc, xls, ppt) File Formats but it is not easy stuff! Opening a document in Word and processing it is much easier.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts