Results 1 to 2 of 2
Thread: Internals of doc (2003 SP3)
2008-03-31, 10:36 #1
- Join Date
- Oct 2007
- Copenhagen, Denmark
- Thanked 0 Times in 0 Posts
Internals of doc (2003 SP3)
To avoid the overhead of developing/testing and running a heavy and time consuming com/automation solution for document-parsing (a big number of documents) I was wondering if it was possible to parse a doc-file directly? Is it in any way possible to intercept and sort-of decrypt the internals of a doc *without* brining up Word in the game?
Can I find a list of tags for such a parsing anywhere? Say the job is to understand where tables begin and end in a doc (without Word) I expect I have to translate what I find from a binary search into resonable human meaning like if it was in html: <table>... </table>.
Another way is to translate the doc (again without Word in between) into another std. open -format like xml, html or rtf and then parse the output. This asks for a doc2html-thingy I suppose and that I don't have either.
Any pointers to _any_ idéas or sources performing such a delicate office-task would be most appreciated.
Michael Mogensen, denmark.
2008-03-31, 10:58 #2
- Join Date
- Mar 2002
- Thanked 30 Times in 30 Posts
Re: Internals of doc (2003 SP3)
The specification of the .doc file format is available at Microsoft Office Binary (doc, xls, ppt) File Formats but it is not easy stuff! Opening a document in Word and processing it is much easier.