Results 1 to 3 of 3
  1. #1
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    identifying files by content (word97/sr2 et al.)

    Has anyone stumbled across THE definitive work on identifying files by content, rather than by file extension association?


    The "Microsoft Word 97 Binary File Format" tells us (buried deep within "File Information Block (FIB)") that at offset X'220' we will find a "unique number Identifying the File's creator 0x6A62 is the creator ID for Word and is reserved. Other creators should choose a different value."

    Spotting that 0x6A62 at that place seems to tell me "This is a Word Document", regardless of how an errant user may have re-named the file's name or extent.



    Presumably other MSOffice and other Office Suite applications (Lotus, Corel et al.) have similar code-strings in their documents.


    I'm pretty sure Phil Katz has a PKZip signature code in ZIP files.

    COM and EXE must have unique identification strings near the head of the file, as must DLL files.



    Armed with a table of references (such as the "0x6A62 at 220" above), it ought to be fairly easy to determine a file's type by its content.

    Where all the references had failed, one might fall back on "If all bytes in the file are keyboard characters, it's an ascii text file".



    I'm amazed that I can't find a reference to this need on the web.

  2. #2
    Gold Lounger
    Join Date
    Dec 2000
    Location
    Hollywood (sorta), California, USA
    Posts
    2,759
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: identifying files by content (word97/sr2 et al.)

    Chris,
    The program called "Outside In" is a file viewer suite that makes use of a file's identifying signature. The Windows "Quick View" feature is licensed from the Stellent (the makers of Outside In).

    Obviously, these developers know the unique file signature for lots of file types. They open a file, read the signature, then call the appropriate viewer .dll with which to view the file. ( I remember the old Word Perfect files had 57 50 43 -- WPC -- at offset 1hex.

    You might want to see what you can learn from Stellent. This ain't much, but who knows... http://www.stellent.com/intradoc-cgi/nph-i...ocName=p2000408
    Kevin <IMG SRC=http://www.wopr.com/w3tuserpics/Kevin_sig.gif alt="Keep the change, ya filthy animal...">
    <img src=/w3timages/blackline.gif width=33% height=2><img src=/w3timages/redline.gif width=33% height=2><img src=/w3timages/blackline.gif width=33% height=2>

  3. #3
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: identifying files by content (word97/sr2 et al

    Thanks, Kevin. I'll see what I can dig out.

    It still strikes me as odd that there's no definitive list, as we have for just about everything else ....

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •