Results 1 to 12 of 12
  1. #1
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    recovering text from spool files (Word97SR2 et al.

    Will I be the first in this forum to write a generic application to recover useful and useable text from a spool file?

    Client sends me a WPS file which appears to be generated from Quill96 (???) for an HP PSC 1200 laser printer (I'm looking at it with Vernon Buerg's List.COM DOS utility).

    I can see interesting stuff, peppered with, of course, printer control characters.

    An algorithm creeps over me:

    I ship the WPS file via NotePad to Word, and perform a few crude tricks to obtain good stuff:

    1) I strip out all but alphabetics, digits, space and common punctuation (,.;:?!-) using my utility function strOnly(strIn,strReferenceSet). That gets rid of a LOT of garbage.

    2) I note that there is a space between each character. Once I find and delimit the words, I can replace double spaces with a tab, replace single spaces with nothing, and flip the tabs back to single spaces.

    3) I can make a stab at words by isolating strings at upper-case letters, i.e. assume that an upper-case letter is the start of a word. For each such string I could spell-check successively longer substrings of that string, starting at the left, and splitting off such "words" as they are found.


    At this point I think I'll have a readable mess. I could ship it back to client for them to edit, or I could hand-tweak it myself.

    Thoughts? Comments? Waste of time? Interesting VBA exercise?


    If I write it, would you be willing to test it for me (by sending a text document to spool and having me decipher it)?

  2. #2
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,853
    Thanks
    4
    Thanked 259 Times in 239 Posts

    Re: recovering text from spool files (Word97SR2 et al.

    Where exactly is your market coming from Chris?

    I think it would be an interesting VBA project with the same intrinsic value as completing a crossword - it fills the time, makes me feel good if I happen to complete it but ultimately performs no useful purpose. I might even be tempted to short-circuit the project by taking postscript spool files only and turning them into Acrobat.

    I would either find a client to pay for that up front or prepare a prospectus and float it on the stockmarket before embarking on the project. <img src=/S/evilgrin.gif border=0 alt=evilgrin width=15 height=15>

    That way, I can get back to serious work - you don't happen to know a seven letter word that means procrastinate?
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  3. #3
    Bronze Lounger
    Join Date
    Nov 2001
    Location
    Arlington, Virginia, USA
    Posts
    1,394
    Thanks
    0
    Thanked 3 Times in 3 Posts

    Re: recovering text from spool files (Word97SR2 et al.

    <hr>Will I be the first in this forum to write a generic application to recover useful and useable text from a spool file?<hr>
    A: If so, yes, and probably also the last.
    <hr>An algorithm creeps over me<hr>
    A: Call the exterminator!
    <hr>Thoughts? Comments? Waste of time? Interesting VBA exercise?<hr>
    A: Item c, Waste of time.

    Please excuse sarcasm, but cannot envision any useful purpose for this. A VBA book I own has an example of using VBA to programatically create a crossword puzzle in Word using tables, columns, etc. (no joke). That would be a more useful employment of one's time than the project envisioned here -- at least people like to do crossword puzzles....

    HTH

  4. #4
    Bronze Lounger
    Join Date
    Nov 2001
    Location
    Arlington, Virginia, USA
    Posts
    1,394
    Thanks
    0
    Thanked 3 Times in 3 Posts

    Re: recovering text from spool files (Word97SR2 et al.

    PS: Here is link to site where you can download Crossword Puzzle VBA demo files:

    Office VBA Developer - Download Page

    If you go to this page, click "Accept Terms" option, the download links will appear. Click link:
    Chapter 2: Word Solution Development (72 KB).

    This link will allow you to download file: Chapter_2_Source_Files.zip.

    After unzipping, see Crossword_Demo.doc. Other required files included in zip file. If in need of a new "project" to occupy your time, this may be a more useful one to pursue.

    HTH

  5. #5
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts

    Re: recovering text from spool files (Word97SR2 et al.

    I once evaluated a product that was designed to mine data from mainframe print files. The product might have been called Monarch (?). I think those files probably were cleaner than the ones you are describing, so it's a different problem. Have you been following the thread (somewhere on the board!) about converting PCL for printing on a non-PCL-speaking printer? I think there are a few stabs at it on SourceForge.

  6. #6
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: recovering text from spool files (Word97SR2 et

    Andrew, thanks for the response, which provoked yet more thoughts within me.

    My market (in terms of cash) for this is zero, but then so is much of what I do on a 'whim", and it turns out to be a useful tool down the road. From a marketing viewpoint this is a no-go, but I often lay up an inventory of useful techniques in my utility library.

    I had spent about one hour de-scrambling this particular text file, and figured that four hours spent cobbling together a crude de-scarmbler would be an ace up my sleeve for the next cry-for-help.

    It does pass the time, and it is "interesting", but I tend to lump it in as part of the R&D that i feel I need to do to expand my skill set.

    I hadn't thought of using Acrobat to extract text; I have the 6.0 reader and that didn't work, but perhaps the full version does the trick.

    The greatest value to me was further insight into a topic "Powerful Engines", identifying and using existing applications to drive newer applications; I'd mentioned in my first post that I use NotePad primarily as a text filter. I've added Acrobat to that list, and will essay with that at the first opportunity.

    I've also had time to mull over other sources of text - perhaps corrupted files, or those files protected by passwrd systems that do NOT encrypt the text.

    Anyway, thanks again; your responses always do one of two things (1) help me dig myself out of a hole or (2) make me see new horizons. Same thing, I suppose!

  7. #7
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: recovering text from spool files (Word97SR2 et

    > If so, yes, and probably also the last.

    Mark, you already know that I'm used to that!

    > but cannot envision any useful purpose for this

    Nor could I until I'd spent an hour manually sifting through a text file. Now I can.

    I tend to be "down" on programming text books. After the obligatory "Hello World!" excercise, the code samples seem to be useless. That is, of course, one of the advantages of classroom training - the instructor can respond to specific needs.

    FWIW I've also moved away from teaching the syntax up front "This is a FOR loop, this is an IF statement, this is ... ".

  8. #8
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: recovering text from spool files (Word97SR2 et

    > mine data from mainframe print files.http://monarch.datawatch.com/

    Thanks for this, Jonothan, I found the site.

    yes, they seem to be de-formatting text (in this case print files and spreadsheets) , rather than obtaining text. I'm looking at a two-tiered approach.

    My first step (lexical) is to obtain strings, the second step would be to interpret them as postal codes, telephone numbers and so on. I think Monarch is closer to the second step.


    >thread about converting PCL for printing

    I had a quick look for this this morning but will go hunt further. I hadn't seen it.

    As you will see from my response to Andrew, although Monday's effort turned out to be a print file, I'm considering this low-level approach to be applicable to just about any file.

    I wrote out the state transition table Monday night - it's very small. I alreday have library code to do most of the work. If I get a version working I'll post it here.

  9. #9
    Super Moderator
    Join Date
    Jan 2001
    Location
    Melbourne, Victoria, Australia
    Posts
    3,853
    Thanks
    4
    Thanked 259 Times in 239 Posts

    Re: recovering text from spool files (Word97SR2 et

    Chris

    Acrobat reader won't do the job. The full version of Acrobat works by taking a Postscript printer file and converting it into an acrobat file (it is my understanding that Acrobat files are really postscript files with minor changes). But you don't need to spend big bucks to do this with Acrobat - it can be done with freeware or shareware such as Ghostscript, CutePDF, JAWS.

    In writing this I just had an epiphany and thought 'I wonder if PCL to Postscipt converters exist' - Google tells me many others have thought of this before and all the work has already been to convert PCL into Acrobat. eg http://www.lincolnco.com/lincpdf.htm has software they will sell for $1,500 but google will also return many others as well.

    I still don't understand what you intend to do with these new tools in your toolkit. If you want to do some R&D work then I would be looking at XML and applying VBA to create and parse XML. There is enough there to entertain me for years and I see a lot more market there in the future.
    Andrew Lockton, Chrysalis Design, Melbourne Australia

  10. #10
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: recovering text from spool files (Word97SR2 et

    > taking a Postscript printer file and converting it

    These sound like specialised solutions - using PostScript as the source; I'm interested in a generic solution - the ability to determine, somewhat intelligently, what constitutes english-language text, and what doesn't, from any machine-readable stream.


    > what you intend to do with these new tools in your toolkit.

    The application I'm considering will be built from existing library code, and some new library routines. Those new library routines will become my tools of the future. I think that one of the first VBA routines I wrote, while I was still running WordBasic in Word97, was dear old strSplitAt, to parse a string at each delimiter. That routine figures in just about every string application I get asked to write.

    To me the distinction between, say, a medical doctor reading a journal and me developing my knowledge in VBA is blurred. In both cases we are extending our abilities with no direct payment in sight.


    The "conversion application" will be a fun hobby that ties together existing routines and new routines.

    I agree that I could, as well, delve into XML, but it's quite possible that my rules-based Document Cleanser can already do that. The DocCleanser interprets a table of rules for massaging text, and includes the facility to splice in end-user macros with each rule. Back in the WordBasic days of Word6.0 I was happily reading and converting AmiPro 3.0 documents when Microsoft couldn't/wouldn't do it.

  11. #11
    Super Moderator jscher2000's Avatar
    Join Date
    Feb 2001
    Location
    Silicon Valley, USA
    Posts
    23,112
    Thanks
    5
    Thanked 93 Times in 89 Posts

    Re: recovering text from spool files (Word97SR2 et

    > I'm interested in a generic solution - the ability to determine, somewhat intelligently, what constitutes
    > english-language text, and what doesn't, from any machine-readable stream.

    Have you seen the "Recover Text From Any File (*.*)" File>Open filter in MS Word? Maybe you can use it as your preprocessor and try to clean up its output.

  12. #12
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: recovering text from spool files (Word97SR2 et

    > try to clean up its output.

    No need. Aaaaaaaaaaaaargh. It's lovely! And so are your green lenses!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •