Page 1 of 2 12 LastLast
Results 1 to 15 of 16
  1. #1
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    algorithms - random sampling of files (Word/all)

    You can safely ignore this posting if you are not interested in a discussion of algorithms, especially those regarding the two basic methods of obtaining file names from the system (DIR and FSO)

    The attached document outlines some of my thoughts on two methods of obtaining random samples of files from a directory tree. I'd be interested to hear comments, especially as they concern the choice between DIR and FSO.

    I have a working program (Files Processor) that builds a table of file names, selecting n/p items from the entire tree (thus 1/1 means all files, 1/10 means one out of every ten).

    I am now faced with a new constraint - that there must be at least one file from each non-empty folder.

    I'm mulling over whether to continue using the FSO, or to revert back to the DIR method, and would appreciate comments about high-level concerns.

  2. #2
    Star Lounger
    Join Date
    Jan 2001
    Posts
    71
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/all)

    Interesting challenge! The following doesn't use FSO or Dir, but works with the Windows Shell.

    <pre>Option Explicit

    'Make a reference to:
    'Microsoft Scripting Runtime (scrrun.dll)
    'Microsoft Shell Controls and Automation (shell32.dll)

    Dim DicResult As New Scripting.Dictionary


    Sub CreateRandomFileList()
    Dim StartDir As String, i As Long
    'specify where to start
    StartDir = "C:"
    GetFolderItems StartDir
    'display the results in a new Word doc
    Documents.Add
    For i = 1 To DicResult.Count
    Selection.TypeText CStr(i) & vbTab & DicResult(i)
    Selection.TypeParagraph
    Next
    DicResult.RemoveAll
    Set DicResult = Nothing
    End Sub

    Sub GetFolderItems(Folder As String)
    On Error Resume Next
    Dim FI As Shell32.FolderItem, i As Long
    Dim DicFolder As New Scripting.Dictionary
    With New Shell
    'evaluate the namespece
    With .NameSpace(Folder)
    For Each FI In .Items
    If Not (FI Is Nothing) Then
    If FI.IsFolder Then
    'evaluate this namespece
    GetFolderItems FI.Path
    Else
    'add to dictionary
    i = i + 1
    DicFolder(i) = FI.Name
    End If
    End If
    Next
    End With
    End With
    'see what we've got
    GetRandomFiles Folder, DicFolder, 10
    Set FI = Nothing
    Set DicFolder = Nothing
    End Sub

    Sub GetRandomFiles(Path As String, DicFolder As Scripting.Dictionary, OneOutOfX As Long)
    Dim nRandom As Long, nRepeat As Long, nFiles As Long, i As Long
    'dictionary content
    nFiles = DicFolder.Count
    'this may be zero
    If nFiles = 0 Then Exit Sub
    'repeat random selection
    If nFiles > OneOutOfX Then
    nRepeat = nFiles OneOutOfX
    For i = 1 To nRepeat
    nRandom = Int((nFiles) * Rnd + 1)
    'add to dictionary
    DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    'remove this one, so we don't get it more than once
    DicFolder.Remove nRandom
    'work with the remaining files
    nFiles = DicFolder.Count
    Next
    Else
    'get just one random file
    nRandom = Int((nFiles) * Rnd + 1)
    DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    End If
    End Sub

    Function BuildPath(Path As String, File As String) As String
    If Not (Path Like "*") Then Path = Path & ""
    BuildPath = Path & File
    End Function

    </pre>


  3. #3
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    > doesn't use FSO or Dir, but works with the Windows Shell
    Oh great. Just GREAT! Now i have to choose from THREE different methods .... (Grin!).

    Don, thanks for the feedback. I have run your code, and enjoy it. I am particularly interested in your method of selection - GetRandomFiles. I think that task was causing me to go back to DIR, as I can deal with folders on an independant basis.

  4. #4
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    If I've understood your code, the following line would take my various tests for suitability of each file. (I have criteria that include the extent, size, date last modified, content="WPC", and so on).

    I have located references to "Scripting.Dictionary" in the lounge and d/l the Scripting help file. It looks like a job for a complete 12-cup pot of coffee.

  5. #5
    Star Lounger
    Join Date
    Jan 2001
    Posts
    71
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    I'm not sure I understand this. Which line are you referring to?
    I idn't read anything like that in the original specs.
    About coffee: my day starts with four very strong espressos.
    I hate coffee out of a pot... :-)

  6. #6
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    > Which line are you referring to?
    <pre>DicFolder(i) = FI.Name</pre>


    Sorry. I forgot to paste it, after all that.

    At the point of "DicFolder(i) = FI.Name", I have the full name of the file.

    It is at that point, is it not, that I should start testing it against masks, extents, date/time, size whether or not it is a Wordperfect file, and any other constraints of my particular application?

  7. #7
    Star Lounger
    Join Date
    Jan 2001
    Posts
    71
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    > At the point of "DicFolder(i) = FI.Name", I have the full name of the file.

    Yes, but not the full path name

    > It is at that point, is it not, that I should start testing it against masks, extents, date/time, size whether or not it is a Wordperfect file, and any other constraints of my particular application?

    Yes, use FI.ModifyDate, FI.Size and FI.Type to get additional info.
    Shell32 doesn't give you extensions, but you can use Split() or InstrRev to get the substring after the last period.

  8. #8
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    Don, I think I've found a problem, and, strapped for time, I *don't* want you to give away the answer - I'll tackle it this evening, but I *would* appreciate if you can confirm that the .Remove is merely setting an entry to "", rather than deleting it completely.

    I am testing with a 'one out of one" sample: <pre> 'see what we've got
    GetRandomFiles Folder, DicFolder, 1
    </pre>


    I wrote a dump-to-print debug routine to inspect the dictionary contents: <pre>Public Function DumpDictionary(DicFolder As Scripting.Dictionary)
    Debug.Print
    Dim j As Integer
    For j = 1 To DicFolder.Count
    Debug.Print j & " " & DicFolder(j)
    Next j
    End Function</pre>



    And I called it at the start of loop, and after .Remove: <pre> If nFiles = 0 Then Exit Sub
    'repeat random selection
    ''' DEBUG: dump dictionary
    Call DumpDictionary(DicFolder)
    If nFiles > OneOutOfX Then
    nRepeat = nFiles OneOutOfX
    For i = 1 To nRepeat
    nRandom = Int((nFiles) * Rnd + 1)
    'add to dictionary
    DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    'remove this one, so we don't get it more than once
    DicFolder.Remove nRandom
    ''' DEBUG: dump dictionary
    Call DumpDictionary(DicFolder)
    'work with the remaining files
    nFiles = DicFolder.Count
    Next
    Else
    </pre>


    Here are two blocks of data, the first before entering the loop <pre>1 000005.bmp
    2 000006.bmp
    3 000007.bmp
    4 000008.bmp
    5 000009.bmp
    6 000010.bmp
    7 000001.bmp
    8 000002.bmp
    9 000003.bmp
    10 000004.bmp
    </pre>

    and the second after the first .Remove within the loop:<pre>1 000005.bmp
    2 000006.bmp
    3 000007.bmp
    4 000008.bmp
    5 000009.bmp
    6 000010.bmp
    7 000001.bmp
    8
    9 000003.bmp</pre>


    I was puzzled when I ran early tests on nested folders, seeing "paths" rather than files appear from time to time. I suspect the path was an "empty file string", prefaced by path in the BuildPath routine.


    Again, I'd like to tackle this as a learning exercise, but would be pleased to receive confirmation that you can duplicate the problem at your end. I'm using WordXP/SP3.

  9. #9
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    I think I've got it, mainly through the help of Steve's <post#=446450>post 446450</post#> in which he points out the merits of the KEY.

    <pre> 'add to dictionary
    '''''' ' This fails because .Remove removes the key but not the entry.
    '''''' DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    '''''' ' This fails because .Remove removes the key but not the entry.
    ' There are only DicFolder.Count KEYS at any one time.
    ' 1 <= nRandom <= DicFolder.Count
    ' So we really want the item that is identified by the "nRandom"th key.
    Dim lng As Long
    lng = 0
    Dim varKey
    For Each varKey In DicFolder
    lng = lng + 1
    If lng = nRandom Then ' we have the item that is identified by the "nRandom"th key.
    Debug.Print lng & " " & varKey & " " & BuildPath(Path, DicFolder(varKey))
    DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(varKey))
    Exit For
    Else ' Continue looking
    End If
    Next varKey
    'remove this one, so we don't get it more than once
    DicFolder.Remove varKey
    'work with the remaining files
    nFiles = DicFolder.Coun</pre>



    I based this on Steve's reasoning, that the set of keys maps to a set of existing (non-Removed) items, but that the integer pointers into the set of items - removed and non-removed.

    I was initially enthralled with your solution to the selection-without-replacement problem, but also by the ability to "remove" and not perform the housekeeping (which for me usually involves packing the array and ReDimensioning it.

    It seems that regardless, I'm going to pay the overhead for that, and it may be slightly higher with the Shell method.

    Since I build a list of files only once, but run through the list several times, time in creation is not a big issue.

    As a bonus I have started to play with Dictionaries. Thanks again.

  10. #10
    5 Star Lounger st3333ve's Avatar
    Join Date
    May 2003
    Location
    Los Angeles, California, USA
    Posts
    705
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Re: algorithms - random sampling of files (Word/al

    An end-of-the-day post, so apologies in advance if I'm missing something, but I think I may have figured out the following:

    You shouldn't have to move to that loop-through-the-dictionary varKey approach in your post 447413.

    I think the problem you were having in post 447076 resulted from what VBNutshell characterizes as a piece of "strange behavior" on the part of dictionaries: If you refer to a nonexistent key, the dictionary creates that key (with a blank item). That blank line 8 that turned up in your 2nd dump resulted (I think) from the fact that you had just removed the item with key 8, but then your DumpDictionary function referred to key 8, so a new item (with key 8) got created. Meanwhile, when the dump loop started, the dictionary only had 9 items, so it stopped after 9, but by the time the loop ended the dictionary had 10 items (because of the blank), so the last item (000004.bmp) didn't get dumped (even though it was there in the dictionary).

    You may want to try my dictionary-dumper:

    <pre>Sub ShowMeTheDictionary(dctTarget As Scripting.Dictionary)

    Debug.Print DictionaryInfoString(dctTarget)

    End Sub

    Function DictionaryInfoString(dctTarget As Scripting.Dictionary, _
    Optional strSep As String = " | ") As String

    Dim strInfo As String
    Dim lngEntry As Long

    For lngEntry = 0 To dctTarget.Count - 1
    strInfo = strInfo & lngEntry & vbTab & _
    dctTarget.Keys(lngEntry) & strSep & _
    dctTarget.Items(lngEntry) & vbCrLf
    Next lngEntry

    DictionaryInfoString = strInfo

    End Function</pre>

    Also note that, under different circumstances, you can avoid the unintended creation of blank items by using the .Exists method before making any key reference that might be to a nonexistent key.

  11. #11
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    Steve, thanks for the fast response.
    >a nonexistent key, the dictionary creates that key (with a blank item

    This behaviour suggests that for my purposes, the dictionary is behaving rather like a regular string array that I might prepare by using a DIR fucntion.

    In the past I'd use DIR to load a string array with all file names in a folder, then set to "" those names that got used. Any function trying to obtain the "next random item", or obtaining "the next item in sequence", would extract the string, then set the array item to "", to indicate that it no longer existed (for purposes of selection). My code would then have to check for "" in looking for the next candidate, which corresponds to an Exists test.

    "Costing" of code then becomes a matter of off-setting the cost of increasingly multiple-passes looking for a non-null item (in the case of our random selection-without-replacement) versus packing down the array (to remove empty items) after each use.

    Thanks for the Dictionary-Dumper. No desktop should be without one!

  12. #12
    5 Star Lounger st3333ve's Avatar
    Join Date
    May 2003
    Location
    Los Angeles, California, USA
    Posts
    705
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Re: algorithms - random sampling of files (Word/al

    Edited by HansV to provide links to posts - see <!help=19>Help 19<!/help>

    Just to make sure there's no misunderstanding (and, again, assuming I correctly figured out the source of your blank line 8), I note that the .Remove method of dictionaries completely deletes the removed member (automatically resulting in a "packed" dictionary with one less member) rather than blanking the member's Item.

    You got the blank line 8 because of the code in your DumpDictionary function (which caused the dictionary to create a new member with key 8), rather than your main code. The reason my dictionary-dumper won't cause the same result is that it doesn't use keys. It loops through the dictionary's members from 0 to .Count by referring to the indexes (OK, OK, indices) of the members in the arrays returned by the .Keys and .Items methods, ensuring (I think) that it will never process a member that doesn't exist.

    So I believe you can go back to the main code you were using in your <post#=447076>post 447076</post#> (as long as you don't use the same DumpDictionary function for debugging) without having to worry about non-packed dictionaries or doing the extra looping in your <post#=447413>post 447413</post#>.

  13. #13
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    > Just to make sure there's no misunderstanding

    There was a misunderstanding, and it was on my part. I'll revisit your original code again tonight and resolve this.

    I much prefer to have inbuilt code deal with the management of sets of objects rather than have it visible in my code.

    Thanks for posting back.

  14. #14
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: algorithms - random sampling of files (Word/al

    Allright, I'm baffled.

    I've attached a BAS module with a test subroutine at the very foot. I place ten files in c:temp and run the fucntion and observe the resulting contents of the array. I'll see that strAr(10) is empty, but that's because I haven't ReDim'd it down one from the ultimate add. I can deal with that.

    I'll generally see three empty items in strAr, and I believe that even with an .Exists test, the dictionary items either (a) aren't really being removed or ([img]/forums/images/smilies/cool.gif[/img] they are being removed and then re-inserted.

    The only ways that I have found around this problem to date are to include a test for an empty string, and to sample continually from the original set (in my example 10 files) until my quota is filled.

    <pre> 'repeat random selection
    If nFiles > lngPopulation Then
    nRepeat = nFiles * (lngSample / lngPopulation)
    <font color=red> While i < nRepeat
    '''' For i = 1 To nRepeat</font color=red>
    nRandom = Int((nFiles) * Rnd + 1)
    'add to dictionary
    '''' DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    <font color=red> If DicFolder(nRandom) <> "" Then
    i = i + 1
    strAr(UBound(strAr)) = BuildPath(Path, DicFolder(nRandom))
    ReDim Preserve strAr(UBound(strAr) + 1)
    'remove this one, so we don't get it more than once
    DicFolder.Remove nRandom
    Else
    End If</font color=red>
    ''''' 'work with the remaining files
    <font color=red>''''' nFiles = DicFolder.Count
    '''' Next
    Wend</font color=red>
    Else
    'get just one random file
    nRandom = Int((nFiles) * Rnd + 1)
    '''' DicResult(DicResult.Count + 1) = BuildPath(Path, DicFolder(nRandom))
    strAr(UBound(strAr)) = BuildPath(Path, DicFolder(nRandom))
    ReDim Preserve strAr(UBound(strAr) + 1)
    End If</pre>


  15. #15
    5 Star Lounger st3333ve's Avatar
    Join Date
    May 2003
    Location
    Los Angeles, California, USA
    Posts
    705
    Thanks
    0
    Thanked 2 Times in 2 Posts

    Re: algorithms - random sampling of files (Word/al

    If you remove a member from a dictionary, the dictionary automatically gets "packed", but the keys for the remaining members stay the same. To keep it simple, let's say you're using consecutive integers for the keys, starting with 0 -- so the key for each member will be the same (initially) as its ordinal position in the dictionary (the ordinal position being the data returned by the .Keys and .Items methods, not to be confused with the .Key and .Item properties).

    Let's say you start by adding 10 members to dicFolder, and then you remove dicFolder(6), which is the 7th member. dicFolder(7) will now be the 7th member, but its key will not be changed to 6. It will still be dicFolder(7), and at that point there won't be a dicFolder(6).

    But dicFolder(7)'s ordinal position in the dictionary will have changed, so if you want to retrieve its key or item using the .Keys or .Items method, you'd refer to dicFolder.Items(6) rather than dicFolder.Items(7). In other words, at this point you have a mismatch between the key and the ordinal position for each member from dicFolder(7) on. dicFolder.Items(6) will return the Item for dicFolder(7), whereas originally (before the removal), dicFolder.Items(7) returned the Item for dicFolder(7).

    At this point if you try to refer to dicFolder(6), a new dicFolder(6) will be added to the dictionary, with 6 as its Key and a blank Item.

    I think the solution you're looking for (and I admit I didn't look at your attachment) will involve using the .Items method to retrieve the filenames from dicFolder, rather than the keys. The problem with using the keys is that, as you remove members, you get a discontinuous sequence of keys. It's not that some of the keys are still there but with blank items, but that's what you end up with when the line with BuildPath refers to dicFolder(nRandom) if nRandom was one of the previously-removed keys -- because at that point (not earlier) a new dicFolder(nRandom) will be created (simply by reason of the reference to it) with a blank item.

    By contrast, if you refer to dicFolder.Items(nRandom), you can be sure you're picking from a continuous (no gaps) number sequence from 0 to dicFolder.Count - 1.

    Remember that keys can be any kind of variable. So, for example, a dictionary member's key might be "color" and its item might be "blue". It certainly wouldn't work in this case if the member's key got changed (from "color" to something else) if the member ahead of it was removed. The fact that you're not really storing any substantive data in the keys of dicFolder (just a numbering sequence) doesn't change this behavior.

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •