Results 1 to 7 of 7
  1. #1
    New Lounger
    Join Date
    Aug 2001
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts

    huge duplicate problem (2000 sr1)

    Hi!
    I have a spreadsheet with 21 columns of data. I need to discover a pattern of duplication within 37000 records. I've exported this data in chunks from Access (too big to do all at once) using a duplicate query with 2 of the columns used as criteria.

    Now I need to know how many of each duplicate I have in my Excel spreadsheet. For a record to be unique, each of the 21 columns must check out.

    Other solutions I've seen involve sorting (I don't think I can sort on 21 columns at a time) and pivot tables, which can't handle all the columns I'm trying to analyze at once.
    I'm thinking maybe a programming solution is the way to go, or maybe a formula, but I'm not sure where to begin with either.

    Many thanks in advance for any help you can give me!
    <img src=/S/exclamation.gif border=0 alt=exclamation width=15 height=15>

  2. #2
    Bronze Lounger
    Join Date
    Jun 2001
    Location
    New York, New York, Lebanon
    Posts
    1,449
    Thanks
    1
    Thanked 1 Time in 1 Post

    Re: huge duplicate problem (2000 sr1)

    elisap

    I would <font color=red> STRONGLY </font color=red> suggest you use SQL for this kind of a deal. Check what Access can do for you.

    If by all means you don't have a strong SQL helper, Like I have in Satish <img src=/S/bravo.gif border=0 alt=bravo width=16 height=30>, do the following:

    1) Use the Filter command on all 21 Columns. Filter each column and copy the first record, if there are more than one, of the dataset to another worksheet. The you will have to do all 21 columns to get to the unique records.

    2) Sort by the first 3 columns. Make a column with numbers 1,2,3,4,...37000
    Sort by the next 3 columns, Make another column with numbers 1,2,3,4,...37000
    Sort by the next 3 columns, Make another column with numbers 1,2,3,4,...37000
    and so on. When you are done 7 times over add the numbers in the new columns, and number 7 should be your first record and number 777000 will be your last record.

    I really hope you use SQL.

    Wassim
    <img src=/S/compute.gif border=0 alt=compute width=40 height=20> in the <img src=/S/bagged.gif border=0 alt=bagged width=22 height=22>

  3. #3
    Uranium Lounger
    Join Date
    Jan 2001
    Location
    South Carolina, USA
    Posts
    7,295
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: huge duplicate problem (2000 sr1)

    The code below might give you a start on what you want to do. It will look down the rows in Sheet1 for duplicates. It will put the number of duplicates (not counting the first row of the set of duplicates) in column V of the first row of each set of duplicates (zero if there are no duplictes). For all of the duplicate rows, it will put the row number of the first row of the set of duplicates into column W.

    <pre>Public Sub FindDups()
    Dim I As Long, J As Long, K As Long, lMax As Long, lCount As Long
    With Worksheets("Sheet1")
    lMax = .Range("A65536").End(xlUp).Row
    For I = 0 To lMax - 1
    .Range("A1").Offset(I, 21).Value = ""
    .Range("A1").Offset(I, 22).Value = ""
    Next I
    For I = 0 To lMax - 1
    lCount = 0
    If .Range("A1").Offset(I, 22) = "" Then
    For J = I + 1 To lMax - 1
    If .Range("A1").Offset(J, 22) = "" Then
    For K = 0 To 20
    If .Range("A1").Offset(I, K) <> .Range("A1").Offset(J, K) Then
    Exit For
    End If
    Next K
    If K > 20 Then
    lCount = lCount + 1
    .Range("A1").Offset(J, 22).Value = I + 1
    End If
    End If
    Next J
    .Range("A1").Offset(I, 21).Value = lCount
    End If
    Next I
    End With
    End Sub
    </pre>

    Legare Coleman

  4. #4
    5 Star Lounger
    Join Date
    Feb 2001
    Location
    Dallas plus 20 miles or so, Texas, USA
    Posts
    876
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: huge duplicate problem (2000 sr1)

    Well, this may not be what you are looking for, but I gathered from your post that each of the 21 columns constitute part of a record (row). If you are checking out rows to see which row is unique you might try something similar to the following:

    <pre>=CONCATENATE(A1,B1,C1,D1,E1,F1,G1,H1)
    </pre>


    If you put that in a new workbook, say in J1, it will add the contents of A1 to H1 as text. If I am thinking correctly, a unique record would have to have the same characters in the same order. You might then sort on this column instead of all 21.

    HTH,

  5. #5
    5 Star Lounger
    Join Date
    Feb 2001
    Location
    Youngstown, Ohio, USA
    Posts
    705
    Thanks
    1
    Thanked 0 Times in 0 Posts

    Re: huge duplicate problem (2000 sr1)

    That's part of the solution that I was thinking of, but you may want to take it another step. Sort on the concatenated field (think of it as a huge index field <img src=/S/smile.gif border=0 alt=smile width=15 height=15>) and then do an IF comparison:
    <pre>If(j2=j1,"duplicate","unique")</pre>

    Use the autofilter to select all records that are flagged as a "unique". Copy this selection to a new workbook.

  6. #6
    2 Star Lounger
    Join Date
    Jan 2001
    Location
    Perth, Western Australia, Australia
    Posts
    190
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: huge duplicate problem (2000 sr1)

    It could be done with an array formula like
    =sum((A1=(range of column a))*(B1=(range of column [img]/forums/images/smilies/cool.gif[/img])*(C1=(range of column C))* etc for the whole 31 columns ).

    Dont forget to use control-shift-enter when you enter the formula as Excel will not recognise it as an array formula.

    The formula will return the number of time that the row is duplicated. I have assumed that the data commences is in Cell A1, you will have to adjust it to suit.

    Enter the formula in the top row. Make sure that the the ranges are absolute references and the cell reference is relative. Then just double-click the bottom right corner of the cell for the formula to autofill down the sheet.

    it may take a minute or two to calculate so be patient.

  7. #7
    New Lounger
    Join Date
    Aug 2001
    Posts
    11
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: huge duplicate problem (2000 sr1)

    Thank you so much for your help. I think I'm going to try the concatenation, sort, comparison approach first since I understand it. The macro _almost_ makes sense, as does the array formula. I'll look into them further if concatenation doesn't do the trick.

    Really appreciate it!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •