# Thread: huge duplicate problem (2000 sr1)

1. ## huge duplicate problem (2000 sr1)

Hi!
I have a spreadsheet with 21 columns of data. I need to discover a pattern of duplication within 37000 records. I've exported this data in chunks from Access (too big to do all at once) using a duplicate query with 2 of the columns used as criteria.

Now I need to know how many of each duplicate I have in my Excel spreadsheet. For a record to be unique, each of the 21 columns must check out.

Other solutions I've seen involve sorting (I don't think I can sort on 21 columns at a time) and pivot tables, which can't handle all the columns I'm trying to analyze at once.
I'm thinking maybe a programming solution is the way to go, or maybe a formula, but I'm not sure where to begin with either.

<img src=/S/exclamation.gif border=0 alt=exclamation width=15 height=15>

2. ## Re: huge duplicate problem (2000 sr1)

elisap

I would <font color=red> STRONGLY </font color=red> suggest you use SQL for this kind of a deal. Check what Access can do for you.

If by all means you don't have a strong SQL helper, Like I have in Satish <img src=/S/bravo.gif border=0 alt=bravo width=16 height=30>, do the following:

1) Use the Filter command on all 21 Columns. Filter each column and copy the first record, if there are more than one, of the dataset to another worksheet. The you will have to do all 21 columns to get to the unique records.

2) Sort by the first 3 columns. Make a column with numbers 1,2,3,4,...37000
Sort by the next 3 columns, Make another column with numbers 1,2,3,4,...37000
Sort by the next 3 columns, Make another column with numbers 1,2,3,4,...37000
and so on. When you are done 7 times over add the numbers in the new columns, and number 7 should be your first record and number 777000 will be your last record.

I really hope you use SQL.

Wassim

3. ## Re: huge duplicate problem (2000 sr1)

The code below might give you a start on what you want to do. It will look down the rows in Sheet1 for duplicates. It will put the number of duplicates (not counting the first row of the set of duplicates) in column V of the first row of each set of duplicates (zero if there are no duplictes). For all of the duplicate rows, it will put the row number of the first row of the set of duplicates into column W.

<pre>Public Sub FindDups()
Dim I As Long, J As Long, K As Long, lMax As Long, lCount As Long
With Worksheets("Sheet1")
lMax = .Range("A65536").End(xlUp).Row
For I = 0 To lMax - 1
.Range("A1").Offset(I, 21).Value = ""
.Range("A1").Offset(I, 22).Value = ""
Next I
For I = 0 To lMax - 1
lCount = 0
If .Range("A1").Offset(I, 22) = "" Then
For J = I + 1 To lMax - 1
If .Range("A1").Offset(J, 22) = "" Then
For K = 0 To 20
If .Range("A1").Offset(I, K) <> .Range("A1").Offset(J, K) Then
Exit For
End If
Next K
If K > 20 Then
lCount = lCount + 1
.Range("A1").Offset(J, 22).Value = I + 1
End If
End If
Next J
.Range("A1").Offset(I, 21).Value = lCount
End If
Next I
End With
End Sub
</pre>

4. ## Re: huge duplicate problem (2000 sr1)

Well, this may not be what you are looking for, but I gathered from your post that each of the 21 columns constitute part of a record (row). If you are checking out rows to see which row is unique you might try something similar to the following:

<pre>=CONCATENATE(A1,B1,C1,D1,E1,F1,G1,H1)
</pre>

If you put that in a new workbook, say in J1, it will add the contents of A1 to H1 as text. If I am thinking correctly, a unique record would have to have the same characters in the same order. You might then sort on this column instead of all 21.

HTH,

5. ## Re: huge duplicate problem (2000 sr1)

That's part of the solution that I was thinking of, but you may want to take it another step. Sort on the concatenated field (think of it as a huge index field <img src=/S/smile.gif border=0 alt=smile width=15 height=15>) and then do an IF comparison:
<pre>If(j2=j1,"duplicate","unique")</pre>

Use the autofilter to select all records that are flagged as a "unique". Copy this selection to a new workbook.

6. ## Re: huge duplicate problem (2000 sr1)

It could be done with an array formula like
=sum((A1=(range of column a))*(B1=(range of column [img]/forums/images/smilies/cool.gif[/img])*(C1=(range of column C))* etc for the whole 31 columns ).

Dont forget to use control-shift-enter when you enter the formula as Excel will not recognise it as an array formula.

The formula will return the number of time that the row is duplicated. I have assumed that the data commences is in Cell A1, you will have to adjust it to suit.

Enter the formula in the top row. Make sure that the the ranges are absolute references and the cell reference is relative. Then just double-click the bottom right corner of the cell for the formula to autofill down the sheet.

it may take a minute or two to calculate so be patient.

7. ## Re: huge duplicate problem (2000 sr1)

Thank you so much for your help. I think I'm going to try the concatenation, sort, comparison approach first since I understand it. The macro _almost_ makes sense, as does the array formula. I'll look into them further if concatenation doesn't do the trick.

Really appreciate it!

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•