Results 1 to 7 of 7
Thread: number of duplicates

20130612, 18:51 #1
 Join Date
 Jan 2001
 Location
 West Long Branch, New Jersey, USA
 Posts
 1,931
 Thanks
 6
 Thanked 9 Times in 7 Posts
number of duplicates
Hi All,
Through lots of convoluted work, I have cobbled together the following array formula to count the number of duplicate entries in a column. That is I want to count the number of entries that appear more than once; I don't care if the entry appears two times, three times, or whatever. If an entry appears more than once, that entry counts as 1 more for the count. For example, if one entry appears 2 times and another 3 times, I want 2, not 5. If an entry appears only once, it does not add to the count.
Code:{=IF(MAX(COUNTIF(D4:D78,D4:D78))>1,"# Class Names Entered More Than Once=" & ( SUM(1/COUNTIF(D4:OFFSET(D4,0,0,COUNTA(D4:D78),1),D4:OFFSET(D4,0,0,SUM(IF(LEN(D4:D78)>0,1,0)),1)))  SUM(IF(COUNTIF(D4:OFFSET(D4,0,0,COUNTA(D4:D78),1),D4:OFFSET(D4,0,0,SUM(IF(LEN(D4:D78)>0,1,0)),1))=1,1,0))), "No Class Name Entered More Than Once")}
A few things:
 The range D4 to D78 has a bunch of class names, which may be duplicated and this is what's being counted. The class name is a result of looking at col B to see if there's an entry: if so, the name is determined by using the RIGHT most characters of col B (not important how that's done but it works fine); if there's no entry, the corresponding row of col D has "". This is to avoid showing a 0 if the test on col B gives a FALSE. This could be part of the reason for the complexity of the formula.
 the outermost test (MAX(...)>1) tests to see if there are ANY duplicates. If there are, the cell contains # Class Names Entered...= a number (the computation of which is what I'm trying to simplify per below). If not, the cell has No Class Names Entered...
The red part of the formula computes the number of unique values in col D  regardless if the value appears once or more than once. The problem I had is that the "" for the Class Name gets counted as a value if I just use COUNTA. So I resorted to using the SUM(IF(LEN(..)>0,1,0)) in the formula to return the number of entries in col D that are not null (not "").
The blue part of the formula computes the number of values in col D that occurs only once. By subtracting this from the red part, I get the number of entries that appear more than once. Again, this uses the SUM(IF(LEN(...)>0,1,0)) to avoid the null entries.
But there's got to be an easier way.
TIA
Fred

20130612, 19:55 #2
 Join Date
 Jan 2001
 Location
 La Jolla, CA
 Posts
 1,495
 Thanks
 33
 Thanked 63 Times in 59 Posts
The first thought that comes to my mind is to fill down the following in, say, column E.
=IF(COUNTIF($D$4 : D4,D4)=1,COUNTIF(D : D,D4),"")
Then, =COUNTIF(E:E,">1")
But, I suspect you want this all in one fell swoop.
Or, further thought, how about this?
=SUM(IF(FREQUENCY(D4 : D78,D4 : D78)>1,1))Last edited by kweaver; 20130612 at 20:02.

20130613, 11:22 #3
 Join Date
 Jan 2001
 Location
 West Long Branch, New Jersey, USA
 Posts
 1,931
 Thanks
 6
 Thanked 9 Times in 7 Posts
Thanks much. I like both solutions even though they both need a little work for my purposes. So let me ask a few questions and make some comments.
1. Countif solution. Very neat but, yes, I'd like it all in one fell swoop so I can cancatenate the number as I showed in my original posting (or just put the result, with the intermediate calculations that you have in col E, in one cell). Certainly something I can play with although perhaps not obvious after quickly trying a few things. However, the 2nd COUNTIF has a small problem: when examining col D in the first cell that does NOT have any entry (yet or ever), the formula gives a count for how many cells in col D have nothing in them; that's all the rows in the sheet except for the first 16 or so that actually do have something in them (so I got a count of something like 65522). This is easily avoidable in several ways: put in a first IF to see if there's anything in the cell (eg, IF(LEN(cell)>0, then your formula); using a more explicitly defined range instead of all col D (eg, D4:end of range calculated dynamically), or even just subtracting 1 to account for that one extra cell in col E that has the 65522.
So maybe this is something that can be played around with  maybe some array formula to give the answer in 1 fell swoop.
2. Your further thought on FREQUENCY also can work. Why didn't I think of it? Because I've use the FREQUENCY function maybe once in my life. However, my col D has class names in it, so when I tried it (either as a regular formula or as an array formula), I got 0 as the result. It doesn't seem that FREQUENCY works on text values (although the HELP description seems to suggest it could). But sitting in MY col E is a unique Class Number for each Class Name. So by changing your FREQUENCY to look at col E instead, I do get the desired result (and in 1 fell swoop).
So a little more work is needed. Either way, a much better formula than my convoluted original.
Fred

20130613, 11:48 #4
 Join Date
 Jan 2001
 Location
 La Jolla, CA
 Posts
 1,495
 Thanks
 33
 Thanked 63 Times in 59 Posts
Unfortunately, FREQUENCY ignores blank cells and text.
Your other option, #2, with the E column might do the trick for you, I hope.

20130613, 11:50 #5
 Join Date
 Jan 2001
 Location
 West Long Branch, New Jersey, USA
 Posts
 1,931
 Thanks
 6
 Thanked 9 Times in 7 Posts
already tried and seems to be working like a charm. I'll have to keep FREQUENCY in mind. Your first option wasn't bad but 1 fell swoop wins the day.
Fred

20130613, 15:39 #6
 Join Date
 Jan 2001
 Location
 West Long Branch, New Jersey, USA
 Posts
 1,931
 Thanks
 6
 Thanked 9 Times in 7 Posts
A question on how FREQUENCY works the way you used it  out of curiosity:
 the first arg tells where FREQUENCY should look for data while the second arg tells it what the buckets are that should be "filled" depending on the data that's encountered. By defining the buckets the same as the data, aren't you defining a bucket that may be the same as a previouslydefined bucket? For example, if D4=10 (hypothetically), D8=10, and D23=10, then aren't there 3 buckets that 10 could go into? But when the function encounters a 10 in the data, it increments the first bucket used for 10, hence any subsequent bucket for 10 remains "unfilled" (=0). I tried a small experiment and it seems to work this way unless I'm missing something.
In effect, this looks just like your first suggestion with the COUNTIF with YOUR col E serving as the buckets and only the first bucket for a given entry containing a nonzero value (the count of how many of the item there are) while any identical buckets contain 0.
Nevertheless, I liked the COUNTIF approach but have decided to implement the FREQUENCY approach.

20130613, 17:48 #7
 Join Date
 Jan 2001
 Location
 La Jolla, CA
 Posts
 1,495
 Thanks
 33
 Thanked 63 Times in 59 Posts
Suppose you have the data per my first clip.
Then, the FREQ portion of =SUM(IF(FREQUENCY(J4:J18,J4:J18)>1,1)) becomes
=SUM(IF({2;3;0;1;1;1;0;1;1;0;3;0;0;1;0}>1,1))
(my second clip)
Each cell of the first range is compared "slotted" with the bins in the second (increments).

The Following User Says Thank You to kweaver For This Useful Post:
fburg (20130613)