Thread: Record Quality (XP)

1. Record Quality (XP)

Hi All

First may I apologise, this is not necessarily an Excel question, it is just that I am doing the exercise in Excel and I know a lot of statisticians lurk around this part of the Lounge.

I have been doing a massive data migration exercise from a database that is way over 20 years old and have been reconciling two sets of data. The final output was to an Excel workbook.

When doing quality checks of these records against the old data, what is a reasonable percentage of records that I should check to say that the data is "correct" within tolerance. I have some 40-45,000 records in total.

Jerry

2. Re: Record Quality (XP)

What percentage is acceptable to "miss"? 1 in 10, 1 in 100, 1 in 1000, 1 in 50,000, 1 in what?
Approximately how many inaccuracies do you expect? 1 in 10, 1 in 100, 1 in 1000, 1 in 50,000, 1 in what?

if you want to be able to find 1 error in 50,000 and you have 50% inaccurate, you won't need to sample too many to indicate that you are not within tolerance. Also If you only have about 1 error in 50,000 and it is acceptable to miss 1 in 10 you won't have to sample a lot to indicate that you are within tolerance.

BUT, if you have 1 in 50,000 and want to be able to find it, you will have to check them all.

Steve

3. Re: Record Quality (XP)

Hi Steve

You have given me food for thought, thanks.

Due to the age of the data and the horrific data entry faults that I have encountered and hopefully corrected, I am unsure what the outputs will be like. I am entering a very dark room here and could not say how many are correct.

Would you suggest that I do it in increments, and say start at a 10% check and see how many are wrong and then adjust dependent on my returns?
Jerry

4. Re: Record Quality (XP)

Ideally, you would decide in advance
1. <LI>How many errors in the data are acceptable (as a percentage of the total)
<LI>Which probability of passing the data as acceptable (based on the sample) while they aren't will you tolerate?
For example: say that 5 % errors in the data is to be accepted. If you take a sample of 100 records, and you detect 4 errors, that is within the tolerance you specified, But that could be a fluke of the sample: there is a chance that the total data set has over 5 % errors even if your sample has only 4 %. It is possible to calculate the probability of this happening. The larger the sample, the smaller the probability of this kind of "wrong" outcome. How confident do you want to be?

5. Re: Record Quality (XP)

Hi Hans

It seems to be making sense.

I work in a company that manages about 39,000 properties.The basic precis of the migration exercise was to match up the unique property reference from one table with a set of tables that had did not have this reference number which tallied with the addresses.

The resulting merged table will have to uploaded to a new system that will use the unique property reference as the key to all correspondence sent between us and the tenant. These correspondence have been scanned via a DIPS programme. As some of these documents are legal I would prefer a 0% error but I know this would be impractical so I will go for a 2% error margin and then get someone to manually check the errors. Do you think this margin of error is a reasonable expectation?

Jerry

6. Re: Record Quality (XP)

I can't really judge that. I know that scanning companies claim accuracy in that range, but it's up to you to decide. Let's say that 2 % errors are acceptable. The next step is: if you take a sample, you can't be sure that you make the right decision. There are two kinds of wrong decisions:
1) You decide that the lot is OK (less than 2 % errors) while it isn't.
2) You decide that the lot is not OK (over 2 % errors) while it is.
Usually, a wrong decision of the first kind is more serious than a wrong decision of the second kind. So you must decide how high the probability of taking a wrong decision of the first kind can be. If you decide that the lot is OK, you want to be, say, 95 % confident, i.e. the probability that your decision is wrong can be at most 5 %.
If you take a sample of 1, your confidence level is very low - a sample of one is not very conclusive. If you investigate ALL records, your confidence level is 100 %, since you KNOW how many errors there are. Somewhere in between is the minimum sample size that will allow you to decide with the required confidence level. Again, I can't tell you what that confidence level should be.

7. Re: Record Quality (XP)

Thanks Hans

Have a good weekend.

Jerry

8. Re: Record Quality (XP)

You can create an "operating characteristic curve" to look at various sampling techniques and see the "probability" at the given various "actual" errors to see if the acceptance is enough.

Hans' comments are a single point from an OC Curve.

Here is an example. The errors were "modeled" using a poisson distribution

I am not sure what you mean by 2% "error margin", 2% of 40,000 is 800 errors.

If that is an acceptable amount of errors than if you find no errors in a random sample of 500 you can be 99% confident that you will have <2% (800) errors overall since
<pre>=POISSON(0,2%*500,TRUE)</pre>

=0.0045%

Post back if you have further questions on this
Steve

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•