# Thread: Statistical flaws in Excel (Excel 97 / 2000 / XP)

1. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

Thanks very much - it is (amazingly) a topic which has been on my mind lately. Just don't expect me to read it quickly <img src=/S/grin.gif border=0 alt=grin width=15 height=15>

2. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

Thanks for sharing this. It's useful to have an overview.

3. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

Hi Hans,

Excellent article!

Would it be OK if I posted this article in the Microsoft Excel MVP newsgroup?

A message I sometimes refer to with regards to Excel's poor Stat qualities:

==========
From: Jerry W. Lewis (JWLewis53@mediaone.net)
Subject: Re: LINEST with r2 = -1.18 ???
Newsgroups: microsoft.public.excel.worksheet.functions
View: Complete Thread (35 articles) | Original Format
Date: 2001-09-26 06:40:28 PST

LINEST() (also SLOPE(), INTERCEPT(), VAR(), STDEV(), LOGEST(), TREND(),
FORECAST(), etc.) uses a numerically unstable algorithm. With
challenging data (such as yours), rounding error has accumulated to the
point that none of its calculations (slope, intercept, etc.) can be
believed. In your case, you were lucky enough to get an impossible R^2,
so that it was obvious that there was a problem. There may still be a
problem even with data that give more reasonable R^2 values. These
problems with Excel's algorithms have been well documented for years
(cf. Sawitzki, 1994, "Report on the reliability of data analysis
systems" Comput. Statist. Data Anal. 18:289-301) yet Microsoft continues
to ignore them.

Harlan Grove's matrix formulation simply recreates the same problem.

DEVSQ(), COVAR(), and CORREL() are the only 2nd moment functions in
Excel that are numerically reliable. For simple linear regression, use
the following formulas instead of LINEST(), SLOPE(), INTERCEPT(), RSQ(),
STEYX(), etc.

slope = COVAR(y,x)/DEVSQ(x)*COUNT(y)
intercept = AVERAGE(y) - slope*AVERAGE(x)
rsq = CORREL(y,x)^2
SSreg = rsq*DEVSQ(y)
SSresid = (1-rsq)*DEVSQ(y)
df = COUNT(y)-2
F = SSreg/SSresid*df
steyx = SQRT(SSresid/df)
se1 = steyx/SQRT(DEVSQ(x))
seb = steyx/SQRT(1/COUNT(y)+AVERAGE(x)^2/DEVSQ(x))

This approach has the added advantage over LINEST that it allows missing
values in the data range. However that cuts both ways, because they
will give a wrong answer if there are data pairs where only x or y (but
not both) are missing.

Similarly, for univariate statistics use the following formulas instead
of VAR(), VARP(), STDEV(), and STDEVP()

var = DEVSQ(x)/(COUNT(x)-1)
varp = DEVSQ(x)/COUNT(x)
stdev = SQRT(var)
stdevp = SQRT(varp)

Since Microsoft has already programmed routines that would be superior
to their unstable routines, it is puzzling why they continue to maintin
redundant inferior code. The unstable formulas that Excel programed are
mathematically exact (with infinite precision), so my formulas will
agree with the Excel functions for non-challenging data sets. When they
disagree, the dedicated Excel functions are wrong.

There is no DEVSQA function, there is no hel for VARA(), VARPA(),
STDEVA(), or STDEVPA() other than doing those calculations manually.

If you are wedded to using LINEST(), then test to see if

STDEV(x) = SQRT(DEVSQ(x)/COUNT(x))
STDEV(y) = SQRT(DEVSQ(y)/COUNT(y))
PEARSON(y,x) = CORREL(y,x)

If all three of these are approximately true (say to at least 12 figures
each), then LINEST() can probably be believed for simple linear
regression. Figuring out when LINEST() can be believed for more complex
models is not so simple.

Jerry

Richard Nolan wrote:

> Having used LINEST for Linear regression, I think
> successfully a few times, I now have a data set that
> returns an r2 value of -1.18, which is not possible. I can
> look at the data and tell r2 must be +, not negative.
>
> Are there two logic problems with LINEST. (a) r2 can never
> greater than +/- 1, and ([img]/forums/images/smilies/cool.gif[/img] I can see the relationship is
> +, not -.

5. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

OK, on it's way then.

6. ## Re: Statistical flaws in Excel (2007)

I'm wondering if Excel 2007 has addressed any of these issues.

7. ## Re: Statistical flaws in Excel (2007)

Nope, but some of the issues were addressed in Excel 2003 - see Description of improvements in the statistical functions in Excel 2003 and in Excel 2004 for Mac and also Statistics:Numerical Methods/Numerics in Excel As far as I know, no significant changes were made in this area in Excel 2007.
For an exhaustive review of the weaknesses of statistics in all versions of Excel, including Excel 2007, see Errors, Faults and Fixes for Excel Statistical Functions and Routines (as of May/21/2007).

8. ## Re: Statistical flaws in Excel (2007)

Just ran across this
http://www.robweir.com/blog/2007/07/...r-failure.html
Am I right in thinking that these problems only come to light if the spreadsheet is saved in XML format?

9. ## Re: Statistical flaws in Excel (2007)

That article refers to the Office Open XML format, which is used as default in Office 2007 for workbooks without macros; the file extension is .xlsx. I don't know whether the implementation in Excel 2007 actually includes the errors mentioned in the article, or whether they are merely errors in the description of the format.

10. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

Hans P.,

I'm having trouble trying to download your attachment. Maybe it disappeared when the Lounge went down? Would you or someone else please reattach it?

Thank you!

11. ## Statistical flaws in Excel (Excel 97 / 2000 / XP)

Edited by HansV to re-attach the zipped document, it was lost in the server crash of August 2007

Many people, including myself, use Excel's statistical functions and the Analysis Toolpak. However, Excel has some flaws that you should know when you are using these tools. I summarized many of these flaws in one document, which I wanted to present to the forum. I would appreciate your comments and suggestions.
It's a zipped pdf document to reduce it to the acceptable file size for uploading.

12. ## Re: Statistical flaws in Excel (Excel 97 / 2000 / XP)

I had a copy of Hans Pottel's PDF file; I have re-attached it to the original post.

13. ## Re: Statistical flaws in Excel (Excel 97 / 2000 /

It worked. Thanks!

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•