1. Bell Curve/Trend (2K3)

I am out of my comfort zone when it comes to statistics so my apologies in advance.

I have a series the X axis is Minutes and the Y is the count of people who ran that time. I have charted these figures and it "vaguely" looks like a Bell curve. The sample contains 1560 people so a reasonable sample what is the method to depict a smoothed out curve following this data in the form of a trend line.

Hoping this is enough information, data available if required.

2. Re: Bell Curve/Trend (2K3)

This is not really symmetrical, so another distribution than the normal distribution is probably a better fit, for example a binomial distribution.

Do you have a statistics program such as SPSS or SAS at your disposal?

3. Re: Bell Curve/Trend (2K3)

Looks like it could be fit with Weibull or lognormal or even the Gamma function.

perhaps Distribution Fitting or 8.2.1. How do you choose an appropriate life distribution model? can help you out...

You can do some fitting of the curves using solver (similar to what <!profile=HansV>HansV<!/profile> did for exponential fitting in <post:=679,974>post 679,974</post:>)

Steve

4. Re: Bell Curve/Trend (2K3)

I had a chance to play with this.

As I alluded to here is an example (using an estimate of data points from your picture) with the lognormal and Weibull distributions. I included the normal (bell curve) distribution to show you the differences. The Solver is setup to minimize the squared deviation of all 3 equations equations together. The lognormal appears to fit your data the best. It gets the right tail better and the leftmost points the best in addition to having the lowest sum of squares.

Steve

5. Re: Bell Curve/Trend (2K3)

Wow

That is very impressive Steve, thank you. That has whetted my appetite to investigate this further, but this model is fantastic as I want to introduce this in the results of some of the time trials we do... <img src=/S/thankyou.gif border=0 alt=thankyou width=40 height=15>

6. Re: Bell Curve/Trend (2K3)

Thanks for the kind words.

This method of calculating the minimum in the squares of the deviation is general and works with any model whether or not it can be linearized. (The LINEST function in Excel only works with linearized equations). It is a type of non-linear regression. If you have a particular theoretical model, I would start with that. In your example, like Hans also mentioned, it does not appear to be "normal" (bell-shaped) as it has signicant skewness to the right. This occurs alot in nature, especially when there is natural minimum: (the best when everything goes well), but there can be some things which can "delay it" causing small numbers of extreme cases. (eg the time to travel from home to work: no traffic 15 mins, normal traffic 17-25 min, but some days (snow/traffic accidents, etc) it can be hours!). This type of shape is why "average salaries" are generally reported as the median (50 percentile) rather than the mean. The mean is affected dramatically by the few extreme cases of multi-millionaires and billionaires.

One caveat about this method and Solver: the final results may depend on the initial "seed values"/approximations you use. I tend to try to "play" with the parameters until the fitted chart looks "close" and then run the solver. Sometimes manually tweaking thhis can also help and then run again. Changing the Solver options for the iterations can also affect the results.

Also even though I based my fitted line on the points from data for the example, this is not required nor always a good idea. The values are needed for calculations but if the dataset is limited or you want to extrapolate outside the range, I typically will create a separate range from start to finish in the X spaced with enough points to give a nice smooth curve.

Good Luck. Post back if you need further assistance.

Steve

7. Re: Bell Curve/Trend (2K3)

Hi Steve,

Have you ever seen a way to compute R squares or -better yet- the f- value of fits like these from their squared differences? Sort of an ANOVA?

8. Re: Bell Curve/Trend (2K3)

Hi Steve,

Good explanation. If I find time I will implement that in a more generic approach I have taken in the attached workbook, which also allows you to solve problems like these (also using Solver).

9. Re: Bell Curve/Trend (2K3)

Thanks Steve and Jan Karel, this is very useful.

10. Re: Bell Curve/Trend (2K3)

<hr>Do you have a statistics program such as SPSS or SAS at your disposal?<hr>
You might want to try <img src=/S/free.gif border=0 alt=free width=30 height=15>ware, open source R. It does have a steep learning curve, but it looks quite promising. I'm studying it on my own and I can give some references I've found regarding distributions fitting if you are interested.

11. Re: Bell Curve/Trend (2K3)

R is very useful, but it's geared towards programmers. It's definitely not suitable for casual use. (I know, one might argue that this is true of any statistics package, but it's more so for R).

12. Re: Bell Curve/Trend (2K3)

Yes.
I mentioned it in case someone was interested in taking the plunge, but a fair amount of "exposure" is required before being able to do something worth considering. Even importing the information is a real chore. JIC, those interested check the R Commander package, which makes many simple tasks (i.e., importing data), well, simpler.

I've never used SAS / SPSS / Minitab / Statistica, so I cannot tell how R compares to these.

13. Re: Bell Curve/Trend (2K3)

<P ID="edit" class=small>(Edited by sdckapr on 06-Dec-07 06:22. Modified my post, I was a little mistaken on calculation of Degrees of freedom)</P>I see some discrepancies in what I would calculate in ANOVA table (though I am not a statisitician, so I may be mistaken). I have been thinking about the discrepancies and the primary difference is in the "degrees of freedom". After further reading I am editing this and my earlier post since I missed some of more general detail (my lack of formal statistical training shows!)

in B2:
=B5-B3

in B3:
=COUNT(yValues)-COUNT(Constants)

in C3:
=SUM(yDelta)

You could also add the "prob" in G2.
=FDIST(F2,B2,B3)

The formula I am not sure how to make general is B5. The "-1" is not always there. The "-1" comes from the use of an "intercept" (an independent constant, which depends on its use in the equation not its presence as a constant) You may want to create an "independent parameter" (essentially an "offset" in the Y-direction)

This will then match the numbers (especially the F) found using LINEST as well as other stat software (our company uses JMP). Your method does not match their F-value, while it will with the above changes

Some remarks:
The DF in the residual/error (DFres) is always the total degrees of freedom (LINEST reports this if used). It is the count of the Y values - number of parameters (N - k).

The regression or model degrees of freedom (DFreg) is always the total degrees of freedom - the residual degrees of freedon (DFtotal - DFres)

The DFtotal is NOT always the count of Y-values - 1! The "-1" comes from the use of the INTERCEPT (a parameter independent of X). If there is a parameter independent of X then the DFtotal is (N -1) otherwise it is N

[This is the related to why the estimate of the sample std dev is "n-1" weighted, since it loses a degree of freedom by using the mean and estimate of the population mean]

So if there are 10 y-values and you are fitting to an equation:
y = mx + b

DFres = 8 [=10-2 (m,[img]/forums/images/smilies/cool.gif[/img]]
DFtotal = 9 [10 -1 ([img]/forums/images/smilies/cool.gif[/img]]
DFreg = 1 [9-8]

But if you fit to the equation:
y = mx

DFres = 9 [=10-1 (m)]
DFtotal = 10 [No "-1", no constant]
DFreg = 1 [10 - 9]

I apologize ofr any confusion I may have created.

Steve

14. Re: Bell Curve/Trend (2K3)

Well, whichever package you use, any of them is useless without proper statistical insight/training.

I have used StatGraphics (do they even exist anymore?) and Statistica (very complete, but a bit overwhelming) quite a bit, along with Design Expert, which is a DOE software package (the best around IMO). I have worked a bit with (I think it was called) RS/1, which was very good, but back in the mainframe days, so text based and cumbersome to work with.

15. Re: Bell Curve/Trend (2K3)

<P ID="edit" class=small>(Edited by sdckapr on 06-Dec-07 07:47. Changed some info on Degrees of Freedom after doing more investigation.

Edited by sdckapr on 05-Dec-07 09:05. Corrected a mistake. Brainfart, I was using SS for the F instead the MS.)
)</P>Yes they are both readily calculated.

This is probably more than you want or need but just some FYI if people try adapt/expand this technique and want to crunch some of the numbers:

The value we are minimizing is the sum of squares of the deviations. it is the sum [(Yobs-Ypred)²] is is the "Sum of Squares of the residuals" (SSres). [We are presuming that the Ys are the dependent variables and the Xs are independent so all the error is in the Y]. The Total Sum of squares of the data is calculated from the Y values using the VARP and COUNT functions [SStotal = VARP(Ys)*COUNT(Ys)].

For the Example file:
SStotal = VARP(\$B\$7:\$B\$61)*COUNT(\$B\$7:\$B\$61) = 10,134.91

For the lognormal example fitting is in H6
SSres = SUM(H7:H61) = 3,327.18

The R² value is the ratio of the Regression sum of squares (SSreg) to the SStotal [R² = SSreg/SStotal]. It is essentially what percentage of the total variation is explained by the model chosen. Since the SSres is the amount not explained by the mode, SSreg = SStotal - SSres and therefore:

R² = SSreg/SStotal
R² = (SStotal - SSres)/SStotal
R² = 1- SSres/SStotal
R² = 1- 3,327.18/10,134.91
R² = 0.672

This indicates that the "model" fits only 67.2% of the total variation in data.

The degrees of Freedom for the Res (DFres) is "N - k" the total points minus the number of parameters in the model (all in the Jezza examples used 55 - 3 = 52 ), the degrees of Freedom in total are in many models "N-1" = 54, though in these examples there is no "offset" parameter (like the intercept in y = mx + [img]/forums/images/smilies/cool.gif[/img] which is independent of the X so the total is N. The DFreg = are DFtotal - DFres = 3. The DFreg is typically the number of parameters - 1 ("k-1") since most models include the "offset" (y = A0 + A1X1 + A2X2). The models in the Jezza examples could include an offset (if desired). The data could be suggesting that it might be needed since the data "tail" seems to start and end at 2 instead of zero

The F-value is the ratio of the Mean Squares (MSreg to the MSres). The MS are the SS/DF
F = MSreg/MSres
F = (SStotal-SSres)/(SStotal-DFres) / (SSres / DFres)
F = ((10,134.91- 3,327.18) / (55-52) ) / (3,327.18 / 52)
F = 35.5

The "F Prob dist" at 3 and 52 degrees of freedom for that value is:
=Fdist(F, DFreg, DFres)
=Fdist(35.5, 3, 52) <<0.0001

[Note: since the SSres generally will decrease as more parameters are added, an adjusted R² can also be calculated if you are comparing "correlation coefficients" for various models instead of the F-value (the F accounts for this since it explicitly uses the "degress of freedom") :

Adj-R² = 1 - (SSres/DFres) / (SStotal / DFtotal)
Adj-R² = 1 - (3,327.18/(55-3)) / (10,134.91 / 55)
Adj-R² = 1 - (3,327.18/52) / (10,134.91 / 55)

Thus the SSres must decrease enough for the increase on the DF to get the best "bang for the buck" and not add parameters which do not contribute anything significant to the model (the presumption is the simpler the better.]

Steve

Page 1 of 2 12 Last

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•