Page 1 of 2 12 LastLast
Results 1 to 15 of 17
  1. #1
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post

    Bell Curve/Trend (2K3)

    I am out of my comfort zone when it comes to statistics so my apologies in advance.

    I have a series the X axis is Minutes and the Y is the count of people who ran that time. I have charted these figures and it "vaguely" looks like a Bell curve. The sample contains 1560 people so a reasonable sample what is the method to depict a smoothed out curve following this data in the form of a trend line.

    Hoping this is enough information, data available if required.
    Jerry

  2. #2
    Plutonium Lounger
    Join Date
    Mar 2002
    Posts
    84,353
    Thanks
    0
    Thanked 29 Times in 29 Posts

    Re: Bell Curve/Trend (2K3)

    This is not really symmetrical, so another distribution than the normal distribution is probably a better fit, for example a binomial distribution.

    Do you have a statistics program such as SPSS or SAS at your disposal?

  3. #3
    WS Lounge VIP sdckapr's Avatar
    Join Date
    Jul 2002
    Location
    Pittsburgh, Pennsylvania, USA
    Posts
    11,225
    Thanks
    14
    Thanked 342 Times in 335 Posts

    Re: Bell Curve/Trend (2K3)

    Looks like it could be fit with Weibull or lognormal or even the Gamma function.

    perhaps Distribution Fitting or 8.2.1. How do you choose an appropriate life distribution model? can help you out...

    You can do some fitting of the curves using solver (similar to what <!profile=HansV>HansV<!/profile> did for exponential fitting in <post:=679,974>post 679,974</post:>)

    Steve

  4. #4
    WS Lounge VIP sdckapr's Avatar
    Join Date
    Jul 2002
    Location
    Pittsburgh, Pennsylvania, USA
    Posts
    11,225
    Thanks
    14
    Thanked 342 Times in 335 Posts

    Re: Bell Curve/Trend (2K3)

    I had a chance to play with this.

    As I alluded to here is an example (using an estimate of data points from your picture) with the lognormal and Weibull distributions. I included the normal (bell curve) distribution to show you the differences. The Solver is setup to minimize the squared deviation of all 3 equations equations together. The lognormal appears to fit your data the best. It gets the right tail better and the leftmost points the best in addition to having the lowest sum of squares.

    Steve
    Attached Files Attached Files

  5. #5
    Platinum Lounger
    Join Date
    Feb 2002
    Location
    A Magic Forest in Deepest, Darkest Kent
    Posts
    5,681
    Thanks
    0
    Thanked 1 Time in 1 Post

    Re: Bell Curve/Trend (2K3)

    Wow

    That is very impressive Steve, thank you. That has whetted my appetite to investigate this further, but this model is fantastic as I want to introduce this in the results of some of the time trials we do... <img src=/S/thankyou.gif border=0 alt=thankyou width=40 height=15>
    Jerry

  6. #6
    WS Lounge VIP sdckapr's Avatar
    Join Date
    Jul 2002
    Location
    Pittsburgh, Pennsylvania, USA
    Posts
    11,225
    Thanks
    14
    Thanked 342 Times in 335 Posts

    Re: Bell Curve/Trend (2K3)

    Thanks for the kind words.

    This method of calculating the minimum in the squares of the deviation is general and works with any model whether or not it can be linearized. (The LINEST function in Excel only works with linearized equations). It is a type of non-linear regression. If you have a particular theoretical model, I would start with that. In your example, like Hans also mentioned, it does not appear to be "normal" (bell-shaped) as it has signicant skewness to the right. This occurs alot in nature, especially when there is natural minimum: (the best when everything goes well), but there can be some things which can "delay it" causing small numbers of extreme cases. (eg the time to travel from home to work: no traffic 15 mins, normal traffic 17-25 min, but some days (snow/traffic accidents, etc) it can be hours!). This type of shape is why "average salaries" are generally reported as the median (50 percentile) rather than the mean. The mean is affected dramatically by the few extreme cases of multi-millionaires and billionaires.

    One caveat about this method and Solver: the final results may depend on the initial "seed values"/approximations you use. I tend to try to "play" with the parameters until the fitted chart looks "close" and then run the solver. Sometimes manually tweaking thhis can also help and then run again. Changing the Solver options for the iterations can also affect the results.

    Also even though I based my fitted line on the points from data for the example, this is not required nor always a good idea. The values are needed for calculations but if the dataset is limited or you want to extrapolate outside the range, I typically will create a separate range from start to finish in the X spaced with enough points to give a nice smooth curve.

    Good Luck. Post back if you need further assistance.

    Steve

  7. #7
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Weert, Limburg, Netherlands
    Posts
    4,812
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Bell Curve/Trend (2K3)

    Hi Steve,

    Have you ever seen a way to compute R squares or -better yet- the f- value of fits like these from their squared differences? Sort of an ANOVA?
    Jan Karel Pieterse
    Microsoft Excel MVP, WMVP
    www.jkp-ads.com
    Professional Office Developers Association

  8. #8
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Weert, Limburg, Netherlands
    Posts
    4,812
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Bell Curve/Trend (2K3)

    Hi Steve,

    Good explanation. If I find time I will implement that in a more generic approach I have taken in the attached workbook, which also allows you to solve problems like these (also using Solver).
    Attached Files Attached Files
    Jan Karel Pieterse
    Microsoft Excel MVP, WMVP
    www.jkp-ads.com
    Professional Office Developers Association

  9. #9
    Plutonium Lounger
    Join Date
    Mar 2002
    Posts
    84,353
    Thanks
    0
    Thanked 29 Times in 29 Posts

    Re: Bell Curve/Trend (2K3)

    Thanks Steve and Jan Karel, this is very useful.

  10. #10
    5 Star Lounger
    Join Date
    Mar 2002
    Location
    Buenos Aires, Argentina
    Posts
    877
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Bell Curve/Trend (2K3)

    <hr>Do you have a statistics program such as SPSS or SAS at your disposal?<hr>
    You might want to try <img src=/S/free.gif border=0 alt=free width=30 height=15>ware, open source R. It does have a steep learning curve, but it looks quite promising. I'm studying it on my own and I can give some references I've found regarding distributions fitting if you are interested.
    <img src=/w3timages/blue3line.gif width=33% height=2>
    <img src=/S/flags/Argentina.gif border=0 alt=Argentina width=30 height=18> <big><font color=4682b4><font face="Comic Sans MS">Diegol</font face=comic></font color=4682b4> </big>

  11. #11
    Plutonium Lounger
    Join Date
    Mar 2002
    Posts
    84,353
    Thanks
    0
    Thanked 29 Times in 29 Posts

    Re: Bell Curve/Trend (2K3)

    R is very useful, but it's geared towards programmers. It's definitely not suitable for casual use. (I know, one might argue that this is true of any statistics package, but it's more so for R).

  12. #12
    5 Star Lounger
    Join Date
    Mar 2002
    Location
    Buenos Aires, Argentina
    Posts
    877
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Bell Curve/Trend (2K3)

    Yes.
    I mentioned it in case someone was interested in taking the plunge, but a fair amount of "exposure" is required before being able to do something worth considering. Even importing the information is a real chore. JIC, those interested check the R Commander package, which makes many simple tasks (i.e., importing data), well, simpler.

    I've never used SAS / SPSS / Minitab / Statistica, so I cannot tell how R compares to these.
    <img src=/w3timages/blue3line.gif width=33% height=2>
    <img src=/S/flags/Argentina.gif border=0 alt=Argentina width=30 height=18> <big><font color=4682b4><font face="Comic Sans MS">Diegol</font face=comic></font color=4682b4> </big>

  13. #13
    WS Lounge VIP sdckapr's Avatar
    Join Date
    Jul 2002
    Location
    Pittsburgh, Pennsylvania, USA
    Posts
    11,225
    Thanks
    14
    Thanked 342 Times in 335 Posts

    Re: Bell Curve/Trend (2K3)

    <P ID="edit" class=small>(Edited by sdckapr on 06-Dec-07 06:22. Modified my post, I was a little mistaken on calculation of Degrees of freedom)</P>I see some discrepancies in what I would calculate in ANOVA table (though I am not a statisitician, so I may be mistaken). I have been thinking about the discrepancies and the primary difference is in the "degrees of freedom". After further reading I am editing this and my earlier post since I missed some of more general detail (my lack of formal statistical training shows!)

    in B2:
    =B5-B3

    in B3:
    =COUNT(yValues)-COUNT(Constants)

    in C3:
    =SUM(yDelta)

    You could also add the "prob" in G2.
    =FDIST(F2,B2,B3)

    The formula I am not sure how to make general is B5. The "-1" is not always there. The "-1" comes from the use of an "intercept" (an independent constant, which depends on its use in the equation not its presence as a constant) You may want to create an "independent parameter" (essentially an "offset" in the Y-direction)

    This will then match the numbers (especially the F) found using LINEST as well as other stat software (our company uses JMP). Your method does not match their F-value, while it will with the above changes

    Some remarks:
    The DF in the residual/error (DFres) is always the total degrees of freedom (LINEST reports this if used). It is the count of the Y values - number of parameters (N - k).

    The regression or model degrees of freedom (DFreg) is always the total degrees of freedom - the residual degrees of freedon (DFtotal - DFres)

    The DFtotal is NOT always the count of Y-values - 1! The "-1" comes from the use of the INTERCEPT (a parameter independent of X). If there is a parameter independent of X then the DFtotal is (N -1) otherwise it is N

    [This is the related to why the estimate of the sample std dev is "n-1" weighted, since it loses a degree of freedom by using the mean and estimate of the population mean]

    So if there are 10 y-values and you are fitting to an equation:
    y = mx + b

    DFres = 8 [=10-2 (m,[img]/forums/images/smilies/cool.gif[/img]]
    DFtotal = 9 [10 -1 ([img]/forums/images/smilies/cool.gif[/img]]
    DFreg = 1 [9-8]

    But if you fit to the equation:
    y = mx

    DFres = 9 [=10-1 (m)]
    DFtotal = 10 [No "-1", no constant]
    DFreg = 1 [10 - 9]

    I apologize ofr any confusion I may have created.

    Steve

  14. #14
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Weert, Limburg, Netherlands
    Posts
    4,812
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Bell Curve/Trend (2K3)

    Well, whichever package you use, any of them is useless without proper statistical insight/training.

    I have used StatGraphics (do they even exist anymore?) and Statistica (very complete, but a bit overwhelming) quite a bit, along with Design Expert, which is a DOE software package (the best around IMO). I have worked a bit with (I think it was called) RS/1, which was very good, but back in the mainframe days, so text based and cumbersome to work with.
    Jan Karel Pieterse
    Microsoft Excel MVP, WMVP
    www.jkp-ads.com
    Professional Office Developers Association

  15. #15
    WS Lounge VIP sdckapr's Avatar
    Join Date
    Jul 2002
    Location
    Pittsburgh, Pennsylvania, USA
    Posts
    11,225
    Thanks
    14
    Thanked 342 Times in 335 Posts

    Re: Bell Curve/Trend (2K3)

    <P ID="edit" class=small>(Edited by sdckapr on 06-Dec-07 07:47. Changed some info on Degrees of Freedom after doing more investigation.

    Edited by sdckapr on 05-Dec-07 09:05. Corrected a mistake. Brainfart, I was using SS for the F instead the MS.)
    )</P>Yes they are both readily calculated.

    This is probably more than you want or need but just some FYI if people try adapt/expand this technique and want to crunch some of the numbers:

    The value we are minimizing is the sum of squares of the deviations. it is the sum [(Yobs-Ypred)] is is the "Sum of Squares of the residuals" (SSres). [We are presuming that the Ys are the dependent variables and the Xs are independent so all the error is in the Y]. The Total Sum of squares of the data is calculated from the Y values using the VARP and COUNT functions [SStotal = VARP(Ys)*COUNT(Ys)].

    For the Example file:
    SStotal = VARP($B$7:$B$61)*COUNT($B$7:$B$61) = 10,134.91

    For the lognormal example fitting is in H6
    SSres = SUM(H7:H61) = 3,327.18

    The R value is the ratio of the Regression sum of squares (SSreg) to the SStotal [R = SSreg/SStotal]. It is essentially what percentage of the total variation is explained by the model chosen. Since the SSres is the amount not explained by the mode, SSreg = SStotal - SSres and therefore:

    R = SSreg/SStotal
    R = (SStotal - SSres)/SStotal
    R = 1- SSres/SStotal
    R = 1- 3,327.18/10,134.91
    R = 0.672

    This indicates that the "model" fits only 67.2% of the total variation in data.

    The degrees of Freedom for the Res (DFres) is "N - k" the total points minus the number of parameters in the model (all in the Jezza examples used 55 - 3 = 52 ), the degrees of Freedom in total are in many models "N-1" = 54, though in these examples there is no "offset" parameter (like the intercept in y = mx + [img]/forums/images/smilies/cool.gif[/img] which is independent of the X so the total is N. The DFreg = are DFtotal - DFres = 3. The DFreg is typically the number of parameters - 1 ("k-1") since most models include the "offset" (y = A0 + A1X1 + A2X2). The models in the Jezza examples could include an offset (if desired). The data could be suggesting that it might be needed since the data "tail" seems to start and end at 2 instead of zero

    The F-value is the ratio of the Mean Squares (MSreg to the MSres). The MS are the SS/DF
    F = MSreg/MSres
    F = (SStotal-SSres)/(SStotal-DFres) / (SSres / DFres)
    F = ((10,134.91- 3,327.18) / (55-52) ) / (3,327.18 / 52)
    F = 35.5

    The "F Prob dist" at 3 and 52 degrees of freedom for that value is:
    =Fdist(F, DFreg, DFres)
    =Fdist(35.5, 3, 52) <<0.0001


    [Note: since the SSres generally will decrease as more parameters are added, an adjusted R can also be calculated if you are comparing "correlation coefficients" for various models instead of the F-value (the F accounts for this since it explicitly uses the "degress of freedom") :

    Adj-R = 1 - (SSres/DFres) / (SStotal / DFtotal)
    Adj-R = 1 - (3,327.18/(55-3)) / (10,134.91 / 55)
    Adj-R = 1 - (3,327.18/52) / (10,134.91 / 55)
    Adj-R = 0.653

    Thus the SSres must decrease enough for the increase on the DF to get the best "bang for the buck" and not add parameters which do not contribute anything significant to the model (the presumption is the simpler the better.]

    Steve

Page 1 of 2 12 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •