The last time Hackerfall tried to access this page, it returned a not found error.
A cached version of the page is below, or click here
to continue anyway
Statistics With Excel?
Is Microsoft Excel an Adequate Statistics Package? Changes in Excel 2010 and following have improved its use for statistical analysis. The primary deficiency is still that Excel does not contain many of the most useful and modern methods for environmental statistics. So unfortunately, the answer is generally ‘No’. Excel is available to many people as part of Microsoft Office. People often use Excel as their everyday statistics software because they have already purchased it. Excel’s limitations make this a very questionable practice for scientific applications. For business applications where questions might be simpler and precision not as necessary, Excel may be just fine. For comments on versions of Excel prior to 2010, see our Pre-2010 Excel page.What has improved from our earlier review?
Several Excel functions have been changed to be more correct, and/or useful:
- The PERCENTILE.EXC function was added, which uses the Weibull formula [ i/(n+1) ] as the plotting position for the observation with rank i. This is appropriate when describing a subset (sample) of all possible conditions that are out in the real world, leaving a finite probability that the current maximum of your data might later be exceeded. The previous percentile function used the plotting position [i/n], placing the largest observation at the 100th percentile, so without possibility of being exceeded. This is only appropriate when the entire population (real world) has been observed — a census. Using the PERCENTILE.EXC function, computed values will now agree with standard statistics software. The less-applicable "census of all data" formula continues as the PERCENTILE.INC function, which should be avoided for scientific applications.
- The RANK.AVG function was added, which assigns the average rank to all tied values. The previous RANK function assigned the lowest rank to all tied values. Using RANK.AVG, the user can build their own nonparametric procedures such as Wilcoxon and Kruskal-Wallis tests by programming, even though these tests are not available as built-in commands within Excel.
- Two-tailed probability functions such as T.DIST.2T were added to clearly provide a two-tailed probability of exceedance for hypothesis tests. Confusion often occurred with earlier t-test functions in Excel. People were commonly computing one-sided test results but believing they were 2-sided tests, or vice-versa. The old functions were inconsistent and unclear in whether they returned one- or two-tailed probabilities. The newer function names make it clearer as to what is being computed.
- Boxplots (called Box and Whisker charts) have been added to the Chart menu in Excel 2016. These are the simplest form, drawing whiskers out to the max and min of the data. But much better than nothing.
- The CONFIDENCE function previously computed confidence intervals using a normal distribution (z-scores), only applicable when the true variance is known, which in practice is never. The t-interval should have been used instead. Excel 2016 now has the CONFIDENCE.T function for computing confidence intervals correctly. The old function continues as CONFIDENCE.NORM, which should be avoided for scientific applications.
Limitations of Excel1. Many useful functions or tests are not available in Excel and its Analysis Tools, including:
- p-values for the correlation coefficient
- Spearman’s and Kendall’s rank correlation coefficients
- 2-way ANOVA with unequal sample sizes (unbalanced data)
- Multiple comparison tests (post-hoc tests following ANOVA)
- Levene’s test for equal variance (the older F-test used in Excel is far less accurate)
- Nonparametric tests, including the rank-sum, Kruskal-Wallis and Friedman tests
- Regression diagnostics, such as Mallow’s Cp and PRESS (Excel does compute adjusted r-squared and standardized residuals)
Some of these can be programmed by combining several steps and functions within Excel (see ref #1 for many formulas).
2. Advanced but regularly-used methods in environmental statistics are not available in Excel, including:
- Principal components or other multivariate methods
- GLM (generalized linear models)
- Survival analysis methods (for nondetects)
- Tests for serial correlation
- LOESS smooths
Excel's lack of functionality limits its use to simple statistical tasks. For alternatives, see the Possible Solutions presented below.
3. Several Excel procedures are misleading/wrong
- Probability plots are a standard way of judging the adequacy of the normality assumption in regression. In statistics packages, residuals from the regression are easily, or in some cases automatically, plotted on a normal probability plot. Excel’s regression routine provides a Normal Probability Plot option. However, it produces a uniform distribution probability plot for Y (Y versus a linear scale of probability instead of the nonlinear normal probability scale), even though it is labelled as a Normal Probability Plot. The plot is therefore useless and misleading for judging the adequacy of regression residuals.
- Excel's regression residuals plots use the original data rather than predicted values on the X axis. This is acceptable for simple regression with one X variable, but not for multiple regression.
4. Previous problems not yet evaluated
- The SOLVER function is used to solve nonlinear equations, including approximate maximum-likelihood solutions. SOLVER unfortunately frequently returns a solution prior to finding a global minimum or maximum, and so presents incorrect slopes and other coefficients as if they were correct (ref 2). In short, it can give back the wrong answers! In testing by McCullough and Heiser (ref 2 on our Pre-2010 Excel page), the wrong answer was returned 41% of the time.
- Statistical distributions used by Excel do not agree with better algorithms for those distributions at the third digit and beyond. So p-values are approximately correct, but not as exact as would be desired by a statistician. As of 2010 the probability distribution routines in Excel were not judged to be sufficiently accurate to carry out monte carlo methods (many repetitions from a distribution). See reference #2
5. Regression routines are incorrect for multicollinear data.
This affects multiple regression. A good statistics package will report errors due to correlations among the X variables. The Variance Inflation Factor (VIF) is one measure of collinearity. Excel does not compute collinearity measures, does not warn the user when collinearity is present, and reports parameter estimates that may be nonsensical. Are multicollinear data of concern in ‘practical’ problems? I think so -- I find many examples of collinearity in environmental data sets.
Possible SolutionsSolution #1: Write your own proceduresIn "Statistical Analysis with Excel for Dummies", Schmuller provides the code to put into an Excel cell to compute Spearman's rho, Scheffe's multiple comparison test, the rank-sum and Kruskal-Wallis tests. The Levene's test could be programmed easily using the one-factor ANOVA tool. Writing your own procedures will help for a few but not most of the above problems.Solution #2: Excel add-ins
A number of limitations can be overcome by using add-in packages that avoid using Excel’s computation procedures. Several of these were evaluated in our September 2013 newsletter reviewing statistical software for under $600
. These programs use Excel for data handling, but their own routines for data analysis. They vary widely in which routines are included - again, see our newsletter. NOTE: some of these add-ins alter your copy of Excel. Perform a complete backup prior to installing them, in case you don't find them helpful and want to go back to the pristine version. Uninstalling add-ins is not always simple.
Solution #3: Purchase statistics software, or use R (industry standard and free)
Our November 2007 newsletter
reviewed the capabilities of commercial statistics software. Our Summer 2006 and January 2010 newsletters
described the free R statistical software environment. All of our courses
are taught with either commercial software, or with R. We couldn't get through the routines presented in the first day of our introductory Applied Environmental Statistics course if using Excel.
Note: All opinions other than those cited as coming from others are my own.References and LinksArticles:
(1) Statistical Analysis with Excel for Dummies. J. Schmuller, 2016. Wiley, 527 p.
(2) On the numerical accuracy of spreadsheets
by Almiron and others, 2010, in the Journal of Statistical Software (American Statistical Association).
Continue reading on www.practicalstats.com