NAME stats - a summary statistics program SYNOPSIS stats , where flags are: -h or -help this message -c# probability confidence 0.5 < c <= 1.0 -f# field (use twice for (x,y) pairs) -xy read (x,y) from field 1 and field 2 -gp output lines for gnuplot -GP output only lines for gnuplot -s# number of future samples (0 for infinite) -v print out extra information Defaults are: -c0.00 -f1 version 2.80, by Mark Claypool send bugs, suggestions to claypool@cs.wpi.edu DESCRIPTION Stats is designed to be a quick, simple to use program to generate the following summary statistics: - confidence intervals - mean - variance - standard deviation - min, max - sum - linear regression fits (for two fielded inputs) Input is from the standard input in the form of numbers in distinct fields, one entry per line. Fields are groups of non-white space separated by whitespace. The first field is numbered 1, and is indicated by the flag -f1 from the command line. Input lines with a "#" as the first character are treated as a comment and are ignored. For a single stream of numbers, you may specify on what field you wish to calculate statistics. By default, stats reports mean, variance, standard deviation, min, max and sum. If confidence intervals are requested (-c# flag), stats reports a confidence interval of # around the mean. In the case of two-fielded input (specified by -fnum1 -fnum2), results for each field is reported separately, as above. In addition, stats performs a simple least-squares fit of a straight line to the data. Stats also reports the total sum of squares (SST) and the sum of squares explained by regression (SSR). Stats gives the fraction of the variation that is explained determines the goodness for the regression, called the coefficient of determination. If confidence intervals are requested (-c# flag), stats reports confidence intervals around the slope and y intercept. For those that use gnuplot, the -gp flag gives additional format that is easily incorporated into gnuplot scripts for the "plot" command. This includes a format for error bars, for the individual fields, and confidence parabolas around the line fits. The size of the interval depends upon the number of future samples. There are two extreme cases, 1 and infinity. You can specify the number of samples with the -s# flag. A -s0 will indicate an infinite number of samples. In order to speed up processing, all results are calculated in one pass. This involves keeping the sum of the numbers squared for calculating, among other things, the variance. This "on the fly" technique has the potential to cause the sum of the squares to overflow. To be flag if this does not happen, an overflow is checked for and reported. However, in pilot test with LOTS of numbers, this never happened. Note that confidence intervals use an approximation formula for the t tables for over 30 values. For fewer than 30 values, only confidence intervals of 95% will be accurate. All formulas and calculations used in stats can be found in any decent statistics book. However, the author especially used "The Art of Computer Systems Performance Analysis", by Raj Jain, copyright 1991, published by John Wiley and Sons, Inc. EXAMPLES mark% cat example.data 5 10 15 mark% cat example.data | stats Field: 1 lines: 3 mean: 10.000000000000 variance: 25.000000000000 std dev: 5.000000000000 sum: 30.000000000000 min: 5.000000000000 max: 15.000000000000 mark% cat example.data | stats -c.95 Field: 1 lines: 3 mean: 10.000000000000 variance: 25.000000000000 std dev: 5.000000000000 sum: 30.000000000000 min: 5.000000000000 max: 15.000000000000 confidence: 95% left endpoint: 3.207474082984 right endpoint: 16.792525917016 mark% cat example.2.data value: 20 response: 10.2 value: 40 response: 19.3 value: 31 response: 15.4 mark% cat example.2.data | stats -f2 -f4 Field: 2 lines: 3 mean: 30.333333333333 variance: 100.333333333333 std dev: 10.016652800878 sum: 91.000000000000 min: 20.000000000000 max: 40.000000000000 Field: 4 lines: 3 mean: 14.966666666667 variance: 20.843333333333 std dev: 4.565449959570 sum: 44.900000000000 min: 10.200000000000 max: 19.300000000000 line: y = Ax + B A: 0.455647840532 B: 1.145348837209 error squared: 0.025265780731 SSR: 41.661400885936 SST: 41.686666666667 coeff. of det.: 0.999393912185 correlation: 0.999696910161 mark% cat example.2.data | stats -f2 -f4 -GP -c.95 30.333333 16.725659 43.941008 14.966667 8.764479 21.168854 (0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \ 0.455647840532*x + 1.145348837209 title 'best fit', \ (0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2 The first two lines above (beginning with 30.3 and 14.9) are points that you want to plot. Say, the first one is data that came at some X value (indeterminable by stats). You cut and paste the first one into file1 such that it looks like: 1 30.333333 16.725659 43.941008 To gnuplot, the first number is the X coordinate, the second is the Y coordinate, the third is the low confidence interval, the fourth is the high confidence interval. For gnuplot, you then have the command: plot \ "file1" title 'my data' with lines, \ "file1" title '95% confidence interval' with errorbars You do something similar with the second line. With the second bunch of lines (beginning with 0.4), are gnuplot plot commands to generate a line fit and some parabolic confidence line fits. Cut and paste the lines into a gnuplot plot command like: plot \ (0.455647840532*x + 1.145348837209) + ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'max fit' with lines 2, \ 0.455647840532*x + 1.145348837209 title 'best fit', \ (0.455647840532*x + 1.145348837209) - ((0.158952133458*sqrt(1.000000000000 + 0.333333333333+(x-30.333333333333)*(x-30.333333333333)/(2961.000000000000-3*920.111111111111))))*2.920000000000 title 'min fit' with lines 2 BUGS For more than 30 samples, there is an approximation for the T table values used in computing confidence intervals. For fewer than 30 samples, the numbers must be looked up in a table. Because the author of this program is lazy, only table values for 95% have been recorded. Fortunately, 95% confidence intervals are quite common. This man page could do a lot towards describing the relevance and meaning of the statistics reported by stats. For example, what significance does a correlation of 0.60 have? How are confidence intervals to be interpreted? What information can be gathered from just the standard deviation? FUTURE WORK To do: - Make stats read in an environment variable for command line flags. - Add histogram capabilities. - Add more T-Table values for less than 30 samples and non-95% confidence intervals. 90% and 99% are good candidates. - Add a test for normality. - Add a perl script package that helps with parsing input files. - Add hypothesis testing, including P-Values. - Might be nice if stats generated a gnuplot file, both for the data and for the control variables. - Add summary statistics for paired data.