16 1. Summarizing Data The trouble with this is that the total deviation from the mean is always 0 because the negative deviations and the positive deviations always exactly cancel out. (See Exercise 1.10). To fix this problem, we might consider taking the absolute value of the devia- tions from the mean: total absolute deviation from the mean = |x x| . This number will only be 0 if all of the data values are equal to the mean. Even better would be to divide by the number of data values: mean absolute deviation = 1 n |x x| . Otherwise large data sets will have large sums even if the values are all close to the mean. The mean absolute deviation is a reasonable measure of the dispersion in a distribution, but we will not use it very often. There is another measure that is much more common, namely the variance, which is defined by variance = Var(x) = 1 n 1 (x x)2 . You will notice two differences from the mean absolute deviation. First, instead of using an absolute value to make things positive, we square the deviations from the mean. The chief advantage of squaring over the absolute value is that it is much easier to do calculus with a polynomial than with functions involving absolute values. Because the squaring changes the units of this measure, the square root of the variance, called the standard deviation, is commonly used in place of the variance. The second difference is that we divide by n 1 instead of by n. There is a very good reason for this, even though dividing by n probably would have felt much more natural to you at this point. We’ll get to that very good reason later in the course (in Section 4.6). For now, we’ll settle for a less good reason. If you know the mean and all but one of the values of a variable, then you can determine the remaining value, since the sum of all the values must be the product of the number of values and the mean. So once the mean is known, there are only n 1 independent pieces of information remaining. That is not a particularly satisfying explanation, but it should help you remember to divide by the correct quantity. All of these quantities are easy to compute in R. intro-dispersion02 x=c(1,3,5,5,6,8,9,14,14,20) mean(x) [1] 8.5 x - mean(x) [1] -7.5 -5.5 -3.5 -3.5 -2.5 -0.5 0.5 5.5 5.5 11.5 sum(x - mean(x)) [1] 0 abs(x - mean(x)) [1] 7.5 5.5 3.5 3.5 2.5 0.5 0.5 5.5 5.5 11.5 sum(abs(x - mean(x))) [1] 46
Previous Page Next Page