12 1. Summarizing Data distribution is not symmetric, however, the mean and median may be very different, and one measure may provide a more useful summary than the other. For example, if we begin with a symmetric distribution and add in one addi- tional value that is very much larger than the other values (an outlier), then the median will not change very much (if at all), but the mean will increase substan- tially. We say that the median is resistant to outliers while the mean is not. A similar thing happens with a skewed, unimodal distribution. If a distribution is positively skewed, the large values in the tail of the distribution increase the mean (as compared to a symmetric distribution) but not the median, so the mean will be larger than the median. Similarly, the mean of a negatively skewed distribution will be smaller than the median. Whether a resistant measure is desirable or not depends on context. If we are looking at the income of employees of a local business, the median may give us a much better indication of what a typical worker earns, since there may be a few large salaries (the business owner’s, for example) that inflate the mean. This is also why the government reports median household income and median housing costs. On the other hand, if we compare the median and mean of the value of raffle prizes, the mean is probably more interesting. The median is probably 0, since typically the majority of raffle tickets do not win anything. This is independent of the values of any of the prizes. The mean will tell us something about the overall value of the prizes involved. In particular, we might want to compare the mean prize value with the cost of the raffle ticket when we decide whether or not to purchase one. The trimmed mean compromise There is another measure of central tendency that is less well known and represents a kind of compromise between the mean and the median. In particular, it is more sensitive to the extreme values of a distribution than the median is, but less sensitive than the mean. The idea of a trimmed mean is very simple. Before calculating the mean, we remove the largest and smallest values from the data. The percentage of the data removed from each end is called the trimming percentage. A 0% trimmed mean is just the mean a 50% trimmed mean is the median a 10% trimmed mean is the mean of the middle 80% of the data (after removing the largest and smallest 10%). A trimmed mean is calculated in R by setting the trim argument of mean(), e.g., mean(x,trim=0.10). Although a trimmed mean in some sense combines the advantages of both the mean and median, it is less common than either the mean or the median. This is partly due to the mathematical theory that has been developed for working with the median and especially the mean of sample data. 1.2.4. Measures of Dispersion It is often useful to characterize a distribution in terms of its center, but that is not the whole story. Consider the distributions depicted in the histograms in Figure 1.7. In each case the mean and median are approximately 10, but the distributions clearly have very different shapes. The difference is that distribution B is much
Previous Page Next Page