1.2. Graphical and Numerical Summaries of Univariate Data 15 5 6 7 8 setosa versicolor virginica Sepal.Length setosa versicolor virginica 5 6 7 8 eruptions 2 3 4 5 Figure 1.9. Boxplots for iris sepal length and Old Faithful eruption times. 0% 25% 50% 75% 100% 1.6000 2.1627 4.0000 4.4543 5.1000 The latter of these provides what is commonly called the five-number summary. The 0-quantile and 1-quantile (at least in the default scheme) are the minimum and maximum of the data set. The 0.5-quantile gives the median, and the 0.25- and 0.75-quantiles (also called the first and third quartiles) isolate the middle 50% of the data. When these numbers are close together, then most (well, half, to be more precise) of the values are near the median. If those numbers are farther apart, then much (again, half) of the data is far from the center. The difference between the first and third quartiles is called the interquartile range and is abbreviated IQR. This is our first numerical measure of dispersion. The five-number summary can also be presented graphically using a boxplot (also called box-and-whisker plot) as in Figure 1.9. These plots were generated using iris-bwplot bwplot(Sepal.Length~Species,data=iris) bwplot(Species~Sepal.Length,data=iris) bwplot(~eruptions,faithful) The size of the box reflects the IQR. If the box is small, then the middle 50% of the data are near the median, which is indicated by a dot in these plots. (Some boxplots, including those made by the boxplot() use a vertical line to indicate the median.) Outliers (values that seem unusually large or small) can be indicated by a special symbol. The whiskers are then drawn from the box to the largest and smallest non-outliers. One common rule for automating outlier detection for boxplots is the 1.5 IQR rule. Under this rule, any value that is more than 1.5 IQR away from the box is marked as an outlier. Indicating outliers in this way is useful since it allows us to see if the whisker is long only because of one extreme value. Variance and standard deviation Another important way to measure the dispersion of a distribution is by comparing each value with the mean of the distribution. If the distribution is spread out, these differences will tend to be large otherwise these differences will be small. To get a single number, we could simply add up all of the deviation from the mean: total deviation from the mean = (x x) . Sepal.Length
Previous Page Next Page