1.2. Graphical and Numerical Summaries of Univariate Data 9 will discuss ways to summarize the location of the “center” of unimodal distributions numerically. But first we point out that some distributions have other shapes that are not characterized by a strong central tendency. One famous example is eruption times of the Old Faithful geyser in Yellowstone National park. faithful-histogram plot - histogram(~eruptions,faithful,n=20) produces the histogram in Figure 1.5 which shows a good example of a bimodal distribution. There appear to be two groups or kinds of eruptions, some lasting about 2 minutes and others lasting between 4 and 5 minutes. 1.2.3. Measures of Central Tendency Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects of unimodal distributions that we will often want to measure are central tendency (what is a typical value? where do the values cluster?) and the amount of variation (are the data tightly clustered around a central value, or more spread out?). Two widely used measures of center are the mean and the median. You are probably already familiar with both. The mean is calculated by adding all the values of a variable and dividing by the number of values. Our usual notation will be to denote the n values as x1,x2,...xn, and the mean of these values as x. Then the formula for the mean becomes x = ∑n i=1 xi n . The median is a value that splits the data in half half of the values are smaller than the median and half are larger. By this definition, there could be more than one median (when there are an even number of values). This ambiguity is removed by taking the mean of the “two middle numbers” (after sorting the data). See the exercises for some problems that explore aspects of the mean and median that may be less familiar. The mean and median are easily computed in R. For example, iris-mean-median mean(iris$Sepal.Length) median(iris$Sepal.Length) [1] 5.8433 [1] 5.8 Of course, we have already seen (by looking at histograms), that there are some differences in sepal length between the various species, so it would be better to compute the mean and median separately for each species. While one can use the built-in aggregate() function, we prefer to use the summary() function from the Hmisc package. This function uses the same kind of formula notation that the lattice graphics functions use. iris-Hmisc-summary require(Hmisc) # load Hmisc package summary(Sepal.Length~Species,iris) # default function is mean Sepal.Length N=150
Previous Page Next Page