4 1. Summarizing Data $ Petal.Length:num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... $ Petal.Width :num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... $ Species :Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ... The fastR package includes data sets and other utilities to accompany this text. Instructions for installing fastR appear in the preface. We will use data sets from a number of other R packages as well. These include the CRAN packages alr3, car, DAAG, Devore6, faraway, Hmisc, MASS, and multcomp. Appendix A includes instructions for reading data from various file formats, for entering data manu- ally, for obtaining documentation on R functions and data sets, and for installing packages from CRAN. 1.2. Graphical and Numerical Summaries of Univariate Data Now that we can get our hands on some data, we would like to develop some tools to help us understand the distribution of a variable in a data set. By distribution we mean answers to two questions: What values does the variable take on? With what frequency? Simply listing all the values of a variable is not an effective way to describe a distribution unless the data set is quite small. For larger data sets, we require some better methods of summarizing a distribution. 1.2.1. Tabulating Data The types of summaries used for a variable depend on the kind of variable we are interested in. Some variables, like iris$Species, are used to put individuals into categories. Such variables are called categorical (or qualitative) variables to distinguish them from quantitative variables which have numerical values on some numerically meaningful scale. iris$Sepal.Length is an example of a quantitative variable. Usually the categories are either given descriptive names (our preference) or numbered consecutively. In R, a categorical variable is usually stored as a factor. The possible categories of an R factor are called levels, and you can see in the output above that R not only lists out all of the values of iris$species but also provides a list of all the possible levels for this variable. A more useful summary of a categorical variable can be obtained using the table() function. iris-table table(iris$Species) # make a table of values setosa versicolor virginica 50 50 50 From this we can see that there were 50 of each of three species of iris.
Previous Page Next Page