Chapter 1 Summarizing Data It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts. Sherlock Holmes [Doy27] Graphs are essential to good statistical analysis. F. J. Anscombe [Ans73] Data are the raw material of statistics. We will organize data into a 2-dimensional schema, which we can think of as rows and columns in a spreadsheet. The rows correspond to the individuals (also called cases, subjects, or units depending on the context of the study). The columns correspond to variables. In statistics, a variable is one of the measure- ments made for each individual. Each individual has a value for each variable. Or at least that is our intent. Very often some of the data are missing, meaning that values of some variables are not available for some of the individuals. How data are collected is critically important, and good statistical analysis requires that the data were collected in an appropriate manner. We will return to the issue of how data are (or should be) collected later. In this chapter we will focus on the data themselves. We will use R to manipulate data and to produce some of the most important numerical and graphical summaries of data. A more complete introduction to R can be found in Appendix A. 1.1. Data in R Most data sets in R are stored in a structure called a data frame that reflects the 2-dimensional structure described above. A number of data sets are included with the basic installation of R. The iris data set, for example, is a famous data 1

Purchased from American Mathematical Society for the exclusive use of nofirst nolast (email unknown) Copyright 2011 American Mathematical Society. Duplication prohibited. Please report unauthorized use to cust-serv@ams.org. Thank You! Your purchase supports the AMS' mission, programs, and services for the mathematical community.