Chapter 1

Summarizing Data

It is a capital mistake to theorize before one has data. Insensibly one

begins to twist facts to suit theories, instead of theories to suit facts.

Sherlock Holmes [Doy27]

Graphs are essential to good statistical analysis.

F. J. Anscombe [Ans73]

Data are the raw material of statistics.

We will organize data into a 2-dimensional schema, which we can think of as

rows and columns in a spreadsheet. The rows correspond to the individuals (also

called cases, subjects, or units depending on the context of the study). The

columns correspond to variables. In statistics, a variable is one of the measure-

ments made for each individual. Each individual has a value for each variable. Or

at least that is our intent. Very often some of the data are missing, meaning that

values of some variables are not available for some of the individuals.

How data are collected is critically important, and good statistical analysis

requires that the data were collected in an appropriate manner. We will return to

the issue of how data are (or should be) collected later. In this chapter we will

focus on the data themselves. We will use R to manipulate data and to produce

some of the most important numerical and graphical summaries of data. A more

complete introduction to R can be found in Appendix A.

1.1. Data in R

Most data sets in R are stored in a structure called a data frame that reflects

the 2-dimensional structure described above. A number of data sets are included

with the basic installation of R. The iris data set, for example, is a famous data

1