It is a capital mistake to theorize before one has data. Insensibly one
begins to twist facts to suit theories, instead of theories to suit facts.
Sherlock Holmes [Doy27]
Graphs are essential to good statistical analysis.
F. J. Anscombe [Ans73]
Data are the raw material of statistics.
We will organize data into a 2-dimensional schema, which we can think of as
rows and columns in a spreadsheet. The rows correspond to the individuals (also
called cases, subjects, or units depending on the context of the study). The
columns correspond to variables. In statistics, a variable is one of the measure-
ments made for each individual. Each individual has a value for each variable. Or
at least that is our intent. Very often some of the data are missing, meaning that
values of some variables are not available for some of the individuals.
How data are collected is critically important, and good statistical analysis
requires that the data were collected in an appropriate manner. We will return to
the issue of how data are (or should be) collected later. In this chapter we will
focus on the data themselves. We will use R to manipulate data and to produce
some of the most important numerical and graphical summaries of data. A more
complete introduction to R can be found in Appendix A.
1.1. Data in R
Most data sets in R are stored in a structure called a data frame that reflects
the 2-dimensional structure described above. A number of data sets are included
with the basic installation of R. The iris data set, for example, is a famous data