Contemporary Mathematics

Volume 622, 2014

http://dx.doi.org/10.1090/conm/622/12430

Principal Component Analysis (PCA) for high-dimensional

data. PCA is dead. Long live PCA

Fan Yang, Kjell Doksum, and Kam-Wah Tsui

Abstract. Sample covariances and eigenvalues are famously inconsistent

when the number d of variables is at least as large as the sample size n. How-

ever, when d n, genomewide association studies (GWAS) that apparently

are based on principal component analysis (PCA) and use sample covariances

and eigenvalues are famously successful in detecting genetic signals while con-

trolling the probability of false discoveries. To reiterate: “PCA is dead, long

live PCA,” or “PCA is the worst of methods, PCA is the best of methods.” We

outline recent work (Yang, 2013) that reconciles the worst/best dichotomy

by acknowledging that PCA is indeed inconsistent for many classical statistical

settings, but for settings that are natural in genomic studies, PCA produces

effective methods. The dichotomy can in part be explained by how models

are viewed and the goal of the study being carried out. We compare the effec-

tiveness of three PCA methods for testing the association between covariates

and a response in a framework with continuous variables. These methods are

based on adjusting the data using PCA, then applying Pearson, Spearman and

normal scores correlation tests.

1. Introduction

Because of the importance of the covariance matrix Σ and its eigenvalues to

statistical analysis their accurate estimation is an important goal in statistics. With

high-dimensional data where the dimension d of the random vector x is at least as

large as the sample size n, the sample covariance matrix S may fail to be consistent.

Because large data sets are becoming common, this is an important problem. A

number of recent articles that address the problem of constructing consistent esti-

mates of Σ in the d ≥ n case start by referring to the inconsistency of the sample

covariance S. A typical example is “It is now well understood that in such a setting

the standard sample covariance matrix does not provide satisfactory performance

and regularization is needed.” (Cai and Zhou, 2012). Other articles that start

by referring to S as unsatisfactory and address the large d problem using regulariza-

tion methods such as banding, tapering, thresholding, shrinking and penalization

are by Wu and Pourahmadi (2003), Zou, Hastie, and Tibshirani (2006), Bickel and

Levina (2008a, 2008b), El Karoui (2008), Amini and Wainwright (2009), Cai,

2010 Mathematics Subject Classification. Primary 62H25.

Key words and phrases. Eigenstrat, Eigensoft, GWAS, rank tests, stratification, dual princi-

pal components.

c 2014 American Mathematical Society

1