Contemporary Mathematics
Volume 622, 2014
Principal Component Analysis (PCA) for high-dimensional
data. PCA is dead. Long live PCA
Fan Yang, Kjell Doksum, and Kam-Wah Tsui
Abstract. Sample covariances and eigenvalues are famously inconsistent
when the number d of variables is at least as large as the sample size n. How-
ever, when d n, genomewide association studies (GWAS) that apparently
are based on principal component analysis (PCA) and use sample covariances
and eigenvalues are famously successful in detecting genetic signals while con-
trolling the probability of false discoveries. To reiterate: “PCA is dead, long
live PCA,” or “PCA is the worst of methods, PCA is the best of methods.” We
outline recent work (Yang, 2013) that reconciles the worst/best dichotomy
by acknowledging that PCA is indeed inconsistent for many classical statistical
settings, but for settings that are natural in genomic studies, PCA produces
effective methods. The dichotomy can in part be explained by how models
are viewed and the goal of the study being carried out. We compare the effec-
tiveness of three PCA methods for testing the association between covariates
and a response in a framework with continuous variables. These methods are
based on adjusting the data using PCA, then applying Pearson, Spearman and
normal scores correlation tests.
1. Introduction
Because of the importance of the covariance matrix Σ and its eigenvalues to
statistical analysis their accurate estimation is an important goal in statistics. With
high-dimensional data where the dimension d of the random vector x is at least as
large as the sample size n, the sample covariance matrix S may fail to be consistent.
Because large data sets are becoming common, this is an important problem. A
number of recent articles that address the problem of constructing consistent esti-
mates of Σ in the d n case start by referring to the inconsistency of the sample
covariance S. A typical example is “It is now well understood that in such a setting
the standard sample covariance matrix does not provide satisfactory performance
and regularization is needed.” (Cai and Zhou, 2012). Other articles that start
by referring to S as unsatisfactory and address the large d problem using regulariza-
tion methods such as banding, tapering, thresholding, shrinking and penalization
are by Wu and Pourahmadi (2003), Zou, Hastie, and Tibshirani (2006), Bickel and
Levina (2008a, 2008b), El Karoui (2008), Amini and Wainwright (2009), Cai,
2010 Mathematics Subject Classification. Primary 62H25.
Key words and phrases. Eigenstrat, Eigensoft, GWAS, rank tests, stratification, dual princi-
pal components.
c 2014 American Mathematical Society
Previous Page Next Page