2 FAN YANG, KJELL DOKSUM, AND KAM-WAH TSUI
Zhang, and Zhou (2010), Lam and Fan (2009), Johnstone and Lu (2009), Ahmed
and Raheem (2012), Ma (2012), and Deng and Tsui (2013), among others.
On the other hand, in genomics, PCA based on sample covariance matrices
and their eigenvalues have been used to construct effective tests of association
between genetic marker scores and disease indicators. One collection of genomewide
association studies (GWAS) based on the methodology “Eigenstrat,” or its updated
and expanded version “Eigensoft,” started with the papers by Price et al. (2006)
and Patterson et al. (2006). For a statistical examination of GWAS for case-control
studies, see Lin and Zeng (2011).
The discrepancy between PCA in High Dimensional Data Analysis (HDDA)
being “unsatisfactory” in statistics and “effective” in genomics can be explained
by the phase “in such a setting” in the Cai and Zhou above quote. Here we
examine settings where PCA is effective. In particular, we show that PCA is
effective in HDDA when (i) the response vector is split into a low-dimensional
vector containing the responses of initial interest and a high-dimensional vector of
potentially confounding covariates, and (ii) the sample is drawn from a population
made up of unknown subpopulations or strata and this population stratification
has the potential to create confounding variables that lead to spurious correlation
between a response and predictors.
Sections 2, 3, and 4 provide a summary of our framework taken from
Yang (2013). Section 5 uses simulations to show and compare the effectiveness of
these PCA methods.
2. Association regression models based on PCA
2.1. Principal components. Population PCA for the random vector x =
(X1,...,Xd)T
first produces a measure of the variability of x by finding the linear
combination
eT
x that has maximal normalized variance
Var(eT
x)/ e
2.
Let Σ
denote the covariance matrix of x, then e1, the first eigenvector, is
(2.1) e1 =
argmax{eT
e: e =1
Σe}
and the first eigenvalue and the first principal component (PC1) are
λ1 = e1
T
Σe1, PC1 = e1
T
x.
The second eigenvector e2, second eigenvalue λ2, and second PC are obtained in
the same way except e2 is found by maximizing (2.1) over e orthogonal to e1. To
obtain ek, λk and PCk, (2.1) is maximized over e orthogonal to e1,..., ek−1. This
process produces the principal components PC1,...,PCd that capture much of the
variability of x in the sense that Var(PCj) = λj and
∑d
j=1
λj =
∑d
j=1
Var(Xj).
2.2. Regression and association studies. Suppose Y is a response variable
and that x Rd is a random predictor. In association studies, the null hypothesis
H0k that Y and Xk are independent is tested for one Xk at a time. Thus what is
needed is a test statistic Tk whose null distribution is known; at least asymptotically,
when the null hypothesis H0k holds, k = 1,...,d. In this framework, x−k = {Xj :
1 j d, j = k} are confounding variables that could lead to spurious correlation
between Xk and Y . Linear analysis based on the linear model
(2.2) Y = αk + βkXk +
j=k
βjXj +
Previous Page Next Page