PCA FOR HIGH-DIMENSIONAL DATA. PCA IS DEAD. LONG LIVE PCA 3
does not provide stable estimates unless a sample of size n d is available. This
has led to the introduction of shrinkage methods, penalty methods and methods
based on models with sparse covariance matrices. Some of the references can be
found in Section 1.
In this paper we consider the case where confounding is due to population
stratification and use PCA applied to x−k to correct for such stratification. In
particular, the

j=k
βjXj term in (2.2) will be replaced by a sum
∑q
j=1
ηj Zj where
the Zj represents principal components based on x−k and q 10. To find the Zj , we
use dual PCA, which we introduce in the next section. Under certain assumptions,
these Zj’s are effective indicators of which stratum an individual belongs to.
3. Dual eigenanalysis and models for stratified populations
3.1. Stratified populations. A stratified population with K strata or sub-
populations S1,...,SK is such that when one member of the population is se-
lected, the probability that the member is from subpopulation Sk is πk, where
∑K
k=1
πk = 1, πk 0, 1 k K. Consider n independent draws and let Nk be
the number of draws from Sk, then N =
(N1,...,NK)T
follows the multinomial
distribution MN(n, π1,...,πK ) where
∑K
k=1
Nk = n. This strata information is not
available. Instead we have n independent draws from a population that contain K
unknown strata. That is, K and N are unobservable.
Consider a random vector
(X1,...,Xd)T
whose covariance matrix Σ is assumed
to exist. We have available a n × d random data matrix X = (Xij)n×d where the
random vectors xi = (Xi1,...,Xid)T , 1 i n, are independent and identically
distributed. When the xi are drawn from a stratified population the major vari-
ability of X = (Xij) as we change i is due to this stratification, and this variability
can be examined by considering the n × n dual covariance matrix defined by
(3.1) ΣD =
d−1(X
X)(X
X)T
,
where X X is the n × d matrix with entries (Xij Xj), and Xj =
n−1
∑n
i=1
Xij.
To interpret ΣD, let W(d) =
(W1d),...,Wnd))T ( (
be the result of one random
draw from the collection of n-vectors
{(X1j,...,Xnj)T
: 1 j d}
Then
ΣD =
Cov(W(d)

W(d))

E[(W(d)

W(d))(W(d)

W(d))T
],
where
W(d)
=
(n−1
∑n
i=1
Wi(d))1
and 1 is a n-vector of 1’s.
It is known that ΣD has the same nonzero eigenvalues, up to a constant d/n,
as the usual covariance matrix
Σ =
n−1(X

X)T
(X X).
There is another simple relationship between PCA of Σ and ΣD: let
ˆ
λ
q
0 be the
qth largest eigenvalue of Σ, then the qth principal component of Σ evaluated at xi
equals the ith entry of the qth eigenvector of ΣD, up to a constant.
One advantage of ΣD is that in HDDA its dimension n × n is much smaller
than the dimension d × d of Σ. Another advantage is that if we explicitly model
stratification, then we find that even though ΣD is computed without using strata
information, a conditional eigenanalysis of ΣD reveals the unknown population
Previous Page Next Page