PCA FOR HIGH-DIMENSIONAL DATA. PCA IS DEAD. LONG LIVE PCA 3

does not provide stable estimates unless a sample of size n d is available. This

has led to the introduction of shrinkage methods, penalty methods and methods

based on models with sparse covariance matrices. Some of the references can be

found in Section 1.

In this paper we consider the case where confounding is due to population

stratification and use PCA applied to x−k to correct for such stratification. In

particular, the

∑

j=k

βjXj term in (2.2) will be replaced by a sum

∑q

j=1

ηj Zj where

the Zj represents principal components based on x−k and q ≤ 10. To find the Zj , we

use dual PCA, which we introduce in the next section. Under certain assumptions,

these Zj’s are effective indicators of which stratum an individual belongs to.

3. Dual eigenanalysis and models for stratified populations

3.1. Stratified populations. A stratified population with K strata or sub-

populations S1,...,SK is such that when one member of the population is se-

lected, the probability that the member is from subpopulation Sk is πk, where

∑K

k=1

πk = 1, πk 0, 1 ≤ k ≤ K. Consider n independent draws and let Nk be

the number of draws from Sk, then N =

(N1,...,NK)T

follows the multinomial

distribution MN(n, π1,...,πK ) where

∑K

k=1

Nk = n. This strata information is not

available. Instead we have n independent draws from a population that contain K

unknown strata. That is, K and N are unobservable.

Consider a random vector

(X1,...,Xd)T

whose covariance matrix Σ is assumed

to exist. We have available a n × d random data matrix X = (Xij)n×d where the

random vectors xi = (Xi1,...,Xid)T , 1 ≤ i ≤ n, are independent and identically

distributed. When the xi are drawn from a stratified population the major vari-

ability of X = (Xij) as we change i is due to this stratification, and this variability

can be examined by considering the n × n dual covariance matrix defined by

(3.1) ΣD =

d−1(X

− X)(X −

X)T

,

where X − X is the n × d matrix with entries (Xij − Xj), and Xj =

n−1

∑n

i=1

Xij.

To interpret ΣD, let W(d) =

(W1d),...,Wnd))T ( (

be the result of one random

draw from the collection of n-vectors

{(X1j,...,Xnj)T

: 1 ≤ j ≤ d}

Then

ΣD =

Cov(W(d)

−

W(d))

≡

E[(W(d)

−

W(d))(W(d)

−

W(d))T

],

where

W(d)

=

(n−1

∑n

i=1

Wi(d))1

and 1 is a n-vector of 1’s.

It is known that ΣD has the same nonzero eigenvalues, up to a constant d/n,

as the usual covariance matrix

Σ =

n−1(X

−

X)T

(X − X).

There is another simple relationship between PCA of Σ and ΣD: let

ˆ

λ

q

0 be the

qth largest eigenvalue of Σ, then the qth principal component of Σ evaluated at xi

equals the ith entry of the qth eigenvector of ΣD, up to a constant.

One advantage of ΣD is that in HDDA its dimension n × n is much smaller

than the dimension d × d of Σ. Another advantage is that if we explicitly model

stratification, then we find that even though ΣD is computed without using strata

information, a conditional eigenanalysis of ΣD reveals the unknown population