### 1 Introduction

Principal Components Analysis (PCA) and random matrix theory (RMT) have become widespread tools for data analysis. PCA (Joliffe (2002) [6]) provides a mathematical and objective approach to extract economic information from the correlation matrices of asset returns. In this approach, the analyst extracts common risk factors from the eigenvectors and eigenvalues of the correlation matrix.

The first eigenvector of the correlation of stock returns corresponds to the solution of the variational problem

Here, $R$ is the correlation matrix of daily returns and $\left|\right|.\left|\right|$ is the Euclidean norm in ${R}^{n}$ , $n$ being the total number of assets. Equation 1 shows that the principal eigenvector is represents the direction (line) which “captures the most variance” as described by the correlation matrix. The first eigenvector satisfies

PCA also finds recursively additional (orthogonal) directions beyond ${V}^{\left(1\right)}$ which capture the most variance. The other eigenvectors and eigenvalues are computed in the same way as Eq. (1)) with the maximization to the sub-space orthogonal to the space spanned by the ones computed previously, i.e,

The eigenvalues satisfy ${\lambda}^{\left(1\right)}>{\lambda}^{\left(2\right)}\ge ...\ge {\lambda}^{\left(n\right)}$ . Assume that the data corresponds to the daily returns of a group of stocks. The Karhunen-Loeve representation of the standardized returns is

where

By construction,
${F}^{\left(k\right)}$
are uncorrelated and have variance 1. Since these random
variables are linear combinations of the daily standardized returns of
the assets, we call them (standardized) “eigenportfolio (EP) returns”,
with the caveat that the actual portfolio “weights” are obtained by
dividing each entry of the eigenvector by the volatility of the asset (Avellaneda and Lee 2008, 2010) [1].^{2}

PCA is a framework for learning about the common factors which affect the returns of a given group of assets. The first eigenportfolio, associated with the r.v. ${F}^{\left(1\right)}$ , is a common risk factor which explains the maximum variability. We can write a one-factor model for each asset, namely

where ${\beta}_{j}$ is the regression coefficient of the standardized return on the first EP. The “residuals” ${\u03f5}_{j}$ in equation 6 are uncorrelated with ${F}^{\left(1\right)}$ , which is nice. However, they are generally correlated for different stocks.

The regression coefficients satisfy

In the case of economic data, which is
noisy, the consensus is to disregard EPs which correspond to low
eigenvalues. In a celebrated paper, Laloux *et al* (2000) [7]
proposed to use random matrix theory (RMT) to establish a cutoff in the
number of EPs use to model the standardized returns, namely

where ${\beta}_{j}^{\left(k\right)}$ are “factor loadings”and (with a slight abuse of notation) ${\u03f5}_{j}$ are residuals obtained after “defactoring” relatively to the $m$ eigenportfolios. The number $m$ is a cutoff which is to be determined from the context.

According to [7], the eigenvalues of a pure noise matrix follow the Marcenko-Pastur distribution and have a spectrum which, for large matrices,is asymptotically bounded from above by ${\lambda}^{+,MP}=(1+\sqrt{n/T}{)}^{2}$ , where $T$ is the number of observations. Asymptotics should hold in the limit $n/T\to \gamma $ (a constant) as $n$ and $T$ both tend to infinity. The way to use RMT to calculate the cutoff is to construct the correlation matrix ${R}_{i,j}^{\left(m\right)}=Corr({\u03f5}_{i},{\u03f5}_{j})$ for m large enough and verify that its top eigenvalue is of the order of ${\lambda}^{+,MP}$ . One can also compare the empirical distribution of eigenvalues with the Marcenko-Pastur probability distribution.

PCA
aided by RMT is an elegant approach to analyzing correlation matrices
of financial data and can also be applied to may areas of science. The
main strength of the method is that it can detect common risk factors
based on a matrix of asset returns, without any additional information.
In other works, PCA “lets the data speak for itself”. Generally
speaking, PCA explains the most variability with the smallest number of
factors. Most studies tend to justify the PCA approach by recognizing
that it produces some factors which have *ex-post* economic interpretations, such as equating
$E{P}^{\left(1\right)}$
with the Sharpe Market Portfolio (Boyle 2017) [2], or attempt to interpret higher-order EPs in terms of industry sectors [1].
In the case of fixed-income, the EPs are often identified with
“parallel shifts”, or with long-term vs short-term oscillations of the
yield curve (Litterman and Scheinkman, 1991) [8].

### 2 The identification problem

One of the frequent criticisms of PCA in Finance is that the common
risk factors generated by higher-order eigenportfolios - aside from the
first eigenportfolio - are difficult to interpret and appear to be
unstable across time. We call this the *identification problem*. Because of it, many portfolio managers favor traditional factor models such as Barra; see Shkolnik et. al. (2016) [9] for alternative approaches to model financial correlations.

The identification problem in PCA reflects the uncertainty, or unreliability, of cross-asset correlations. From a practical point of view, as the size of trading universe increases, the correlations of assets which are not economically related (a tech stock with an energy stock, or with a foreign stock) are difficult to quantify and may be noisy. This could be due to several reasons: the lack of “explanation” for the relation between the stocks, or perhaps that their prices are not sampled simultaneously (e.g. if they are end-of-day prices in different time-zones) or that the number of observations is not large compared to the number of assets considered. For example, empirical correlations of price changes of out-of-the money options with different underlying assets may not be as reliable or significant as the data would suggest.

To mitigate the identification problem, we should seek a factor model which can recognize the economic nature or function of the asset as well as the statistical properties of returns. This lead us to the model described hereafter.

### 3 Hierarchical PCA

The hierarchical PCA (HPCA) applies to markets which can be partitioned into several sectors or asset-classes. Consider first an abstract market, in which the empirical data matrix of asset returns, with dimensions $T\times n$ , can be partitioned into “blocks of columns” labeled $k=\mathrm{1,2,...,}b$ . These blocks have dimensions $T\times {n}_{k}$ with $k=\mathrm{1,2,...,}b$ . Each block represents data sampled from a sector. For simplicity, we assume that the indices of the securities are organized so that blocks which are adjacent to one another in the matrix and do not overlap. We have a few concrete situations in mind:

The blocks represent data of industry sectors for equities in the same economy (e.g. sectors associated with the 500 or so stocks in the S&P 500 index). In this case, the columns of a block correspond to the historical standardized returns of the stocks in the sector observed over $T$ consecutive dates.

Each block represents a stock or index and all of the derivatives written on it. In this case, the columns in a block represent the returns of the stock and the changes of the implied volatilities of options with different strikes and tenors written on the stock (Dobi 2015 [4]).

In the context of credit derivatives, the data represents changes in credit spreads for CDS. The blocks correspond to CDS referencing the same obligor (issuer) but with different tenors (Cont and Kan (2011) [3], Ivanov (2017) [5]).

Define the function $I\left(j\right)=k$ if asset $j$ is in block $k$ . According to Eq. (4) we can write, for each asset in the “big universe”,

where ${\beta}_{j}$ is the regression coefficient of the returns of asset $j$ on the first factor of block $I\left(j\right)$ and ${\u03f5}_{j}$ is the residual.

We shall make the following assumption (“HPCA assumption”):

The assumption states that residuals are uncorrelated if their assets belong to different sectors. Equation (9) defines the asset statistics within each block exactly, and the model is completed by specifying the joint statistics of the factors ${F}^{\left(\mathrm{1,}k\right)},k=\mathrm{1,2,...,}b.$ The HPCA assumption says nothing new regarding intra-block correlations, which are set equal to the empirical correlations between asset returns within the same sector or block. Of course, the intra-block correlations could be further denoised using RMT if necessary ([4]).

Using the HPCA assumption Eq. (10), the proposed model has the modified correlation matrix for asset returns:

where ${\overline{\rho}}^{k,k\text{'}}=Corr({F}^{(1,k)},{F}^{(1,k\text{'})})$ .

**Proposition 1**
*Eq. (11) corresponds to a symmetric non-negative matrix with*
${\tilde{R}}_{ii}=1$
*for all*
$i$
*. In particular, it corresponds to the correlation matrix of a system of standardized random variables.*

**Proof**. To check non-negative definiteness, note that for all
$\theta \in {R}^{n}$
we have

For any $k$ , the matrix ${R}_{ij}-{\beta}_{i}{\beta}_{j}$ restricted to sector $k$ is identical to the sector correlation, except for the fact that the eigenvalue corresponding to ${V}^{\left(\mathrm{1,}k\right)}$ is set to zero. In particular, it is non-negative definite. Moreover, the matrix ${\overline{\rho}}^{k,k\text{'}}$ is also a correlation matrix, so it is non-negative definite. Since both summands are non-negative it follows that ${\theta}^{t}\tilde{R}\theta \ge 0$ for all $\theta \in {R}^{n}$ .

A concrete implementation of the data model is achieved as follows: let ${\psi}_{1}\mathrm{,...,}{\psi}_{b}$ be Gaussian random variables with mean zero and covariance matrix $\overline{\rho}$ , and let ${\zeta}_{ik},i:I\left(i\right)=b,k=1,...,b$ be i.i.d. standardized Gaussian random variables which are independent of the $\psi $ ’s. The data model is

The random variables need not be necessarily Gaussian: they can be multivariate Student-t, or they can be transforms of arbitrary distributions connected by a Gaussian or t-Copula; see for instance [5].

The multivariate distribution associated with HPCA presents an alternative model to the classical PCA (Eq. (8)). It has a tree structure: in the equity example discussed below, the top vertex corresponds to the “market”; there are 11 branches corresponding to industry sectors, and each of the 11 vertices has branches corresponding to the stocks in each sector.

Hierarchical models with more than two layers arise naturally. For instance, HPCA can be used to model “world portfolios”, in which the first layer consists of countries or regions, the second to industry sector indices in each country, and the third layer could describe the individual securities in each region/sector.

For another useful example, consider a stock market in which stocks belong to different industry sectors, and then, include columns associated with equity options returns. In this case, the tree has three layers because we can associate to each stock an additional sub-group: the block consisting of the returns of implied volatilities (on a constant delta/time-to-maturity grid) and the stock returns. Now the root corresponds to the full market, the first layer corresponds to industry sectors, the second layer corresponds to stocks and the third layer represents an individual name with all the associated option-implied volatilites.

A similar approach works for credit derivatives. In this case, the returns of the CDS with different tenors referencing each obligor constitute a block associated with an obligor. These blocks can be grouped by industry sectors or, alternatively, blocks could be generated according to membership in a credit index (CCX.IG, CDX.HY, CDX.HV), or both; [5].

In summary, if financial data can be grouped into blocks or sectors with clear economic interpretation, with multiple instruments associated with each block, we can generate a data model with tree-like structure from the HPCA assumption in Eq. (10). This approach combines information available for each asset (sector, sub-sector, reference obligor, option underlying asset) with the explanatory power of PCA. For simplicity, we will consider the analysis of a two-layer HPCA. Adding more layers is mathematically straightforward.

### 4 Spectral analysis

The HPCA assumption Eq. (10) gives rise to explicitly computable eigenvalues and eigenvectors for the matrix $\tilde{R}$ defined in Eq. (11).

**Proposition 2.**

*1. For each sector*$k=\mathrm{1,...,}b$ ,*let*${\lambda}^{(1,k)}{\lambda}^{(2,k)}\ge ...\ge {\lambda}^{({n}_{k},k)}$*denote the*${n}_{k}$*eigenvectors of the sector correlation matrix, ordered from largest to smallest, and let*${V}^{\left(i,k\right)}$*be the corresponding eigenvectors. Define the n-dimensional vectors*

*which correspond to the embedding of the sector-level eigenvectors*, ${V}^{(i,k)}\in {R}^{{n}_{k}}$ ,*into the large space*${R}^{n}$ .*The vectors*${W}^{(i,k)},i=1,...,{n}_{k},k=1,...,b$*form an orthogonal basis of*${R}^{n}$ .

*2. The subspace*$\text{\Omega}$*of*${R}^{n}$*generated by the vectors*${W}^{\left(\mathrm{1,}k\right)},k=\mathrm{1,...,}b$ ,*viz.*

*is invariant under the action of*$\tilde{R}$*viewed as an operator from*${R}^{n}$*to*${R}^{n}$ .*3. Consider the*$b\times b$*matrix*

*Let*${\mu}^{\left(1\right)}\mathrm{,...,}{\mu}^{\left(b\right)}$*denote the eigenvalues of*$M$ ,*ranked in decreasing order, and let*$({\alpha}^{\left(k\right)}=\left({\alpha}_{1}^{\left(k\right)}\mathrm{,....,}{\alpha}_{b}^{k)}\right)k=\mathrm{1,...,}b$*represent the corresponding normalized eigenvectors (defined up to sign). The vectors*

*are eigenvectors of*$\tilde{R}$ ,*with corresponding eigenvalues*${\mu}^{\left(k\right)}$ ,*for*$k=\mathrm{1,...,}b$ .*4. For each sector*$k$*and each*$j,2\le j\le {n}_{k}$ ,*the vector*${W}^{\left(j,k\right)}$*is an eigenvector of*$\tilde{R}$ ,*with eigenvalue*${\lambda}^{\left(j,k\right)}$ .

This proposition completely
characterizes the eigenvalues and eigenvectors of the HPCA correlation
matrix relating them to the eigenvalues and eigenvectors of sector PCAs.^{3}
Thus, the HPCA assumption eliminates the identification problem for
common factors: “eigenportfolios” have concrete meanings attached to the
information about the correlations of sectors. In the examples to
follow, we shall compare HPCA with PCA and show that the former is an
excellent substitute for the full empirical correlation matrices when we
model multivariate financial data.

### 5 Application: S&P 500 constituents

We consider data for $n=434$ equities which are constituents of the S&P500 index. The data ranges from February 22, 2012 to February 16, 2018. We consider the correlation matrix of standardized stock returns, and define the sectors as General Industry Classification groups (GICs), so $b=11$ ; see Table 1.

Cuadro 1.

GIC sectors and number of companies in each sector.

#### 5.1 Eigenvalues

We considered the full empirical correlation matrix^{4} and the HPCA correlation matrix
$\tilde{R}$
(“HPCA matrix”). The spectrum of the HPCA matrix is very similar than the one of the empirical correlation matrix
$R$
, with the difference that the latter eigenvalues at the top
of the spectrum are slightly larger the eigenvalues of the HPCA matrix.
This is due to the fact that PCA explains more variance with fewer
common factors (see Figure (5.1)). On the other hand, the sum of eigenvalues is equal to
$n=434$
in both cases, which means that for high enough rank, the
higher-order eigenvalues of HPCA are larger than those of PCA. The
lowest eigenvalues of
$R$
are infinitesimal, and the latter matrix is degenerate. At
the bottom of the spectrum (not shown here) the HPCA spectrum has much
higher eigenvalues (separated from zero) than PCA, since they are
bounded from below by the lowest eigenvalue from all the sectors. Thus,
the HPCA matrix is better conditioned than the full empirical matrix.

Figure 1.

X=axis: rank ( $k$ ) of the eigenvalues, sorted in decreasing order. Y-axis: sum of the first $k$ eigenvalues divided by $n=434$ . The PCA curve rises faster than HPCA, due to the nature of the PCA algorithm.

Cuadro 2.

Top 25 eigenvalues of PCA and HPCA, sorted in decreasing order.

The column “Eigenportfolio” gives an interpretation of the corresponding HPCA eigenportfolio. “Multi-sector” corresponds to a
${\mu}^{\left(k\right)}$
-eigenvalue and eigenvector, which are combinations of the *first* eigenportfolios for each of the 11 sectors (space
$\text{\Omega}$
). The other eigenvalues/eigenvectors correspond to
higher-order eigenvalues/eigenvectors for individual GIC sectors. Notice
that, after sorting, some of the GIC eigenportfolios are more important
in terms of explaining variability than multi-sector portfolios.

#### 5.2 Eigenvectors

We turn to empirical analysis of the eigenvectors of the HPCA and the empirical correlation matrices, *i.e.* to the issue of identification problem for PCA/HPCA. The first eigenvectors for HPCA and PCA are plotted in Figures (5.2) and (5.2). Since the first eigenvector of
$M$
has positive entries and the first eigenvectors of sector
correlations also have positive entries due to the positive correlations
of stocks ( [1],[2] ; EV1 loadings are positive for both PCA and PCA. Figure (5.2)
superimposes both eigenvectors. The ordering of the X-axis is
alphabetical in each sector and sectors are grouped displayed in
increasing order of GIC according to Table (5). The two eigenvectors are
practically indistinguishable in the sense that their average
difference is of order
$1.0\times {10}^{-5}$
and the standard deviation (centered RMS distance) is
$5.3\times {10}^{-3}$
. The RMS error is one order of magnitude smaller than the
average size of each entry in the eigenvectors which is approximately
equal to
$4.7\times {10}^{-2}$
, in both cases.

This identifies the
first eigenportfolio of the market as a “portfolio of first
eigenportfolios” of different sectors (GICs). The difference in
explanatory power between the two eigenvectors is the difference between
the corresponding eigenvalues, divided by the number of stocks, namely
$\left(138.87-137.19\right)/434=0.39\text{\%}$
, which is negligible in this context. In particular, this
suggests that using the first HPCA eigenportfolio as a proxy for the
market portfolio gives rise to a better description of the market
portfolio and an easier way to allocate to each stock. For instance, the
first EV could be proxied by a capitalization-weighted sector ETF.^{5}.

For eigenvectors 2 through 5 Figures (5.2) through (5.2), we find that the PCA eigenvectors correspond to “noisy versions” of the corresponding HPCA eigenvectors. The latter are essentially long-short sector eigenportfolios. The discrepancy increases when we consider higher-order eigenvalues, beyond 5. Eigenvectors #6 aren’t similar as shown in Figure (5.2). The PCA eigenvector contains both positive and negative signs within the Consumer Discretionary sector. Eigenvector 7 in HPCA is the first which is concentrated in a single sector, which is Consumer Discretionary (Fig. (5.2). The remaining eigenvectors up to rank 10 are displayed in Figures (5.2) to (5.2).

The main conclusions are: (a) most of the top eigenvalues and corresponding eigenvectors are related to the inter-sector correlation $\overline{\rho}$ . This provides an interpretation for these eigenportfolios, or common risk factors, as “portfolios of long-only sector portfolios”. (b) The remaining eigenvectors may be quite different. The HPCA defines the factors into “sector-sector” and “long-short intra-sector”. PCA eigenvectors, in contrast, become increasingly difficult to interpret as simple sector-sector interactions or intra-sector interactions.

Figure 2.

First eigenvector of HPCA. Variance explained= 30%.

Figure 3.

Comparison of the first eigenvectors of HPCA and PCA, which have approximately the same explanatory value. Their Euclidean distance (RMS error) is $5.5\times {10}^{-3}$ , which is an order of magnitude smaller than the average entry size.

Figure 4.

Second eigenvector of HPCA. The variance explained is 4.7% for HPCA and 6.1% for PCA.