1 Introduction
Principal Components Analysis (PCA) and random matrix theory (RMT) have become widespread tools for data analysis. PCA (Joliffe (2002) [6]) provides a mathematical and objective approach to extract economic information from the correlation matrices of asset returns. In this approach, the analyst extracts common risk factors from the eigenvectors and eigenvalues of the correlation matrix.
The first eigenvector of the correlation of stock returns corresponds to the solution of the variational problem
Here, is the correlation matrix of daily returns and is the Euclidean norm in , being the total number of assets. Equation 1 shows that the principal eigenvector is represents the direction (line) which “captures the most variance” as described by the correlation matrix. The first eigenvector satisfies
PCA also finds recursively additional (orthogonal) directions beyond which capture the most variance. The other eigenvectors and eigenvalues are computed in the same way as Eq. (1)) with the maximization to the sub-space orthogonal to the space spanned by the ones computed previously, i.e,
The eigenvalues satisfy . Assume that the data corresponds to the daily returns of a group of stocks. The Karhunen-Loeve representation of the standardized returns is
where
By construction, are uncorrelated and have variance 1. Since these random variables are linear combinations of the daily standardized returns of the assets, we call them (standardized) “eigenportfolio (EP) returns”, with the caveat that the actual portfolio “weights” are obtained by dividing each entry of the eigenvector by the volatility of the asset (Avellaneda and Lee 2008, 2010) [1].2
PCA is a framework for learning about the common factors which affect the returns of a given group of assets. The first eigenportfolio, associated with the r.v. , is a common risk factor which explains the maximum variability. We can write a one-factor model for each asset, namely
where is the regression coefficient of the standardized return on the first EP. The “residuals” in equation 6 are uncorrelated with , which is nice. However, they are generally correlated for different stocks.
The regression coefficients satisfy
In the case of economic data, which is noisy, the consensus is to disregard EPs which correspond to low eigenvalues. In a celebrated paper, Laloux et al (2000) [7] proposed to use random matrix theory (RMT) to establish a cutoff in the number of EPs use to model the standardized returns, namely
where are “factor loadings”and (with a slight abuse of notation) are residuals obtained after “defactoring” relatively to the eigenportfolios. The number is a cutoff which is to be determined from the context.
According to [7], the eigenvalues of a pure noise matrix follow the Marcenko-Pastur distribution and have a spectrum which, for large matrices,is asymptotically bounded from above by , where is the number of observations. Asymptotics should hold in the limit (a constant) as and both tend to infinity. The way to use RMT to calculate the cutoff is to construct the correlation matrix for m large enough and verify that its top eigenvalue is of the order of . One can also compare the empirical distribution of eigenvalues with the Marcenko-Pastur probability distribution.
PCA aided by RMT is an elegant approach to analyzing correlation matrices of financial data and can also be applied to may areas of science. The main strength of the method is that it can detect common risk factors based on a matrix of asset returns, without any additional information. In other works, PCA “lets the data speak for itself”. Generally speaking, PCA explains the most variability with the smallest number of factors. Most studies tend to justify the PCA approach by recognizing that it produces some factors which have ex-post economic interpretations, such as equating with the Sharpe Market Portfolio (Boyle 2017) [2], or attempt to interpret higher-order EPs in terms of industry sectors [1]. In the case of fixed-income, the EPs are often identified with “parallel shifts”, or with long-term vs short-term oscillations of the yield curve (Litterman and Scheinkman, 1991) [8].
2 The identification problem
One of the frequent criticisms of PCA in Finance is that the common risk factors generated by higher-order eigenportfolios - aside from the first eigenportfolio - are difficult to interpret and appear to be unstable across time. We call this the identification problem. Because of it, many portfolio managers favor traditional factor models such as Barra; see Shkolnik et. al. (2016) [9] for alternative approaches to model financial correlations.
The identification problem in PCA reflects the uncertainty, or unreliability, of cross-asset correlations. From a practical point of view, as the size of trading universe increases, the correlations of assets which are not economically related (a tech stock with an energy stock, or with a foreign stock) are difficult to quantify and may be noisy. This could be due to several reasons: the lack of “explanation” for the relation between the stocks, or perhaps that their prices are not sampled simultaneously (e.g. if they are end-of-day prices in different time-zones) or that the number of observations is not large compared to the number of assets considered. For example, empirical correlations of price changes of out-of-the money options with different underlying assets may not be as reliable or significant as the data would suggest.
To mitigate the identification problem, we should seek a factor model which can recognize the economic nature or function of the asset as well as the statistical properties of returns. This lead us to the model described hereafter.
3 Hierarchical PCA
The hierarchical PCA (HPCA) applies to markets which can be partitioned into several sectors or asset-classes. Consider first an abstract market, in which the empirical data matrix of asset returns, with dimensions , can be partitioned into “blocks of columns” labeled . These blocks have dimensions with . Each block represents data sampled from a sector. For simplicity, we assume that the indices of the securities are organized so that blocks which are adjacent to one another in the matrix and do not overlap. We have a few concrete situations in mind:
The blocks represent data of industry sectors for equities in the same economy (e.g. sectors associated with the 500 or so stocks in the S&P 500 index). In this case, the columns of a block correspond to the historical standardized returns of the stocks in the sector observed over consecutive dates.
Each block represents a stock or index and all of the derivatives written on it. In this case, the columns in a block represent the returns of the stock and the changes of the implied volatilities of options with different strikes and tenors written on the stock (Dobi 2015 [4]).
In the context of credit derivatives, the data represents changes in credit spreads for CDS. The blocks correspond to CDS referencing the same obligor (issuer) but with different tenors (Cont and Kan (2011) [3], Ivanov (2017) [5]).
Define the function if asset is in block . According to Eq. (4) we can write, for each asset in the “big universe”,
where is the regression coefficient of the returns of asset on the first factor of block and is the residual.
We shall make the following assumption (“HPCA assumption”):
The assumption states that residuals are uncorrelated if their assets belong to different sectors. Equation (9) defines the asset statistics within each block exactly, and the model is completed by specifying the joint statistics of the factors The HPCA assumption says nothing new regarding intra-block correlations, which are set equal to the empirical correlations between asset returns within the same sector or block. Of course, the intra-block correlations could be further denoised using RMT if necessary ([4]).
Using the HPCA assumption Eq. (10), the proposed model has the modified correlation matrix for asset returns:
where .
Proposition 1 Eq. (11) corresponds to a symmetric non-negative matrix with for all . In particular, it corresponds to the correlation matrix of a system of standardized random variables.
Proof. To check non-negative definiteness, note that for all we have
For any , the matrix restricted to sector is identical to the sector correlation, except for the fact that the eigenvalue corresponding to is set to zero. In particular, it is non-negative definite. Moreover, the matrix is also a correlation matrix, so it is non-negative definite. Since both summands are non-negative it follows that for all .
A concrete implementation of the data model is achieved as follows: let be Gaussian random variables with mean zero and covariance matrix , and let be i.i.d. standardized Gaussian random variables which are independent of the ’s. The data model is
The random variables need not be necessarily Gaussian: they can be multivariate Student-t, or they can be transforms of arbitrary distributions connected by a Gaussian or t-Copula; see for instance [5].
The multivariate distribution associated with HPCA presents an alternative model to the classical PCA (Eq. (8)). It has a tree structure: in the equity example discussed below, the top vertex corresponds to the “market”; there are 11 branches corresponding to industry sectors, and each of the 11 vertices has branches corresponding to the stocks in each sector.
Hierarchical models with more than two layers arise naturally. For instance, HPCA can be used to model “world portfolios”, in which the first layer consists of countries or regions, the second to industry sector indices in each country, and the third layer could describe the individual securities in each region/sector.
For another useful example, consider a stock market in which stocks belong to different industry sectors, and then, include columns associated with equity options returns. In this case, the tree has three layers because we can associate to each stock an additional sub-group: the block consisting of the returns of implied volatilities (on a constant delta/time-to-maturity grid) and the stock returns. Now the root corresponds to the full market, the first layer corresponds to industry sectors, the second layer corresponds to stocks and the third layer represents an individual name with all the associated option-implied volatilites.
A similar approach works for credit derivatives. In this case, the returns of the CDS with different tenors referencing each obligor constitute a block associated with an obligor. These blocks can be grouped by industry sectors or, alternatively, blocks could be generated according to membership in a credit index (CCX.IG, CDX.HY, CDX.HV), or both; [5].
In summary, if financial data can be grouped into blocks or sectors with clear economic interpretation, with multiple instruments associated with each block, we can generate a data model with tree-like structure from the HPCA assumption in Eq. (10). This approach combines information available for each asset (sector, sub-sector, reference obligor, option underlying asset) with the explanatory power of PCA. For simplicity, we will consider the analysis of a two-layer HPCA. Adding more layers is mathematically straightforward.
4 Spectral analysis
The HPCA assumption Eq. (10) gives rise to explicitly computable eigenvalues and eigenvectors for the matrix defined in Eq. (11).
Proposition 2.
1. For each sector , let denote the eigenvectors of the sector correlation matrix, ordered from largest to smallest, and let be the corresponding eigenvectors. Define the n-dimensional vectors
which correspond to the embedding of the sector-level eigenvectors, , into the large space . The vectors form an orthogonal basis of .
Let denote the eigenvalues of , ranked in decreasing order, and let represent the corresponding normalized eigenvectors (defined up to sign). The vectors
are eigenvectors of , with corresponding eigenvalues , for .
4. For each sector and each , the vector is an eigenvector of , with eigenvalue .
This proposition completely characterizes the eigenvalues and eigenvectors of the HPCA correlation matrix relating them to the eigenvalues and eigenvectors of sector PCAs.3 Thus, the HPCA assumption eliminates the identification problem for common factors: “eigenportfolios” have concrete meanings attached to the information about the correlations of sectors. In the examples to follow, we shall compare HPCA with PCA and show that the former is an excellent substitute for the full empirical correlation matrices when we model multivariate financial data.
5 Application: S&P 500 constituents
We consider data for equities which are constituents of the S&P500 index. The data ranges from February 22, 2012 to February 16, 2018. We consider the correlation matrix of standardized stock returns, and define the sectors as General Industry Classification groups (GICs), so ; see Table 1.
Cuadro 1.
GIC sectors and number of companies in each sector.
5.1 Eigenvalues
We considered the full empirical correlation matrix4 and the HPCA correlation matrix (“HPCA matrix”). The spectrum of the HPCA matrix is very similar than the one of the empirical correlation matrix , with the difference that the latter eigenvalues at the top of the spectrum are slightly larger the eigenvalues of the HPCA matrix. This is due to the fact that PCA explains more variance with fewer common factors (see Figure (5.1)). On the other hand, the sum of eigenvalues is equal to in both cases, which means that for high enough rank, the higher-order eigenvalues of HPCA are larger than those of PCA. The lowest eigenvalues of are infinitesimal, and the latter matrix is degenerate. At the bottom of the spectrum (not shown here) the HPCA spectrum has much higher eigenvalues (separated from zero) than PCA, since they are bounded from below by the lowest eigenvalue from all the sectors. Thus, the HPCA matrix is better conditioned than the full empirical matrix.
Figure 1.
X=axis: rank ( ) of the eigenvalues, sorted in decreasing order. Y-axis: sum of the first eigenvalues divided by . The PCA curve rises faster than HPCA, due to the nature of the PCA algorithm.
Cuadro 2.
Top 25 eigenvalues of PCA and HPCA, sorted in decreasing order.
The column “Eigenportfolio” gives an interpretation of the corresponding HPCA eigenportfolio. “Multi-sector” corresponds to a -eigenvalue and eigenvector, which are combinations of the first eigenportfolios for each of the 11 sectors (space ). The other eigenvalues/eigenvectors correspond to higher-order eigenvalues/eigenvectors for individual GIC sectors. Notice that, after sorting, some of the GIC eigenportfolios are more important in terms of explaining variability than multi-sector portfolios.
5.2 Eigenvectors
We turn to empirical analysis of the eigenvectors of the HPCA and the empirical correlation matrices, i.e. to the issue of identification problem for PCA/HPCA. The first eigenvectors for HPCA and PCA are plotted in Figures (5.2) and (5.2). Since the first eigenvector of has positive entries and the first eigenvectors of sector correlations also have positive entries due to the positive correlations of stocks ( [1],[2] ; EV1 loadings are positive for both PCA and PCA. Figure (5.2) superimposes both eigenvectors. The ordering of the X-axis is alphabetical in each sector and sectors are grouped displayed in increasing order of GIC according to Table (5). The two eigenvectors are practically indistinguishable in the sense that their average difference is of order and the standard deviation (centered RMS distance) is . The RMS error is one order of magnitude smaller than the average size of each entry in the eigenvectors which is approximately equal to , in both cases.
This identifies the first eigenportfolio of the market as a “portfolio of first eigenportfolios” of different sectors (GICs). The difference in explanatory power between the two eigenvectors is the difference between the corresponding eigenvalues, divided by the number of stocks, namely , which is negligible in this context. In particular, this suggests that using the first HPCA eigenportfolio as a proxy for the market portfolio gives rise to a better description of the market portfolio and an easier way to allocate to each stock. For instance, the first EV could be proxied by a capitalization-weighted sector ETF.5.
For eigenvectors 2 through 5 Figures (5.2) through (5.2), we find that the PCA eigenvectors correspond to “noisy versions” of the corresponding HPCA eigenvectors. The latter are essentially long-short sector eigenportfolios. The discrepancy increases when we consider higher-order eigenvalues, beyond 5. Eigenvectors #6 aren’t similar as shown in Figure (5.2). The PCA eigenvector contains both positive and negative signs within the Consumer Discretionary sector. Eigenvector 7 in HPCA is the first which is concentrated in a single sector, which is Consumer Discretionary (Fig. (5.2). The remaining eigenvectors up to rank 10 are displayed in Figures (5.2) to (5.2).
The main conclusions are: (a) most of the top eigenvalues and corresponding eigenvectors are related to the inter-sector correlation . This provides an interpretation for these eigenportfolios, or common risk factors, as “portfolios of long-only sector portfolios”. (b) The remaining eigenvectors may be quite different. The HPCA defines the factors into “sector-sector” and “long-short intra-sector”. PCA eigenvectors, in contrast, become increasingly difficult to interpret as simple sector-sector interactions or intra-sector interactions.
Figure 2.
First eigenvector of HPCA. Variance explained= 30%.
Figure 3.
Comparison of the first eigenvectors of HPCA and PCA, which have approximately the same explanatory value. Their Euclidean distance (RMS error) is , which is an order of magnitude smaller than the average entry size.
Figure 4.
Second eigenvector of HPCA. The variance explained is 4.7% for HPCA and 6.1% for PCA.