Authors: Sepideh Sadeghi and Thomas Wiecki
Correlation is a critical concept in assessing and minimizing risk in a portfolio. Often the goal is to find a diversified portfolio by means of minimizing the pairwise correlations between a selection of stocks. One of the problems of correlations is the that they can't be directly observed, unlike prices, exchange rates and so on. We thus have to estimate them. Unfortunately, this is much harder than it might appear at first. In addition to the observability problem, the correlations are known to change over time -- i.e. they are non-stationarity. The usual solution in quant finance is to then estimate rolling correlations which uses a fixed size rolling window over the time series to compute the correlations at each point in time. Selecting the right size for this window is not always straight forward. Larger windows lead to more stable estimations at the cost of not capturing recent changes in correlation structure while short windows lead to very noisy correlation estimation.
In this blog post, we will demonstrate a different, more elegant approach to the non-stationarity problem. Probabilistic Programming allows us to easy build a Bayesian model tailored the problem at hand. As we demonstrate below, we can easily extend a classic linear regression to have time-varying parameters. Specifically, we will achieve this by placing a random-walk prior on the linear-regression slope parameter. We will compare four different correlation recovery models, then list the relevant issues that may arise, arguing for the benefits of taking a Bayesian approach. Finally, we close by showing one of the applications of our Bayesian correlation recovery model to estimating beta of the Amazon stock.
Correlation risk refers to the error in the estimated correlations values as well as the model's confidence in its estimated values. Here we explore several models and contrast their ability in recovering the actual correlation values used for data generation.
|static correlation||linear regression||Bayesian linear regression|
|time varying correlation||Linear rolling regression||Stochastic regression|
As illustrated in the table above, each correlation recovery model has its own assumption on whether the correlation values are static or dynamic over time. Furthermore, Bayesian models have a very natural quantification of uncertainty associated with them. Otherwise we have to make use of point-estimates or use Frequentist statistics. Here, we contrast the Bayesian estimation with point-estimates, which are still very common. The inconsistency between the model assumptions and the actual time dependence of the correlations is usually the main source of introducing error in estimations, but not the only one. Over-fitting is another important issue which becomes more and more relevant with smaller train datasets.
Case 1: Overfitting, Correlations are static and correlation models hold the same assumption.
In the picture below you see a scatter plot of two return series that are drawn from a mutivariate normal with correlation close to 0.
We used both classic correlation computation as well as our Bayesian linear regression model (static over time), on two sets of 100 data points each from the same distribution (train and test). The picture below illustrates the differences across train and test set as well as models.
As can be seen, the linear regression estimates highly different correlations in the two sets. Taking uncertainty into account with the Bayesian model also shows the same mean difference, but seeing the full posterior distribution gives a sense of the uncertainty in our estimation. Looking at the width of the distribution shows that we really can't say that either correlation of the train and test set are different from 0, or from one-another. As we gather more and more data, the posterior distribution will get narrower and narrower. For more information on Bayesian statistics, see the excellent book Probabilistic Programming and Bayesian Methods for Hackers.
Case 2: Correlations are dynamic but correlation models assume static correlations.
The picture below illustrates two return series drawn from a mutivariate normal with time varying correlations (time is color coded). Therefore we have a time varying correlation between the two time series.
Now, lets look at the regression inferred by the static Bayesian model. In green are the actual correlation coefficients used to generate the data. These follow a random-walk and can thus slowly drift over time. As neither the classic nor the Bayesian linear regression have a sense of time, correlation is static over time, as represented by the blue and red lines. The blue lines have a width which represents the width of the posterior and thus our uncertainty into its values.
Case 3: Correlations are dynamic and correlation models assume dynamic correlations as well.
Next, lets estimate the rolling and stochastic correlation models on the same data set. Below, we compare classic rolling correlation (with different window sizes) with Bayesian stochastic regression, taking the mean squared error (MSE) as the performance measure. In red you can see the actual correlations we used to generate the data, and how they change over time. Each subplot then shows a different estimation technique, with the top one being our Bayesian stochastic regression model.
As can be seen, the Bayesian model equipped with the right assumption about the time dependence of correlations improves significantly over the previous one with the static assumption. The gradual change in the slope of the regression lines is a reflection of the gradual change in the estimated correlations. Moreover, the Bayesian models gives by far the lowest error (measured as the mean squared error -- MSE -- to the actual correlations).
Next we can look at the regression lines on top of the data and how the slowly adapt to maximally explain the current data.
Application: Estimating the correlation between Amazon and benchmark using Bayesian stochastic regression.
Below we compare the correlation estimations by classic rolling correlations and the Bayesian stochastic regression.
We demonstrated the relevant issues that may arise in the context of correlation recovery due to 1) model assumptions, 2) lack of enough data leading to over-fitting and 3) model incapability to yield uncertainty estimates. We expected the Bayesian approach to be superior to the other models due to 1) a more principled quantification of uncertainty in the form of posterior distributions, 2) the ability to flexibly build models tailored to the structure of the data. We demonstrated some of the benefits of taking a Bayesian approach in addressing the problem of correlation estimation. Our Bayesian models are all developed in PyMC3 (a probabilistic programming package for Python) and estimated using Markov Chain Monte Carlo.
The current post only address the problem of estimating the pair-wise correlations between two return series. Extending the current model to accommodate more than two return series and moving towards covariance matrix estimation, instead of correlations between just two return series, also allows to use this approach as part of portfolio optimization which takes uncertainty into account.