Back to all posts

Bayesian Correlation Estimation: Dealing with Uncertainty and Non-Stationarity

Authors: Sepideh Sadeghi and Thomas Wiecki
Correlation is a critical concept in assessing and minimizing risk in a portfolio. Often the goal is to find a diversified portfolio by means of minimizing the pairwise correlations between a selection of stocks. One of the problems of correlations is the that they can't be directly observed, unlike prices, exchange rates and so on. We thus have to estimate them. Unfortunately, this is much harder than it might appear at first. In addition to the observability problem, the correlations are known to change over time -- i.e. they are non-stationarity. The usual solution in quant finance is to then estimate rolling correlations which uses a fixed size rolling window over the time series to compute the correlations at each point in time. Selecting the right size for this window is not always straight forward. Larger windows lead to more stable estimations at the cost of not capturing recent changes in correlation structure while short windows lead to very noisy correlation estimation.

In this blog post, we will demonstrate a different, more elegant approach to the non-stationarity problem. Probabilistic Programming allows us to easy build a Bayesian model tailored the problem at hand. As we demonstrate below, we can easily extend a classic linear regression to have time-varying parameters. Specifically, we will achieve this by placing a random-walk prior on the linear-regression slope parameter. We will compare four different correlation recovery models, then list the relevant issues that may arise, arguing for the benefits of taking a Bayesian approach. Finally, we close by showing one of the applications of our Bayesian correlation recovery model to estimating beta of the Amazon stock.


Correlation risk refers to the error in the estimated correlations values as well as the model's confidence in its estimated values. Here we explore several models and contrast their ability in recovering the actual correlation values used for data generation.

Classic Bayesian
static correlation linear regression Bayesian linear regression
time varying correlation Linear rolling regression Stochastic regression

All Bayesian models are implemented in PyMC3. For more details on the stats, the model and its implementation, see here.


As illustrated in the table above, each correlation recovery model has its own assumption on whether the correlation values are static or dynamic over time. Furthermore, Bayesian models have a very natural quantification of uncertainty associated with them. Otherwise we have to make use of point-estimates or use Frequentist statistics. Here, we contrast the Bayesian estimation with point-estimates, which are still very common. The inconsistency between the model assumptions and the actual time dependence of the correlations is usually the main source of introducing error in estimations, but not the only one. Over-fitting is another important issue which becomes more and more relevant with smaller train datasets.

Case 1: Overfitting, Correlations are static and correlation models hold the same assumption.

In the picture below you see a scatter plot of two return series that are drawn from a mutivariate normal with correlation close to 0.
We used both classic correlation computation as well as our Bayesian linear regression model (static over time), on two sets of 100 data points each from the same distribution (train and test). The picture below illustrates the differences across train and test set as well as models.

As can be seen, the linear regression estimates highly different correlations in the two sets. Taking uncertainty into account with the Bayesian model also shows the same mean difference, but seeing the full posterior distribution gives a sense of the uncertainty in our estimation. Looking at the width of the distribution shows that we really can't say that either correlation of the train and test set are different from 0, or from one-another. As we gather more and more data, the posterior distribution will get narrower and narrower. For more information on Bayesian statistics, see the excellent book Probabilistic Programming and Bayesian Methods for Hackers.

Case 2: Correlations are dynamic but correlation models assume static correlations.

The picture below illustrates two return series drawn from a mutivariate normal with time varying correlations (time is color coded). Therefore we have a time varying correlation between the two time series.


Now, lets look at the regression inferred by the static Bayesian model. In green are the actual correlation coefficients used to generate the data. These follow a random-walk and can thus slowly drift over time. As neither the classic nor the Bayesian linear regression have a sense of time, correlation is static over time, as represented by the blue and red lines. The blue lines have a width which represents the width of the posterior and thus our uncertainty into its values.


Case 3: Correlations are dynamic and correlation models assume dynamic correlations as well.

Next, lets estimate the rolling and stochastic correlation models on the same data set. Below, we compare classic rolling correlation (with different window sizes) with Bayesian stochastic regression, taking the mean squared error (MSE) as the performance measure. In red you can see the actual correlations we used to generate the data, and how they change over time. Each subplot then shows a different estimation technique, with the top one being our Bayesian stochastic regression model.varying_corr_data

As can be seen, the Bayesian model equipped with the right assumption about the time dependence of correlations improves significantly over the previous one with the static assumption. The gradual change in the slope of the regression lines is a reflection of the gradual change in the estimated correlations. Moreover, the Bayesian models gives by far the lowest error (measured as the mean squared error -- MSE -- to the actual correlations).

Next we can look at the regression lines on top of the data and how the slowly adapt to maximally explain the current data.


Application: Estimating the correlation between Amazon and benchmark using Bayesian stochastic regression.

Here, we apply the Bayesian rolling regression model to two return series, Amazon and the S&P500 from 2007 to end of 2008:

The regression lines inferred by the Bayesian model reflect the gradual change of the correlation between Amazon and the S&P500 during 2007 to 2008 as expected due to market crash.
Below we compare the correlation estimations by classic rolling correlations and the Bayesian stochastic regression.
As you can see, even with a window length of 50 we have quite abrupt changes in the estimated correlation value. These are caused by extreme market shifts that strongly effect correlation estimation moving in and outside the window. Bayesian stochastic correlation gives a much smoother estimate and also assumes correlations to be lower up until they rise in late 2008. At the same time, the relatively high uncertainty estimate encoded as the width of the line informs us we should not assume to be able to estimate a precise number.


We demonstrated the relevant issues that may arise in the context of correlation recovery due to 1) model assumptions, 2) lack of enough data leading to over-fitting and 3) model incapability to yield uncertainty estimates. We expected the Bayesian approach to be superior to the other models due to 1) a more principled quantification of uncertainty in the form of posterior distributions, 2) the ability to flexibly build models tailored to the structure of the data. We demonstrated some of the benefits of taking a Bayesian approach in addressing the problem of correlation estimation. Our Bayesian models are all developed in PyMC3 (a probabilistic programming package for Python) and estimated using Markov Chain Monte Carlo.

Future work

The current post only address the problem of estimating the pair-wise correlations between two return series. Extending the current model to accommodate more than two return series and moving towards covariance matrix estimation, instead of correlations between just two return series, also allows to use this approach as part of portfolio optimization which takes uncertainty into account.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian.

In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Christopher Gifford

Another option is the implementation of non-classical Bayesianism.


Correlation < -1? How does that work?

Thomas Wiecki

@Lurker: Sorry for the confusion, it seems to be plotting covariance, rather than correlation.

Comments are closed.