Back to all posts

Using Machine Learning to Predict Out-Of-Sample Performance of Trading Algorithms

By Dr. Thomas Wiecki, Lead Data Scientist at Quantopian

Earlier this year, we used DataRobot to test a large number of preprocessing, imputation and classifier combinations to predict out-of-sample performance. In this blog post, I’ll take some time to first explain the results from a unique data set assembled from strategies run on Quantopian. From these results, it became clear that while the Sharpe ratio of a backtest was a very weak predictor of the future performance of a trading strategy, we could instead use DataRobot to train a classifier on a variety of features to predict out-of-sample performance with much higher accuracy.

What is Backtesting?

Backtesting is ubiquitous in algorithmic trading. Quants run backtests to assess the merit of a strategy, academics publish papers showing phenomenal backtest results, and asset allocators at hedge funds take backtests into account when deciding where to deploy capital and who to hire. This makes sense if (and only if) a backtest is predictive of future performance. But is it?

This question is actually of critical importance to Quantopian. As we announced at QuantCon 2016, we have started to deploy some of our own capital to a select number of trading strategies. We found these algorithms by analyzing thousands of algorithms by using only pyfolio,our open source finance analysis library. As a data scientist, it is my job to come up with better (and more automatic) methods to identify promising trading algorithms.

What makes this possible (and a whole lot of fun) is that we have access to a unique dataset of millions of backtests being run on the Quantopian platform. While our terms-of-use prohibit us from looking at code, we do know when code was last edited. Knowing this, we can rerun the strategy up until yesterday and evaluate how the strategy continued to perform after the last code, meaning how it performed on data the quant did not have access to at the time of writing the code. Using this approach, we were able to assemble a data set of 888 strategies based on various criteria with at least 6 months of out-of-sample data.

The first thing I looked at was how predictive the backtest or in-sample (IS) Sharpe ratio was of the out-of-sample (OOS) Sharpe ratio:


With a disappointing R²=0.02, this shows that we can’t learn much about how well a strategy will continue to perform from a backtest. This is a real problem and might cause us to throw our hands in the air and give up. Instead, we wondered if perhaps all the other information we have about an algorithm (like other performance metrics, or position and transactions statistics) would allow us to do a better job.

Using Machine Learning to Predict Out-Of-Sample Performance

Our team created 58 features based on backtest results, including:

  • Tail-ratio (the 95th percentile divided by the 5th percentile of the daily returns distribution)
  • Average monthly turnover, how many backtests a user ran (which is indicative of overfitting)
  • The minimum, maximum and standard deviation of the rolling Sharpe ratio
  • …and many more (for a full list, see our SSRN manuscript ).

We then tried various machine learning techniques to try and predict out-of-sample Sharpe ratio based on these features. This is where DataRobot came in really handy. We uploaded our data set to DataRobot and from there, a huge combination of preprocessing, feature extraction, and regressors were tried in a cross-validated manner in parallel. Most of these could be done using scikit-learn, but it is a huge time saver and produced results better than my own humble attempts.

Using an ExtraTrees regressor (similar to a Random Forest), DataRobot achieved an R² of 0.17 on the 20% hold-out set not used for training. The lift-chart below shows that the regressor does a pretty good job at predicting OOS Sharpe ratio.

Lift chart

While it is generally difficult to intuit what the machine learning regressor has learned, we can look at which features it identified as most predictive. Of note, an important feature does not have to be predictive by itself or in relation to an OOS Sharpe ratio in a linear way. These methods are so powerful because they can learn extremely complex, nonlinear interactions of multiple features that are predictive in the aggregate.

Below you can appreciate the most informative feature.


It is interesting to point out that tail-ratio and kurtosis are both measures that assess the tails of the returns distribution. Moreover, the number of backtests a user ran (“user_backtest_days”) is also very informative of how a strategy will perform going forward.

While we showed that we can do a much better job at predicting an OOS Sharpe ratio than using only the backtest Sharpe ratio, it is not clear yet if this really makes a difference in reality. After all, no one earns money from a good R² value. We then asked if a portfolio constructed from the 10 strategies ranked highest by our machine learning regressor would do well. We compared this to selecting the top 10 strategies ranked by their in-sample Sharpe ratio as well as many simulations of choosing 10 strategies randomly.

The below image shows our results when applying this portfolio construction method on the hold-out set.

sharpe ratio

Future Outlook

We found it interesting that even at highly quantitative hedge funds, asset allocation is still done in a discretionary way, at least for the most part. I believe that a long-term goal for this research would be to further automate this aspect using Machine Learning, as depicted in the figure below.

Future outlook

It would also be useful to look more at the OOS data. Usually, deployment decisions aren’t done on backtest performance alone. At Quantopian, we require at least 6 months of OOS data. This number is rather arbitrary, however, and there might be strategies where we gain confidence more quickly, and ones where we would need to wait longer.

For more on how we approached comparing backtest and out-of-sample performance on cohort algorithms and how we found the best results were achieved with DataRobot, you can view the full report of our test here. You can also view my presentation on this subject from Quantcon here.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian.

In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.


Can this be run within the online Quantopian platform? Or would zipline need to be used?

Comments are closed.