Back to all posts

9 Mistakes Quants Make that Cause Backtests to Lie by Tucker Balch, Ph.D.

Below is a follow-up article on Dr. Balch's talk he gave at Quantcon 2015

"I’ve never seen a bad backtest” -- Dimitris Melas, head of research at MSCI.

A backtest is a simulation of a trading strategy used to evaluate how effective the strategy might have been if it were traded historically. Backtestesting is used by hedge funds and other researchers to test strategies before real capital is applied. Backtests are valuable because they enable quants to quickly test and reject trading strategy ideas.

All too often strategies look great in simulation but fail to live up to their promise in live trading. There are a number of reasons for these failures, some of which are beyond the control of a quant developer. But other failures are caused by common, insidious mistakes.

An over optimistic backtest can cause a lot of pain. I’d like to help you avoid that pain by sharing 9 of the most common pitfalls in trading strategy development and testing that can result in overly optimistic backtests:

1. In-sample backtesting

Many strategies require refinement, or model training of some sort. As one example, a regression-based model that seeks to predict future prices might use recent data to build the model. It is perfectly fine to build a model in that manner, but it is not OK to test the model over that same time period. Such models are doomed to succeed.

Don’t trust them.

Solution: Best practices are to build procedures to prevent testing over the same data you train over. As a simple example you might use data from 2007 to train your model, but test over 2008-forward.

By the way, even though it could be called “out-of-sample” testing it is not a good practice to train over later data, say 2014, then test over earlier data, say 2008-2013. This may permit various forms of lookahead bias.

2. Using survivor-biased data

Suppose I told you I have created a fantastic new blood pressure medicine, and that I had tested it using the following protocol:

a. Randomly select 500 subjects
b. Administer my drug to them every day for 5 years
c. Measure their blood pressure each day

At the beginning of the study the average blood pressure of the participants was 160/110, at the end of the study the average BP was 120/80 (significantly lower and better).

Those look like great results, no? What if I told you that 58 of the subjects died during the study? Maybe it was the ones with the high blood pressure that died! This is clearly not an accurate study because it focused on the statistics of survivors at the end of the study.

This same sort of bias is present in backtests that use later lists of stocks (perhaps members of the S&P 500) as the basis for historical evaluations over earlier periods. A common example is to use the current S&P 500 as the universe of stocks for testing a strategy.

Why is this bad? See the two figures below for illustrative examples.

Figure: The green lines show historical performance of stocks that were members of the S&P 500 in 2012. Note that all of these stocks came out of the 2008/2009 downturn very nicely.

Figure: The green lines show historical performance of stocks that were members of the S&P 500 in 2012. Note that all of these stocks came out of the 2008/2009 downturn very nicely.

Figure: What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

Figure: What really happened: If, instead we use the members of the S&P 500 starting in 2008, we find that more than 10% of the listed companies failed.

In our work at Lucena Research, we see an annual 3% to 5% performance “improvement” with strategies using survivor-biased data.

Solution: Find datasets that include historical members of indices, then use those lists to sample from for your strategies.

3. Observing the close & other forms of lookahead bias

In this failure mode, the quant assumes he can observe market closing prices in order to compute an indicator, and then also trade at the close. As an example, one might use closing price/volume to calculate a technical factor used in the strategy, then trade based on that information.

This is a specific example of lookahead bias in which the strategy is allowed to peek a little bit into the future. In my work I have seen time and again that even a slight lookahead bias can provide fantastic (and false) returns.

Other examples of lookahead bias have to do with incorrect registration of data such as earnings reports or news. Assuming for instance that one can trade on the same day earnings are announced even though earnings are usually announced after the close.

Solution: Don’t trade until the open of the next day after information becomes available.

4. Ignoring market impact

The very act of trading affects price. Historical pricing data does not include your trades and is therefore not an accurate representation of the price you would get if you were trading.

Consider the chart below that describes the performance of a real strategy I helped develop. Consider the region A, the first part of the upwardly sloping orange line. This region was the performance of our backtest. The strategy had a Sharpe Ratio over 7.0! Based on the information we had up until that time (the end of A), it looked great so we started trading it.

When we began live trading we saw the real performance illustrated with the green “live” line in region B– essentially flat. The strategy was not working, so we halted trading it after a few weeks. After we stopped trading it, the strategy started performing well again in paper trading (Region C, Arg!).

Ignoring Market Impact Illustration

How can this be? We thought perhaps that the error was in our predictive model, so we backtested again over the “live” area and the backtest showed that same flat area. The only difference between the nice 7.0 Sharpe Ratio sections and the flat section was that we were engaged in the market in the flat region.

What was going on? The answer, very simply, is that by participating in the market we were changing the prices to our disadvantage. We were not modeling market impact in our market simulation. Once we added that feature more accurately, our backtest appropriately showed a flat, no-return result for region A. If we had had that in the first place we probably would never have traded the strategy.

Solution: Be sure to anticipate that price will move against you at every trade. For trades that are a small part of overall volume, a rule of thumb is about 5 bps for S&P 500 stocks and up to 50 bps for more thinly traded stocks. It depends of course on how much of the market your strategy is seeking to trade.

5. Buy $10M of a $1M company

Naïve backtesters will allow a strategy to buy or sell as much of an asset as it likes. This may provide a misleadingly optimistic backtest because large allocations to small companies are allowed.

There often is real alpha in thinly traded stocks, and data mining approaches are likely to find it. Consider for a moment why it seems there is alpha there. The reason is that the big hedge funds aren’t playing there because they can’t execute their strategy with illiquid assets. There are perhaps scraps of alpha to be collected by the little guy, but check to be sure you’re not assuming you can buy $10M of a $1M company.

Solution: Have your backtester limit the strategy’s trading to a percentage of the daily dollar volume of the equity. Another alternative is to filter potential assets to a minimum daily dollar volume.

6. Overfit the model

An overfit model is one that models in-sample data very well. It predicts the data so well that it is likely modeling noise rather than the underlying principle or relationship in the data that you are hoping it will discover.

Here’s a more formal definition of overfitting: As the degrees of freedom of the model increase, overfitting occurs when in-sample prediction error decreases and out-of-sample prediction error increases.

What do we mean by “degrees of freedom?” Degrees of freedom can take many forms, depending on the type of model being created: Number of factors used, number of parameters in a parameterized model and so on.

Solution: Don’t repeatedly “tweak” and “refine” your model using in-sample data. And always compare in-sample error versus out-of-sample error.

7. Trust complex models

Complex models are often overfit models. Simple approaches that arise from a basic idea that makes intuitive sense lead to the best models. A strategy built from a handful of factors combined with simple rules is more likely to be robust and less sensitive to overfitting than a complex model with lots of factors.

Solution: Limit the number of factors considered by a model, use simple logic in combining them.

8. Trusting stateful strategy luck

A stateful strategy is one whose holdings over time depend on which day in history it was started. As an example, if the strategy rapidly accrues assets, it may be quickly fully invested and therefore miss later buying opportunities. If the strategy had started one day later, it’s holdings might be completely different.

Sometimes such strategies’ success vary widely if they are started on a different day. I’ve seen, for instance, a difference in 50% return for the same strategy started on two days in the same week.

Solution: If your strategy is stateful, be sure to test it starting on many difference days. Evaluate the variance of the results across those days. If is large you should be concerned.

9. Data mining fallacy

Even if you avoid all of the pitfalls listed above, if you generate and test enough strategies you’ll eventually find one that works very well in a backtest. However, the quality of the strategy cannot be distinguished from a lucky random stock picker.

How can this pitfall be avoided? It can’t be avoided. However, you can and should forward test before committing significant capital.

Solution: Forward test (paper trade) a strategy before committing capital.

Summary

It is best to view backtesting as a method for rejecting strategies, than as a method for validating strategies. One thing is for sure: If it doesn’t work in a backtest, it won’t work in real life. The converse is not true: Just because it works in a backtest does not mean you can expect it to work in live trading.

However, if you avoid the pitfalls listed above, your backtests stand a better chance of more accurately representing real life performance.

===
Live Webinar: Dr. Balch will present a webinar on this topic on April 24, 2015 at 11AM. You can register to watch the webinar live by following this link.

QuantCon 2015 Replay: Missed QuantCon 2015? Watch Tucker's talk and view his presentation slides from the event.

About the author
Tucker Balch, Ph.D. is a professor of Interactive Computing at Georgia Tech. He is also CTO of Lucena Research, Inc., a financial decision support technology company. You can read more essays by Tucker at http://augmentedtrader.com.

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory or other services by Quantopian. In addition, the content of the website offers no opinion with respect to the suitability of any security or any specific investment.

Quantopian makes no guarantees as to accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances. All investments involve risk – including loss of principal. You should consult with an investment professional before making any investment decisions.

Good post. However, if one reads enough of my posts, one will find some bad backtests--though only of others' strategies =P

That said, I display all my code so anyone can feel free to dispute my results.

I agree with everything except "Complex models are often overfit models". Whilst it is certainly true that models with more parameters can overfit the data more easily, it is also true that simple models with too few parameters will never fit the data (under fit). Also, there is a large body of research dedicated to ways of checking that more complex models (regression trees, automatons, neural networks, etc.) are not over-fitting. All one can really say is that generally speaking Ockhams Razor holds - 'given two models with equal in-sample performance, the model with fewer parameters will generalize better'. The key point is that you have two models of equal in-sample performance, not just any two models with different numbers of parameters. Another reason for 'overfitting' is actually that markets are dynamic, but thats a different story.

For me the most important point is #9. Churning through strategies but using a 'correct' testing approach is no guarantee of success as this in itself can be a form of over optimisation.

I would say that paper trading is overrated and would suggest very small position sizing to start as this allows for 'skin in the game' and can focus a traders attention much better then a demo account.

@Stuart Reid, I don't consider models built using regression trees to be complex. The complexity I'm talking about concerns how many factors are used in the model.

@Ryan -- yes, skin in the game makes is real.

@Tucker, thanks for the clarification. It depends on your definition of 'complex model'. In your point regarding over-fitting you hit on the definition I am familiar with, namely that model complexity is the proportionate to the number of free variables (parameters) in a model.

Under this definition, regression trees are as complex as the number of decision boundaries learnt by the tree (which can become very large) and neural networks are as complex as the number of connections between the perceptrons ... which is the same as for linear regression models where the model is as complex as the number of co-efficients (inputs).

Assuming that the number of free-variables grows with the number of factors, then we are saying the same thing and I stand by my point, "there is a large body of research dedicated to ways of checking that more complex models are not over-fitting". It just seems one-sided to mention how complex models over-fit without mentioning that simple model's under-fit.

As I said, the only conclusion you can really make is that generally speaking Ockham's Razor holds - 'given two models with equal in-sample performance, the model with fewer [parameters] will generalize better'. [parameters] :- variables, factors, etc. In my line of work we use complex hedging models with may factors & parameters which do not over-fit.

P.S. I'm actually quite a big fan of your Coursera Computational Finance course. Sorry to give you a hard time. I just disagree with the quant status quo regarding complexity 🙂

james

at what volume of trading should one consider market impact? surely for small individual investors the impact is little to none?

Grant Kiehne

Hello Tucker,

There's a flip side to "4. Ignoring market impact" in that it implies that possibly the market could be manipulated to one's advantage, rather than taking it on the chin If my algo can affect the market, such that I lose money unexpectedly, shouldn't I be able to devise an algo that affects the market in such a way that I produce predictable gains (a kind of "inverse slippage" effect)? If one has an accurate model for how the market will react to a given perturbation, then naively it would seem possible to devise profitable schemes, particularly if the market over-reacts to my actions.

Grant

This is an interesting peace of advise and surely something that newbies should look at. I have a couple of comment and more advise at the bottom.

1. I like to use for walk forward optimization with an expanding window. Some people prefer sliding windows. I don't see any real value there. In my opinion the best method is a foreign market that has similar standards (US and German markets). I disagree with the lookahead bias when testing backwards.

2. This is one of the most overseen errors. We have spend over a year to produce clean data for the major US stocks, free of OHLC issues and of course considering the symbols that are in an index. Results that we have will surprise the general consensus.

3. That error is too simple to comment. However your solution has a flaw: The open price. This is the most wrongly reported price!

4. and 5. Any good software has that considered. You limit the number of shares you can buy to a certain percentage of the volume. At least that works for backtesting and is still no guarantee that your order does not have an impact.
Depending on your average profit per trade (in percent) you should know the limit of your account and your position size.

6. Thats a given and you will probably see this when using my method described in #1.

7. Right on the spot. KISS is the right thing.

8. Must be your money management method you are using. I haven't seen such a difference on strategies traded on stocks but I believe I have seen it on strategies trading futures with certain money management methods.

9. I think you covered that in one of your above statements. Forward testing on the computer is IMO not helpful, you should trade it with a small position size. I guess your methods don't allow that?

Summary:
There are many more things to look at than the one stated: Commission, position size, slippage, average profit per trade, exposure… You need the right software and data to consider all that. Most software and data providers aren't setup for that.

Gene

Just one observation.

Forward testing is just as faulty as the backtesting. And here is why.
Imagine you have a universe of 100 strategies and you did tuning, optimizing, filtering (all that backtesting optimization stuff), and 10 of them did well on the what is now an in-sample data. So you, being smart, ran those 10 on out-of-sample data (forward optimization) and found that one strategy that did well both on in-sample and out-of-sample data. Should your confidence in that one be any higher than in those previous 10? -- Not really, because had you run that universe of 100 strategies on the aggregated in-sample and out-of-sample data you would have gotten exactly the same one survivor strategy. And in the latter case all your data would have been in-sample -- exactly what you tried to avoid doing an out-of-sample testing.

Gene asked

>> Forward testing is just as faulty as the backtesting... Should your confidence in that one be any higher than in those previous 10?

@Gene, eventually you need to invest in something. I view these methods primarily as ways to eliminate strategies. So backtesting is a first filter, then forward testing is another. One reason forward testing is important is that it culls strategies with lookahead bias. It validates the backtest in this way.

>> Forward testing on the computer is IMO not helpful, you should trade it with a small position size. I guess your methods don't allow that?

@Volker, I agree that trading forward with a small amount of money is an excellent approach, even better than forward trading on the computer. It is just more complex. It is fine if you have the resources to trade several strategies forward.

>> shouldn't I be able to devise an algo that affects the market in such a way that I produce predictable gains (a kind of "inverse slippage" effect)?

@Grant, I believe that's illegal. But that doesn't mean that people don't do it.

>> at what volume of trading should one consider market impact? surely for small individual investors the impact is little to none?

If your trade is going to be more than 5% of the dollar volume of the market over the period that you will trade, you should be considering this.

>> regression trees are as complex as the number of decision boundaries learnt by the tree

@Stuart, I concur, the learning method can include its own complexity in addition to the number of factors used. FWIW, When I use decision trees, I usually use them in the context of an ensemble or forest of learners. This tends to reduce overfitting.

Comments are closed.