Back to all posts

Bayesian Optimization of a Technical Trading Algorithm with Zipline+SigOpt

Authors: Justin Lent (Quantopian), Thomas Wiecki (Quantopian), Scott Clark (SigOpt)

Parameter optimization of trading algorithms is quite different from most other optimization problems. Specifically, the optimization problem is non-convex, non-linear, stochastic, and can include a mix of integers, floats and enums as parameters. Moreover, most optimizers assume that the objective function is quick to evaluate which is definitely not the case for a trading algorithm run over multiple years of financial data. Immediately that disqualifies 95% of optimizers including those offered by scipy or cvxopt. At Quantopian we have long been, and continue to be, interested in robust methods for parameter optimization of trading algorithms.

Bayesian optimization is a rather novel approach to the selection of parameters that is very well suited to optimize trading algorithms. This blog post will first provide a short introduction to Bayesian optimization with a focus on why it is well suited for quantitative finance. We will then show how you can use SigOpt to perform Bayesian optimization on your own trading algorithms running with zipline.

This blog post originally resulted from a collaboration with Alex Wiltschko where we used Whetlab for Bayesian optimization. Whetlab, however, has since been acquired by Twitter and the Whetlab service was discontinued. Scott Clark from SigOpt helped in porting the code to their service which is comparable in functionality and API. Scott also co-wrote the blog post.

Introduction to Bayesian optimization

Bayesian Optimization is a powerful tool that is particularly useful when optimizing anything that is both time consuming and expensive to evaluate (like trading algorithms). At the core, Bayesian Optimization attempts to leverage historical observations to make optimal suggestions on the best variation of parameters to sample (maximizing some objective like expected returns). This field has been actively studied in academia for decades, from the seminal paper "Efficient Global Optimization of Expensive Black-Box Functions" by Jones et al. in 1998 to more recent contributions like "Practical Bayesian Optimization of Machine Learning Algorithms" by Snoek et al. in 2012.

Many of these approaches take a similar route to solving the problem: they map the observed evaluations onto a Gaussian Process (GP), fit the best possible GP model, perform optimization on this model, then return a new set of suggestions for the user to evaluate. At the core, these methods balance the tradeoff between exploration (learning more about the model, the length scales over which they vary, and how they combine to influence the overall objective) and exploitation (using the knowledge already gained to return the best possible expected result). By efficiently and automatically making these tradeoffs, Bayesian Optimization techniques can quickly find the global optima of these difficult to optimize problems, often much faster than traditional methods like brute force or localized search.

At every Bayesian Optimization iteration, the point of highest Expected Improvement is returned; this is the set of parameters that, in expectation, will most improve upon the best objective value observed thus far. SigOpt wraps these powerful optimization techniques behind a simple API so that any expert in any field can optimize their models without resorting to expensive trial and error. More information about how SigOpt works, as well as other examples of using Bayesian Optimization to perform parameter optimization, can be found on our research page.

A Generalized Framework for Evaluating Trading Signals

Historically quant traders have used many price based signals to define an investment strategy. Many of these signals have been implemented into the popular TA-lib library, with an available Python library here. Typically, a price based signal takes in historical prices as input to compute the signal's value. For example, RSI (Relative Strength Index), is commonly used for mean-reversion and momentum trading. To compute RSI one must first choose the number of pricing days over which to compute the signal. Then, the trader chooses a range of valid values to trigger trade entry. The valid values for RSI is between 0 to 100. If a stock has undergone a sharp selloff recently, RSI values will trend towards 0, and after a strong rally, trend towards 100. A common trading strategy is going long a stock when RSI is below 20, betting on mean-reversion. Similarly, to go short when RSI is above 80. However, other groups of traders have found success betting on persistent momentum in the stock (rather than mean-reversion) when RSI reaches an extreme reading. In which case, the trader will go long when RSI is above 80, for example, betting that the stock will continue going higher.

This poses several questions:

  • So which is right? Should we use RSI as a mean-reversion indicator or a momentum indicator?
    • What range of RSI values shall we choose to determine if the momentum or mean-reversion condition is met?
    • How many lookback days of prices should we use to compute the stock's RSI?
  • Besides RSI, can I integrate a second indicator into my investment strategy?

Each decision made regarding specifying our trading signal (e.g. RSI) or how to interpret the signal's value for making a trading decision is effectively a free parameter in our system, and depending upon the range of reasonable values each free parameter can take, it can quickly explode the total combinations of possible parameter settings. Each signal added to the strategy can quickly increase the total combinations into the many millions and even billions; we'll see this firsthand in the example below that incorporates only 2 signals.

A discussion of strategy model overfitting, and evaluating how overfit a trading backtest may be, will not be addressed here, and will be the topic of a future blog post. An excellent introduction for how to address trading strategy overfitting in your own algorithms can be found here and here.

The large parameter space of millions, or billions, of combinations over which our strategy will need to be tested in order to determine the subspace where the global maximum is likely located is why Bayesian Optimization can be so effective at quickly evaluating potentially profitable trading strategies. Brute force grid-search over a billion combination parameter space is often intractable, even if each combination only takes 30-seconds to complete. Bayesian optimization decreases the evaluation of the model over the global search space by an order of magnitude, as described in the previous section.

The trading algorithm we implement below will create a simple structure for the passing in of free parameters into any simple price based trading signals (simplified to work more easily with ta-lib functions). Then each signal is evaluated each trading day, and when all the conditions are true, a trade is entered, and held until the next signal evaluation period (where evaluation period is yet another free parameter).

For the purposes of this optimization, our objective function will be the Sharpe Ratio of the strategy. A broadly accepted metric from industry for evaluating trading strategy performance. However, the framework implemented below allows for ease of swapping in any objective function desired by the analyst.

Example Zipline algorithm

To illustrate how you can use Bayesian optimization on your zipline trading algorithms and how it compares to other naive approaches (i.e. grid search) we will use a rather simple algorithm comprised of a trading trigger based on two commonly used signals from the ta-lib library, RSI (Relative Strength Index) and ROC (Rate of Change).

The trading algorithm will implement what might be considered a sector rotation strategy, and search for trades across these Select Sector SPDR ETF's:

  • XLV, XLF, XLP, XLE, XLK, XLB, XLU, XLI.

By running the trading logic across all of these ETF's we will be implementing a simple sector-rotation strategy.

-----------------------------------------------------------------------------------------------------
Strategy Specification

Buy the ETF only if it meets both the RSI signal and ROC signal criteria. When an ETF is bought long, then an equivalent dollar amount of SPY will be shorted, creating a hedged, dollar-neutral portfolio.

-----------------------------------------------------------------------------------------------------

By hedging all of our trades, it serves to "tease-apart" the actual usefulness provided by these signals (RSI, ROC) since it extracts upward movement in the stock simply occurring because the rest of the stock market is going up. As a result, the profit achieved by this hedged strategy can be perceived as more "pure alpha," rather than highly-correlated to the direction of the broad stock market.

The 7 free parameters of our trading strategy are as follows:

  • RSI
    • (1) Lookback window for # of prices used in the RSI calculation
    • (2) Lower_bound value defining the trade entry condition
    • (3) Range_width, which will be added to the Lower-bound
      • Lower_bound + Range_width is the range of values over which our RSI signal will be considered True
  • ROC
    • (4) Lookback window for # of prices used in the ROC calculation
    • (5) Lower_bound value defining the trade entry condition
    • (6) Range_width, which will be added to the Lower-bound
      • Lower_bound + Range_width is the range of values over which our ROC signal will be considered True
  • Signal evaluation frequency
    • (7) Number of days between evaluation of if our signals
      • Do we evaluate them every day, every week, every month, etc.

It's worth noting that even with just 2 price based signals, we have 7 free parameters in this system!

Reasonable Ranges for each of the 7 free parameters above (assuming each is an integer, with integer steps):

  1. 115 values: 5 to 120
  2. 90 values: 0 to 90
  3. 20 values: 10 to 30
  4. 61 values: 2 to 63
  5. 30 values: 0 to 30
  6. 195 values: 5 to 200
  7. 18 values: 3 to 21

Multiplying the valid ranges of each yields a total combination count of:

  • 1,329,623,100,000 theoretical combinations
    • 115 x 90 x 20 x 61 x 30 x 195 x 18
    • Imagine how many combinations are possible if 3, 4, 5... 10 signals are added to a strategy

Obviously grid-searching through all those combinations is unreasonable, though a skilled practitioner can prune the search space significantly by only grid-searching across each parameter using wider steps based on their intuition of the model they are building. But even if the skilled practitioner can reduce the grid-search to 10,000 combinations even that number of combinations may be unwieldy if the objective function (e.g. trading strategy) takes 1-minute to evaluate, which is quite frequently the case with trading strategies. This is where the benefit of having access to a Bayesian optimizer becomes extremely helpful.

Below is the result of an analysis accomplish in an ipython notebook comparing SigOpt's Bayesian Optimizer's results from 3 independent experiments, of only 300 trials each determined intelligently by their optimizer, to a "smart" grid-search of approximately 3500 combinations chosen intuitively from a reasonable interpretations of sensible RSI and ROC values. Only 3500 trials were chosen for the grid-search approach, because even those few combinations took 48 hours to evaluate. 300 trials were chosen for the SigOpt approach because in practice the SigOpt optimizer is able to find good optima within 10 to 20 times the dimensionality of the parameter space being optimized. This linear number of evaluations with respect to the number of parameters allows for optimization of algorithms that would otherwise be intractable using standard methods like grid search, which grow exponentially with the number of parameters.

Conclusion

The SigOpt Bayesian Optimizer discovered a better global maximum when testing only over only 300 combinations, than what was discovered by the ~3500 combination grid search.

(This was seen in 4 out of 5 runs with SigOpt, and the 1 that returned a worse objective value was only a minor shortfall)

In practice SigOpt is able to find a good optima in a linear number of evaluations with respect to the number of parameters being tuned (usually between 10 and 20 times the dimensionality of the parameter space). Grid search, even an expertly tuned grid search, grows exponentially with the number of parameters. As the model being tuned becomes more complex it can quickly become completely intractable to tune a model using these traditional methods.

An example of how quickly SigOpt discovered a parameter combination near our expected global maximum, is shown below, as well as a comparison versus the extremely course (and slow) grid-search:


Deeper Dive

Next we will inspect aspects of the optimization further.  (If you wish to view the entirety of this blog, within the context of the ipython notebook that it was created, you can view it here on Quantopian's public research repo.)

A look at the distribution of the objective values returns from each iteration from each optimizer.  We notice SigOpt's bayesian approach tests more points in the parameter space that are nearer to the expected global maximum, as the mean of all trials is much closer to the maximum value discovered. As well, the mean is firmly above zero, so its suggested parameter combinations are targeting a space in the region of the desired outcome.
Next, let's inspect the distribution of values, of each parameter, attempted by each method to get a sense of how the bayesian approach is able to hone in on the specific region more precisely.  The course grid search simply sets a min/max range for each parameter along with a discrete step to traverse the grid, which is fairly common practice for industry practitioners when running parameter optimizations.  A more complex grid search could be implemented via random sampling, or perhaps implementing a particle swarm optimization, but that complexity is commonly out of reach for non-programmers.

Grid-Search Parameter Combination Distributions.

SigOpt Parameter Combination Distributions.

Here we can see the 1:1 relationship between each of the SigOpt suggested parameter combinations, and which regions of each parameter intersection were determined to be best at resulting in optimal objective function values.  Along the diagonal is simply the distribution of each individual parameter (as a kde fit line plot, similar to the histograms shown directly above).

Now let's apply the optimal parameters chosen by each method to our out-of-sample heldout data.  (We trained our model using market data from 2004-2009, and the heldout data is from 2010-2015).

Grid-Search Results

In-sample, Grid-Search Optimal Strategy
Out-of-Sample, Grid-Search Optimal Parameters.

SigOpt Results

In-Sample, SigOpt Optimal Strategy.
Out-of-Sample, SigOpt Optimal Parameters.

Takeaway:   Recognizing how poorly the strategy performs out-of-sample shows how important it is to perform additional analysis (cross-validation, out-of-sample testing, etc.) after using parameter optimization to discover a global maximum.

 

On a positive note, however, by increasing the speed of running the optimization from using the bayesian approach versus grid-search, we we able to assess our out-of-sample performance much more rapidly --because grid search to over 2-days to finish!  The bayesian optimization via SigOpt allowed us to continue our research process 10x faster - in a matter of hours, rather than days.

For completeness, and to put the entire analysis together across backtest and out-of-sample, below is a pyfolio tearsheet allowing visual inspection of the strategy as it transitions from in-sample to out-of-sample.


 

If you wish to work on this analysis, or view the code used to accomplish the above, feel free to clone our research repo on GitHub.


 

The material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory services by Quantopian.

In addition, the material offers no opinion with respect to the suitability of any security or specific investment. No information contained herein should be regarded as a suggestion to engage in or refrain from any investment-related course of action as none of Quantopian nor any of its affiliates is undertaking to provide investment advice, act as an adviser to any plan or entity subject to the Employee Retirement Income Security Act of 1974, as amended, individual retirement account or individual retirement annuity, or give advice in a fiduciary capacity with respect to the materials presented herein. If you are an individual retirement or other investor, contact your financial advisor or other fiduciary unrelated to Quantopian about whether any given investment idea, strategy, product or service described herein may be appropriate for your circumstances. All investments involve risk, including loss of principal. Quantopian makes no guarantees as to the accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Simon Thornington

Comparisons to grid search are not terribly fair; a 7-parameter optimization, one would normally try a particle swarm or differential evolution ?

Simon,
I mentioned particle swarm in the post actually. However there is a bit of hurdle there IMHO, that requires a relatively strong programming background/understanding. As you're probably aware, grid-search is incredibly prevalent in the industry primarily because it's easy to implement, and that is partly the secondary takeaway of this post -- that grid search is ineffective for anything beyond a 2 or 3 parameter optimization

Nice extract and analysis is brilliant

I suggest you to start analysis with "distribution of number of months" chart on top .
Chart shows system is to lose in the long run as left area -right area would get negative.

So first show algo results after modding , then the mods applied to the algo:
we reached this... that was the cause sounds more "Bayesian" to me

John

This is a great analysis of why the quantopian fund has not made any money out of sample

David Anderson

Well done for publishing negative results. You have shown that the primary challenge is not parameter search but over fitting. Can you apply Bayesian techniques to this problem?

Simon, while there is a large swath of global optimization techniques out there like Particle Swarm, Differential Evolution, Simulated Annealing, Genetic Algorithms, or even Random Search (which papers show fares well against grid search) all of these methods require a large number of total evaluations to find an optima. Bayesian Optimization techniques are very valuable when the underlying evaluation is time consuming and expensive. For this relatively simple trading strategy it took ~1 min per evaluation, but as this increases with the complexity of the model many of these other techniques can quickly become completely intractable. While an expertly tuned grid search made that approach somewhat feasible in this case, it still took 10x longer (not counting expert time) to find worse results, and this just gets worse for this and other methods as the complexity of the problem and dimensionality of the parameter space increases.

David, you can use these same techniques to optimize any objective function. In practice we recommend either using cross validation or regularization to help counter over-fitting. For an example of using a cross validated metric that performed well in a holdout dataset check out our blog on tuning betting algorithms: http://blog.sigopt.com/post/136340340198/sigopt-for-ml-using-model-tuning-to-beat-vegas

Comments are closed.