(c) 2016 by Dr. Thomas Wiecki.

Originally published at http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/

There are currently three big trends in machine learning: **Probabilistic Programming**, **Deep Learning** and "**Big Data**". Inside of PP, a lot of innovation is in making things scale using **Variational Inference**. In this blog post, I will show how to use **Variational Inference** in PyMC3 to fit a simple Bayesian Neural Network. I will also discuss how bridging Probabilistic Programming and Deep Learning can open up very interesting avenues to explore in future research.

**Probabilistic Programming** allows very flexible creation of custom probabilistic models and is mainly concerned with **insight** and learning from your data. The approach is inherently **Bayesian** so we can specify **priors** to inform and constrain our models and get uncertainty estimation in form of a **posterior** distribution. Using MCMC sampling algorithms we can draw samples from this posterior to very flexibly estimate these models. PyMC3 and Stan are the current state-of-the-art tools to construct and estimate these models. One major drawback of sampling, however, is that it's often very slow, especially for high-dimensional models. That's why more recently, **variational inference** algorithms have been developed that are almost as flexible as MCMC but much faster. Instead of drawing samples from the posterior, these algorithms instead fit a distribution (e.g. normal) to the posterior turning a sampling problem into and optimization problem. ADVI -- Automatic Differentation Variational Inference -- is implemented in PyMC3 and Stan, as well as a new package called Edward which is mainly concerned with Variational Inference.

Unfortunately, when it comes to traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees).

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars:

- Speed: facilitating the GPU allowed for much faster processing.
- Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU.
- Learning algorithms: training on sub-sets of the data -- stochastic gradient descent -- allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting.
- Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran's recent blog post.

While this would allow Probabilistic Programming to be applied to a much wider set of interesting problems, I believe this bridging also holds great promise for innovations in Deep Learning. Some ideas are:

**Uncertainty in predictions**: As we will see below, the Bayesian Neural Network informs us about the uncertainty in its predictions. I think uncertainty is an underappreciated concept in Machine Learning as it's clearly important for real-world applications. But it could also be useful in training. For example, we could train the model specifically on samples it is most uncertain about.**Uncertainty in representations**: We also get uncertainty estimates of our weights which could inform us about the stability of the learned representations of the network.**Regularization with priors**: Weights are often L2-regularized to avoid overfitting, this very naturally becomes a Gaussian prior for the weight coefficients. We could, however, imagine all kinds of other priors, like spike-and-slab to enforce sparsity (this would be more like using the L1-norm).**Transfer learning with informed priors**: If we wanted to train a network on a new object recognition data set, we could bootstrap the learning by placing informed priors centered around weights retrieved from other pre-trained networks, like GoogLeNet.**Hierarchical Neural Networks**: A very powerful approach in Probabilistic Programming is hierarchical modeling that allows pooling of things that were learned on sub-groups to the overall population (see my tutorial on Hierarchical Linear Regression in PyMC3). Applied to Neural Networks, in hierarchical data sets, we could train individual neural nets to specialize on sub-groups while still being informed about representations of the overall population. For example, imagine a network trained to classify car models from pictures of cars. We could train a hierarchical neural network where a sub-neural network is trained to tell apart models from only a single manufacturer. The intuition being that all cars from a certain manufactures share certain similarities so it would make sense to train individual networks that specialize on brands. However, due to the individual networks being connected at a higher layer, they would still share information with the other specialized sub-networks about features that are useful to all brands. Interestingly, different layers of the network could be informed by various levels of the hierarchy -- e.g. early layers that extract visual lines could be identical in all sub-networks while the higher-order representations would be different. The hierarchical model would learn all that from the data.**Other hybrid architectures**: We can more freely build all kinds of neural networks. For example, Bayesian non-parametrics could be used to flexibly adjust the size and shape of the hidden layers to optimally scale the network architecture to the problem at hand during training. Currently, this requires costly hyper-parameter optimization and a lot of tribal knowledge.

First, lets generate some toy data -- a simple binary classification problem that's not linearly separable.

In [1]:

```
%matplotlib inline
import pymc3 as pm
import theano.tensor as T
import theano
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
```

In [2]:

```
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
```

In [3]:

```
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');
```

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

In [17]:

```
# Trick: Turn inputs and outputs into shared variables.
# It's still the same thing, but we can later change the values of the shared variable
# (to switch in the test-data later) and pymc3 will just use the new data.
# Kind-of like a pointer we can redirect.
# For more info, see: http://deeplearning.net/software/theano/library/compile/shared.html
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)
n_hidden = 5
# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden)
init_2 = np.random.randn(n_hidden, n_hidden)
init_out = np.random.randn(n_hidden)
with pm.Model() as neural_network:
# Weights from input to hidden layer
weights_in_1 = pm.Normal('w_in_1', 0, sd=1,
shape=(X.shape[1], n_hidden),
testval=init_1)
# Weights from 1st to 2nd layer
weights_1_2 = pm.Normal('w_1_2', 0, sd=1,
shape=(n_hidden, n_hidden),
testval=init_2)
# Weights from hidden layer to output
weights_2_out = pm.Normal('w_2_out', 0, sd=1,
shape=(n_hidden,),
testval=init_out)
# Build neural-network using tanh activation function
act_1 = T.tanh(T.dot(ann_input,
weights_in_1))
act_2 = T.tanh(T.dot(act_1,
weights_1_2))
act_out = T.nnet.sigmoid(T.dot(act_2,
weights_2_out))
# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output)
```

`Normal`

priors help regularize the weights. Usually we would add a constant `b`

to the inputs but I omitted it here to keep the code cleaner.

We could now just run a MCMC sampler like `NUTS`

which works pretty well in this case but as I already mentioned, this will become very slow as we scale our model up to deeper architectures with more layers.

Instead, we will use the brand-new ADVI variational inference algorithm which was recently added to `PyMC3`

. This is much faster and will scale better. Note, that this is a mean-field approximation so we ignore correlations in the posterior.

In [34]:

```
%%time
with neural_network:
# Run ADVI which returns posterior means, standard deviations, and the evidence lower bound (ELBO)
v_params = pm.variational.advi(n=50000)
```

< 40 seconds on my older laptop. That's pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw samples from the variational posterior using `sample_vp()`

(this is just sampling from Normal distributions, so not at all the same like MCMC):

In [35]:

```
with neural_network:
trace = pm.variational.sample_vp(v_params, draws=5000)
```

In [36]:

```
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
```

Out[36]:

<matplotlib.text.Text at 0x7fa5dae039b0>

`sample_ppc()`

to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation).

In [7]:

```
# Replace shared variables with testing set
ann_input.set_value(X_test)
ann_output.set_value(Y_test)
# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)
# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
```

In [8]:

```
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
```

In [9]:

```
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
```

Accuracy = 94.19999999999999%

Hey, our neural network did all right!

For this, we evaluate the class probability predictions on a grid over the whole input space.

In [10]:

```
grid = np.mgrid[-3:3:100j,-3:3:100j]
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)
```

In [11]:

```
ann_input.set_value(grid_2d)
ann_output.set_value(dummy_out)
# Creater posterior predictive samples
ppc = pm.sample_ppc(trace, model=neural_network, samples=500)
```

In [26]:

```
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');
```

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

In [27]:

```
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(10, 6))
contour = ax.contourf(*grid, ppc['out'].std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');
```

So far, we have trained our model on all data at once. Obviously this won't scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

In [43]:

```
# Set back to original data to retrain
ann_input.set_value(X_train)
ann_output.set_value(Y_train)
# Tensors and RV that will be using mini-batches
minibatch_tensors = [ann_input, ann_output]
minibatch_RVs = [out]
# Generator that returns mini-batches in each iteration
def create_minibatch(data):
rng = np.random.RandomState(0)
while True:
# Return random data samples of set size 100 each iteration
ixs = rng.randint(len(data), size=50)
yield data[ixs]
minibatches = [
create_minibatch(X_train),
create_minibatch(Y_train),
]
total_size = len(Y_train)
```

While the above might look a bit daunting, I really like the design. Especially the fact that you define a generator allows for great flexibility. In principle, we could just pool from a database there and not have to keep all the data in RAM.

Lets pass those to `advi_minibatch()`

:

In [48]:

```
%%time
with neural_network:
# Run advi_minibatch
v_params = pm.variational.advi_minibatch(
n=50000, minibatch_tensors=minibatch_tensors,
minibatch_RVs=minibatch_RVs, minibatches=minibatches,
total_size=total_size, learning_rate=1e-2, epsilon=1.0
)
```

In [49]:

```
with neural_network:
trace = pm.variational.sample_vp(v_params, draws=5000)
```

In [50]:

```
plt.plot(v_params.elbo_vals)
plt.ylabel('ELBO')
plt.xlabel('iteration')
sns.despine()
```

As you can see, mini-batch ADVI's running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.

In [51]:

```
pm.traceplot(trace);
```

Hopefully this blog post demonstrated a very powerful new inference algorithm available in PyMC3: ADVI. I also think bridging the gap between Probabilistic Programming and Deep Learning can open up many new avenues for innovation in this space, as discussed above. Specifically, a hierarchical neural network sounds pretty bad-ass. These are really exciting times.

`Theano`

, which is used by `PyMC3`

as its computational backend, was mainly developed for estimating neural networks and there are great libraries like `Lasagne`

that build on top of `Theano`

to make construction of the most common neural network architectures easy. Ideally, we wouldn't have to build the models by hand as I did above, but use the convenient syntax of `Lasagne`

to construct the architecture, define our priors, and run ADVI.

While we haven't successfully run `PyMC3`

on the GPU yet, it should be fairly straight forward (this is what `Theano`

does after all) and further reduce the running time significantly. If you know some `Theano`

, this would be a great area for contributions!

You might also argue that the above network isn't really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets.

I also presented some of this work at PyData London, view the video below:

Finally, you can download this NB here. Leave a comment below, and follow me on twitter.

Taku Yoshioka did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I'd also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.

Quantopian provides this website to help people write trading algorithms - the website is not intended to provide investment advice.

More specifically, the material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory or other services by Quantopian.

In addition, the content of the website neither constitutes investment advice nor offers any opinion with respect to the suitability of any security or any specific investment. Quantopian makes no guarantees as to accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Tag: Bayesian, Big Data, Deep Learning, Probabilistic Programming

Posted in Quantopian | 2 Comments »

By Dr. Thomas Wiecki, Lead Data Scientist at Quantopian

Earlier this year, we used DataRobot to test a large number of preprocessing, imputation and classifier combinations to predict out-of-sample performance. In this blog post, I’ll take some time to first explain the results from a unique data set assembled from strategies run on Quantopian. From these results, it became clear that while the Sharpe ratio of a backtest was a very weak predictor of the future performance of a trading strategy, we could instead use DataRobot to train a classifier on a variety of features to predict out-of-sample performance with much higher accuracy.

Backtesting is ubiquitous in algorithmic trading. Quants run backtests to assess the merit of a strategy, academics publish papers showing phenomenal backtest results, and asset allocators at hedge funds take backtests into account when deciding where to deploy capital and who to hire. This makes sense if (and only if) a backtest is predictive of future performance. But is it?

This question is actually of critical importance to Quantopian. As we announced at QuantCon 2016, we have started to deploy some of our own capital to a select number of trading strategies. We found these algorithms by analyzing thousands of algorithms by using only pyfolio,our open source finance analysis library. As a data scientist, it is my job to come up with better (and more automatic) methods to identify promising trading algorithms.

What makes this possible (and a whole lot of fun) is that we have access to a unique dataset of millions of backtests being run on the Quantopian platform. While our terms-of-use prohibit us from looking at code, we do know when code was last edited. Knowing this, we can rerun the strategy up until yesterday and evaluate how the strategy continued to perform after the last code, meaning how it performed on data the quant did not have access to at the time of writing the code. Using this approach, we were able to assemble a data set of 888 strategies based on various criteria with at least 6 months of out-of-sample data.

The first thing I looked at was how predictive the backtest or in-sample (IS) Sharpe ratio was of the out-of-sample (OOS) Sharpe ratio:

With a disappointing R²=0.02, this shows that we can’t learn much about how well a strategy will continue to perform from a backtest. This is a real problem and might cause us to throw our hands in the air and give up. Instead, we wondered if perhaps all the other information we have about an algorithm (like other performance metrics, or position and transactions statistics) would allow us to do a better job.

Our team created 58 features based on backtest results, including:

- Tail-ratio (the 95th percentile divided by the 5th percentile of the daily returns distribution)
- Average monthly turnover, how many backtests a user ran (which is indicative of overfitting)
- The minimum, maximum and standard deviation of the rolling Sharpe ratio
- …and many more (for a full list, see our SSRN manuscript ).

We then tried various machine learning techniques to try and predict out-of-sample Sharpe ratio based on these features. **This is where DataRobot came in really handy**. We uploaded our data set to DataRobot and from there, a huge combination of preprocessing, feature extraction, and regressors were tried in a cross-validated manner in parallel. Most of these could be done using scikit-learn, but it is a huge time saver and produced results better than my own humble attempts.

Using an ExtraTrees regressor (similar to a Random Forest), **DataRobot achieved an R² of 0.17 on the 20% hold-out set not used for training.** The lift-chart below shows that the regressor does a pretty good job at predicting OOS Sharpe ratio.

While it is generally difficult to intuit what the machine learning regressor has learned, we can look at which features it identified as most predictive. Of note, an important feature does not have to be predictive by itself or in relation to an OOS Sharpe ratio in a linear way. These methods are so powerful because they can learn extremely complex, nonlinear interactions of multiple features that are predictive in the aggregate.

Below you can appreciate the most informative feature.

It is interesting to point out that tail-ratio and kurtosis are both measures that assess the tails of the returns distribution. Moreover, the number of backtests a user ran (“user_backtest_days”) is also very informative of how a strategy will perform going forward.

While we showed that we can do a much better job at predicting an OOS Sharpe ratio than using only the backtest Sharpe ratio, it is not clear yet if this really makes a difference in reality. After all, no one earns money from a good R² value. We then asked if a portfolio constructed from the 10 strategies ranked highest by our machine learning regressor would do well. We compared this to selecting the top 10 strategies ranked by their in-sample Sharpe ratio as well as many simulations of choosing 10 strategies randomly.

The below image shows our results when applying this portfolio construction method on the hold-out set.

We found it interesting that even at highly quantitative hedge funds, asset allocation is still done in a discretionary way, at least for the most part. I believe that a long-term goal for this research would be to further automate this aspect using Machine Learning, as depicted in the figure below.

It would also be useful to look more at the OOS data. Usually, deployment decisions aren’t done on backtest performance alone. At Quantopian, we require at least 6 months of OOS data. This number is rather arbitrary, however, and there might be strategies where we gain confidence more quickly, and ones where we would need to wait longer.

For more on how we approached comparing backtest and out-of-sample performance on cohort algorithms and how we found the best results were achieved with DataRobot, you can view the full report of our test here. You can also view my presentation on this subject from Quantcon here.

Quantopian provides this website to help people write trading algorithms - the website is not intended to provide investment advice.

More specifically, the material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory or other services by Quantopian.

In addition, the content of the website neither constitutes investment advice nor offers any opinion with respect to the suitability of any security or any specific investment. Quantopian makes no guarantees as to accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Posted in Quantopian | 4 Comments »

Quantopian is committed to being a company and community of substance. We want to ask hard questions, and invest the effort to find detailed answers. Everything we do demands our best performance: our platform, our algorithm portfolio, our events, our office, our lectures. Everything requires thoughtful design. Including our logo.

We want a logo that is honest, scalable, and special to Quantopian. It should convey simplicity and ease, just as our product does. It should be open and encouraging, like our community.

We often refer to ourselves as “Q,” and I think it's because we have a quiet determination to make an amazing product. We build for quants. We have talented engineers and designers who know that quants are always where we start. Our logo should reflect that aspiration to keep growing and to keep solving important problems.

We believe great design boils down to determination, the willpower to keep working, and iterating until we have the best answer. We’ve been testing our new logo internally for over a year. It gave us the opportunity to experience and improve it before it was worthy of becoming part of Quantopian.

Our logo is not merely “new,” it’s a smarter identity as a whole. It’s concise and specific to certain uses, which makes it easy to fit into a variety of mediums. This flexibility is a technical win for Q on the front end.

Our logo is not purely technical, it is also a reflection of our community and our company culture. We have an amazing community of quants, students, statisticians, professors, research scientists, mathematicians, and professional investors.

People around the world make up Quantopian, and we now have a mark to back them up.

Quantopian provides this website to help people write trading algorithms - the website is not intended to provide investment advice.

More specifically, the material on this website is provided for informational purposes only and does not constitute an offer to sell, a solicitation to buy, or a recommendation or endorsement for any security or strategy, nor does it constitute an offer to provide investment advisory or other services by Quantopian.

In addition, the content of the website neither constitutes investment advice nor offers any opinion with respect to the suitability of any security or any specific investment. Quantopian makes no guarantees as to accuracy or completeness of the views expressed in the website. The views are subject to change, and may have become unreliable for various reasons, including changes in market conditions or economic circumstances.

Posted in Quantopian | 1 Comment »

Our community reached a watershed moment in Q4 of 2015: the first community authored algorithms were chosen and deployed with capital. We analyzed tens of thousands of algorithms, and then we made six-figure allocations to 3 of them. We are writing to the whole community to say thank you — this is something we accomplished together. We also want to share some details, and to encourage and inspire you to keep writing great algorithms.

**Allocation and Compensation**

The allocation process worked out just as we had hoped that it would. We approached each of the chosen authors with a contract. The terms of the contract promise that a portion of the returns be paid to the author as compensation. The contract also covers the other things you'd expect - length of the contract, maintaining the intellectual property as the author's, etc.

We then did some additional diligence with the authors and worked with them to make the algos more robust for live trading. For instance, one of our author's algorithms had some brittle rebalancing logic that failed during stress testing, and we asked him to correct that before we deployed it with real money. Once the contract was signed and diligence was complete, we deployed the algorithm using our capital.

Algorithm authors were paid for their 2015 performance, and we look forward to writing checks again.

These allocations are in addition to, and separate from, the real-money trading we have done for Quantopian Open winners. These new allocations were made on the basis of analysis by our quant research team, using tools like pyfolio tearsheets, and not the contest rules.

**More Allocations Coming**

We are making additional allocations this quarter and going forward. We expect our allocations to increase in size over time.

All algorithms with at least 6 months of out-of-sample data are being considered. We evaluate the algorithms by looking at their in-sample and out-of-sample performance; we never look at the code itself. As we have outlined, we are looking for well-hedged, low-beta, low-volatility algorithms. Obviously, they also have to be profitable. Finally, they must have low correlation with other algorithms we have allocated to.

Many algorithms we have considered were promising but fell short in one vital area or another. We've been contacting these community members and have been working with them on improving their algorithms. After additional out-of-sample time passes we hope to provide them with allocations.

**Capital**

These allocations are all being made from Quantopian's own balance sheet - we believe in this model enough to risk our own capital. This is proprietary trading, or "prop trading." We are using prop trading to refine our algorithm selection process, our risk management, and our algorithm portfolio construction. We are very pleased with the progress in all of these areas.

Our business growth depends on making much more capital available to your algorithms. More capital also means you can earn larger royalties by licensing to Quantopian. We are working to ramp both the number and the size of capital allocations to your algorithms by transitioning from prop trading to a hedge fund.

**Keep Writing Algorithms!**

These allocations are the latest step in the progress of our community and in our business, but it is far from the last step. The platform is improving and expanding. There is much more capital still to allocate.

Most importantly, there are more algorithms for you, our community, to write. These allocations are how we can convert your hard work and insight into your profit. The great algorithms you're writing today are the allocations we will be making tomorrow.

Posted in Quantopian | 1 Comment »

Authors: Justin Lent (Quantopian), Thomas Wiecki (Quantopian), Scott Clark (SigOpt)

Parameter optimization of trading algorithms is quite different from most other optimization problems. Specifically, the optimization problem is non-convex, non-linear, stochastic, and can include a mix of integers, floats and enums as parameters. Moreover, most optimizers assume that the objective function is quick to evaluate which is definitely not the case for a trading algorithm run over multiple years of financial data. Immediately that disqualifies 95% of optimizers including those offered by `scipy`

or `cvxopt`

. At Quantopian we have long been, and continue to be, interested in robust methods for parameter optimization of trading algorithms.

**Bayesian optimization** is a rather novel approach to the selection of parameters that is very well suited to optimize trading algorithms. This blog post will first provide a short introduction to Bayesian optimization with a focus on why it is well suited for quantitative finance. We will then show how you can use SigOpt to perform Bayesian optimization on your own trading algorithms running with `zipline`

.

This blog post originally resulted from a collaboration with Alex Wiltschko where we used Whetlab for Bayesian optimization. Whetlab, however, has since been acquired by Twitter and the Whetlab service was discontinued. Scott Clark from SigOpt helped in porting the code to their service which is comparable in functionality and API. Scott also co-wrote the blog post.

Bayesian Optimization is a powerful tool that is particularly useful when optimizing anything that is both time consuming and expensive to evaluate (like trading algorithms). At the core, Bayesian Optimization attempts to leverage historical observations to make optimal suggestions on the best variation of parameters to sample (maximizing some objective like expected returns). This field has been actively studied in academia for decades, from the seminal paper "Efficient Global Optimization of Expensive Black-Box Functions" by Jones et al. in 1998 to more recent contributions like "Practical Bayesian Optimization of Machine Learning Algorithms" by Snoek et al. in 2012.

Many of these approaches take a similar route to solving the problem: they map the observed evaluations onto a Gaussian Process (GP), fit the best possible GP model, perform optimization on this model, then return a new set of suggestions for the user to evaluate. At the core, these methods balance the tradeoff between exploration (learning more about the model, the length scales over which they vary, and how they combine to influence the overall objective) and exploitation (using the knowledge already gained to return the best possible expected result). By efficiently and automatically making these tradeoffs, Bayesian Optimization techniques can quickly find the global optima of these difficult to optimize problems, often much faster than traditional methods like brute force or localized search.

At every Bayesian Optimization iteration, the point of highest Expected Improvement is returned; this is the set of parameters that, in expectation, will most improve upon the best objective value observed thus far. SigOpt wraps these powerful optimization techniques behind a simple API so that any expert in any field can optimize their models without resorting to expensive trial and error. More information about how SigOpt works, as well as other examples of using Bayesian Optimization to perform parameter optimization, can be found on our research page.

Historically quant traders have used many price based signals to define an investment strategy. Many of these signals have been implemented into the popular TA-lib library, with an available Python library here. Typically, a price based signal takes in historical prices as input to compute the signal's value. For example, RSI (Relative Strength Index), is commonly used for mean-reversion *and* momentum trading. To compute RSI one must first choose the number of pricing days over which to compute the signal. Then, the trader chooses a range of valid values to trigger trade entry. The valid values for RSI is between 0 to 100. If a stock has undergone a sharp selloff recently, RSI values will trend towards 0, and after a strong rally, trend towards 100. A common trading strategy is going long a stock when RSI is below 20, betting on mean-reversion. Similarly, to go short when RSI is above 80. However, other groups of traders have found success betting on persistent momentum in the stock (rather than mean-reversion) when RSI reaches an extreme reading. In which case, the trader will go long when RSI is above 80, for example, betting that the stock will continue going higher.

This poses several questions:

- So which is right? Should we use RSI as a
*mean-reversion*indicator or a*momentum*indicator?- What range of RSI values shall we choose to determine if the momentum or mean-reversion condition is met?
- How many lookback days of prices should we use to compute the stock's RSI?

- Besides RSI, can I integrate a second indicator into my investment strategy?

Each decision made regarding specifying our trading signal (e.g. RSI) or how to interpret the signal's value for making a trading decision is effectively a free parameter in our system, and depending upon the range of reasonable values each free parameter can take, it can quickly explode the total combinations of possible parameter settings. Each signal added to the strategy can quickly increase the total combinations into the many millions and even *billions; *we'll see this firsthand in the example below that incorporates only 2 signals.

**A discussion of strategy model overfitting, and evaluating how overfit a trading backtest may be, will not be addressed here, and will be the topic of a future blog post.** An excellent introduction for how to address trading strategy overfitting in your own algorithms can be found here and here.

The large parameter space of millions, or billions, of combinations over which our strategy will need to be tested in order to determine the subspace where the global maximum is likely located is why Bayesian Optimization can be so effective at quickly evaluating potentially profitable trading strategies. Brute force grid-search over a billion combination parameter space is often intractable, even if each combination only takes 30-seconds to complete. Bayesian optimization decreases the evaluation of the model over the global search space by an order of magnitude, as described in the previous section.

The trading algorithm we implement below will create a simple structure for the passing in of free parameters into any simple price based trading signals (simplified to work more easily with ta-lib functions). Then each signal is evaluated each trading day, and when *all* the conditions are true, a trade is entered, and held until the next signal evaluation period (where evaluation period is yet another free parameter).

For the purposes of this optimization, our objective function will be the Sharpe Ratio of the strategy. A broadly accepted metric from industry for evaluating trading strategy performance. However, the framework implemented below allows for ease of swapping in any objective function desired by the analyst.

To illustrate how you can use Bayesian optimization on your zipline trading algorithms and how it compares to other naive approaches (i.e. grid search) we will use a rather simple algorithm comprised of a trading trigger based on two commonly used signals from the ta-lib library, RSI (Relative Strength Index) and ROC (Rate of Change).

The trading algorithm will implement what might be considered a sector rotation strategy, and search for trades across these Select Sector SPDR ETF's:

- XLV, XLF, XLP, XLE, XLK, XLB, XLU, XLI.

By running the trading logic across all of these ETF's we will be implementing a simple sector-rotation strategy.

*Buy the ETF only if it meets both the RSI signal and ROC signal criteria. When an ETF is bought long, then an equivalent dollar amount of SPY will be shorted, creating a hedged, dollar-neutral portfolio.*

By hedging all of our trades, it serves to "tease-apart" the actual usefulness provided by these signals (RSI, ROC) since it extracts upward movement in the stock simply occurring because the rest of the stock market is going up. As a result, the profit achieved by this hedged strategy can be perceived as more "pure alpha," rather than highly-correlated to the direction of the broad stock market.

The 7 free parameters of our trading strategy are as follows:

- RSI
**(1)**Lookback window for # of prices used in the RSI calculation**(2)**Lower_bound value defining the trade entry condition**(3)**Range_width, which will be added to the Lower-bound- Lower_bound + Range_width is the range of values over which our RSI signal will be considered
`True`

- Lower_bound + Range_width is the range of values over which our RSI signal will be considered

- ROC
**(4)**Lookback window for # of prices used in the ROC calculation**(5)**Lower_bound value defining the trade entry condition**(6)**Range_width, which will be added to the Lower-bound- Lower_bound + Range_width is the range of values over which our ROC signal will be considered
`True`

- Lower_bound + Range_width is the range of values over which our ROC signal will be considered

- Signal evaluation frequency
**(7)**Number of days between evaluation of if our signals- Do we evaluate them every day, every week, every month, etc.

It's worth noting that even with just 2 price based signals, we have 7 free parameters in this system!

Reasonable Ranges for each of the 7 free parameters above (assuming each is an integer, with integer steps):

- 115 values: 5 to 120
- 90 values: 0 to 90
- 20 values: 10 to 30
- 61 values: 2 to 63
- 30 values: 0 to 30
- 195 values: 5 to 200
- 18 values: 3 to 21

Multiplying the valid ranges of each yields a total combination count of:

**1,329,623,100,000**theoretical combinations- 115 x 90 x 20 x 61 x 30 x 195 x 18
- Imagine how many combinations are possible if 3, 4, 5... 10 signals are added to a strategy

Obviously grid-searching through all those combinations is unreasonable, though a skilled practitioner can prune the search space significantly by only grid-searching across each parameter using wider steps based on their intuition of the model they are building. But even if the skilled practitioner can reduce the grid-search to 10,000 combinations even that number of combinations may be unwieldy if the objective function (e.g. trading strategy) takes 1-minute to evaluate, which is quite frequently the case with trading strategies. This is where the benefit of having access to a Bayesian optimizer becomes *extremely *helpful.

Below is the result of an analysis accomplish in an ipython notebook comparing SigOpt's Bayesian Optimizer's results from 3 independent experiments, of only 300 trials each determined intelligently by their optimizer, to a "smart" grid-search of approximately 3500 combinations chosen intuitively from a reasonable interpretations of sensible RSI and ROC values. Only 3500 trials were chosen for the grid-search approach, because even those few combinations took 48 hours to evaluate. 300 trials were chosen for the SigOpt approach because in practice the SigOpt optimizer is able to find good optima within 10 to 20 times the dimensionality of the parameter space being optimized. This linear number of evaluations with respect to the number of parameters allows for optimization of algorithms that would otherwise be intractable using standard methods like grid search, which grow exponentially with the number of parameters.

(This was seen in 4 out of 5 runs with SigOpt, and the 1 that returned a worse objective value was only a minor shortfall)

In practice SigOpt is able to find a good optima in a linear number of evaluations with respect to the number of parameters being tuned (usually between 10 and 20 times the dimensionality of the parameter space). Grid search, even an expertly tuned grid search, grows exponentially with the number of parameters. As the model being tuned becomes more complex it can quickly become completely intractable to tune a model using these traditional methods.

An example of how quickly SigOpt discovered a parameter combination near our expected global maximum, is shown below, as well as a comparison versus the extremely course (and slow) grid-search:

Next we will inspect aspects of the optimization further. (If you wish to view the entirety of this blog, within the context of the ipython notebook that it was created, you can view it here on Quantopian's public research repo.)

A look at the distribution of the objective values returns from each iteration from each optimizer. We notice SigOpt's bayesian approach tests more points in the parameter space that are nearer to the expected global maximum, as the mean of all trials is much closer to the maximum value discovered. As well, the mean is firmly above zero, so its suggested parameter combinations are targeting a space in the region of the desired outcome.

Next, let's inspect the distribution of values, *of each parameter*, attempted by each method to get a sense of how the bayesian approach is able to hone in on the specific region more precisely. The course grid search simply sets a min/max range for each parameter along with a discrete step to traverse the grid, which is fairly common practice for industry practitioners when running parameter optimizations. A more complex grid search could be implemented via random sampling, or perhaps implementing a particle swarm optimization, but that complexity is commonly out of reach for non-programmers.

Here we can see the 1:1 relationship between each of the SigOpt suggested parameter combinations, and which regions of each parameter intersection were determined to be best at resulting in optimal objective function values. Along the diagonal is simply the distribution of each individual parameter (as a kde fit line plot, similar to the histograms shown directly above).

Now let's apply the optimal parameters chosen by each method to our out-of-sample heldout data. (We trained our model using market data from 2004-2009, and the heldout data is from 2010-2015).

**Takeaway: Recognizing how poorly the strategy performs out-of-sample shows how important it is to perform additional analysis (cross-validation, out-of-sample testing, etc.) after using parameter optimization to discover a global maximum.**

On a positive note, however, by increasing the speed of running the optimization from using the bayesian approach versus grid-search, we we able to assess our out-of-sample performance much more rapidly --because grid search to over 2-days to finish! The bayesian optimization via SigOpt allowed us to continue our research process 10x faster - in a matter of hours, rather than days.

For completeness, and to put the entire analysis together across backtest and out-of-sample, below is a pyfolio tearsheet allowing visual inspection of the strategy as it transitions from in-sample to out-of-sample.

If you wish to work on this analysis, or view the code used to accomplish the above, feel free to clone our research repo on GitHub.

Posted in Optimization, Quantopian | 6 Comments »

Algorithmic trading used to be a very difficult and expensive process. The time and cost of system setup, maintenance, and commission fees made programmatic trading almost impossible for the average investor. That’s all changing now.

We’re excited to announce that Quantopian has integrated with Robinhood, **a zero commission brokerage**. This partnership has made the process of algorithmic trading, from start-to-finish, completely free.

From initial brainstorming with research, to testing and optimizing with backtesting, and finally, commission-free execution with Robinhood, algorithmic trading has never been easier.

**Here’s What Users Get**

**Data**- Data is the lifeblood of algorithmic trading. But most data is costly and dirty. Quantopian solves that for you with clean, integrated data sources. Some data sources are entirely free (traded price and volume, corporate fundamentals), and some data sources are freemium (news sentiment, earnings estimates, and more).**Platform**- Quantopian provides you with a platform to do your free-form research, to write and test your algorithm, to paper trade, and even trade real money. You don't have to set up, maintain, or monitor - we do it all for you.**Execution**- Robinhood provides order execution, holding your funds and filling your orders.**Capital**- You can trade your own money, or you can seek an allocation and trade with our money. One way to get allocation is to win our contest, trade $100,000, and keep 100% of the profits. Other algorithms get larger discretionary allocations through our fund.

**How To Get Started**

If you have an existing Robinhood account, you can begin trading today. If you’d like to open an account, you can sign up directly at Robinhood - the process takes less than five minutes to complete. For more information and video tutorials, our community post has you covered.

P.S. Attached is a sample algorithm that's geared and ready for live trading. It's based off Mebane Faber’s Tactical Asset Allocation. The allocation Faber proposes is designed to be "a simple quantitative method that improves the risk-adjusted returns across various asset classes." You can read the original academic paper from Meb Faber, or the previous discussion of the strategy on Quantopian.

Posted in Quantopian | Comments Closed

When I give talks about probabilistic programming and Bayesian statistics, I usually gloss over the details of how inference is actually performed, treating it as a black box essentially. The beauty of probabilistic programming is that you actually don't *have* to understand how the inference works in order to build models, but it certainly helps.

When I presented **a new Bayesian model** to Quantopian's CEO, Fawce, who wasn't trained in Bayesian stats but is eager to understand it, he started to ask about the part I usually gloss over: "Thomas, how does the inference actually work? How do we get these magical samples from the posterior?".

Now I could have said: "Well that's easy, MCMC generates samples from the posterior distribution by constructing a reversible Markov-chain that has as its equilibrium distribution the target posterior distribution. Questions?".

**That statement is correct, but is it useful?** My pet peeve with how math and stats are taught is that no one ever tells you about the intuition behind the concepts (which is usually quite simple) but only hands you some scary math. This is certainly the way I was taught and I had to spend countless hours banging my head against the wall until that euraka moment came about. Usually things weren't as scary or seemingly complex once I deciphered what it meant.

**This blog post is an attempt at trying to explain the intuition behind MCMC sampling (specifically, the Metropolis algorithm).** Critically, we'll be using code examples rather than formulas or math-speak. Eventually you'll need that but I personally think it's better to start with the an example and build the intuition before you move on to the math.

**Table of Contents**

- The problem and its unintuitive solution
- Setting up the problem
- Explaining MCMC sampling with code
- Visualizing MCMC
- Proposal width
- Extending to more complex models
- Conclusion

Tag: Bayes Formula, Bayesian, Bayesian model, bayesian statistics, computation, intro datascience, Markov chain, MCMC, Metropolis Algorithm, Quantopian

Posted in Quantopian | 3 Comments »

A new month brings a new contest winner. Meet Andreas, the winner of Contest 9 (also known as the October Prize)!

Andreas originally hails from Sweden, then moved to the United Kingdom for his university studies. In the UK, he studied mathematics and then stayed to pursue a career in the finance industry, before embarking on a graduate degree in mathematical physics. He is currently a PhD student in Spain continuing his journey in mathematics. Andreas stumbled across Quantopian while traversing the web, and was immediately hooked. With no previous background in Python, he started learning how to create trading algorithms. He shares, "I started coding up some basic algorithms and was impressed by how easy it was to get going. There was also a great community forum and tutorials that had answers to most questions." His Python skills improved and Andreas began coding a variety of algorithms and trying different strategies.

Andreas was focused on the data. "For me, quant research is all about the data. Analysing and understanding the data always comes first (and backtesting last!). Quantopian has a number of interesting data feeds (that I hope it will continue growing!). My algo uses some of these data feeds to select baskets of stocks to trade". Quantopian provides 13 years of pricing data and fundamental data, along with 22 (and growing) datasets in the store.

Andreas continues to improve his current ideas and test new strategies using the research environment and backtester. "I know how much work it goes into creating a proper backtesting and research environment, and that Quantopian makes one available to you for free is quite amazing!" He is currently in the first phase of his prize, undergoing a quant consultation session. Afterward, he will enter the second phase and begin trading a $100,000 brokerage account for 6 months, and keep all the profits. We'll write him a check monthly for his earnings!

We've already paid out over $2300 in contest earnings. Are you the next contest winner? If so, submit your algorithm by the next deadline on **Nov. 2** at **9:30AM ET** to start your 6 months of paper trading.

Posted in Interview, Quantopian | 2 Comments »

**Every student in every school should have the opportunity to learn computer science.**

**Code.org** is a non-profit dedicated to expanding access to computer science, and increasing participation

by women and underrepresented students of color in this field. They believe computer science should be part of the core curriculum, alongside other courses such as biology, chemistry, or algebra.

We at Quantopian believe in Code.org's vision to bring computer science to every student. To help them achieve this goal, we have decided to donate all revenue generated by our live stream ticket sales for **QuantCon 2016** to them.

**QuantCon 2016** will feature a stellar lineup including: Dr. Emanuel Derman, Dr. Marcos López de Prado, Dr. Ernie Chan, and more. It will be a full day of expert speakers and in-depth tutorials. Talks will focus on innovative trading strategies, unique data sets, and new programming tips and tools. The goal? To give you all the support you need to craft and trade outperforming strategies.

A live stream purchase will also include **first-access to all QuantCon recordings and presentation decks.** For tickets or more information, please visit **www.quantcon.com**.

Posted in Quantopian | Comments Closed

**Strata**, the conference where cutting-edge science and new business fundamentals intersect, will take place September 29th to October 1st in New York City.

The conference is a deep-immersion event where data scientists, analysts, and executives explore the latest in emerging techniques and technologies.

**Quantopian Talks & Tutorials**

Our team will be **presenting **several talks and tutorials at Strata. The topics range from how global-sourcing is flattening finance, to a Blaze tutorial, to a review on pyfolio and how it can improve your portfolio and risk analytics, to an out-performing investment algorithm on women-led companies in the Fortune 1000.

To see our entire lineup, please click **here**.

**Join Us!**

If you would like to attend the conference, RSVP **here** and enter discount code **QUANT for a ****a 20% discount on any pass.**

We hope to see you there!

Posted in Quantopian | Comments Closed

- June 2016
- May 2016
- March 2016
- February 2016
- January 2016
- November 2015
- October 2015
- September 2015
- August 2015
- July 2015
- June 2015
- May 2015
- April 2015
- March 2015
- February 2015
- January 2015
- December 2014
- November 2014
- September 2014
- July 2014
- June 2014
- May 2014
- April 2014
- March 2014
- February 2014
- January 2014
- December 2013
- November 2013
- October 2013
- August 2013
- June 2013
- May 2013
- April 2013
- March 2013
- February 2013
- January 2013
- December 2012
- November 2012
- October 2012
- September 2012
- August 2012
- July 2012
- May 2012