Last week someone attacked our website. We were able to stop the attack before any data — ours or our members’ — was compromised.
We shut down our application when the attack was detected, and put up a site-down message explaining that we had been attacked. We kept the site down for about a day while we made the most urgently needed security improvements. When we brought the site back up, we published a summary of what happened and what we did in response. We also promised a detailed post mortem of the attack, our response, and our future plans. Although we were able to detect and stop the attack quickly with the tools on hand, we learned a lot about necessary improvements in our security. We’ve already made many changes, and we’ll make more in the near future. The (lengthy!) post mortem detailing what we learned is below.
We are sharing this post mortem to provide as much transparency as possible. We want our members to trust us with their intellectual property. That trust can’t be bought or demanded — it must be earned. We believe transparency will help us earn trust. We also know there are lessons to be learned from our experience for others who maintain web applications, and this post mortem will help them.
Your comments, ideas, and criticisms are welcome in comments below or email to firstname.lastname@example.org.
Overview of our architecture
To understand what happened last week and how we responded, it is helpful to know a bit about our architecture.
Our application has three tiers: a front-end built with Ruby on Rails and hosted by Heroku, a pool of algorithm processing servers hosted by Amazon EC2, and several MongoDB databases hosted by MongoHQ.
What makes our application unusual is that we allow members to write arbitrary Python code to be executed on our algorithm processing servers. Most attacks that involve running hostile code require the attacker to first find a vulnerability to use to get the code to run. However, we fling that particular door wide open and tell our members, “Here you go! Write some code and we’ll run it on our servers for you!” This makes securing our app more challenging than most.
We knew from the start that our Python interpreter was the most likely target for any attacker, and we had taken steps to secure it long before last week’s attack. However, in retrospect, it’s clear we hadn’t done enough.
Detecting and shutting down the attack
Our application logs a lot of information, and we capture and filter it all in real-time through Papertrail. We get frequent emails from Papertrail about log entries that look out of the ordinary. We’ve tuned our filters to ignore most of the noise, so what’s left is sufficiently low-volume that the important stuff doesn’t get drowned out.
At 20:09 US/Eastern last Wednesday night, we started getting email alerts about log entries we hadn’t seen before. There wasn’t anything about them that specifically cried out, “Attack! Attack!” so we didn’t start looking into them until an hour or so later, at around 21:30.
A few minutes after that, this message showed up in the log emails:
Dec 12 21:31:43 ip-10-87-29-60 zipline.log: IOError: [Errno 2] No such file or directory: '/proc/22374/cmdline'
We were now certain we were being attacked, because there’s no reason for anything in our code base to be attempting to read the command line of a process from /proc. Almost immediately, the entire team was on HipChat working together to respond to the attack.
At this point we were faced with a critical decision: should we (a) allow the attack to proceed, watching it in real-time to learn more about it and shut it down if needed; (b) block the attacker’s access to the site while keeping it up for everyone else; or (c) shut down the entire site? We quickly ruled out option (a), because it would have taken us too long to figure out how successful the attacker had been thus far, and that was too risky. We then had to rule out option (b), because of what we must admit was a serious gap in our preparedness: we hadn’t actually implemented a way to ban a specific account or IP address from the site. That left option (c), so at 21:59, we enabled Heroku’s “maintenance mode.” The site was down and the attack was stopped.
Was the attack stopped when we shut down the site? If the attacker had not successfully breached our security, then yes, it was the end of the attack. But if he had already breached our security, and managed to either gain access to resource credentials stored on our middle-tier servers or to create a back-door he could use to log into them, we weren’t out of the woods. We didn’t know yet how far he had gotten, and it was going to take a while for us to figure it out. On the one hand, we didn’t want to shut everything down, change all of our resource credentials, and so on, because doing so would take time away from analyzing the attack and make that analysis more difficult. On the other hand, we didn’t want to leave things wide open in case he had managed to gain some access before we shut down the site.
Therefore, as a compromise, we decided to immediately change the authentication credentials for all of our production MongoDB databases, and to disable the SSH key used by our servers to pull updates from GitHub. Even if the attacker had managed to steal any of those credentials from one of our servers, the stolen credentials would be useless once we changed them.
We decided to hold off for the time being on shutting down our middle-tier servers altogether because we might need to do forensic analysis with the servers still running. Nevertheless, we knew that we could burn the servers at any time since our EC2 instances are designed to be disposable and can be easily and automatically rebuilt from scratch as needed.
Analyzing the attack
Our next step was to do a detailed analysis of the code the attacker had run through our Python interpreter to determine what he was trying to accomplish and how far he got. Fortunately, we had a complete record of his actions, since not only do we log the timing of every algorithm run through our backtester, we also save a copy of the executed code in our database (this is necessary not only for security purposes, but also so members can see the code associated with their previously executed backtests). We were therefore able to pull the code for all of the attacker’s backtests — there were about 60 of them — out of the database in execution order and walk through them one by one.
We were not able to see code the attacker attempted to run which was blocked by our existing security checks, since our app only stores copies of code that actually gets run, a gap we have since addressed.
From our analysis, we learned that the attacker’s basic strategy was to find some way to use Python code in a backtest to get access to system-related data, and then to use the algorithm logging mechanism built into our framework to send that data to his browser. This was a straightforward strategy which was limited by the fact that our logging mechanism truncates individual log entries to 1024 characters, and many of the data he attempted to print into the log were longer than that.
If we had properly secured our Python interpreter environment to prevent access to all file I/O functions, the attacker would have been completely stymied in his efforts. Unfortunately, we missed some: when we added third-party libraries like numpy and pandas, we did not carefully audit them for functions that would give the interpreter filesystem access. Our attacker settled upon numpy’s genfromtxt() to read strings from text files. For example:
data = genfromtxt("/etc/passwd", delimiter="\n", dtype="|S255")
Using his capture + log strategy, the attacker was able to gain access to the following data:
- a list of all the variables and their values accessible from within the interpreter as returned by “locals()“, quite incomplete because of the truncation to 1024 characters;
- approximately the first 1024 bytes of /etc/passwd;
- approximately the first 1024 bytes of /etc/mtab;
- our Linux version string (i.e., the contents of /proc/version); and
- the process IDs of some processes running on our servers.
The information he captured from /etc/passwd, /etc/mtab, and /proc/version was useless to him; our Linux security is sufficiently locked down that none of that gave him any insight into how to break into our servers.
He used a somewhat clever technique to get process IDs: in a loop, he iterated through the range of possible process IDs and attempted to read /proc/pid using genfromtxt(). For invalid PIDs, he got back a “No such file or directory” exception, but for valid PIDs, he got back “Is a directory”, i.e., genfromtxt() couldn’t read the contents of a directory as a text file. Therefore, he was able to detect the valid process IDs by checking what type of exception he got back. His intention was to then examine /proc/pid/environ and /proc/pid/cmdline to try to extract sensitive information about our system from the command line and environment variables associated with running processes.
Unfortunately for him, his strategy was ineffective at capturing information about processes running on our system, for several reasons:
- we shut down the site shortly after he started trying to read the environment variables and command lines of processes;
- our application runs as an unprivileged account, i.e., not as root, so he was unable to read the details of most processes running on the servers;
- we don’t specify sensitive information on the command line or in the environment variables of any of our application’s processes; and
- he did the “find valid process IDs” step in a separate backtest from the “read environment variables and command line of a specific process ID” step, apparently not realizing that backtests are distributed among servers in a pool, and a process ID which is valid on one server in the pool is unlikely to be valid on another.
After completing this analysis, we knew for certain that the attacker had not gained access to any of our data or left behind any back-doors. We breathed a collective sigh of relief, chatted for a bit about what we were going to do the next day to harden the site before bringing it back on-line, and then signed off and decided to look at things with fresh eyes in the morning.
Tracing the origin of the attack
We are still working on tracing the attack back to its origin. We will update this section later when those investigations are complete.
Critical application hardening
As a result of the attack, we identified and implemented the following improvements to our security before bringing the site back on-line the next evening.
First and foremost, we dramatically increased the security checks we perform on a member’s Python code before it is executed. We are now much more restrictive about which functions — both built-ins and imported from modules — may be used. We did a complete audit of all the third-party modules accessible from algorithm code for functions that perform I/O of any sort and blocked them all. It’s possible that we missed something, but we are confident that it will be much, much harder for anyone to repeat the trick used in the recent attack.
Even before the attack, our logging was quite extensive and enabled us to detect and shut down the attack quickly, but we felt we needed to do still more. Therefore, we set up real-time alerts to the entire team any time someone runs afoul of our newly hardened security checks.
We added the ability to dynamically (no application restart required) lock out individual users and IP addresses.
Prior to the attack, we were storing our database and GitHub credentials in files on disk that were accessible to the server role that executes backtests. That was for our convenience when restarting servers or troubleshooting application issues, but it was clear after the attack that the negative security risk of keeping these files on disk was too great. Therefore, we modified our deployment process so that these files are on disk for only as long as they are needed during deployments and are deleted immediately afterwards.
Prior to the attack, we were storing database credentials in a configuration file in the source code of our front-end application. We’ve removed the credentials from that file and are now storing them only in memory to reduce our vulnerability from either someone compromising our front-end application or someone compromising our GitHub repository.
We noticed during the attack that our servers were accessing GitHub through the account of one of our developers, with full push / pull / admin access to all our repositories (bad idea!). We fixed this by redirecting the servers to a newly created GitHub account within our organization, one with pull-only access to just the necessary repositories.
For maximum paranoia, we destroyed and rebuilt all of our server instances from scratch. Though we were completely convinced then — and remain so now — that our servers were not actually compromised, the cost of replacing our EC2 instances is so low that we figured what the heck, go ahead and do it.
Also for maximum paranoia, we created a new SSH key pair for our servers to use to pull updates from GitHub. Though we had no reason to believe that the key pair had been compromised, recreating it was easy so we went ahead and did so.
Additional application hardening
In addition to the steps we took to harden the application before bringing it back on-line the day after the attack, we’ve done more since then to increase the security of the application.
All algorithm code is now executed within an empty chroot jail.
We now automatically lock out users who run afoul of our code security policy too many times within a preset period of time.
We added a user-friendly admin-only page within our application for controlling which accounts and IP addresses are blocked, in contrast to the change we implemented immediately after the attack which required us to edit a MongoDB collection directly. We did this because when we are in the midst of an attack, the tools we use to fend off the attack should be as user-friendly and error-resistant as possible so that we can do what needs to be done quickly and accurately. Manually editing a database table is neither user-friendly nor error-resistant, so it had to go.
We’ve added our own “maintenance mode” implementation to our application to allow us to continue to use the application internally, while blocking all other visitors. During the recent attack we used Heroku’s maintenance mode to take down the application; this was effective at blocking the attacker, but it also blocked us from using our application to investigate the attack.
We’ve always recognized that the security of our application is critically important. Our development roadmap has been a mix of features, improvements and security measures. A product is never “done,” and neither is security; there is always more that can be done. This attack provided a lot of information and quite a bit of urgency, so our roadmap today has more security investment planned than it did before the attack.
Here are some of the things that are in the works or under consideration for further improvements to the security of our application:
- Incident response plan — we fended off the recent attack using our wits and the wisdom collected over years of experience. That was good enough this time, but we need a more rigorous solution. After publishing this post mortem, the next thing we will be writing is a detailed, comprehensive plan for responding appropriately to the various kinds of attacks we expect to face in the future. This will also provide us with a framework for responding methodically and comprehensively, even to as-yet-unforeseen attack vectors. Formulating an incident response plan will also help us identify gaps in our incident response “toolbelt” so that we can build the necessary tools before they are needed.
- More granular GitHub access — our new configuration for accessing GitHub from our servers during updates is more secure than it was before, but it’s still not as secure as it could be. We will make the access granted to various automated components within our infrastructure even more granular to limit the exposure from the compromise of a single component.
- Better code isolation — the chroot jail we added provides a high level of algorithm code isolation, but we are considering going even further. The technologies we’re evaluating include LXC, seccomp, and SELinux. However, we’re not convinced that the additional security provided by any of these will be worth the effort to build and maintain the required jail implementation, which will be much more complicated than our simple chroot implementation.
In addition, we still need to implement several of the security features from the list we mentioned the last time we wrote about security:
- Use SSL encryption for data travelling between our application and our MongoDB servers.
- Encrypt all member data, including algorithm code, in our MongoDB databases.
- Deploy vendor security patches faster and with more regularity than we do now.
- Arrange for regular third-party audits of our application security, and promptly remediate any issues uncovered by those audits.
With the attack and its aftermath behind us, it’s useful to look back and ask what went right and what went wrong.
Log everything. Discard nothing.
Too many software developers seem to think that if you don’t expect an error at a particular location in your code, or you don’t know what to do about any errors that might occur there, you should just silently throw them away. This is incredibly dangerous. With disk space so cheap nowadays and with services like Papertrail that make it easy to capture and filter logs in real-time, you should log everything. We were able to detect and stop the attack on our application quickly only because of our extensive logging and real-time filtering. Similarly, we were able to determine with certainty exactly what the attacker was able to accomplish only because of our extensive user activity logging.
We could have been logging even more than we were. As noted above, we weren’t aware of the attack until the attacker started successfully executing backtests, because we weren’t logging backtests whose code was rejected, a gap we have since remedied. If you ever find yourself saying, “Gee, I wish our application were logging <x>,” listen to yourself: add the logging now, before you really regret not having it!
Have communication channels at the ready.
Everyone on our team was involved in responding to the attack. We were all able to collaborate effectively because we already had in place a private communication channel, HipChat, familiar and accessible to all of us.
We have some employees in far-flung locations and others who work from home on a regular basis, so HipChat and Skype are essential tools that we use every single day. If you run a web application, you need to have something like HipChat set up for your first responders to use in the midst of an emergency, and they need to know how to use it.
Prepare an incident response plan.
When you are in the midst of responding to an attack is not the right time to be planning how to respond to attacks; you need an incident response plan in place which covers the entire process. If you’re not sure what needs to be in your plan, Google “incident response plan” and start reading.
Build countermeasures before you are attacked.
One of the side-effects of preparing a detailed incident response plan is that it will help you identify what tools and application enhancements you need to respond effectively to attacks. For example, as noted above, it didn’t occur to us until we were in the midst of an attack that we had no way to block individual accounts or IP addresses.
Don’t store credentials in the filesystem.
Everybody knows you’re not supposed to store passwords on disk, but your application needs the passwords to connect to your databases and other resources on the network, so where can you put them?
Two commonly used solutions to this problem are command-line options and environment variables. However, the attack we experienced makes it clear that these are both bad ideas: if our attacker had succeeded, he would have been able to read the command-line options and environment variables of all the processes on our servers.
We’ve opted instead to put the necessary credentials on disk only for long enough for the application to read them on start-up; after that, they are only available in the application’s memory. That doesn’t make them invulnerable to attack, since a sufficiently competent attacker with sufficient penetration into our servers could read the application’s memory, but security isn’t about making things impossible, it’s about making them hard enough to deter or prevent attacks.
Build multiple layers of security.
We relied on our security checks of algorithm code to prevent attackers from running nefarious code. However, our security checks weren’t catching everything, and once the attacker figured that out he was able to access the filesystem on our servers.
Good application security implementations ensure that sensitive resources are protected with multiple, independent layers of security whenever possible. In our case, we’ve both beefed up the layer of security that checks algorithms for suspicious code and added a completely new, independent layer since the attack — executing algorithm code in an empty chroot jail.
Use granular access controls.
For example, if your application servers need to access your code repository during deployments, make sure the credentials used to access the repository only have access to the code they need to, and only have read-only access.
Similarly, if your application only needs read-only access to some databases, make sure the credentials used to access those databases are read-only.
This shouldn’t be news to anyone with experience securing web applications and it wasn’t news to us either, but the attack prompted us to take a good, hard look at our application and realize that we weren’t following these best practices. If you take a good, hard look at your application, you may find the same thing.
Changing passwords isn’t always enough.
If you change your database passwords during an attack, or even if you delete your database users entirely, an attacker who has already established a connection to your database may be able to continue accessing it. This goes for other resources with persistent connections (e.g., NFS) as well. If you need to change credentials when responding to an attack, then remember to kill all persistent connections!
Don’t put off security.
It’s trite and clichéd but true all the same. When you’re a start-up working on an exciting, new web application with all sorts of new functionality that you want to push out the door Now! Now! Now! to keep your existing members happy and draw new members to the site, it’s easy to convince yourself, “We’re not big enough yet to be an attractive target. We can deal with the security stuff later.”
Presumably you are hoping that your application will be a success or you wouldn’t be working on it. If it’s successful, then it will be targeted by attackers.
You have to keep plugging away at security. We had invested enough in security that we were able to fend off this attack before it became a disaster, but we hadn’t invested enough to avoid 18 hours of site downtime. For a company in our stage of development, that’s not too bad of a price to pay. Still, in retrospect, we should have invested in security more. You should evaluate your roadmap in the same light — are you investing enough?
Trust and Transparency
As Fawce said in his post last week, the protection of our members’ intellectual property is very important to us. It’s not just something we should do, it’s something we must do. We want our members to trust us with their intellectual property. That trust can’t be bought or demanded — it must be earned. We are sharing this post mortem so that we can provide as much transparency as possible. We believe transparency will help us earn trust.
I know we haven’t thought of everything. If you have any advice, questions or criticisms regarding our security measures and how we handled this incident, we welcome them in comments or email to email@example.com.