HACKER Q&A
📣 mooreds

How do you roll back production?


Once you have pushed code to production and realized there is an issue, how do you roll back?

Do you roll forward? Flip DNS back to the old deployment? Click the button in heroku that takes you back to the previous version?


  👤 aasasd Accepted Answer ✓
A place I worked at had a symlink pointing to the app directory, and a new version went to a new dir. This allowed us to do atomic deployments: code wasn't replaced while it's being run. A rollback, consequently, meant pointing that symlink to the older version.

For the database, during a migration we didn't synchronize code with one version of the db. Database structure was modified to add new fields or tables, and the data was migrated, all while the site was online. The code expected to find either version of the db, usually signaled by a flag for the shard. If the changes were too difficult to do online, relatively small shards of the db were put on maintenance. Errors were normally caught after a migration of one shard, so switching it back wasn't too painful. Database operations for the migrations were a mix of automatic updates to the list of table fields, and data-moving queries written by hand in the migration scripts—the db structure didn't really look like your regular ORM anyway.

This approach served us quite well, for about five years that I was there—with a lot of visitors and data, dozens of servers and multiple deployments a day.


👤 robbya
Specifically, I tell Jenkins to deploy the commit hash that was last known good. Jenkins just deploys, and doesn't really know that it's a "roll back."

Generally, going back to a known clean state should be easier, safer and relatively quick (DNS flip is fast, redeploy of old code is fast if your automation works well).

In some cases changes to your data may make rolling back cause even more problems. I've seen that happen and we were stuck doing a rapid hot fix for a bug, which was ugly. We did a lot more review to ensure we avoided breaking roll back. So I'd advise code review and developer education of that risk.


👤 karka91
Assuming this is about web dev.

Nowadays - flip a toggle in the admin. Deployments and releases are separated.

Made a major blunder? In kubernetes world we do "helm rollback". Takes seconds. This allows for a super fast pipeline and a team of 6 devs pushes out like 50 deployments a day.

Pre-kubernetes it would be AWS pipeline that would startup servers with old commits. We'd catch most of the stuff in blue/green phase, though. Same team, maybe 10 deployments a day but I think this was still pretty good for a monolith.

Pre-aws we used deployment tools like capistrano. Most of the tools in this category have multiple releases on the servers and a symlink to the live one. If you make mistake - run a command to delete the symlink, ln -s old release, restart web server. Even though this is the fastest rollback of the bunch the ecosystem was still young and we'd do 0-2 releases a day.


👤 DoubleGlazing
My almost universal experience has been to simply do a Git revert and let the CI pipeline do its thing. Pros - It's simple. Cons - It's slow, especially in an emergency.

My last job had an extra layer of security. As a .net house all new deployments were sent to Azure in a zip file. We backed those up and maintained FTP access to the Azure app service. If a deployment went really wrong and we couldn't wait the 10-20 mins for the CI pipline to process a revert, we'd just switch off the CI process and FTP upload the contents of the previous last good version.

Of course, if there were database migrations to deal with then all hell could break loose. Reverting a DB migration in production is easier said than done especially if a new table or column has already started being filled with live data.

To be fair though, most of the problems I encountered were usually as the result of penny pinching by management who didn't want to invest in proper deployment infrastructure.


👤 drubenstein
Depends on the issue - if there's a code / logic bug that doesn't affect the data schema, we use elastic beanstalk versions to go back the previous version (usually this is a rolling deploy backwards), and then clean up the data manually if necessary. Otherwise, we roll forward (fix the bug, clean up the data, etc).

It's more often been the case for us that issues are caused by mistaken configuration / infrastructure updates. We do a lot of IAC (Chef, Cloudformation), so with those, it's usually a straight git revert and then a normal release.


👤 EliRivers
Tell people we need to roll back, clone the repo to my hard drive, open up git, undo the commit that merged the bad code in, push it. All done.

"Production"? Does that mean something that goes to the customers? Very few of our customers keep up with releases so it's generally not a big deal. We can have a release version sitting around for weeks before any customer actually installs it; some customers are happy with a five year old version and occasional custom patches.

I bet it's a bigger problem for those for whom the product is effectively a running website, but those of us operating a different software deployment model have a different set of problems.


👤 ksajadi
Since we run on Kubernetes, rolling back the code is a matter of redeploying the older images. Rolling back database changes is more challenging but usually we have “down” scripts as well as up scrips for all dB changes allowing us to roll database changes back too.

We use Cloud 66 Skycap for deployment which gives us a version controlled repository for our Kubernetes configuration files as well as takes care of image tags for each release.


👤 MaxGabriel
For our backend, we deploy it as a nix package on NixOS, so we can atomically rollback the deployed code, as well as any dependencies like system libraries. Right now this requires SSHing into each of our two backend servers and running a command.

If it’s not urgent we’d just revert with a PR though and let the regular deploy process handle it.

The frontend we deploy with Heroku, so we deploy with the rollback button or Heroku CLI. Unfortunately we don’t have something setup where the frontend checks if it’s on the correct version or not, so people will get the bad code until they refresh


👤 folkhack
I have multiple strategies because I've got one foot in Docker and one foot in the "old-school" realm of simple web servers.

Code rollbacks are simple as heck - I just keep the previous Docker container(s) up for a potential rollback target, and/or have a symlink cutover strategy for the webservers. I use GitLab CI/CD for the majority of what I do so the SCM is not on the server, it's deployed as artifacts (either a clean tested container and/or .tar.gz). If I need to rollback it's a manual operation for the code but I want to keep it that way because I am a strong believer in not automating edge-cases which is what running rollbacks through your CI/CD pipeline is.

Also for code I've been known to even cut a hot image of the running server just in case something goes _really_ sideways. Never had to use it though, and I will only go this far if I'm making actual changes to the CI/CD pipeline (usually).

The biggest concern for me is database changes. You may think I'm nuts but I have been burnt _sooooo_ bad on this (we were all young and dumb at one time right?)... I have multiple points of "oh %$&%" solutions. The first is good migrations - yeah yeah yell at me if you wish... I run things like Laravel for my API's and their migration rollbacks can take care of simple things. TEST YOUR ROLLBACK MIGRATIONS! The second solution is that I cut an actual readslave for each and every update of the application and then segregate it so that I have a "snapshot" that is at-most 1-2 hours out of date.

Have redundancy to your redundancy is my motto... and although my deployments take a 1-3 hours for big changes (cutting hot images of a running server, building/isolating an independent DB slave, shuffling containers, etc.) I've never had a major "lights out" issue that's lasted more than 1hr.


👤 dxhdr
Push different code to production, either the last-known-good commit, or new code with the issue fixed.

I imagine that much larger operations likely do feature flags or a rolling release so that problems can be isolated to a small subset of production before going wide. But still the same principle, redeploy with different code.


👤 technological
You can do rolling deployment.

Setup environment with previous version of production code (which does not have issue) and then using load balancer switch the traffic to this new environment


👤 KaiserPro
WE have two things that could need to be rolled back, the app/api and the dataset.

The App is docker, so we have a tag called app-production, and app-production-1(up to 5) which are all the previous production versions. If anything goes wrong, we can flip over to the last known good version.

We are multi-region, so we don't update all at once.

The dataset is a bit harder. Because its > 100gigs, and for speed purposes it lives on EFS (its lots of 4meg files, and we might need to pull in 60 or so files at once, access time is rubbish using S3) Manually syncing it takes a couple of hours.

To get round this, we have a copy on write system, with "dataset-prod" and "dataset-prod-1" up to 6. Changing the symlink of the top level directory is minimal.


👤 jekrb
At the agency I used to work for, we used GitLab CI/CD.

We were able to do a manual rollback for each deployment from the GitLab UI.

https://docs.gitlab.com/ee/ci/environments.html#retrying-and...

Disclaimer: I work at GitLab now, but my old agency was also using GitLab and their CI/CD offering for client projects for a couple years while I was there.

At that agency they have even open sourced their GitLab CI configs :) https://gitlab.com/digitalsurgeons/gitlab-ci-configs


👤 ryanthedev
I mean it's not rolling back or rolling forward.

It's just doing another deployment. It doesn't matter what version you are deploying.

That's the whole point.

My teams go into their CI/CD platform and just cherry pick which build they want to release.


👤 atemerev
Blue-green deployment is the only way to fly: https://martinfowler.com/bliki/BlueGreenDeployment.html

There are two identical prod servers/cloud configurations/datacenters: blue and green. Each new version is deployed intermittently on blue and green areas: if version N is on blue, version N-1 is on green, and vice versa. If some critical issue happens, rolling back is just switching the front router/balancer to the other area, which can be done instantly.


👤 ericol
We have a rather simple app that we manage with Github. When a Pr is merged into our main repo's master branch it gets automatically deployed into production.

whenever we need to roll back something we just use the corresponding Github feature to revert a merge, and that is automatically shoved into production using GH hooks and stuff.

Again, we have a rather easy and ancient deploy system, and it just works.

We do several updates a week if needed. We try to avoid late Friday afternoon merges, but with a couple alerts here and there (Mostly, New Relic) we have a good coverage to find out about problems.


👤 helloguillecl
Before implementing CI with containers I used to deploy using Capistrano. One thing I loved about this setup was that in case of needing to rollback, I would just run a command which would change a symlink pointing to the previous deploy and restart. All usually done in a couple of seconds.

👤 m00dy
I rollback by deploying previously tagged docker image.

👤 emptysea
Ideally we'd have the problematic code behind a feature flag and we'd turn the flag off.

For other issues we press the rollback button in the Heroku dashboard.

Heroku has its problems: buildpacks, reliability, cost, etc, but the dashboard deploy setup is pretty nice.


👤 perlgeek
We use https://gocd.io/ for our build + deployment pipelines. A rollback is just re-running the deployment stage of the last known-good version.

Since the question of database migrations came up: We take care to break up backwards incompatible changes into multiple smaller ones.

For example, instead of introducing a new NOT NULL column, we first introduce it a NULLable, wait until we are confident that we don't want to roll back to a software version that leaves the column empty, and only then changing it to NOT NULL.

It requires more manual tracking than I would like, but so far, it seems to work quite well.


👤 insulfrable
Flip dns back, keep the old stack around for a few days. The only case that doesn't just work is with db schema updates that no longer work with the previous version of prod, but this is true for any rollback.

👤 Cofike
We use Elastic Beanstalk so we just deploy whatever application version we'd like. Honestly not the biggest fan of that strategy because there is at least a 5 minute period of time while the new instances are provisioned and healthchecked that you just need to wait for.

When compared to our Fastly deploys which are global in seconds, it leaves me wanting a faster solution.


👤 nine_k
100% of my rollbacks were like this:

* Deploy new code in new VMs.

* Route some prod traffic to the new nodes.

* Watch the nodes misbehave somehow.

* Route 100% of the prod traffic back to old nodes (which nobody tore down).

Rollback complete.

In the case of normal deployment, 100% of prod traffic would eventually be directed to new modes. After a few hours of everything running smoothly, the old nodes would be spun down.


👤 lelabo_42
We started using docker and kubernetes not long ago. Every deployement in production must have a release number as a tag. If one element of our environment need a rollback, I redeploy an old image on kubernetes. Its very fast, only a few seconds to rollback and you can do rolling updates to avoid downtime.

👤 WrtCdEvrydy
Depends on the issues.

1) Issues that cause a complete failure to start containers will fail healthchecks and are auto rolled back in our new CI/CD flow.

2) Issues that are more subtle are manually rolled to one back hash until it goes away (then we create a revert branch from that diff between HEAD and WORKING).


👤 cddotdotslash
We actually just implemented something like this. Our entire environment is AWS CodeBuild, CodePipeline, and Lambda-based, but the process would be similar for more traditional environments:

1. Developer creates a PR. To be mergeable, it must pass code review, be based on master, and be up-to-date with master (GitHub recently made this really easy by adding a one-click button to resync master into the PR).

2. Each commit runs a build system that installs dependencies, runs tests, and ZIPs the final code to an S3 bucket.

3. Once the developer is ready to deploy, and the PR passes the above checks, they type "/deploy" as a GitHub comment.

3. A Lambda function performs validation and then updates our dev Lambda functions with the ZIP file from S3. Once complete, it leaves a comment on the PR with a link to the dev site to review.

4. The developer can now comment "/approve" or "/reject". Reject reverts the last Lambda deploy in dev. Approve moves the code to stage.

5. The above steps repeat for stage --> prod.

6. Once the code is in prod, the developer must approve or reject. If rejected, the Lambdas are reverted all the way back through dev. If approved, the PR is merged by the bot (we have some additional automation here, such as monitoring CloudWatch metrics for API stability, end-to-end tests, etc).

TL;DR - Don't merge PRs until the code is in production and reviewed. If a rollback is needed afterwards, create a rollback (roll-forward) PR and repeat.


👤 sbmthakur
At my workspace, we use Gitlab. We pick up the old(stable) Job ID and ask Devops to deploy it.

👤 yellow_lead
For our Kubernetes apps, every merge to master creates a tag in GitHub. If there's an issue that doesn't result in a failed health check (these would be rolled back automatically), we can rollback by passing the older tag into a Jenkins job.

👤 bdcravens
Our app is in ECS, our database in RDS. I'd roll back to a prior task definition, and if absolutely necessary do a point in time restore in RDS. (I tend to leave the database alone unless absolutely necessary but our schema is pretty mature)

👤 savrajsingh
Google App Engine, just set previous deploy to live version (one click)

👤 markbnj
We revert the bad commit and redeploy. We do this for our workloads on vms as well as those on kubernetes, but it is both easier and faster for the kubernetes workloads.

👤 wickedOne
interesting question.

we roll forward and thus far never ran into the situation that that wasn't possible in a reasonable amount of time.

nevertheless i've wondered more than once what would happen if we run into such a situation and there's a substantial database migration in the process (i.e. with table drops).

curious to learn what the different strategies are on that point: do you put your table contents in the down migration, do you revert to the last backup, etc.


👤 sunasra
We use chef framework. Each deployment has specific tag. If something goes wrong, we revert to most stable chef tag and redeploy to all the tiers.

👤 sodosopa
Depends. For base code moves we restore from the previous release. For larger things like websites we use blue green deployments.

👤 mister_hn
I simply build a new install package and send it to customer

or build a new VM and send it to them.

Not everything is web-based


👤 cdumler
You are asking a very generic question without really stating what environment you are using. The ability to "rollback" is really a statement of how you have defined your deployments. A good environment really should have a few things:

* A version control system (ie. git) that has a methodology for controlling what is tested and then released (ie. feature releases). If you want the ability to revert a feature, you need to use your version control to group (ie. squish) code into features they can can be easily reverted. Look up the GIT Branching Model [1]. It's a good place to start when thinking about organizing your versioning to control releases.

* You should be able to deploy from any point in your version control. Make sure your deployment system is able to deploy from a hash, tag or branch. This gives you the option of "reverting" by deploying from a previously known good position. I would highly suggest automating deployment to generate timestamp tags into the repo for deployment so you can see the history of deployments.

* Try to make your deployments idempotent and/or separate your state changes so they can be independently controlled. If you have migrations, make sure they can withstand being deployed again, ie. "DROP TABLE IF EXISTS" then "CREATE TABLE", so redeploying doesn't blow up. If you need to roll back, you can rollback as much as you need to the point you want to deploy. A trait of a well designed system is it needs few state changes to add new features and/or those state changes can be easily controlled.

* Have a staging system(s). You should be able to deploy to a staging system to verify the behavior of a deployment. It should be able replicate the production every way except in data content. Ideally, should also build this from scratch every time so that you can guarantee if production dies hard death you can completely reproduce it. A great system will also do this for production, bring it up for final testing, and then you can switch over to it once tested.

Notice the trend here is to breakup the dependences between how, what, and where code is deployed so that have many ways to respond to issues. Maybe the solution is small enough to just make fix in the future. Maybe it is create an emergency patch, test it on a new production deployment and then switch over. Maybe it is so bad you want to immediately deploy a previous version and get things running again. All of these abilities depend on building your system such that you have these choices.

[1] https://nvie.com/posts/a-successful-git-branching-model/


👤 nodesocket
With Kubernetes and deployments simply:

kubectl rollout undo deployment/$DEPLOYMENT


👤 SkyLinx
I deploy to Kubernetes with Helm, so I just do an Helm rollback.

👤 rvdmei
Rolling back is usually a bad practice and can get quite challenging if not impossible in distributed environments.

If you can pinpoint a specific commit that is causing the issue. Revert that commit and go through your standard release process.


👤 elliotec
Click the button, revert master, then fast follow with a fix.

👤 billconan
each of my deployments is a zip file. I just redeploy with an older zip

👤 crb002
Liquibase

👤 faissaloo
I don't rollback unless it's for work, I just take the time to fix it, the site can stay down for as long as it needs to.

👤 rolltiide
Rollback the master branch and deploy that again.

Similar to "clicking the button in heroku"


👤 mnm1
Build previous working version. Deploy to elastic beanstalk. Rollback any migrations. Fix the issue and redeploy at leisure.

👤 rinchik
roll back (step back), is an inherited from waterfall anti-pattern.

Now we should only march forward with small, on demand releases, this way we will know exactly where the issue is and will be able to fix it forward quickly.

Rollbacks were a strategy with monthly (or even quarterly [insane huh?]), giant, stinky, release dumps, knowing there is no way we could quickly identify and deploy the fix. aka lets throw production 3 months back and take another 2 month for figuring out there the issue that happened during last release is.

And to finally answer your question: we never roll back. We always march forward.