HACKER Q&A
📣 bradwood

Do you test in production?


There are a lot of blog posts talking about the fact that testing in prod should not be a taboo like it may have been in the 90s. I've read some of these [1] [2], I get the arguments in favour of it, and I want to try some experiments.

My question is -- how does one go about doing it _safely_? In particular, I'm thinking about data. Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?

Ultimately, I think this will end up looking like implementing SLIs and SLOs in PROD, but for some of my SLOs, I think I need to actually _fake_ the data in order to get the SLIs I need, so how to do this?

Suggestions appreciated -- thanks.

[1] https://increment.com/testing/i-test-in-production/

[2] https://segment.com/blog/we-test-in-production-you-should-too/


  👤 paxys Accepted Answer ✓
Lots of ways to test in production. IMO the way you are suggesting – injecting synthetic data into prod – is the worst of both worlds. You aren't actually testing real world use cases, and end up polluting your prod environment.

Some common ways to go about it:

- Feature flags: every new change goes into your codebase behind a flag. You can flip the flag for a limited set of users and do a broader rollout when ready.

- Staged rollouts: have staging/canary etc. environments and roll out new deployments to them first. Observe metrics and alerts to check if something is wrong.

- Beta releases: have a group of internal/external power users test your features before they go out to the world.


👤 csours
Everyone tests in production. Some people also test before production!

Some people try to NOT test in production, but everyone does test in prod in a very real sense because dependencies and environments are different in prod.

I think the question was "Do you INTENTIONALLY test in production"


👤 cuuupid
I work for a B2E company that has a structure similar to Salesforce. We test in production all the time even for our secure environments where the data is highly sensitive.

Re: data, it’s a somewhat common practice to notionalize data (think isomorphically faking data). We regularly do this and will often designate rows as notional to hide them from users who aren’t admins. I’ve found this to work exceptionally well; we do this 1-2 times a week, ensure there’s a closed circuit for notional data, and for more critical systems we’ll inform our customers that testing will occur.

I’m sure there are more complex and automated solutions but when it comes to testing, simple and flexible is often the way to go.


👤 brianwawok
Anytime you need to talk to a third party API, you need to test in prod.

Some people have sandbox apis. They are generally broken and not worth it. See eBay for super in depth sandbox API that never works.

You can read the docs 100 times over. At the end of the day, the API is going to work like it works. So you kind of “have to” test in prod for these guys.


👤 dmitriid
A/B tests and feature flags are basically testing in prod. And yes, some of those features sometimes run as a "well, it should work, but we're not entirely sure until we get a significant number of users using the system". It could be an edge case failing or scalability requirements being wrong.

Another variation on the same theme is rewriting systems when you run production data through both systems. Quite often that's the only way of doin migrations to a new platform, or a new database, or yes, a newly re-written system.

> Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?

A very common practice is to run a snapshot of prod data (e.g. last hour, or last 24 hours, or even a week/month/year) through a system in staging (or cooking, or pre-cooking, or whatever name you give the system that's just about to be released). However, doing it properly may not be easy, and depends on the systems involved.


👤 fleekonpoint
We run canaries in Prod, it isn’t as extensive as our integration tests that run in our test stages but it still tests happy paths for most of our APIs.

👤 cloudking
I think it depends on how your application works. If you have the concept of customers, then you can have a test customer in production with test data that doesn't affect real customers for example. You can reset the test customer data each time you want to test.

👤 HereBeBeasties
Good testing is an exercise in pushing I/O to the fringes, as that's what has stateful side-effects. (Some might even argue that anything that tests I/O is an integration test. The term "integration test" is not well defined and not worth getting hung up over IME.)

Once you're into testing I/O, which is ultimately unavoidable no matter how hard you try not to, you either need cooperative third parties who can give you truly representative test systems (rare) or a certain amount of test-in-prod.

Testing database stuff remains hard. You either wrap things in a some kind of layer you can mock out, or dupe prod or some subset of it into a staging environment with a daily snapshot or similar and hope any differences (scale, normally) aren't too bad.

Copy-on-write systems or those with time-travel and/or immutability help immensely with test-in-prod, especially if you can effectively branch your data. If it's your own systems you are testing against, things like lakefs.io look pretty useful in this regard.

And yes, feature flags, good metrics, and load balancers that you can send a small percentage of traffic through a new version (if your traffic/system allows such things) all help.


👤 adra
My org has done a bunch of what's already covered here. We have a bunch of customers (SAAS), and though we have a good idea of what's going well in aggregate through observability, it's hard to gauge that any single org is having exactly repeatable results at they should expect vs. statistically acceptable volume for everyone. Because of this, we also setup synthetic accounts for test customers and regularly drive test scenarios through them to make sure the single customer doing the same old bring workloads are also doing alright. It tends to catch large issues that are caused changes that affect outputs without changing the volumes/latency. It's like end to end testing very common hot paths running forever in a real customer account flagged without billing. It tends to catch regressions way more often than it rightly should've.

👤 atemerev
In electronic trading, most new systems are tested in production by running with smaller capital allocation first. It is hard to flatten out all bugs unless you are on the real market with real money and real effects (of course, simulations testing and unit testing are heavily employed too).


👤 caust1c
Hi! I wrote the referenced Segment post! Happy to answer any questions.

The way we did it safely is just as you say: creating fabricated users/organizations/configurations with data generators injecting into the system.

Faking data to look realistic is always challenging, but we used this cool library written by an early segment engineer: https://github.com/yields/phony

Not perfect but works well enough! And it's super simple. :-)


👤 turtleyacht
Sometimes one cannot get the exact same specs on test hardware versus production, yet a rollout depends on simulating system load to shake out issues.

Performance testing needs a schedule, visibility, timebox, known scope, backout plan, data revert plan, pre- and post-graphs.

  Schedule. Folks are clearly tagged in a
            table with times down the side.

  Visibility. Folks who should know know
              when it's going to happen,
              are invited to the session, 
              and are mentioned in the 
              distributed schedule.

  Timebox.    It's going to start at a
              defined time and end on a
              defined time.

  Known scope. Is it going to fulfill an
               order? How many accounts
               created?

  Backout plan. DBA and DevOps on standby
                for stopping the test.

  Data revert plan. We know what rows to
                    delete or update after
                    testing.

  Pretty pictures.  You want to show graphs
                    during the test, so
                    that you know what to
                    improve and everyone's
                    time wasn't wasted.
Reference: observing successful runs that didn't result in problems later.

👤 nmstoker
With certain kinds of reporting/BI tools, I've generally found it's not that risky to test in production, provided certain conditions apply, and it comes with a number of advantages where the QA environments don't truly mimic what happens in production (or the time for updates in QA is way too slow, so you don't see varied output cases appearing fast enough to give a good test).

A common dev concern (usually raised by people who have no idea how users actually use stuff!) is that someone might pick up the report and then do something awful based on it, which would be awful^2 - I then explain that users can't find/use the updated reports till we tell them where they are and grant access permissions etc etc, so it's going to be fine and there's no need to panic, which calms them down till they forget by the time this comes up again!

On a side note, these people seem to get much more wound up about principal based worries (it would be bad to test in Prod being a prime example) compared to concerns based on their own weaknesses (ie they rush, forget a whole section of requirements, make mistakes, can't spot obvious bugs) which they seem to imagine are way less likely to cause problems than experience demonstrates.


👤 Msurrow
As others have said, injecting fabricated data into prod wont give you any value. The only reason to test in prod these theys are to try your new feature on data with the breadth and level of detail that prod data has that you can never fabricate. (Hardware differences between prod and other envs really should not be a problem these days).

In almost every case you cannot test new functionality on actual prod data, at least not anything thats not strictly “read only”-functionality. If you have a new feature to send automated mail to someone foreclosing on their property you just do not test that on a real live system.

What you can do is setup a staging environment that is as close to prod config as possible, and then copy the prod database to the staging env. Do your tests in staging. It doesnt matter if data in stg is messed ud. There may well be legal or company policy or security restrictions preventing you from doing this, but its the only way to test on real life data without the risk of f**ing up data in the live system.

Then there are integration tests - to other systems that is, which are a much harder problem


👤 nitwit005
What I've done in the past is to write a test that runs every five minutes in production, accessing the APIs like a user for the most common app flows. It provided a great way to be sure the app was genuinely working.

That did require having multi-tenancy support, and there was a need to suppress some security features by whitelisting the IP of the test app.


👤 alkonaut
If you are lucky you have internal users who use the product in production. Then if you are lucky you have a group of external power users who appreciate getting features first and understand that there is a risk of bugs.

Most software probably falls in this category of “we could test more but at some point our users would rather get the product with bugs than wait”.

Whether it’s staged rollouts, feature flags, it’s the same thing. It’s mitigating risk when testing in prod. It’s the best bet.

Some software obviously falls into the category that can’t have serious bugs for any users. Then you just have to keep the software so simple that you can be confident it works.


👤 j_kao
Testing in production will trend upwards among companies because everybody's workload is shifting towards the cloud and/or making use of external SaaS services. There are a number of cloud services that are not open source that can't be run locally, or run at the same scale as production.

It is not a good use of time to mock everything, because you have no control of external systems. The only reason I'd see it being important is if these external systems are tightly coupled to complex local logic that should be tested locally. However, there are a number of strategies to deal with such "tight coupling" in such cases.


👤 yellowapple
One approach I recall taking as the in-house developer for a company's warehouse management system was to designate the warehouse I was in to be the "experimental" warehouse: new features (or new systems entirely!) would be developed/configured in coordination with that warehouse's team, and they were generally comfortable with the idea that they would get those new features first (with the risks that entails). Once my local site had used the new feature/system for a few weeks without major incident, it would then get rolled out to the other sites.

👤 Pamar
In most of my projects I made sure that there was always available a recent (1) copy of the whole Production DB - this is mostly used be able to replicate erroneous behaviour in a controlled environment.

But it is also useful to get very close to "test in prod" without actually risking anything.

Actually executing data-changing code for testing is actively discouraged, though.

1) the current system takes a snapshot of the production db at the end of the day and uses it to repopulate from scratch this "staging" environment. In past cases I had to accept less frequent updates, though.


👤 raptorraver
Currently not, but there was a project where we had to "develop" in production. I was coding a IoT adapter for a building automation system. We sure had development machines but when we first tested our code in real env we noticed they were slightly different version. So there was no other way than to ssh into our machines and use Vim to make code work and then replicate the changes on own computer. Fun times but don't really miss the stress of messing up something on real building.

👤 sethammons
Note: if you plan on accurate financial planning and metrics (esp. if going public), you need to be able to separate your test prod stats from the real prod stats for reporting.

👤 paulryanrogers
I tried this for a while, marking such tests as production compatible. They relied on test records made for the purpose, sometimes copied to other environments to make the tests.

For 3rd parties with test modes like Stripe you can get E2E, or if the cost of the test is low.

Some safety controls to avoid running non-prod-safe tests are wise.

Another alternative is using anonymized prod copies outside prod. Possibly even mocking 3rd parties to behave like prod, happy, sad, etc.


👤 user68858788
One box testing works well for some scenarios. It's not completely safe but the risks are low. If there is an issue then it only impacts a very small number of customers. If they retry they'll likely hit one of the thousands of stable instances. Comparing metrics between the one box and normal instances is helpful and can be tied in to CI/CD for automatic rollbacks if necessary.

👤 ScrexyScroo
At my workplace which deals with millions of customers. What we have is essentially a segmented clone of prod. It's a 1:1 copy of prod with real data flowing through.

We use feature flags to enable-disable features. This way when our devs ship code to prod - it first lands in this segmented clone.

Then incrementally changes are propagated from this segmented area into actual prod-prod.


👤 mabbo
I put my new feature behind a beta flag or experiment flag. If the flag is off for a user, they don't see it.

Then I turn it on just for the user I test with in prod. Then I test in prod.

When it's time to enable the feature for the rest of the users, the same system let's me slowly dial up which users can see the feature. This separates deployment from launch, which is also a great best practice.


👤 dec0dedab0de
The taboo is when you only test in production. At the very least you should manually try out your app after deploying a change. As far as automated integration tests in production, it is simple as identifying which tests are prod safe, and marking them. That really depends on the app, but in a web app it generally means all the GET requests, plus some of the others.

👤 quickthrower2
In a multi tennant system one of the accounts can be a test account. Within that you can run integration tests. You might need special cases: test payment accounts and credit cards, test pricing plans and so on.

Some basic ping tests and other checks before swapping (as in preparing, initiating, and pointing the load balancer) to a new version into production would be smart.


👤 labarilem
Actually never seen a team that hasn't tested their system in prod. Just be careful with fake data, you might be testing something that does not mirror actual application usage. Feature flags, betas, etc. can be safer than fake data.

👤 hk1337
Generally, no. I have been known to point my local instance to production database where I am now as it's easier to get the dataset where an error occurs. I don't do anything that requires changing the data, strictly selects and views. I make a point to switch it off production ASAP.

I would prefer not having to do that at all though.


👤 throw1138
A significant source of frustration at $dayjob recently has been the _inability_ to test in production. We've just deployed Stripe, and if you're using prod API keys, there's no testing possible without spending real money. Deploy to production and pray to the tech gods I guess.

👤 jdougan
In most of the systems I worked with the ACID database is the source of truth. So I carefully (there is framework support to reduce errors) run tests without committing the open transaction. Not recommended, but sometimes database copies don't surface the actual problem.

👤 casualwriter
yes, I do testing on production usually. I feel much comfortable if can perform enough testing in production, before inform end-user to start use it.

In my experience, it starts from DESIGN PHARSE, which should be awared to make it possible to test in production without impact end-user. CODING PHARSE shall make some arrangement to provide more possibility for testing on prodcution environment. DEPLOYMENT process shall be able to provide a time-gap like "pre-launch". Then will be happy to test in the "pre-launch" period in production, and feel confidence to infrom end-user for a release.


👤 bradwood
I see a lot of suggestions in the comments for feature flags -- we've been using these from the beginning, to very good effect.

However flags turn on/off code, not data, and my main area of interest here is how to deal with the test data problem in prod.


👤 shashurup
What a coincidence! Just right now. Definitely I don't consider this as a normal situation. However, the crisis in progress leave no other way to find a root cause of production database sudden degradation. So here am I.

👤 logicalmonster
Just to play Devil's Advocate and be argumentative, what is the point of testing in production when your development/staging environment is guaranteed to be identical to your production environment?

👤 __s
Yes

Just because you have staging doesn't mean you don't need unit tests. Similarly, test in stage, then test in prod. Ideally in a way isolated from real prod users (eg, in an insurance system we had fake dealer accounts for testing)


👤 rr808
Depends a lot on your application and how big the changes are. IF you're an online store and you're pushing out incremental changes to a subset of users its a good strategy. If its aircraft auto-pilot not so much.

👤 kojeovo
I wouldn't personally inject fabricated data into prod just for testing. I use feature flags and test internally in prod before rolling out to real users.

👤 revskill
It's more about handling production error quickly, than testing in production. Feature flag is a good way.

👤 jedberg
I've been testing in prod for 20+ years, here are the best practices I suggest:

tl;dr: Safety comes in the form of confidence that you will know right away when something has gone wrong and can quickly recover from it back to the last known good state.

1) Observability is key. You can't test in prod unless you have really good metrics and monitoring in case you break something. It's also the only way you'll know the test worked. So that has to come first.

2) Automated deployment and rollback. You need your deployments to be fully automated as well as rollback. That way if something goes wrong you can quickly back out the change. It also means that devs can roll out smaller changes, because they don't have to amortize any deployment overhead. If a dev knows it will take 30 minutes minimum to deploy, they won't do it as often. Smaller deployments more often mean smaller blast radii.

3) Automated canaries. Once you have 1 and 2, you can fairly easily build 3. When code is checked in, have it automatically deploy and receive a small portion of traffic. Then have it automatically monitored and compare metrics. If the metrics are worse on the canary, roll it back.

You don't need to automate step 3, it's just a lot easier. But you can totally do step 3 by hand as long as you have 1 and 2.

These steps apply to stateless systems, but they can easily be applied to stateful systems with some small changes. With stateful systems you can still do canaries. But you have to add an abstraction layer between your business functions and their datastore (but you're doing that already right?). In that abstraction layer is where you add the coordination to keep data in sync during transitions from one data store to another (when doing schema changes for example). Or if you're changing the way you write to the data store in any way, so that you can write to both new and old and read from new and old without the code being different between them.

And then lastly you start adding in chaos engineering [0]. If your systems can automatically recover from errors in production, then it can automatically recover from bad deployments.

[0] https://principlesofchaos.org


👤 tonymet
yes you can do so with a canary tier . assuming your code is well instrumented to distinguish performance and quality regressions , a canary tier served to customers will catch more regressions than synthetic testing

👤 mlhpdx
Anyone with CI/CD is testing each deployment in production, right?

👤 satisfice
I don’t recall testing in production ever being taboo in the 90’s.

👤 msh
Testing is called verification when done in production ;)

👤 tiku
Is there another way?

👤 natoliniak
Feature flags.

👤 i_have_an_idea
I also code in production

👤 jrockway
I've never done much testing in production. A long time ago I was too lazy to put my website into git, so I would just ssh into the webserver and edit the HTML files with mg. Not particularly productive or enjoyable, honestly. I am sure the search engines also liked index.html~ being very similar to index.html; was also too lazy to turn off backup files ;)

My priorities with production are getting as much information recorded as possible; if there is ever a bug that occurs and isn't detected by monitoring and debuggable by looking at the telemetry, that's a big problem that is a priority to fix. It is always a work in progress, but something that you can chip away at gradually over time. (Add them as postmortem action items.)

The provided articles mention weird quirks that only happen in production, like network card firmware issues that drop a particular bit pattern. I've definitely seen things like this (at a higher level); I add the bit patterns to my test suite and make the test suite runnable as an "application" in the production environment and then collect my data. As for straight-up hardware problems, that's happened exactly once in my career. I used to maintain a several-thousand replica application; one day one replica was crash looping. I looked at the stack traces, different each time, and couldn't figure out what was possibly wrong with the code. A nearby coworker suggested "just restart that replica with --avoid_parent to schedule it on a different machine". The problem went away and never came back. Shrug. Sometimes the computer doesn't faithfully run the instructions that you put into memory, but it is pretty rare. Detect it and remove the faulty computer, I guess.

For less quirky things, I like the ability to simulate resource constraints, rather than trying to run into them with physical hardware. For example, it's pretty hard to write a load test that makes S3 slow, but it's pretty easy to hack up Minio to sleep for a second every MB of data and now your load tests can see what blows up when S3 is slow. Then you can edit your code to be resilient against that. (etcd on low iops disks has also been a problem in my work; that is easy enough to simulate without changing the code, cgroups provides a mechanism. Now you don't actually have to generate enough load to make your disk slow.) Adjusting network latency with "tc qdisc add dev X netem ..." has also been useful for debugging slow file uploads over high-latency links without actually going through the hassle of renting a server far away to upload things to. I will say the disadvantage there is that the less you know about the full stack, the less you trust your simulations. You'll end up with a lot of pushback along the lines of "that's not a real scenario", and it is true that calling Write() slowly versus the OS not returning from the write() syscall because the disk is busy is a slightly different codepath and there can always be side effects that you're missing. But often the black box model is a worthwhile tradeoff for improved development cycle times; just make sure you add the instrumentation to real production so you can get data about how good your simulation is.

I'm willing to use error/latency budget for unusual production deployments to collect real-world data, for example, running 10% of requests through a build with the race detector enabled. That now accounts for your worst 10% response latency (and errors if you do have data races in hot paths!), but if it's within the budget, it's worth it because you get a stack trace pointing at a critical correctness error in your code, and you can go add that case to your unit tests and never have the problem again. Sometimes you can't think of everything, which is why telemetry from production is so important to me. (This kind of data is important for more than just the mechanics of the code, of course. Talk to your users and see if they like the new icon set. If they don't, your test in production failed and you should fix your app.) Finally, I also like fuzz testing on top of all of this; have a beefy computer generating the most corrupt possible data billions of times a second and see how your app behaves. Every fuzz test I've ever written has exposed a boneheaded subtle mistake in the code, even in code with 100% test coverage.