My question is -- how does one go about doing it _safely_? In particular, I'm thinking about data. Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?
Ultimately, I think this will end up looking like implementing SLIs and SLOs in PROD, but for some of my SLOs, I think I need to actually _fake_ the data in order to get the SLIs I need, so how to do this?
Suggestions appreciated -- thanks.
[1] https://increment.com/testing/i-test-in-production/
[2] https://segment.com/blog/we-test-in-production-you-should-too/
Some common ways to go about it:
- Feature flags: every new change goes into your codebase behind a flag. You can flip the flag for a limited set of users and do a broader rollout when ready.
- Staged rollouts: have staging/canary etc. environments and roll out new deployments to them first. Observe metrics and alerts to check if something is wrong.
- Beta releases: have a group of internal/external power users test your features before they go out to the world.
Some people try to NOT test in production, but everyone does test in prod in a very real sense because dependencies and environments are different in prod.
I think the question was "Do you INTENTIONALLY test in production"
Re: data, it’s a somewhat common practice to notionalize data (think isomorphically faking data). We regularly do this and will often designate rows as notional to hide them from users who aren’t admins. I’ve found this to work exceptionally well; we do this 1-2 times a week, ensure there’s a closed circuit for notional data, and for more critical systems we’ll inform our customers that testing will occur.
I’m sure there are more complex and automated solutions but when it comes to testing, simple and flexible is often the way to go.
Some people have sandbox apis. They are generally broken and not worth it. See eBay for super in depth sandbox API that never works.
You can read the docs 100 times over. At the end of the day, the API is going to work like it works. So you kind of “have to” test in prod for these guys.
Another variation on the same theme is rewriting systems when you run production data through both systems. Quite often that's the only way of doin migrations to a new platform, or a new database, or yes, a newly re-written system.
> Is it common practice to inject fabricated data into a prod system to run such tests? What's the best practice or prior art on doing this well?
A very common practice is to run a snapshot of prod data (e.g. last hour, or last 24 hours, or even a week/month/year) through a system in staging (or cooking, or pre-cooking, or whatever name you give the system that's just about to be released). However, doing it properly may not be easy, and depends on the systems involved.
Once you're into testing I/O, which is ultimately unavoidable no matter how hard you try not to, you either need cooperative third parties who can give you truly representative test systems (rare) or a certain amount of test-in-prod.
Testing database stuff remains hard. You either wrap things in a some kind of layer you can mock out, or dupe prod or some subset of it into a staging environment with a daily snapshot or similar and hope any differences (scale, normally) aren't too bad.
Copy-on-write systems or those with time-travel and/or immutability help immensely with test-in-prod, especially if you can effectively branch your data. If it's your own systems you are testing against, things like lakefs.io look pretty useful in this regard.
And yes, feature flags, good metrics, and load balancers that you can send a small percentage of traffic through a new version (if your traffic/system allows such things) all help.
The way we did it safely is just as you say: creating fabricated users/organizations/configurations with data generators injecting into the system.
Faking data to look realistic is always challenging, but we used this cool library written by an early segment engineer: https://github.com/yields/phony
Not perfect but works well enough! And it's super simple. :-)
Performance testing needs a schedule, visibility, timebox, known scope, backout plan, data revert plan, pre- and post-graphs.
Schedule. Folks are clearly tagged in a
table with times down the side.
Visibility. Folks who should know know
when it's going to happen,
are invited to the session,
and are mentioned in the
distributed schedule.
Timebox. It's going to start at a
defined time and end on a
defined time.
Known scope. Is it going to fulfill an
order? How many accounts
created?
Backout plan. DBA and DevOps on standby
for stopping the test.
Data revert plan. We know what rows to
delete or update after
testing.
Pretty pictures. You want to show graphs
during the test, so
that you know what to
improve and everyone's
time wasn't wasted.
Reference: observing successful runs that didn't result in problems later.
A common dev concern (usually raised by people who have no idea how users actually use stuff!) is that someone might pick up the report and then do something awful based on it, which would be awful^2 - I then explain that users can't find/use the updated reports till we tell them where they are and grant access permissions etc etc, so it's going to be fine and there's no need to panic, which calms them down till they forget by the time this comes up again!
On a side note, these people seem to get much more wound up about principal based worries (it would be bad to test in Prod being a prime example) compared to concerns based on their own weaknesses (ie they rush, forget a whole section of requirements, make mistakes, can't spot obvious bugs) which they seem to imagine are way less likely to cause problems than experience demonstrates.
In almost every case you cannot test new functionality on actual prod data, at least not anything thats not strictly “read only”-functionality. If you have a new feature to send automated mail to someone foreclosing on their property you just do not test that on a real live system.
What you can do is setup a staging environment that is as close to prod config as possible, and then copy the prod database to the staging env. Do your tests in staging. It doesnt matter if data in stg is messed ud. There may well be legal or company policy or security restrictions preventing you from doing this, but its the only way to test on real life data without the risk of f**ing up data in the live system.
Then there are integration tests - to other systems that is, which are a much harder problem
That did require having multi-tenancy support, and there was a need to suppress some security features by whitelisting the IP of the test app.
Most software probably falls in this category of “we could test more but at some point our users would rather get the product with bugs than wait”.
Whether it’s staged rollouts, feature flags, it’s the same thing. It’s mitigating risk when testing in prod. It’s the best bet.
Some software obviously falls into the category that can’t have serious bugs for any users. Then you just have to keep the software so simple that you can be confident it works.
It is not a good use of time to mock everything, because you have no control of external systems. The only reason I'd see it being important is if these external systems are tightly coupled to complex local logic that should be tested locally. However, there are a number of strategies to deal with such "tight coupling" in such cases.
But it is also useful to get very close to "test in prod" without actually risking anything.
Actually executing data-changing code for testing is actively discouraged, though.
1) the current system takes a snapshot of the production db at the end of the day and uses it to repopulate from scratch this "staging" environment. In past cases I had to accept less frequent updates, though.
For 3rd parties with test modes like Stripe you can get E2E, or if the cost of the test is low.
Some safety controls to avoid running non-prod-safe tests are wise.
Another alternative is using anonymized prod copies outside prod. Possibly even mocking 3rd parties to behave like prod, happy, sad, etc.
We use feature flags to enable-disable features. This way when our devs ship code to prod - it first lands in this segmented clone.
Then incrementally changes are propagated from this segmented area into actual prod-prod.
Then I turn it on just for the user I test with in prod. Then I test in prod.
When it's time to enable the feature for the rest of the users, the same system let's me slowly dial up which users can see the feature. This separates deployment from launch, which is also a great best practice.
Some basic ping tests and other checks before swapping (as in preparing, initiating, and pointing the load balancer) to a new version into production would be smart.
I would prefer not having to do that at all though.
In my experience, it starts from DESIGN PHARSE, which should be awared to make it possible to test in production without impact end-user. CODING PHARSE shall make some arrangement to provide more possibility for testing on prodcution environment. DEPLOYMENT process shall be able to provide a time-gap like "pre-launch". Then will be happy to test in the "pre-launch" period in production, and feel confidence to infrom end-user for a release.
However flags turn on/off code, not data, and my main area of interest here is how to deal with the test data problem in prod.
Just because you have staging doesn't mean you don't need unit tests. Similarly, test in stage, then test in prod. Ideally in a way isolated from real prod users (eg, in an insurance system we had fake dealer accounts for testing)
tl;dr: Safety comes in the form of confidence that you will know right away when something has gone wrong and can quickly recover from it back to the last known good state.
1) Observability is key. You can't test in prod unless you have really good metrics and monitoring in case you break something. It's also the only way you'll know the test worked. So that has to come first.
2) Automated deployment and rollback. You need your deployments to be fully automated as well as rollback. That way if something goes wrong you can quickly back out the change. It also means that devs can roll out smaller changes, because they don't have to amortize any deployment overhead. If a dev knows it will take 30 minutes minimum to deploy, they won't do it as often. Smaller deployments more often mean smaller blast radii.
3) Automated canaries. Once you have 1 and 2, you can fairly easily build 3. When code is checked in, have it automatically deploy and receive a small portion of traffic. Then have it automatically monitored and compare metrics. If the metrics are worse on the canary, roll it back.
You don't need to automate step 3, it's just a lot easier. But you can totally do step 3 by hand as long as you have 1 and 2.
These steps apply to stateless systems, but they can easily be applied to stateful systems with some small changes. With stateful systems you can still do canaries. But you have to add an abstraction layer between your business functions and their datastore (but you're doing that already right?). In that abstraction layer is where you add the coordination to keep data in sync during transitions from one data store to another (when doing schema changes for example). Or if you're changing the way you write to the data store in any way, so that you can write to both new and old and read from new and old without the code being different between them.
And then lastly you start adding in chaos engineering [0]. If your systems can automatically recover from errors in production, then it can automatically recover from bad deployments.
My priorities with production are getting as much information recorded as possible; if there is ever a bug that occurs and isn't detected by monitoring and debuggable by looking at the telemetry, that's a big problem that is a priority to fix. It is always a work in progress, but something that you can chip away at gradually over time. (Add them as postmortem action items.)
The provided articles mention weird quirks that only happen in production, like network card firmware issues that drop a particular bit pattern. I've definitely seen things like this (at a higher level); I add the bit patterns to my test suite and make the test suite runnable as an "application" in the production environment and then collect my data. As for straight-up hardware problems, that's happened exactly once in my career. I used to maintain a several-thousand replica application; one day one replica was crash looping. I looked at the stack traces, different each time, and couldn't figure out what was possibly wrong with the code. A nearby coworker suggested "just restart that replica with --avoid_parent to schedule it on a different machine". The problem went away and never came back. Shrug. Sometimes the computer doesn't faithfully run the instructions that you put into memory, but it is pretty rare. Detect it and remove the faulty computer, I guess.
For less quirky things, I like the ability to simulate resource constraints, rather than trying to run into them with physical hardware. For example, it's pretty hard to write a load test that makes S3 slow, but it's pretty easy to hack up Minio to sleep for a second every MB of data and now your load tests can see what blows up when S3 is slow. Then you can edit your code to be resilient against that. (etcd on low iops disks has also been a problem in my work; that is easy enough to simulate without changing the code, cgroups provides a mechanism. Now you don't actually have to generate enough load to make your disk slow.) Adjusting network latency with "tc qdisc add dev X netem ..." has also been useful for debugging slow file uploads over high-latency links without actually going through the hassle of renting a server far away to upload things to. I will say the disadvantage there is that the less you know about the full stack, the less you trust your simulations. You'll end up with a lot of pushback along the lines of "that's not a real scenario", and it is true that calling Write() slowly versus the OS not returning from the write() syscall because the disk is busy is a slightly different codepath and there can always be side effects that you're missing. But often the black box model is a worthwhile tradeoff for improved development cycle times; just make sure you add the instrumentation to real production so you can get data about how good your simulation is.
I'm willing to use error/latency budget for unusual production deployments to collect real-world data, for example, running 10% of requests through a build with the race detector enabled. That now accounts for your worst 10% response latency (and errors if you do have data races in hot paths!), but if it's within the budget, it's worth it because you get a stack trace pointing at a critical correctness error in your code, and you can go add that case to your unit tests and never have the problem again. Sometimes you can't think of everything, which is why telemetry from production is so important to me. (This kind of data is important for more than just the mechanics of the code, of course. Talk to your users and see if they like the new icon set. If they don't, your test in production failed and you should fix your app.) Finally, I also like fuzz testing on top of all of this; have a beefy computer generating the most corrupt possible data billions of times a second and see how your app behaves. Every fuzz test I've ever written has exposed a boneheaded subtle mistake in the code, even in code with 100% test coverage.