HACKER Q&A
📣 c89X

How do you deal with atomicity in microservice environments?


We run a small SaaS where users are able to create accounts, submit billing information and upload/call ML artefacts.

In order to not reinvent the wheel we use external services where possible: Auth0 for authentication, Stripe for handling billing, etc. For this question, I am considering this a 'microservices' architecture. I am aware that this definition will spark its own discussion, but I believe that the problem generalises to a lot of the other (better, more complete, etc.) microservice definitions. So please, bear with me.

Now, in the lifecycle of a customer (createAccount, addBilling, deleteAccount, ...) at various points we expect operations to occur atomically. By which I mean (simplified) that upon creating a new account, I also need to ensure a customer is created in Stripe as well as register the user in Auth0 - but if either of these subtasks fail, the operation (createAccount) should fail completely and in fact 'undo' any state changes already performed. If not, I risk high-impact bugs such as double charging customers.

Now, in a 'conventional' setup (without external services), I would resolve a lot of this by ensuring transactional operations on a single source-of-truth database. I understand that 'idempotency' comes up a lot here, but whichever way I try to apply that here - it always seems to explode in a (fragile/brittle) spaghetti of calls, error handling and subsequent calls.

Surely this has been resolved by now, but I'm having a hard time finding any good resources on how to approach this in an elegant manner. Concretely:

Do you recognise the problem of atomicity in microservices architecture?

And,

How do you deal with guaranteeing this atomicity in your microservices?


  👤 mooted1 Accepted Answer ✓
Having built similar applications in microservice environments, I think there are usually simpler answers than distributed transactions. And if you do need distributed transactions, this is often a sign that your service boundaries are too granular.

In fact, since the services you're describing don't know about each other, distributed transactions aren't an option.

I think the only solution to this problem is idempotency. Idempotency is a distributed systems swiss army knife—you can tackle 90% of use cases by combining retries and idempotency and never have to worry about ACID. Yes, it adds complexity. No, you don't have a choice.

I'm also not sure why this requires a lot of complexity. Can you explain how you're implementing idempotency? The simplest approach is to initialize an idempotency key on the browser side which you thread across your call graph. Stripe has built in support for idempotency keys so in that case, no additional logic is required. For providers without idempotency support, you'll need a way to track idempotency keys atomically, but this is usually trivial to implement. When a particular provider fails, you can ask users to retry.

* If you need a particular operation to succeed only if another succeeds (creating a stripe charge, for example), make sure that it runs after its dependencies.

* If you don't like the idea of forcing users to retry, you can ensure "eventual consistency" using a durable queue + idempotency.

I'm not a fan of HN comments that trivialize problems, but if you have to build complex distributed systems machinery to solve the problem you're describing, I feel strongly that something's going wrong.


👤 siliconc0w
You often don't need or want cross service atomicity, just eventual consistency. Each microservice should be an idempotent state-machine where it expects to have atomic commits to it's own state but never expects to be able to conduct a transaction across the service boundary. However, services can conduct local transactions on their own state joined against a read-only cached copy of another service's state - you can implement this with an event bus or a shared caching tier. This can allow you to avoid writing your own joining logic and use standard ORMs. Ensuring the queues are flowing and retries happen is very important though, you need to monitor queue lengths and job errors to ensure the eventual part of eventual consistency is happening.

For this particular example createAccount would create a local commit with a state of CREATING, returns an account_id and asynchronously creates jobs to complete the billing, the auth, whatever. You then have a job that is polling in the background to move the account to CREATED once all or enough of the dependencies are successful(i.e you may have a slow third party provider you don't want to block on). Your front end polls the state of the account and displays a pretty animation to distract the user while you do the work.


👤 jjanyan
"Sagas", or distributed transactions, are what you're looking for. These are APIs/functions/methods that know how to complete every step of your atomic operation and how to roll it back if any step fails. They more or less recreate what would have been a single database transaction pre-microservice.

👤 mh8h
The real business world is not atomic. The idea is to keep track of things that you've done, and do compensating things to roll their effect back if something goes wrong. See the Saga pattern for an example. I also found Gregor Hohpe's blog post [0] titled "Starbucks Does Not Use Two-Phase Commit" very informative.

[0] https://www.enterpriseintegrationpatterns.com/ramblings/18_s...


👤 perlgeek
Not everything that uses lots of services needs to be a classical "microservice" architecture, where every service can call every other service.

You can have a component that coordinates workflows (can be a different component for each workflow, or can be one central component, depending on your need to scale vs. simplicity).

In the OpenStack project, they developed TaskFlow[0] for that. It allows you to build graphs of tasks, handles parallelization, persistence of meta information (if desired) and, important for your use case, rollback functions in case of errors.

[0]: https://docs.openstack.org/taskflow/latest/


👤 phyrex
"Designing Data-Intensive Applications" by Martin Kleppmann touches on this subject: http://dataintensive.net

👤 AlexITC
I know two not-so-complex ways for dealing with this:

1. Use a persistent queue where you store what needs to be done, this allows you to ensure than an operation will eventually succeed by a leveraging the powers of the queue, you'll need to keep track of the stages where the jobs failed in order to not repeat operations.

2. Rely on the clients retrying and use idempotency, there is a nice and very detailed blog[0] about it, this works very well with your microservices but most external services won't have idempotency and you'll need to plan on how to deal with those in order to emulate idempotency, sometimes the queue helps.

[0]: https://brandur.org/idempotency-keys


👤 gregoryl
Just don't bother. At the scale you're working at, if something tries to run and, e.g. The user doesn't exist, just throw an exception, email the team for a manual fix. If(if!) it becomes too common, throw in a retry or two.

You have better things to be doing in a startup vs distributed transactions (and honestly, microservices...)


👤 namelosw
This is a hard problem to tackle.

An obvious approach is 2-phase commit or N-phase commit. 2PC and NPC are generally not recommended because it sacrifices performance too much.

* Pros: It is more like a transaction so that you don't have to write rollback logic.

* Cons: It seriously sacrifices system availability since the distributed transaction happens across networks.

A commonly recommended approach is Saga, it's basically compensation.

* Pros: Does not sacrifices system availability since it's an asynchronous/optimal design.

* Cons: Have to write a lot of compensation logic and call them correctly.

* Cons: There are a lot of operations that cannot be compensated. These operations should be arranged in the last steps.

* Cons: What happens if compensation fails?

* Cons: It's eventual consistency, it's not strong atomicity.

I have to admit, I don't try to tackle this problem unless it's really important for the business in a microservices environment.

If your business requires atomicity in most of the places, it's highly recommended to have a carefully designed monolith so you can easily benefit from RDBMS, with your domains modeled in different modules instead of services.


👤 jbob2000
Microservices are good for large complex systems, with many developers making many changes. It sounds like your system is rather simple, I don’t think you need to pursue this architecture, especially just starting out - you’ll need to write too much infrastructure and you probably don’t need the scalability.

👤 moondev
This may not be the solution you are seeking, but I would do this with an asynchronous job model. Kick off a job where each stage is a dependency on the previous stage. A failure of any stage will kickoff a cleanup job. The account creation service would be responsible for creating the job and monitoring the status. You could run jobs with whatever you see fit. A kafka consumer would be an interesting option. For a quick POC you could even do it in Jenkins via the API.

👤 steven2012
Calling external services "microservices" is fundamentally wrong. They are "services". All that will do is create a lot of confusion because you are utterly misusing terminology. You can't just decide to misuse terminology. That's like calling "red" "blue". You don't gain information by misusing well-defined terms, you create confusion and misinformation.

That said, your ideas on distributed systems is wrong. Any point in a workflow can and will fail. You need to account for all of this. You can successfully create an account on Stripe, and then when you bill that account, it could return an error. Or even worse, it can timeout, meaning you don't know whether or not a user was charged.

You have to take into consideration all of these failure situations. There is no atomicity in the way that you expect. Whenever things deviate off the happy path, you fail quickly and decisively so that everyone knows where they stand. That gives people the option to retry or call support.


👤 quadrature
Hate to shill, but this is a problem we faced where i work and we wrote a great article about our solution https://engineering.shopify.com/blogs/engineering/building-r...

👤 BurningFrog
Simplest Thing That Could Possibly Work:

I would look at keeping track of what substeps have completed, and only treat the user as created if they're all done.

You could roll back the successful ones if not everything worked, but you could also just let them be.


👤 myvoiceismypass
Google “sagas” - it is a pattern that helps with this, rollbacks across boundaries, etc

👤 z3t4
There are no correct answer that applies to all architectures. But generally speaking you have a core that use the services like black boxes with API's. The core/caller need to handle the errors.

👤 quintes
2 phase commit and transaction commit / rollback. Think about boundaries and domains

👤 jrockway
There is no one-size-fits-all approach to atomicity in microservices. Microservices typically want to store the most minimal amount of state possible, and push the coordination up a level; i.e., the calling service will send all the relevant information to the service that is going to handle this tiny piece of the work. Eventually there is going to be one service that owns some sort of workflow, and pushes the rest of the world in the direction of its desired outcome.

A year or so ago I needed to write a service to rotate accounts on network equipment every day or so. The control channel and the network hardware tended to be flaky, so I designed a state machine to make as much progress as possible, and be able to pick up where it left off. Each newly-created account was a row in a transactional database, and the successful completion of the operation was noted by changing a column from NULL to the current time. The flow was; if a new account is needed, generate the credential (generated_at). Find the old credentials and log into the device. Add the new account (applied_at). Try logging in with the new account (verified_at). Wait until account expiration (2 days later). Delete the account (deleted_at). Verify that we can't log in with the old account (delete_verified_at).

From this data, we could see what state every account was in, and query for a list of operations that needed to be performed. (Or if nothing needed to be done, how long to sleep for so that we would re-run the loop at the exact instant when work needed to be done, not that it was critical.)

I believe that your account creation and account deletion should follow a similar workflow. Accounts that fail to be created after retrying for X length of time should just move into the deletion workflow.

The user creation service and the day-to-day user operation service should probably be the same thing. The "user" row in your database should be what the state machine uses to figure out what operations need to be performed.

A user record would probably look something like:

  username, password, email, ... = 
  created_at = 
  stripe_account_id = 
  stripe_account_verified_at = 
Now you can do queries to figure out what work needs to be done. "select username from users where delete_requested_at is null and stripe_account_verified_at is null" -> create stripe accounts for those users. "select username from users where delete_requested_at is not null and auth0_account_delete_verified_at is null" -> delete auth0 accounts for those users.

The last bit of complexity is to prevent two copies of your user service from running the query at the exact instant and deciding to create a stripe account twice. I would personally just run one replica of that job. This makes deployments slightly more irritating, since no user creation can occur during a rollout (when the number of replicas is 0), but it sure is simple. I worked on a user-facing feature at Google that did that, and although I hated it, it worked fine. It is also possible to add a column to lock the row; check that stripe_account_creation_started_at is null; change it to the current time; commit. Only one replica will be able to successfully commit that and progress to the next step, but then you need a way of finding deadlocks (the app could crash right after the commit) and breaking them.

It is a little complicated but I personally would rather have exact accounting of what is happening and why, rather than guessing when some user logs in and doesn't have a stripe account.

Edit to add: one last bit... I like running this all in a background loop that wakes up periodically to see what needs to be done, but your createAccount RPC should also be able to immediately wake this loop up. That way if everything goes well, you don't add any latency by introducing this state machine / workflow manager. If something happens like Stripe being down... progress will resume when the loop wakes up again. For that reason, I think you should be explicit and provide the end user with a system that lets them request an account and lets them check the status of the request. (Maybe not the user directly, but your signup UI will be consuming this stuff and providing them with appropriate feedback. You don't want the createAccount RPC to just hang while you're talking to Stripe and Auth0 in the background. Probably. The happy case might take a second, but the worst case could take much longer. Design your UI to be good for the worst case, and it will be good for the happy case too.)


👤 metapsj
event sourcing and the notion of compensating transactions go along way in solving these types of problems.