In order to not reinvent the wheel we use external services where possible: Auth0 for authentication, Stripe for handling billing, etc. For this question, I am considering this a 'microservices' architecture. I am aware that this definition will spark its own discussion, but I believe that the problem generalises to a lot of the other (better, more complete, etc.) microservice definitions. So please, bear with me.
Now, in the lifecycle of a customer (createAccount, addBilling, deleteAccount, ...) at various points we expect operations to occur atomically. By which I mean (simplified) that upon creating a new account, I also need to ensure a customer is created in Stripe as well as register the user in Auth0 - but if either of these subtasks fail, the operation (createAccount) should fail completely and in fact 'undo' any state changes already performed. If not, I risk high-impact bugs such as double charging customers.
Now, in a 'conventional' setup (without external services), I would resolve a lot of this by ensuring transactional operations on a single source-of-truth database. I understand that 'idempotency' comes up a lot here, but whichever way I try to apply that here - it always seems to explode in a (fragile/brittle) spaghetti of calls, error handling and subsequent calls.
Surely this has been resolved by now, but I'm having a hard time finding any good resources on how to approach this in an elegant manner. Concretely:
Do you recognise the problem of atomicity in microservices architecture?
And,
How do you deal with guaranteeing this atomicity in your microservices?
In fact, since the services you're describing don't know about each other, distributed transactions aren't an option.
I think the only solution to this problem is idempotency. Idempotency is a distributed systems swiss army knife—you can tackle 90% of use cases by combining retries and idempotency and never have to worry about ACID. Yes, it adds complexity. No, you don't have a choice.
I'm also not sure why this requires a lot of complexity. Can you explain how you're implementing idempotency? The simplest approach is to initialize an idempotency key on the browser side which you thread across your call graph. Stripe has built in support for idempotency keys so in that case, no additional logic is required. For providers without idempotency support, you'll need a way to track idempotency keys atomically, but this is usually trivial to implement. When a particular provider fails, you can ask users to retry.
* If you need a particular operation to succeed only if another succeeds (creating a stripe charge, for example), make sure that it runs after its dependencies.
* If you don't like the idea of forcing users to retry, you can ensure "eventual consistency" using a durable queue + idempotency.
I'm not a fan of HN comments that trivialize problems, but if you have to build complex distributed systems machinery to solve the problem you're describing, I feel strongly that something's going wrong.
For this particular example createAccount would create a local commit with a state of CREATING, returns an account_id and asynchronously creates jobs to complete the billing, the auth, whatever. You then have a job that is polling in the background to move the account to CREATED once all or enough of the dependencies are successful(i.e you may have a slow third party provider you don't want to block on). Your front end polls the state of the account and displays a pretty animation to distract the user while you do the work.
[0] https://www.enterpriseintegrationpatterns.com/ramblings/18_s...
You can have a component that coordinates workflows (can be a different component for each workflow, or can be one central component, depending on your need to scale vs. simplicity).
In the OpenStack project, they developed TaskFlow[0] for that. It allows you to build graphs of tasks, handles parallelization, persistence of meta information (if desired) and, important for your use case, rollback functions in case of errors.
1. Use a persistent queue where you store what needs to be done, this allows you to ensure than an operation will eventually succeed by a leveraging the powers of the queue, you'll need to keep track of the stages where the jobs failed in order to not repeat operations.
2. Rely on the clients retrying and use idempotency, there is a nice and very detailed blog[0] about it, this works very well with your microservices but most external services won't have idempotency and you'll need to plan on how to deal with those in order to emulate idempotency, sometimes the queue helps.
You have better things to be doing in a startup vs distributed transactions (and honestly, microservices...)
An obvious approach is 2-phase commit or N-phase commit. 2PC and NPC are generally not recommended because it sacrifices performance too much.
* Pros: It is more like a transaction so that you don't have to write rollback logic.
* Cons: It seriously sacrifices system availability since the distributed transaction happens across networks.
A commonly recommended approach is Saga, it's basically compensation.
* Pros: Does not sacrifices system availability since it's an asynchronous/optimal design.
* Cons: Have to write a lot of compensation logic and call them correctly.
* Cons: There are a lot of operations that cannot be compensated. These operations should be arranged in the last steps.
* Cons: What happens if compensation fails?
* Cons: It's eventual consistency, it's not strong atomicity.
I have to admit, I don't try to tackle this problem unless it's really important for the business in a microservices environment.
If your business requires atomicity in most of the places, it's highly recommended to have a carefully designed monolith so you can easily benefit from RDBMS, with your domains modeled in different modules instead of services.
That said, your ideas on distributed systems is wrong. Any point in a workflow can and will fail. You need to account for all of this. You can successfully create an account on Stripe, and then when you bill that account, it could return an error. Or even worse, it can timeout, meaning you don't know whether or not a user was charged.
You have to take into consideration all of these failure situations. There is no atomicity in the way that you expect. Whenever things deviate off the happy path, you fail quickly and decisively so that everyone knows where they stand. That gives people the option to retry or call support.
I would look at keeping track of what substeps have completed, and only treat the user as created if they're all done.
You could roll back the successful ones if not everything worked, but you could also just let them be.
A year or so ago I needed to write a service to rotate accounts on network equipment every day or so. The control channel and the network hardware tended to be flaky, so I designed a state machine to make as much progress as possible, and be able to pick up where it left off. Each newly-created account was a row in a transactional database, and the successful completion of the operation was noted by changing a column from NULL to the current time. The flow was; if a new account is needed, generate the credential (generated_at). Find the old credentials and log into the device. Add the new account (applied_at). Try logging in with the new account (verified_at). Wait until account expiration (2 days later). Delete the account (deleted_at). Verify that we can't log in with the old account (delete_verified_at).
From this data, we could see what state every account was in, and query for a list of operations that needed to be performed. (Or if nothing needed to be done, how long to sleep for so that we would re-run the loop at the exact instant when work needed to be done, not that it was critical.)
I believe that your account creation and account deletion should follow a similar workflow. Accounts that fail to be created after retrying for X length of time should just move into the deletion workflow.
The user creation service and the day-to-day user operation service should probably be the same thing. The "user" row in your database should be what the state machine uses to figure out what operations need to be performed.
A user record would probably look something like:
username, password, email, ... =
created_at =
stripe_account_id =
stripe_account_verified_at =
Now you can do queries to figure out what work needs to be done. "select username from users where delete_requested_at is null and stripe_account_verified_at is null" -> create stripe accounts for those users. "select username from users where delete_requested_at is not null and auth0_account_delete_verified_at is null" -> delete auth0 accounts for those users.The last bit of complexity is to prevent two copies of your user service from running the query at the exact instant and deciding to create a stripe account twice. I would personally just run one replica of that job. This makes deployments slightly more irritating, since no user creation can occur during a rollout (when the number of replicas is 0), but it sure is simple. I worked on a user-facing feature at Google that did that, and although I hated it, it worked fine. It is also possible to add a column to lock the row; check that stripe_account_creation_started_at is null; change it to the current time; commit. Only one replica will be able to successfully commit that and progress to the next step, but then you need a way of finding deadlocks (the app could crash right after the commit) and breaking them.
It is a little complicated but I personally would rather have exact accounting of what is happening and why, rather than guessing when some user logs in and doesn't have a stripe account.
Edit to add: one last bit... I like running this all in a background loop that wakes up periodically to see what needs to be done, but your createAccount RPC should also be able to immediately wake this loop up. That way if everything goes well, you don't add any latency by introducing this state machine / workflow manager. If something happens like Stripe being down... progress will resume when the loop wakes up again. For that reason, I think you should be explicit and provide the end user with a system that lets them request an account and lets them check the status of the request. (Maybe not the user directly, but your signup UI will be consuming this stuff and providing them with appropriate feedback. You don't want the createAccount RPC to just hang while you're talking to Stripe and Auth0 in the background. Probably. The happy case might take a second, but the worst case could take much longer. Design your UI to be good for the worst case, and it will be good for the happy case too.)