What are best patterns for events in distributed transactions?

Question

I am amid researching events/messaging patterns to use in a microservice architecture with distributed transactions. The goal is to reliably process events published by multiple services - potentially with different data stores.We've already decided to use Sagas to orchestrate the distributed transactions, and now I want to find an event framework that complements Sagas well.For context, we are building an open-source commerce engine striving to be highly extensible. We provide developers with the primitives to create sophisticated commerce experiences and allow them to replace the different domains (PIM, Inventory, etc.) with their own microservices. Hence, maintaining a great developer experience is paramount to us in technical decision-making.It seems the industry standard is to use a Transactional Outbox Pattern (unless you go all in on event sourcing), but I believe this increases the complexity for external developers building microservices as they would need to understand the outbox pattern and expose one themselves to ensure the events system of our engine is fully functional.I've been working on a solution that uses a shared cache across microservices to store events in an ongoing distributed transaction - effectively a shared outbox. The transaction orchestrator of the Saga would be responsible for processing the cached events upon successfully committing - or invalidate in case an error is thrown. This solution reduces complexity and boilerplate code for developers. Has anyone heard of a similar pattern? I haven't been able to find anything.I am curious to learn how others have solved the problem of events in distributed transactions.

atombender · Accepted Answer

I'm not sure what your application is, but Temporal [1] could be a good fit. The nice thing is that it can cover all three aspects in one conceptual model: Sagas, distributed transactions, and events.
With Temporal, you write your various complex, potentially long-running actions as normal, synchronous code, which Temporal "magically" turns into distributed, asynchronous, automatically retried, idempotent jobs. The design is really elegant, and removes all of the hard work from writing code that must keep things in sync and handle failures through queueing and retrying. In one sense, Temporal is a way to develop sagas — but also anything else.
Part of the beauty is that you can use it to build the transactional outbox pattern in a much simpler way. You simply emit the event at the end of your workflow. There's no need to make sure the "outbox" is maintained in a database transaction, because Temporal itself is transactional and ensures that the outbox event happens, no matter what failures occur.
Whether that gels with your needs to cater to external developers, I don't know. Are those developers actually working in your codebase, or are you just providing an API?
Either way, you may want to look at Temporal for inspiration, if nothing else. It solves some complex problems in an extremely elegant way. I consider it a must-have in any modern application stack.
[1] https://temporal.io/

tonyhb · Answer

We've built something similar at https://www.inngest.com — we allow you to build durable serverless functions that react to events (or run on a schedule).
The good thing about us is that you don't need to learn about events, subscribers, backoffs, idempotency etc. Instead, you write a single line and can deploy reliable event-driven functions to any provider, even at the edge. Most people are afraid of tackling event-driven architectures because it's too hard to set up, debug, build, make reliable, deploy - all the standard stuff that's frustrating.
When I say durable, I mean:
- Functions get their own state, retries, idempotency, throttling, etc.
- You can create steps within functions which have their own retries, similar to Temporal
- You can eg `sleep`, `sleepUntil` within a serverless function to handle queues or delayed jobs
- You can `waitForEvent` within functions to wait for other events which match arbitrary expressions. This lets you do magic things like "when a user adds something to their cart, wait for the checkout event from this user for 1 day. If the event is null/not received, send a reminder email". This is hard to do out of the box, but with us it's built in and takes a single line of code.
- We also store each event so you can do things like local replay, event versioning, schema management & governance, etc.
When you mention "distributed transactions", there's many ways to go. For us, we're working on CDC so that you can atomically react to actual DB transactions. For everything else, durability and retries becomes important.
There's a lot to talk about here. I'd love to discuss more in depth - shoot me an email if it's of interest! :)

survirtual · Answer

Have you looked into CRDTs (Conflict Free Replicated Data Types)? Here is a paper on them: https://arxiv.org/pdf/1805.06358.pdf
With CRDTs, it becomes much easier to perform concurrent updates in a distributed system and decouples each replica from each other, since state can always merge to reach the same state.
This Rust crate has good documentation with more explanation (even if you don’t use Rust): https://crates.io/crates/crdts
Take a look, should be clear how they fit in to a distributed event based system w microservices.

nivertech · Answer

I recommend this course:Distributed data patterns in a Microservice architecturehttp://chrisrichardson.net/virtual-bootcamp-distributed-data...

nitwit005 · Answer

The boring question people tend to overlook is "How do I do database backup and restore?".
If you did everything with distributed transactions, and had a single data store that supported it, you could have all backups be at the same transaction checkpoint. When you restored, everything would be in a consistent state.
If everyone is managing their own data in different data stores, and just listening to Kafka or similar, you have to deal with the inconsistent state after a DB restore somehow.

mountainreason · Answer

I dont have an answer, I'm just curious about something. I thought that the point of an outbox was that it was local, and could therefore be updated atomically along with any local db changes. What would happen if the orchestrator was unable to update the remote shared cache outbox after a local exception/rolled back transaction?

stevesimmons · Answer

Do you have any references to learn more? I've looked at [1][1] https://learn.microsoft.com/en-us/azure/architecture/best-pr...

olivermrbl · Answer

On the alternative solution, it's worth stressing, that it does not have to be a cache service. Any key/value store would do.

What are best patterns for events in distributed transactions?

I recommend this course:
Distributed data patterns in a Microservice architecture
http://chrisrichardson.net/virtual-bootcamp-distributed-data...

Do you have any references to learn more? I've looked at [1]
[1] https://learn.microsoft.com/en-us/azure/architecture/best-pr...

On the alternative solution, it's worth stressing, that it does not have to be a cache service. Any key/value store would do.