How do you handle duplicate side effects when jobs, workflows retry?

Question

Quick context: I'm building background job automation and keep hitting this pattern:1. Job calls external API (Stripe, SendGrid, AWS) 2. API call succeeds 3. Job crashes before recording success 4. Job retries &rarr; calls API again &rarr; duplicateExample: process refund, send email notification, crash. Retry does both again. Customer gets duplicate refund email (or worse, duplicate refund).I see a few approaches:Option A: Store processed IDs in database Problem: Race between "check DB" and "call API" can still duplicateOption B: Use API idempotency keys (Stripe supports this) Problem: Not all APIs support it (legacy systems, third-party)Option C: Build deduplication layer that checks external system first Problem: Extra latency, extra complexityWhat do you do in production? Accept some duplicates? Only use APIs with idempotency? Something else?(I built something for Option C, but trying to understand if this is actually a common-enough problem or if I'm over-engineering.)

moomoo11 · Accepted Answer

You proxy those api calls yourself and have idempotency to cover you for those APIs that don&rsquo;t have it. If you architect it right you won&rsquo;t have more than a ms latency addition. You can avoid the race condition issues by using atomic records so if something else tries they&rsquo;d see it&rsquo;s in progress and exit.

babelfish · Answer

Use something like Temporal

stephenr · Answer

I think the answer is probably like most things: it depends.
- If the external service supports idempotent operations, use that option.
- If the external service doesn't, but has a "retrieval" feature (i.e. lookup if the thing already exists, e.g fetch refunds on a given payment), use that first.
- If the system has neither, assess how critical it is to avoid duplicates.

codebitdaily · Answer

Idempotency is the only sustainable answer here. Whether it's at the database level using unique constraints or implementing idempotency keys in your API headers, you have to design for the 'at-least-once' delivery reality. I usually implement a 'processed_requests' table that stores the unique ID of the job. Before the worker executes any side effect (like a payment or email), it checks if that ID exists. If it does, it skips the execution and returns the previous result. It adds a bit of latency, but it's much cheaper than dealing with double-billing or corrupted data