HACKER Q&A
📣 todaysAI

How do you update multiple disparate databases?


Talked to a CTO of a small bank. He has 4 disparate legacy databases which need to be updated when a mortgage rate changes, which is obviously a problem.

What solutions does the HN community have in dealing with this issue?


  👤 redis_mlc Accepted Answer ✓
Banks use an ancient but powerful architecture called Source of Truth (SoT), or system of record. It's one of the first techniques developed to manage heterogeneous distributed databases.

One master is picked as the final authority, then changes flow to other databases, replicas, etc. either using database tools or applications.

Awareness of deadlines or how much latency (ie. when the updates are needed for each downstream pipeline) is helpful.

A practical example that solves a simple case is the recent release of a tool by Netflix.

Pro tip: when you interview people for a project like this, they should already know the above. :)

Source: DBA.

https://en.wikipedia.org/wiki/Single_source_of_truth

https://en.wikipedia.org/wiki/System_of_record


👤 Misdicorl
Step 1) Generate a GUID

Step 2) Send a message to all the `downstream` systems (i.e. not the Source of Truth systems) with the change description and GUID

Step 3a) Update source of truth with the new info, write the GUID to a log table in source of truth in the same transaction

Step 3b,c,d) Downstream systems poll the source of truth for existence of the GUID in the log table indicating the change has been committed to source of truth. Load latest value from source of truth and commit the change to the downstream system.

Step 4) If GUID never shows up, transaction was rolled back in source of truth (or an error occurred trying to send to one of the other downstreams, aborting before transaction)- do something relevant for your use case. Alert somebody, carry on, whatever.


👤 gwbas1c
Honestly, the solution depends highly on your circumstances. Microsoft SQL has a concept of a distributed transaction, but I'm going to assume that "4 disparate legacy databases" aren't Microsoft SQL.

Here's the high-level approach I would take:

1: Understand which databases serve what, and what the needs for accuracy are. (For example, a database serving their public-facing website can probably be a few seconds behind compared to a database handling real money.)

2: Figure out which database is the source of truth. Update that database first.

3: Update the remaining three databases in a loop: Update the database, then check to make sure the update happened

You will probably need to add some state management where you put yourself into an "updating" state, so that if the update breaks in the middle, you can go back and recover.

----

Another approach is to refactor the applications that are critical (Those that handle money, not the public facing web site) to always read from the database you decide is the source of truth.


👤 perlgeek
So, the first solution to consider is if this needs a technical solution at all.

If it happens infrequently enough, you can have simply have a proceed "change mortgage rate" that touches the 4 legacy databases. Done.

If it does need a technical solution, you have to establish the leading system / source of truth / whatever you call it. This can be either an existing system or a new one. And then all the others need to follow the leading system.

A more distributed approach (every of the 4 legacy systems can generate events that the other 3 work with) likely won't work with 4 legacy systems. It's often hard enough to get high-fidelity events/deltas out of a single system.


👤 segmondy
Change Data Capture. - share the consulting loot when you get yours. https://en.wikipedia.org/wiki/Change_data_capture

👤 henriquez
It would be hard to formulate a one-size-fits-all solution even if you were talking about relatively modern systems.

There’s a trade-off between having a manual vs. automated process in the best of circumstances. You might think intuitively that an automated process would be easier and more reliable, but depending on local factors for each of the databases, there might be issues like availability or locking that could make an automated solution more “fragile” than simply having an employee with domain expertise doing an update on a schedule. Reliably automating a process like this would require a lot of careful QA.

So throwing legacy systems into the mix would just aggravate this further. It may be that the cost/benefit of automating this just isn’t there. If it’s cheaper in terms of salary cost to do it by hand for N years than the one-time expense to automate, it may just not make sense to automate. (Obviously over an indefinite time scale it would almost always make sense to automate)


👤 cryptonector
There are no standard mechanisms for keeping heterogeneous databases in sync, or even DBs of the same types but with different schemas.

The standard thing to do is to have a source of truth, as described in various comments here like https://news.ycombinator.com/reply?id=22690991&goto=item%3Fi...

Sometimes you'll have multiple sources of truth for (hopefully non-overlapping) subsets of the data in question.

These things are messy. You have to do the hard work of writing sync tools. You want sync tools that work off of incremental / delta feeds, but you also need (slower) tools that can make-it-right based on the totality of your data -- at least the first time you sync two systems, and often also for auditing purposes to make sure that the incremental (fast) sync is working correctly.


👤 oweiler
CDC via Kafka Connect is one very powerful option.

👤 edoceo
Oh, I've done this. The 4DBs are for forward-origination, servicing and reverse origination, servicing.

We had a prime app, a fifth DB driving it. It had scripts to pull rates and a UI to adjust. Logs all changes.

These rates are pulled by the other systems (eg the AS400 pulled over FTP). Additional internal apps could query over SMB and HTTP.

The prime updated slowly and each of the services would poll it very frequently (4m) + on-demand when needed


👤 ransom1538
1) Create an internal endpoint for each of the databases to change the mortgage rate. Each one is different so you will need to do this 4 different times. EG: setMortgageRate() . This is PUSH not Pull. Create this too, while you are at it: getMortgageRate().

2) Then once this is standardized, write a for loop.


👤 ahasani
Need to watch out for update failure, some sort of transaction management/atomic commit[2] such that if one update fails, other updated dbs should rollback.

1. https://en.wikipedia.org/wiki/Atomic_commit


👤 timgebrally
I researched a pretty long article about this and documented my findings here: https://www.rallyhealth.com/devops/database-updates-across-t...

👤 quickthrower2
This simplest way possible. If the rate change can be processed via an endpoint that endpoint then updates the 4 databases. If talking to ACID databases, do a 2 phase commit to reduce the chance of them getting out of sync. Send me 10% of your consulting fee please :-)

👤 mooreds
Surprised no one has mentioned two phase commit:

https://en.m.wikipedia.org/wiki/Two-phase_commit_protocol


👤 pooktrain
Debezium plus Kafka would be good here - see https://www.youtube.com/watch?v=fU9hR3kiOK0

👤 mappu
Exactly how small?

A little planned downtime goes a long way.


👤 perl4ever
Oh...you don't want to know that... (in the tone of J Frank Parnell explaining you don't want to look in his Malibu's trunk)

👤 kennyangel
this is just a comment test

👤 lonelappde
Send 4 API calls to update the 4 databases.

Name the bank so we know not patronize patronize it.