HACKER Q&A
📣 flightster

What's the best way to join on numerous fields with fuzzy matching?


I've been tasked with creating an API to look up a customer based on a variety of fields. A non-exhaustive list of fields could look something like:

- customer_id - customer_first_name - customer_last_name - customer_loan_number - customer_phone_number - customer_email - customer_address - customer_date_of_birth - customer_zip_code

What I ultimately want is customer_id. But what I have is any combination of the subsequent fields. They might be all lowercase, they might not. There might be type-os. The phone number might have parentheses and dashes, it might not.

The address and zip are also wonky - for example customer_zip_code may look like XXXXX-YYYY, or it may just be XXXXX.

And example query would be: given customer_first_name and customer_last_name and customer_address, find the customer_id. The customer_last_name might be lowercase in the query and uppercase in the main database. The customer_address in the query might be "123 Main St" but in the database it might be "123 MAIN STREET".

And the data is in the 100m-1000m rows size, so this has to run reasonably quickly with that in mind. Can't just do a "select * where [x = y] and [p = q] and [t = s]". Would like to pre-load the big table in RAM, and then somehow look up the matches.

Because this almost definitely a common problem in our industry, I was wondering, is there some kind of good database already pre-optimized for this kind of fuzzy lookup?


  👤 kmgrassi Accepted Answer ✓