HACKER Q&A
📣 Gyanangshu

What's your strategy for inconsistent date formats?


Working on a data cleaning tool and running into a problem I haven't found a clean solution for: user-uploaded CSVs where a single date column has multiple formats. Not just ISO vs US vs European - but also relative formats ("Jan-23"), written formats ("15th January 2023"), and partial dates all mixed together.

The naive approach of trying a list of strptime format strings in order breaks on ambiguous dates like "01/02/03" - is that January 2nd 2003, or February 1st 2003, or 1st February 2003? The answer depends on the locale and context that we often don't have.

Our current approach: scan the first 50 non-null values, rank format candidates by match frequency, flag ambiguous dates for user confirmation, and store the detected format alongside the column metadata for future imports. We handle about 94% of real-world mixed-format columns automatically, but the remaining 6% need user input.

Some patterns that came up more often than expected in real datasets: dates with ordinal suffixes ("1st", "2nd", "3rd"), fiscal quarter notation ("Q3 2024"), and Unix timestamps stored as strings.

Curious what approaches others have used, particularly around the ambiguity resolution step.


  👤 stop50 Accepted Answer ✓
Localization for ui only. Im- and exported data only in standard formats

👤 freakynit
A known issue.. have faced this personally long back (had to let it go back then since the use-cases was no more valid) ... but, will this help?

https://github.com/freakynit/smart-date-parser

This does maintain context based on past successful parses.

Disclaimer: This is fully opus generated, but do have test cases (in usage.js ... i know.. it's not what it's for.. but it is what it is).