HACKER Q&A
📣 mahalel

What is the best tool to infer data type of tabular data?


Hi HN, I am looking for the most accurate tool which I can use to infer Data Types from Tabular data (csv,tsv,excel)

I need to be able to perform some small customization, if possible, to the detection algorithm. For example if I have a 9 digit number, starting with 0, then treat it as a String.

So far - I have found Frictionless Framework [0] which seems good, but I can't see any way of specifying customizations to the profiling algorithm, and Data Profiler [1] which uses ML for type detection, and it seems I should be able to train some new rules but I need a CUDA capable machine, which at the moment I do not have.

Hoping the collective HN brain can point me to something better if it exists.

[0] - https://framework.frictionlessdata.io/ [1] - https://github.com/capitalone/DataProfiler


  👤 switch007 Accepted Answer ✓
What kind of types?

https://pandas.pydata.org/pandas-docs/stable/reference/api/p... is pretty powerful (see also "parse_dates" and "converters" parameters). See also parse_excel()

You can also use procedural code to look at the column data and change the type:

    # if all values in col c2 when converted to string begin with "0" and and values are of length 9, convert to int64
    if df["col"].str.match("^0").all() and set(df["col"].str.len()) == {9}:
        df["col"] = df["col"].astype("int64")

👤 quickthrower2
I’d probably knock one up in nodejs ad follows:

Import a csv reader library that can stream.

Read each line and apply a series of regex, each one classifying on match.

Eg

    ^0\d{8}$
Means string

Then have a reduction rule e.g.

If so far we think it’s a numeric column and we get a string then treat as string.

If so far we think it’s a numeric column and we get a number it is still a numeric column.

Then doing the regex and reduce in a loop will give you the final answers.

Happy to knock up some example code if you wish.