I need to be able to perform some small customization, if possible, to the detection algorithm. For example if I have a 9 digit number, starting with 0, then treat it as a String.
So far - I have found Frictionless Framework [0] which seems good, but I can't see any way of specifying customizations to the profiling algorithm, and Data Profiler [1] which uses ML for type detection, and it seems I should be able to train some new rules but I need a CUDA capable machine, which at the moment I do not have.
Hoping the collective HN brain can point me to something better if it exists.
[0] - https://framework.frictionlessdata.io/ [1] - https://github.com/capitalone/DataProfiler
https://pandas.pydata.org/pandas-docs/stable/reference/api/p... is pretty powerful (see also "parse_dates" and "converters" parameters). See also parse_excel()
You can also use procedural code to look at the column data and change the type:
# if all values in col c2 when converted to string begin with "0" and and values are of length 9, convert to int64
if df["col"].str.match("^0").all() and set(df["col"].str.len()) == {9}:
df["col"] = df["col"].astype("int64")
Import a csv reader library that can stream.
Read each line and apply a series of regex, each one classifying on match.
Eg
^0\d{8}$
Means stringThen have a reduction rule e.g.
If so far we think it’s a numeric column and we get a string then treat as string.
If so far we think it’s a numeric column and we get a number it is still a numeric column.
Then doing the regex and reduce in a loop will give you the final answers.
Happy to knock up some example code if you wish.