But what if I have data that is not clean in form of files or databases? Then I have to read the files, clean it up and structure it in data structures according to my need (is THIS called parsing?). I am talking about this preprocessing part.
What CS or programming subjects should I study to become somewhat of an expert in data cleaning, preprocessing and structuring large amounts of files in batches?
I am also interested in the second part of the pipeline where I analyse the data and produce output both in terms of good visualisations and output data to be stored in files.
Any books/courses or any other types of resource pointers will be appreciated.
P.S.: Files can be anything. They are just streams of bytes. Images, audio, video, text, csv.
There’s probably useful free data sets out on the internet. Learning python is useful.
I know AWS has a heap of services catering to data pipe lines.. maybe see if there’s a free tier on anything.
The fixing of bad data I’ve most commonly heard of as “data cleansing” or “data scrubbing”.