ML and more broadly data science are very useful, but even some of the most recent "easy" data science tools (e.g. Google AutoML tables) have too high of a learning curve to be useful to the average consumer.
Normally if you were learning a new tool, you might learn through a combination of study, and trial and error. However, many people don't have a lot of time to sit down and learn something complex in this manner. (They need a bit of an extra push to minimize their error/guide them toward reasonable use cases.) The result is something like this:
1 - get excited to try something easy and get new value out of their data
2 - get frustrated because the tool is not easy enough, or they don't know what questions are answerable with the available algorithms
3 - search the internet for guidance on what the algorithms do, get overwhelmed
4 - abandon tool
Solution:
1 - send us your data (probably a spreadsheet/CSV/Excel file)
2 - we analyze the data, and send you a list of questions that we can answer/insights that we can derive
3 - you select which of the questions you want answered
4 - we run our analyses and send you the results, including an explanation of the algorithms that were used to derive the results
The key here is that the "learning" takes place after value is delivered to the user. Even though a tool may allow you to do things with the click of a button, the hidden complexity still presents a learning curve to the user.
Footnotes:
- I'm not claiming to have a large amount of data to back this up, hence why I said this is a "hypothesis". I'm offering the idea up for feedback and am interested in hearing what people say!
- This certainly does not apply to people who are used to self-directed learning and enjoy a healthy challenge
> 1 - send us your data (probably a spreadsheet/CSV/Excel file)
> 2 - we analyze the data, and send you a list of questions that we can answer/insights that we can derive
> 3 - you select which of the questions you want answered
> 4 - we run our analyses and send you the results, including an explanation of the algorithms that were used to derive the results
This order of points isn't quite right. The overwhelming majority of companies will already have a question that needs answering or a problem that needs solving. They will then want to know which parts of which datasets are relevant. If the existing datasets aren't enough, they consider collecting and/or purchasing more. Knowing what to collect and/or buy is another problem. Then you need to set up systems for extracting any 'insight' from what they have and continuously managing and processing data from multiple sources, and so on and so on.
No one is really sitting with a single csv file open in front of them, thinking "Hmm if only someone would tell me what I can do with this".
In my case I would require (1) confidence in the security of our data, (2) some way to continue using the service without it being manual (eg latest months data is reflected somewhere I can log into), (3) where a model is being created a way for existing systems to interact with it via API.
PS Personally I love the insertion of #2 on the solution points. Yes I would have questions going in but would appreciate validation that they can be answered effectively and would appreciate a list of potential questions that I may have missed myself.
- Google Sheets suggests pivot tables / charts that can be built on your worksheet data
- with our BI tool (https://www.seektable.com) everyone can upload even rather large (up to 500mb) CSV file, then engine suggests dimensions/measures automatically and even suggests some typical reports (suggestions are very simple as for now, just set of heuristics rules based on CSV column names). More than this, for CSV file user can 'ask' data with search-like queries and get an answer in form of pivot table.
The biggest problem I find is makes expectations _high_ from the beginning, which is a bad idea.
Most data projects are about managing expectations.
Also, it is _very_ hard to demonstrate that your model works if the customer does not have (yet) some graphics to compare against.
Your first step should be _showing_ the data just to have a visual baselite to compare against.