Why AI and Data Science projects have a high risk of failure?

Question

shoo · Accepted Answer

Maybe one problem is that some projects naively try to start at "fit a model to whatever data we've got" when perhaps they would be better off starting with a statistical experimental design perspective, and thinking about how exactly the data would need to be collected https://en.wikipedia.org/wiki/Design_of_experiments
Here's one detailed anecdote of how projects can fail:
> Imagine you want to design some algorithm to detect cancer. You get data of healthy and sick people; you train your algorithm; it works fine giving you high accuracy and you conclude that you’re ready for a successful career in medical diagnostics.
> Not so fast …
> Many things could go wrong. In particular, the distributions that you work with for training and those in the wild might differ considerably. This happened to an unfortunate startup I had the opportunity to consult for many years ago. They were developing a blood test for a disease that affects mainly older men and they’d managed to obtain a fair amount of blood samples from patients. It is considerably more difficult, though, to obtain blood samples from healthy men (mainly for ethical reasons). To compensate for that, they asked a large number of students on campus to donate blood and they performed their test. Then they asked me whether I could help them build a classifier to detect the disease. I told them that it would be very easy to distinguish between both datasets with probably near perfect accuracy. After all, the test subjects differed in age, hormone level, physical activity, diet, alcohol consumption, and many more factors unrelated to the disease. This was unlikely to be the case with real patients: Their sampling procedure had caused an extreme case of covariate shift that couldn’t be corrected by conventional means. In other words, training and test data were so different that nothing useful could be done and they had wasted significant amounts of money.
-- https://blog.smola.org/post/4110255196/real-simple-covariate...

nomizygous · Answer

AI is an evolving technology and it's somewhat true that AI projects have a high risk of failure attached to it. Last year, many big sites predicted that major data science projects would face failure in the future. According to a report, 87% of ongoing projects will fail in delivering the desired results. What I suggest is that it is imperative to do continuous in-depth research on a particular use case and the supported models before starting working on it.
Data Science technologies are much improved and advanced now compared to 10 years ago but there is a lot more to improve when it comes to meeting end-user expectations and real-life implementation of an Enterprise AI project. AI operations and processes is one factor but there are many other reasons that lead to failure of data science projects. These include:
- Absence of comprehension about AI tools and methodology. - Poor Data Quality - Not opting the right tool - Bad Strategy from top management. - Lack of investment in employees who know data very well
We know that data science uses statistical concepts and theories that exist since ages and most of them have been successfully built by the numerical scientists from the big tech giants. What’s important for the organizations now is to understand the use case and hire the resources with the right set of skills. Many organizations are confused about the required skill set because the field of data science is new for the top management and they are just trapped by the hype created around. Most of them are naturally thinking of hiring a statistician having a PhD in statistics, which is not required most of the time. As per my experience, this is true only if the task in hand is to do research in advanced statistical models and algorithms.
With the abundant supply of the off-the-shelf modeling tools and technologies in the market, 99% of the times organizations just require a well-equipped resource who is
1) skilled in the right tool that is compatible with existing technologies. 2) have good understanding of certain statistical models and techniques.
This article has done an analysis on why major AI projects failed, and summaries it very well https://thinkml.ai/five-biggest-failures-of-ai-projects-reas...

whyhow · Answer

I think there is a huge risk that the data being collected just isn't meaningful to the problem being asked.
For example, if you are trying to predict the weather you'd really like information on how big clouds are moving but all you have is wind speed and humidity.
Everyone acts like having fancier machine learning methods will solve their problem, but often the data just isn't good enough and getting better data is impossible.

logicslave · Answer

1.) Most data scientists are not good enough at math to take an existing model and use it for a different purpose. Usually there needs to a model whos mathematical properties match almost exactly to their needs (this is usually the case).2.) The data is bad

Skaven · Answer

Most methods easily find a local optimum, but it is very hard to tell if it is a global optimum.

streetcat1 · Answer

Because they are measured by thier performance on unseen data.

osipov · Answer

due lack of clear and measurable success criteria

Jugurtha · Answer

I'll address this from the perspective of a company that's been delivering complete custom data applications for enterprise for the past seven years. More context about problems here[0] and here[1]. The risk is so high that we're building an ML platform to address those.
First, the initial conversations take time and depend on the maturity of the organization data wise, and all our clients are huge organizations. You keep exchanging until you find a problem. Sometimes they come to us with a problem in mind, ut it's almost never formulated. We have to extract it and formulate it.
Then we have to understand the problem, why is it a problem, what's the desired state. What success looks like. This is not trivial, as many things are qualitative and how do you mesure that? Many people must be involved.
You must also address the data question early on. Is there data to work with ? What kind of data? Is it enough ? Do you have to work with several sources, etc. But for now, you want 'some' data. Who has it ? You'll iterate over this multiple time as they send you data you can't work with. Is it sensitive data ? What regulations exist around that ? More people get involved (legal, security, etc). You also have to address ethical questions and what you will and will not do. These are long, long conversations.
One aspect not to dismiss is that there may be people who feel threatened by the project. If the project is to reduce menial work, or reduce the skill level required to do something, some people might try to undermine it.
Then in the rare cases where you got these right, and the data is good, you'll explore and iterate to train models. This often is the easiest part, but there are challenges for teams on collaboration, tracking, deployment, and including domain expertise in this step which can drag the project further. That's what we're building to solve.
Once you have satisfying results, you'll then want a system to make this operational. We built applications to allow clients to use our models under the hood to do something. The deliverable isn't just a model that spits back predictions. This is a software project. You also need to make it possible to adapt: data changes and you have to take that into account (data pipelines, model training and management, concept drift, etc.)
All this takes time and money and the effort of many people who are committed to do it, with diverse backgrounds and roles: you need to be able to build relationships and trust and get the help of domain experts. It's an all inclusive or nothing. Frictions, or people moving at different speeds.
The ticket price for a data project can be hefty, and there are too many reasons for them to fail.
[0]: https://iko.ai/docs/
[1]: https://iko.ai/docs/story/