I imagine with various privacy scandals it fell out of favour since your data should be /your/ data only.
And many have talked about data being the ‘new oil’ when really it should be reframed as radioactive waste.
What happened to using this term to hype up your brand: ‘We use Big Data to infer information about how to improve and go forward’?
Was it just a hyped up buzzword?
Similar language history that happened to terms like "dynamic web" or "interactive web". In the late 1990s when Javascript started to be heavily used, we called attention to that new trend the "dynamic web". Today, the "interactive web" phrase has mostly gone away. But that doesn't mean that Javascript-enabled web pages were a fad. On the contrary, we're using more Javascript than ever before. We just take it as a given that the web is interactive.
Examples of rise & fall of "interactive web" in language use that peaked around 2004:
https://books.google.com/ngrams/graph?content=dynamic+web&ye...
https://books.google.com/ngrams/graph?content=interactive+we...
Then Machine Learning comes along and these same people think that means that you can just feed the beast with your big data and it will be clever enough to tell you what you want to know and then the same companies will realise that they still don't have the skills to work out what to tell the ML algorithm to do.
Now it is in many places. Enterprises use it each moment.
A laptop hard disk is now capable of holding databases with tens of millions of rows.
Traditional "Data Science" and modern Deep Learning rely entirely on it. Millions of datapoints are used to create models everyday.
A sensor on human wrist collects and stores thousands of data points each day.
So do refrigrators, cars, and your washing machine with ubiquity of IoT.
Giant tech cos use billions of rows each day to show users products, or sell their attention as products.
Big Data became ubiquitous. And it became so common that no nody calls it that anymore.
Tools like BigQuery, Dask, and even Pandas and SQL can handle hundreds of thousands to hundreds of millions of rows (or other structure) with normal, regular programming, command, etc.
Still tons of folks out there using Hadoop (ew), Snowflake, etc. New technologies coming out include things like Trino, Apache Iceberg, etc. So it's there ... just no one cares about the moniker .. just getting things done.
I guess it is similar to other technologies which most companies or developers would really never need due to their limited scale like distributed databases, NoSQL or microservices: It is interesting technology and engineers would like to get their hands on it because that's what the big boys play with, even if they don't really need it. In the meantime the industry hypes it because the technology is difficult so they know that they can make money doing consulting.
I'm not saying that it is not useful technology, I work at a company where we had the need to go from Postgres to "Big Data" tooling. But for tons of businesses it just doesn't make any sense. And even in our case one of the questions I have most frequently is: What business decision are you taking based on processing this enormous amount of data? Can we not take the same decision based on less data?
Storing data on S3 or using BigQuery remove a lot of the challenges as opposed to doing this stuff in the data centre. You then also have services such as EMR, Databricks and Snowflakes to acquire the tooling and platforms as an IaaS/SaaS. The actual work then moves up the stack.
Businesses are doing more with data than ever before and the volumes are growing. I just think the challenge moved on from managing large datasets as result of new tooling, infrastructure and practices.
It's crazy how much you can do with one machine these days. Hence you often just have "data". And then snowflake/bigquery/redshift if it literally can't fit on a machine (which is rare).
Yes.
I can't tell you how many meetings I've been in where someone was pitching a big data idea and it the meeting ended when we all realized that if it fits on a $50 thumb drive it isn't big data.
You can call data the new oil when someone invades a country to secure a data center.
I don't think anything fell out of favor and things are a long way from data being, "your data only" although you have been given some rights in that regard.
Nothing happened to it. Big data always represented pushing the boundaries of what could be done when dealing with large amounts of data. After a while the technology matured to the point where working with large datasets just became something you did. There was a lot of hype to it and many organizations unnecessarily went along for the ride. It's also a balance betwee current technology and economics of compute, storage, networking, etc. As the balance changes what and how you do things also changes.
The true "big data" became so ubiquitous and accessible that there became no reason for anyone to care about it outside the bubble of Silicon Valley. It's just data, and really was all along.
Its amazing to see that nowadays the persistent volume claim used for logging, is on average now much bigger than the average dedicated machine was about 10 years ago.
It's true that privacy regulations have made personally identifiable information (PII) into something that is challenging to store, like radioactive waste.
But most of the world's big data is not PII. For example, the huge amount of data being produced by modern telescopes and particle physics labs is about things like stars and subatomic particles, not people.
The world has less than 8e9 people, but there are around 1e11 stars in our galaxy, and there are more than 1e11 galaxies in the observable universe.
People tried to define big data in terms of the size of the data set. The best definition of big data I heard is "a data storage and/or processing system that cannot handle the amount of data in one physical machine and needs distributed storage and/or processing".
That's a lot of data. Most people and companies are not dealing with big data.
Kind of like everything being "blockchain" at one point. Eventually people realized that the word has a specific meaning that does not apply to many things..
It doesn't make much sense to me as I've never seen anyone use anything you'd find in an AI book that you wouldn't also find in a Machine learning book
For big data, I think that the terminology waned but data engineers internalized the desire to scale everything they make to handle big data. So data engineering teams are still using things like Spark (or databricks) even if their datasets aren't big enough to need that
For example: https://www.horizon-europe.gouv.fr/extreme-data-mining-aggre...
But most people work on small to medium data.
- catch-all excuse to record everything forever without having an idea how to use that data
- actually hard problems
Otherwise known as, "you tend to find what you're looking for", as hidden biases in the query will ignore data that doesn't support that.
"The advances in computing have made it easier to accomplish tasks that were completely unnecessary before"
1. It morphed into ML as the dirty secret to most ML projects is that they're predominantly about data. Put another way you can't derive a model from nothing.
2. You mentioned privacy scandals, but things like CCPA + GDPR legitimately did make larger corporations pause and ask "Do we actually need this information?" where prior to that everyone was a hoarder "just in case"
A lot of companies didn't even need to go the hadoop route. CSVs, jupyter notebooks and SQL databases are very powerful tools for most companies.
We nonchalantly spin up massive 1TB+ ram clusters to process our data without really admiring how much data it actually is.
Big oil. Big bad. Big lie. Big brother. Big apple. Etc...
Hype.