Traditional ML would do a lot of feature extraction and engineering -- from a very specific problem space --before throwing the training compute at it. I think there are good pattern detection, prediction, and anomaly detection models that come out of this.
What happens if we just scrape all data (say metrics like weather, flights, population stats, gps locations, web pages clicks, deep space network observations, and kindergarten grades) and if it were possible to build a model with enough weights for all this diverse data ...
What kind of use case might such a ... Large DATA model ...open up?
Firstly, the concept you're hinting at is not purely traditional ML. In traditional machine learning, we often prioritize feature extraction and engineering specific to a given problem space before training.
What you're describing and what we've been working on at Gretel.ai, is leveraging the power of models like Large Language Models (LLMs) to understand and extrapolate from vast amounts of diverse data without the need for time-consuming feature engineering. Here's a link to our open-source library https://github.com/gretelai/gretel-synthetics for synthetic data generation (currently supporting GAN and RNN-based language models), and also our recent announcement around a Tabular LLM we're training to help people build with data https://gretel.ai/tabular-llm
A few areas where we've found tabular or Large Data Models to be really useful are: * Creating privacy preserving versions of sensitive data * Creating additional labeled examples for ML training (much less expensive than traditional data collection/ml techniques) * Augmenting existing datasets with new fields, cleaning data, filling in missing values
Lots of mentions of RLHF here in the threads, one area I think RLHF will be super helpful is in ensuring that LLM data models return diverse and ethically fair results (hopefully better than the data they were trained on). Cheers!
So the answer is that if you just feed it a bunch of data, it might be good at spitting out some similarly looking data but ultimately has little use. If you manage to get some experts sitting with it and give it instructions and explain the data, i.e. relating data to concepts, then you may get a model that can generate interesting things based on natural language, because it was taught how to interpret the data with natural language.
That is the next crop of AI I believe. For example, in biotech, people are feeding it protein sequences and teach it what each protein are and how to relate certain concepts with the sequences. The result is an AI that can generate new proteins with certain characteristics and functions.
https://medium.com/@jordan_volz/who-owns-the-future-looking-...
"I won’t claim it’s simple, but as we’ve built AI that can understand human language, we can similarly build AI that understands data...Instead of building a large language model, we instead can build large data models (LDMs?)."
See for example Burfoot https://arxiv.org/abs/1104.5466 or Schmidhuber https://arxiv.org/abs/1511.09249#schmidhuber + https://arxiv.org/abs/1802.08864#schmidhuber (among many others) or (if I may shill my own work) https://gwern.net/aunn
I worked for a startup a while ago that was working on foundation models for tabular data but it was looking at the header and the cell of a csv file to assign some semantics to an individual cell. It competed with more traditional profiling tools that looked at columns as a whole. With that system we'd sometimes treat tabular data using ordinary ML algorithms from scikit-learn.
[0] https://www.earthdata.nasa.gov/news/impact-ibm-hls-foundatio...
I founded approximatelabs to essentially chase this out and show how much rich fruit there is in this space. See approximatelabs.com
It is not traditional ML to me, which has a "row-wise" way of thinking. Instead, we are thinking columnar-ly (like you would if you were an analyst), and representing entities and attributes as tokens (not individual features). To me, describing "data" requires creating tokens that represent the information of the whole of the data. There are a few different ways of approaching this, but technically, the way we are approaching this is representing columns via aggregations (both exact and approximate (sketching algorithms like HLL, TDigest, etc.)). These columns now have representations that can be worked with by transformers in the same way they look at "tokens" of language, now we have "tokens of data" (also think multi-modal patch embeddings with images, but with columns of your data).
To power this, we actively are scraping "all of the data" as you say, and have over 100M tables in our store, and hopefully will push this to over 1B in the next week or so (about to run our next big parse job). We plan to open this dataset to public / research very soon, so looking forward to that. Shortly after that we plan to publish some papers about the actual details of our multi-modal tabular-data foundation models and techniques we have tried and learned from!
Towards the last part of your question: what use case would this model open up? We believe that the entire data stack (currently modern data stack) is about to get a revolution in the way that LLMs caused "symbolic" language models to get revolutionized. Think catalogs that are actually smart and semantic (not just based on the semantics of the metadata, but also of understanding natively the content of the data), observability that is in embedding space and is able to self-explain deviations, and most importantly: accessibility. Since LLMs have proven that they can lower the accessibility barrier to technical concepts, we are expecting and planning on our product being "the new excel", where everyday people can actually leverage the analytic power of things like pandas and sql to answer their questions and trust the results. Heavy bias, but I think the space is about to get very, very hot, and early competitors are starting to pop-up around the chat-bot use-case, and we believe we have a huge edge because we're making our own models + starting at the fundamentals.
Happy to answer any other questions too!
[1] https://en.wikipedia.org/wiki/Self-supervised_learning
[2] https://ai.meta.com/blog/self-supervised-learning-the-dark-m...
Now, why do LLMs work? They are exploiting structure. Language has structure, and language represents other structures. At a large enough scale the language models can learn and use that structure. There's quite a bit of consistency in language. You can train a model to be fluent (grammatically correct) without massive scale, but output often lacks meaning or coherency. It takes a lot of training to learn that "reading" a book and "reading" from a spinner hard drive platter are conceptually similar and extremely different in their details.
So, can you use a transformer with raw numerical data? Yes. Can you train a very large model on a very large amount of numerical data. Yes. Would you expect that training across a mixed corpus of raw numerical data would lead to a similar kind of 'understanding' that we've seen in LLMs? No. but it is not impossible.
Recall that language is just a layer on top of sensory processing. It helps a great deal but there is plenty of "intelligence" in the animal kingdom without it.
My personal opinion is that helping models build their own internal representations via curriculum learning and making sure it is being sent data which is generally related to, or correlated with, other data is very useful. If you don't have the aid of a universal sematic representation scheme like language, why not try your best to make it easy to make one?
The idea for using this type of model, from quants, was applied to language, THAT was the new idea. Now it sounds like you are saying "what if" we took a step backwards, but again, we this is exactly how we use these models now, is trained on data.
More on it here: https://amplitude.com/blog/AI-powered-product-development
Imagine large models trained on ad targeting+click data, cctv, wifi+location data, sensor and audio crap from phones, geospatial/streetview, widearea sat photos,etc... and you combine that with LLMs
Yeah, very powerful but not only can it be abused, it can't be scrutinized and unlike LLM data, other types of data (specially the private dataset kind) can't be publicly and independently corrobarates in most cases I'd imagine.
1. Build up a set of instructions on how to interpret a particular type of data in the system prompt
2. Run a set of analysis and instead of outputting the data in tabular form, output natural language versions of the metrics (if they are worth thinking about)
3. Pass the instructions to the LLM to interpret them based on the system prompt
The theory is you can run a number of these analysis and feed the results into another system prompt to do an overall analysis.
It’s a theory at this point but it could work!
Deep learning generally brings a lot less to tabular data because the underlying thing being modelled is much simpler compared to language models that effectively are modeling the human mind + culture so there's a lot more to fit the model to.
if you go in with a trimmed and parsed dataset its harder to get (what i call) introspection pattern recognition from it because you haven't got enough disparate samples to build a decent universe to compare too. what you assume going in stamps its shape on the whole thing.