HACKER Q&A
📣 albert_e

If we train an LLM with “data” instead of “language” tokens


Is it just traditional ML?

Traditional ML would do a lot of feature extraction and engineering -- from a very specific problem space --before throwing the training compute at it. I think there are good pattern detection, prediction, and anomaly detection models that come out of this.

What happens if we just scrape all data (say metrics like weather, flights, population stats, gps locations, web pages clicks, deep space network observations, and kindergarten grades) and if it were possible to build a model with enough weights for all this diverse data ...

What kind of use case might such a ... Large DATA model ...open up?


  👤 alexwatson405 Accepted Answer ✓
Hey there! Co-founder of Gretel.ai here, and I think I can provide some insights on this topic.

Firstly, the concept you're hinting at is not purely traditional ML. In traditional machine learning, we often prioritize feature extraction and engineering specific to a given problem space before training.

What you're describing and what we've been working on at Gretel.ai, is leveraging the power of models like Large Language Models (LLMs) to understand and extrapolate from vast amounts of diverse data without the need for time-consuming feature engineering. Here's a link to our open-source library https://github.com/gretelai/gretel-synthetics for synthetic data generation (currently supporting GAN and RNN-based language models), and also our recent announcement around a Tabular LLM we're training to help people build with data https://gretel.ai/tabular-llm

A few areas where we've found tabular or Large Data Models to be really useful are: * Creating privacy preserving versions of sensitive data * Creating additional labeled examples for ML training (much less expensive than traditional data collection/ml techniques) * Augmenting existing datasets with new fields, cleaning data, filling in missing values

Lots of mentions of RLHF here in the threads, one area I think RLHF will be super helpful is in ensuring that LLM data models return diverse and ethically fair results (hopefully better than the data they were trained on). Cheers!


👤 tomohelix
RLHF is a very important part of the training process. Without it, the LLMs would definitely behave much more similar to a glorified autocomplete than something resembling "intelligence".

So the answer is that if you just feed it a bunch of data, it might be good at spitting out some similarly looking data but ultimately has little use. If you manage to get some experts sitting with it and give it instructions and explain the data, i.e. relating data to concepts, then you may get a model that can generate interesting things based on natural language, because it was taught how to interpret the data with natural language.

That is the next crop of AI I believe. For example, in biotech, people are feeding it protein sequences and teach it what each protein are and how to relate certain concepts with the sequences. The result is an AI that can generate new proteins with certain characteristics and functions.


👤 jmbiven
Jordan Volz wrote an article speculating about this earlier this year. First time I heard the term Large Data Model, which I love and plan on using.

https://medium.com/@jordan_volz/who-owns-the-future-looking-...

"I won’t claim it’s simple, but as we’ve built AI that can understand human language, we can similarly build AI that understands data...Instead of building a large language model, we instead can build large data models (LDMs?)."


👤 v9v
There are some robotics researchers trying to frame robotics tasks as sequences of action and sensor tokens, giving the model all the data collected by the robot and the actions taken by the robot, so that the model can learn to predict the action tokens based on sensor info. Here's a blog post from a researcher in this field reviewing a relevant paper: https://evjang.com/2023/06/22/robocat.html

👤 gwern
This is an old idea dating back to the earliest algorithmic information theory work in the '50s/60s like Ray Solomonoff on universal induction: given a powerful learning algorithm, create AI by encoding all possible data as a binary stream (with no further preprocessing or engineering), and simply predict each successive bit. This leads to AIXI, the compression-as-intelligence paradigm (and online learning / meta-learning), and most recently, the success of self-supervised learning & generative modeling.

See for example Burfoot https://arxiv.org/abs/1104.5466 or Schmidhuber https://arxiv.org/abs/1511.09249#schmidhuber + https://arxiv.org/abs/1802.08864#schmidhuber (among many others) or (if I may shill my own work) https://gwern.net/aunn


👤 PaulHoule
Current LLMs seem to struggle with tabular data but there's got to be some answer.

I worked for a startup a while ago that was working on foundation models for tabular data but it was looking at the header and the cell of a csv file to assign some semantics to an individual cell. It competed with more traditional profiling tools that looked at columns as a whole. With that system we'd sometimes treat tabular data using ordinary ML algorithms from scikit-learn.


👤 blovescoffee
There's a lot to unpack here. First, consider that to the LLM/computer each token is just "data". Second, consider that LLM's are mostly matmul, dot-products, etc., so there has to be some structure to the data and dimensions of the data have to make some sense - but I suppose you could just torch.cat a lot of points (kind of). Anyways, if you want to read about real-world examples of what you're suggesting checkout something like the following foundation model [0] and expand from there.

[0] https://www.earthdata.nasa.gov/news/impact-ibm-hls-foundatio...


👤 monkeydust
It's funny how generative AI has cast 'numeric' ML into the shadows as traditional ML in such a short space of time.

👤 bluecoconut
I love this question, because this is exactly what I'm currently focused on doing!

I founded approximatelabs to essentially chase this out and show how much rich fruit there is in this space. See approximatelabs.com

It is not traditional ML to me, which has a "row-wise" way of thinking. Instead, we are thinking columnar-ly (like you would if you were an analyst), and representing entities and attributes as tokens (not individual features). To me, describing "data" requires creating tokens that represent the information of the whole of the data. There are a few different ways of approaching this, but technically, the way we are approaching this is representing columns via aggregations (both exact and approximate (sketching algorithms like HLL, TDigest, etc.)). These columns now have representations that can be worked with by transformers in the same way they look at "tokens" of language, now we have "tokens of data" (also think multi-modal patch embeddings with images, but with columns of your data).

To power this, we actively are scraping "all of the data" as you say, and have over 100M tables in our store, and hopefully will push this to over 1B in the next week or so (about to run our next big parse job). We plan to open this dataset to public / research very soon, so looking forward to that. Shortly after that we plan to publish some papers about the actual details of our multi-modal tabular-data foundation models and techniques we have tried and learned from!

Towards the last part of your question: what use case would this model open up? We believe that the entire data stack (currently modern data stack) is about to get a revolution in the way that LLMs caused "symbolic" language models to get revolutionized. Think catalogs that are actually smart and semantic (not just based on the semantics of the metadata, but also of understanding natively the content of the data), observability that is in embedding space and is able to self-explain deviations, and most importantly: accessibility. Since LLMs have proven that they can lower the accessibility barrier to technical concepts, we are expecting and planning on our product being "the new excel", where everyday people can actually leverage the analytic power of things like pandas and sql to answer their questions and trust the results. Heavy bias, but I think the space is about to get very, very hot, and early competitors are starting to pop-up around the chat-bot use-case, and we believe we have a huge edge because we're making our own models + starting at the fundamentals.

Happy to answer any other questions too!


👤 boberoni
(Disclaimer: I am not an AI expert.) To me, LLMs are a special form "self-supervised learning"[1], which is a special form of unsupervised learning. While supervised learning aims to use input data to predict labels, self-supervised learning uses the context around the input data to predict future data. The key part here is context. This excellent article [2] by Yann Lecun et al. explains how self-supervised learning uses temporal and spatial context to extract patterns, which can then be used to predict future data or generate new data. In fact, prediction and generation go hand-in hand. So, you could definitely apply the techniques of LLMs to arbitrary streams of data with meaningful temporal or spatial patterns.

[1] https://en.wikipedia.org/wiki/Self-supervised_learning

[2] https://ai.meta.com/blog/self-supervised-learning-the-dark-m...


👤 iandanforth
Let's break this down. Large Language Models are almost all transformer models. Transformer models are sequence prediction models. There are other kinds of sequence prediction models like RNNs. Transformers are not tied to tokens, neither are RNNs. You can use them to predict sequences of values directly, you don't need to have high dimensional representations as input.

Now, why do LLMs work? They are exploiting structure. Language has structure, and language represents other structures. At a large enough scale the language models can learn and use that structure. There's quite a bit of consistency in language. You can train a model to be fluent (grammatically correct) without massive scale, but output often lacks meaning or coherency. It takes a lot of training to learn that "reading" a book and "reading" from a spinner hard drive platter are conceptually similar and extremely different in their details.

So, can you use a transformer with raw numerical data? Yes. Can you train a very large model on a very large amount of numerical data. Yes. Would you expect that training across a mixed corpus of raw numerical data would lead to a similar kind of 'understanding' that we've seen in LLMs? No. but it is not impossible.

Recall that language is just a layer on top of sensory processing. It helps a great deal but there is plenty of "intelligence" in the animal kingdom without it.

My personal opinion is that helping models build their own internal representations via curriculum learning and making sure it is being sent data which is generally related to, or correlated with, other data is very useful. If you don't have the aid of a universal sematic representation scheme like language, why not try your best to make it easy to make one?


👤 kylebenzle
We (I included) have been doing this EXACT thing for a long with with the stock market. We take ALL the data, build a model, then ask for an output buy/sell signal.

The idea for using this type of model, from quants, was applied to language, THAT was the new idea. Now it sounds like you are saying "what if" we took a step backwards, but again, we this is exactly how we use these models now, is trained on data.


👤 sskates
I am thinking about this problem at Amplitude. The holy grail of the martech space has always been to predict someone's next action based on what they've done before. What's exciting is that we have one of the largest datasets to be able to do that. Done right you could have products proactively do those actions/change itself to make those actions easier. If anyone is interested in this problem, please reach out!

More on it here: https://amplitude.com/blog/AI-powered-product-development


👤 badrabbit
I hope this doesn't become a thing, ML is already being abused to track people. I have no problem against the technology but with the lack of laws to address its abuse.

Imagine large models trained on ad targeting+click data, cctv, wifi+location data, sensor and audio crap from phones, geospatial/streetview, widearea sat photos,etc... and you combine that with LLMs

Yeah, very powerful but not only can it be abused, it can't be scrutinized and unlike LLM data, other types of data (specially the private dataset kind) can't be publicly and independently corrobarates in most cases I'd imagine.


👤 mattew
I’m thinking about data for an LLM from a different angle for ecommerce analytics.

1. Build up a set of instructions on how to interpret a particular type of data in the system prompt

2. Run a set of analysis and instead of outputting the data in tabular form, output natural language versions of the metrics (if they are worth thinking about)

3. Pass the instructions to the LLM to interpret them based on the system prompt

The theory is you can run a number of these analysis and feed the results into another system prompt to do an overall analysis.

It’s a theory at this point but it could work!


👤 version_five
Look up autoregressive model. There are already forecasting models that do something like this. I don't know if any use attention, it seems like a pretty obvious thing to have done. Overall this isn't a new idea.

Deep learning generally brings a lot less to tabular data because the underlying thing being modelled is much simpler compared to language models that effectively are modeling the human mind + culture so there's a lot more to fit the model to.


👤 h2odragon
take as an example "bytes" as input, with no further parsing or interpretation given to them. with enough input data and enough time spent iterating on it, you might well come out with decent wordlists, instruction sets, and things like that. if you work the statistics on the wordlists they give you some table based stemmer and thesaurus implications too.

if you go in with a trimmed and parsed dataset its harder to get (what i call) introspection pattern recognition from it because you haven't got enough disparate samples to build a decent universe to compare too. what you assume going in stamps its shape on the whole thing.


👤 DanHulton
...we'd get a model that hallucinates weather forecasts?