HACKER Q&A
📣 dark7

How big of a problem is unstructured data for companies?


I read somewhere that 90% of companies have data like documents, PDFs, videos, images, audio clips, and other content that are unstructured that will be a big obstacle for ai.

Are there companies already in this space?

Trying to see if there's something here before I possibly create a MVP.


  👤 danjl Accepted Answer ✓
The trick is finding a problem that increases revenue or decreases costs after providing structure to the data. Sure, it would be great to bring structure to assets, but you can't just provide search or labeling. You have to figure out how providing the structure actually brings value to those companies. You'd hope they would do that for you, but you need to figure it all out, at least for one set of customers who will pay you, before you build the MVP. The details of how it benefits the company has a profound effect on the design of the MVP, including how to access the assets and how to expose the structure in the UX.

👤 edmundsauto
I work for big tech. Our problem is not the unstructured nature of the data; it is the volume of noise to signal. Basic ranking and information retrieval is implemented; we have LLM/RAG systems that can be queried. However, it’s hard to evaluate what is good and up to date information - 98% of the documents people kick out are not useful.

👤 evanjrowley
It's an even bigger obstacle for data management, particularly classification and loss prevention. Comparatively, it's less of an obstacle for AI and most likely that will be a game changer for addressing those other issues.

👤 datadrivenangel
For most companies, the unstructured 'data' is barely information, let alone data, let alone valuable.

Most companies have internal training videos, recordings of meetings, PDFs of policies, and basically all of these are worthless within a year from a business perspective. Some things are useful for longer, or for reasons of historical interest, but the half life is short. The real thing that contains value is what those could potentially represent, like decisions or events that would benefit from action. If a meeting has a decision that some executive would want to know about, maybe a summary of the transcript could be useful?

Turning the random memorandum generated as part of business into valuable insights without process re-design is a pipe dream in most scenarios. Not all though.


👤 constantinum
Unstract is trying to solve this problem by fully leveraging the LLM stack. It is open-source https://github.com/Zipstack/unstract

👤 aworks
Ben Thompson suggests Palantir as a company to leverage deep enterprise data with AI.

https://stratechery.com/2024/enterprise-philosophy-and-the-f...


👤 muzani
I would think that the main purpose of multimodal AI like GPT-4o is to be able to process data from videos, images, audio, etc.

👤 theGnuMe
There are companies but it is a wide open space. There was a legal AI startup sold last year for a billion or so...