HACKER Q&A
📣 calculito

RAG and unstructured data from several docs


I don't have a deep understanding of RAG and related concepts. My approach is always to start by assuming there is no solution and then figure out how to solve it. When it comes to RAG, I ask myself the following questions:

When I have two paragraphs from one document, say p1 and p2, should I analyze them individually or first determine if they share some common information? If they do share information, should I then evaluate the common information (p1|2), p1 excluding the common information (p1-p1|2), and p2 excluding the common information (p2-p1|2)? This approach aims to reduce redundancy, but does it make sense?

Additionally, I would assign a 'label' to each paragraph. For example, p1 could have the label 'llm', p2 could have the label 'RAG', and p3 could have the label 'llm'. Using these labels as primary parameters might increase the system's speed. The label could also be an array of relevant words, representing the essence of the paragraph. By finding related paragraphs through labels, I would know that p1 and p3 are somehow related. Again, does this approach make sense?

Furthermore, regarding re-ranking or re-chunking, should the database, whether it's a vector DB, a knowledge graph, or a hybrid, be highly dynamic or rather static?

Another question: When comparing two paragraphs, p1 and p2, should the comparison be at the paragraph level, or should it also be word by word? For example, consider the sentences "Dog sits here", "Dog sits there", and "Dog, sit!". Without using NLP and LLM, simply comparing the words shows that s1 and s2 are closer to each other than to s3. Would an additional layer of comparison be helpful? How should punctuation be handled, for example?


  👤 newpeak Accepted Answer ✓
Tagging each paragraph is not a good approach. Instead, using LLM to generate a summary based on the clustering of paragraphs could be a good alternative. That's what RAPTOR(https://arxiv.org/html/2401.18059v1) has suggested.

Regarding to reranker, it's a pure dynamic solution, which is different with re-chunking which requires all data to be reindexed.

Comparing two paragraphs, in most cases, it's based on embedding, which means each paragraph will correspond to a single embedding. So comparing them does not take words into account. However, if you adopts a hybrid search which will use full text search as another kind of recall approach, it will take the words into ranking consideration, in that case, the scores are computed based on the TF/IDF metrics within the paragraph, which is an accumulated score of all hitted tokens.


👤 calculito
Thanks for the comments. Regarding re-chunking: I assume the models, the strategies, the dependencies are continuously checken, improved and adapted. I would assume also a re-chunking should be necessary. Not every day of course, but how I can assess the quality of my chunking? Running in parallel benchmarks on two different chunking strategies? Thats means I would need to do double work, creating and using continuously two different chunk systems. Any ideas how to evaluate the quality and performance of my chunking strategy?

👤 vissidarte_choi
There are numerous strategies and methods available to enhance RAG performance, particularly when it comes to improving performance in parsing vast amounts of unstructured data. Additionally, various scenarios call for different parsing techniques. I would suggest exploring a RAG project that excels in document parsing: https://github.com/infiniflow/ragflow