When I have two paragraphs from one document, say p1 and p2, should I analyze them individually or first determine if they share some common information? If they do share information, should I then evaluate the common information (p1|2), p1 excluding the common information (p1-p1|2), and p2 excluding the common information (p2-p1|2)? This approach aims to reduce redundancy, but does it make sense?
Additionally, I would assign a 'label' to each paragraph. For example, p1 could have the label 'llm', p2 could have the label 'RAG', and p3 could have the label 'llm'. Using these labels as primary parameters might increase the system's speed. The label could also be an array of relevant words, representing the essence of the paragraph. By finding related paragraphs through labels, I would know that p1 and p3 are somehow related. Again, does this approach make sense?
Furthermore, regarding re-ranking or re-chunking, should the database, whether it's a vector DB, a knowledge graph, or a hybrid, be highly dynamic or rather static?
Another question: When comparing two paragraphs, p1 and p2, should the comparison be at the paragraph level, or should it also be word by word? For example, consider the sentences "Dog sits here", "Dog sits there", and "Dog, sit!". Without using NLP and LLM, simply comparing the words shows that s1 and s2 are closer to each other than to s3. Would an additional layer of comparison be helpful? How should punctuation be handled, for example?
Regarding to reranker, it's a pure dynamic solution, which is different with re-chunking which requires all data to be reindexed.
Comparing two paragraphs, in most cases, it's based on embedding, which means each paragraph will correspond to a single embedding. So comparing them does not take words into account. However, if you adopts a hybrid search which will use full text search as another kind of recall approach, it will take the words into ranking consideration, in that case, the scores are computed based on the TF/IDF metrics within the paragraph, which is an accumulated score of all hitted tokens.