I want to make queries on this matrix like (what are the rows most similar to row #2), but I don't have a machine with 100GB of RAM.
What would be the best strategy for running queries on this matrix? My current plan is to use duckdb, but I am wondering if there are more elegant alternatives.
Also consider https://www.pinecone.io/
The obvious way to do a similarity query is to do a full-scan which can be done in a straightforward way against data on disk if the data is appropriately packaged. It may be slow but it works.
There are n-dimensional indexes that can accelerate similarity queries, pinecone uses them, they are not as effective as 1-d and 2-d (geospatial) indexes but they do help.
Sparse similarity search is its own problem, solved by full-text search engines like Lucene.
Dimensional reduction, say going from 300K features to 100 features, might improve the quality of your results as well as compactifying your data. It could be something you do once on a monster machine and then do many lookups on a smaller machine. That dimensional reduction might take a lot of resources, see
https://scikit-learn.org/stable/modules/decomposition.html
but in your case you might get embarassingly good results with random projections
There are libraries for efficient vector similarity search like https://github.com/facebookresearch/faiss
This works as long as you control the file format (so you can ensure that rows are stored contiguously) and as long as you don't need too many rows in memory at the same time.
https://scikit-learn.org/stable/modules/neighbors.html#unsup...