Disk-backed NumPy for data analysis?

Question

I have a large similarity matrix. It is 300K by 300K entries, and is ~ 100GB in size. To clarify, this is the result of multiplying a 300K by 512 matrix with its transpose.I want to make queries on this matrix like (what are the rows most similar to row #2), but I don't have a machine with 100GB of RAM.What would be the best strategy for running queries on this matrix? My current plan is to use duckdb, but I am wondering if there are more elegant alternatives.

PaulHoule · Accepted Answer

Try https://www.dask.org/Also consider https://www.pinecone.io/The obvious way to do a similarity query is to do a full-scan which can be done in a straightforward way against data on disk if the data is appropriately packaged. It may be slow but it works.There are n-dimensional indexes that can accelerate similarity queries, pinecone uses them, they are not as effective as 1-d and 2-d (geospatial) indexes but they do help.Sparse similarity search is its own problem, solved by full-text search engines like Lucene.Dimensional reduction, say going from 300K features to 100 features, might improve the quality of your results as well as compactifying your data. It could be something you do once on a monster machine and then do many lookups on a smaller machine. That dimensional reduction might take a lot of resources, seehttps://scikit-learn.org/stable/modules/decomposition.htmlbut in your case you might get embarassingly good results with random projectionshttps://en.wikipedia.org/wiki/Random_projection

prirun · Answer

You can use swap / virtual memory. The tricky part with that or any db is to make sure that your array accesses are as sequential as possible, because in the worst case, each memory reference could turn into a disk access. Make sure your swap is on a fast NVME/SSD device to minimize access time. See mkswap.

jstx1 · Answer

> I want to make queries on this matrix like (what are the rows most similar to row #2)There are libraries for efficient vector similarity search like https://github.com/facebookresearch/faiss

yorwba · Answer

Use mmap to load only the region of the file that corresponds to row #2, interpret it as an array using np.frombuffer, proceed as with normal in-memory data.This works as long as you control the file format (so you can ensure that rows are stored contiguously) and as long as you don't need too many rows in memory at the same time.

tacosbane · Answer

you didn't provide any hints why it wouldn't work for you, so i recommend you look at scikit-learn's neighbors module (i.e., construct NearestNeighbors object and query it).https://scikit-learn.org/stable/modules/neighbors.html#unsup...

Disk-backed NumPy for data analysis?

> I want to make queries on this matrix like (what are the rows most similar to row #2)
There are libraries for efficient vector similarity search like https://github.com/facebookresearch/faiss

you didn't provide any hints why it wouldn't work for you, so i recommend you look at scikit-learn's neighbors module (i.e., construct NearestNeighbors object and query it).
https://scikit-learn.org/stable/modules/neighbors.html#unsup...