HACKER Q&A
📣 vsroy

Disk-backed NumPy for data analysis?


I have a large similarity matrix. It is 300K by 300K entries, and is ~ 100GB in size. To clarify, this is the result of multiplying a 300K by 512 matrix with its transpose.

I want to make queries on this matrix like (what are the rows most similar to row #2), but I don't have a machine with 100GB of RAM.

What would be the best strategy for running queries on this matrix? My current plan is to use duckdb, but I am wondering if there are more elegant alternatives.


  👤 PaulHoule Accepted Answer ✓
Try https://www.dask.org/

Also consider https://www.pinecone.io/

The obvious way to do a similarity query is to do a full-scan which can be done in a straightforward way against data on disk if the data is appropriately packaged. It may be slow but it works.

There are n-dimensional indexes that can accelerate similarity queries, pinecone uses them, they are not as effective as 1-d and 2-d (geospatial) indexes but they do help.

Sparse similarity search is its own problem, solved by full-text search engines like Lucene.

Dimensional reduction, say going from 300K features to 100 features, might improve the quality of your results as well as compactifying your data. It could be something you do once on a monster machine and then do many lookups on a smaller machine. That dimensional reduction might take a lot of resources, see

https://scikit-learn.org/stable/modules/decomposition.html

but in your case you might get embarassingly good results with random projections

https://en.wikipedia.org/wiki/Random_projection


👤 prirun
You can use swap / virtual memory. The tricky part with that or any db is to make sure that your array accesses are as sequential as possible, because in the worst case, each memory reference could turn into a disk access. Make sure your swap is on a fast NVME/SSD device to minimize access time. See mkswap.

👤 jstx1
> I want to make queries on this matrix like (what are the rows most similar to row #2)

There are libraries for efficient vector similarity search like https://github.com/facebookresearch/faiss


👤 yorwba
Use mmap to load only the region of the file that corresponds to row #2, interpret it as an array using np.frombuffer, proceed as with normal in-memory data.

This works as long as you control the file format (so you can ensure that rows are stored contiguously) and as long as you don't need too many rows in memory at the same time.


👤 tacosbane
you didn't provide any hints why it wouldn't work for you, so i recommend you look at scikit-learn's neighbors module (i.e., construct NearestNeighbors object and query it).

https://scikit-learn.org/stable/modules/neighbors.html#unsup...