HACKER Q&A
📣 siyud

Seeking a Vector Database for ClickHouse Users – Suggestions Appreciated


I am currently working on a project that requires me to store and efficiently query large amounts of multidimensional data, and I believe a vector database could provide the perfect solution. However, I am unsure which one would best integrate with ClickHouse. If any of you have had experience using a vector database in conjunction with ClickHouse, I would be immensely grateful for your recommendations and insights.


  👤 tomhamer Accepted Answer ✓
You should check out https://github.com/marqo-ai/marqo for an end-to-end vector search database with batteries included.

Disclaimer, I'm from the Marqo team.


👤 zX41ZdbW
How large is the data size (the number of vectors, and their dimensions?), what are the type of queries (N nearest neighbors to a target vector according to L2 distance, or something else?), from where the queries are sent (reccomendation system for a user request; internal requests from a ML system), the throughput and the latency requirements (how many queries per second it should serve and how quickly it should answer)?

ClickHouse already works good for vector search.

For example, if you have one million of vectors of 1024 dimensions, and you search nearest vectors to a vector by brute force search, the query will take 150 ms, which is good for a reccomendation system scenario for e-commerce, food-tech, and similar applications.

Example:

    CREATE TABLE vectors (id UInt64, vector Array(Float32)) ENGINE = Memory;
    SET max_block_size = 16; -- 64 KB per row
    INSERT INTO vectors SELECT number, arrayMap(x -> randNormal(0.0, 1.0, x), range(1024)) FROM numbers_mt(1000000); -- 4 GiB

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT count() FROM vectors WHERE NOT ignore(L2SquaredDistance(vector, target)); -- 0.113

    SELECT count() FROM vectors WHERE NOT ignore(L2Norm(vector)); -- 0.110

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT count() FROM vectors WHERE NOT ignore(arraySum((x, y) -> x * y, vector, target)); -- 0.150

    WITH (SELECT vector FROM vectors LIMIT 1) AS target
    SELECT id, L2SquaredDistance(vector, target) AS distance FROM vectors ORDER BY distance LIMIT 10; -- 0.144

👤 sebawita
I would recommend https://weaviate.io/ Disclaimer, I work at Weaviate.

It is open source, super fast and really easy to work with. Plus, it can easily handle huge volumes, we even have it running with a billion objects.


👤 bobvanluijt
Ha, great question! Coincidentally, at Weaviate, we are thinking about this too! There are a few ways to do this with AirByte (https://docs.airbyte.com/integrations/destinations/weaviate/) or (potentially) with Spark (https://github.com/weaviate/spark-connector). Would love to collaborate on this; feel free to reach out over Slack or so.

👤 lqhl
You can explore MyScale at https://myscale.com/. It's a SaaS platform built on ClickHouse, offering more sophisticated vector indexing options like HNSW and IVF compared to the open-source version. It also provides a free tier for beta users.

Disclaimer: I work for MyScale.


👤 ryadh
ClickHouse can actually store vectors as tuples or arrays. It also comes with some handy distance functions

https://clickhouse.com/docs/en/sql-reference/functions/dista...


👤 jeadie
I've been using https://github.com/jdagdelen/hyperDB and it's been really easy to use. I think Clickhouse support is on the short-term roadmap.

👤 andre-z
You could check Qdrant, a dedicated Vector Database with advanced features. https://github.com/qdrant/qdrant Disclaimer: I'm from the Qdrant Team.