HACKER Q&A
📣 queuebert

How to properly build a multi-terabyte DuckDB database?


I have an initial data dump of ~2 TB of structured text, and a further 20-30 GB weekly to be added in the future, with a final size of ~10 TB. I only need a single node, and I need serious analytics, so I'm considering DuckDB. Disk space is not a concern, but incremental backup speed is.

How do I do this properly so that it is performant and scales without breaking? Are there any gotchas?

There are so many ways to import -- which is the fastest?

Is this a definite case for partitioning?

Should I create adaptive radix tree indices? The docs say, "ART indexes must currently be able to fit in-memory. Avoid creating ART indexes if the index does not fit in memory."

What else am I missing here? Can DuckDB even handle databases of this size?

Any guidance would be greatly appreciated!


  👤 szarnyasg Accepted Answer ✓
Disclaimer: DuckDB Labs employee here.

What does your 'serious analytics' entail? Using the full text search extension's macros (stemming, using match_bm25, ...), running regexes, computing aggregates? Are you doing highly-selective lookups based on some columns that you'd like to index on? What would be your partitioning key?

> There are so many ways to import -- which is the fastest?

Loading from Parquet is great if you have Parquet file... but for use case, CSV import is the best bet. It is also very fast (>1GB/s on uncompressed CSVs) and works fine if the CSVs are reasonably well-formatted.


👤 vgt
Check out MotherDuck!

(co-founder and head of produck, feel free to reach out)


👤 ryadh
Have you considered ClickHouse?