How to properly build a multi-terabyte DuckDB database?

Question

I have an initial data dump of ~2 TB of structured text, and a further 20-30 GB weekly to be added in the future, with a final size of ~10 TB. I only need a single node, and I need serious analytics, so I'm considering DuckDB. Disk space is not a concern, but incremental backup speed is.How do I do this properly so that it is performant and scales without breaking? Are there any gotchas?There are so many ways to import -- which is the fastest?Is this a definite case for partitioning?Should I create adaptive radix tree indices? The docs say, "ART indexes must currently be able to fit in-memory. Avoid creating ART indexes if the index does not fit in memory."What else am I missing here? Can DuckDB even handle databases of this size?Any guidance would be greatly appreciated!

szarnyasg · Accepted Answer

Disclaimer: DuckDB Labs employee here.What does your 'serious analytics' entail? Using the full text search extension's macros (stemming, using match_bm25, ...), running regexes, computing aggregates? Are you doing highly-selective lookups based on some columns that you'd like to index on? What would be your partitioning key?> There are so many ways to import -- which is the fastest?Loading from Parquet is great if you have Parquet file... but for use case, CSV import is the best bet. It is also very fast (>1GB/s on uncompressed CSVs) and works fine if the CSVs are reasonably well-formatted.

vgt · Answer

Check out MotherDuck!(co-founder and head of produck, feel free to reach out)

ryadh · Answer

Have you considered ClickHouse?