HACKER Q&A
📣 craggyjaggy

Tools for exploratory analysis of 10-100GB graphs


I've never had to work with any dataset bigger than L3 cache, so I'm somewhat out of my depth here. I have a sample of the (relational) data that's about 10GB, with another 80GB available that may or may not be mostly garbage.

In the end I would like to have the graph in a visual interface to zoom and pan through it, and a way to experiment with different clustering algorithms based on some proximity measure (I have an idea for what those might look like).

I'm not a data scientist so I have no overview of the tooling landscape here and find it difficult to filter through endless pages of marketing for vaguely ML/Big Data related products. I'm not looking for an expensive ready-made solution, I do like to hack on things after all :)


  👤 PaulHoule Accepted Answer ✓
At scales far below what you're talking about people experience grave difficulties making sense of big graphs.

https://cambridge-intelligence.com/how-to-fix-hairballs/

One of my favorite examples is this guy

https://en.wikipedia.org/wiki/Mark_Lombardi

I saw an art exhibit that showed some of the sketches that he made and it was clear that he worked really hard drawing and redrawing each graph and they went from being hairballish to telling a clear story.

You're also very insightful to be talking about the specific scale you're working at because it matters. Graph workloads can drive you batty because they frequently defeat caches by beating very nonlocal.

For your small data set you are in the range where you can get a "big" computer with say 64GB or 128GB of RAM and be able to work in RAM. You might be a little disappointed with the performance (it takes a while to touch every memory address in a 128GB machine) but it will good enough if you're efficient and disciplined.

As an RDF fanatic I'll share that I have handled data sets on the small end of your scale with

https://virtuoso.openlinksw.com/


👤 icsa
How many nodes and edges are in your graphs?

10,000 nodes is an upper limit for most graph visualization tools that I have used.