HACKER Q&A
📣 chucky_z

How do you detect oddly performing machines in a large cluster?


Hi HN,

I have a problem where random nodes in a large cluster perform pretty far out of spec. What's the correct way to find them? There's a huge diversity of workloads, and the boxes are large so I was considering doing something really trivial that eats up some consistent % of CPU then graphing machines that are some n of SD out of normal (calculating fibbonacci to some low-ish number, for instance).

Is there any really good, clean way to do this that solves the problem in an elegant way?


  👤 verdverm Accepted Answer ✓
Is the problem data collecting and detection in time series, or more being able to reproduce so you can work towards a fix?