How do you detect oddly performing machines in a large cluster?

Question

Hi HN,I have a problem where random nodes in a large cluster perform pretty far out of spec. What's the correct way to find them? There's a huge diversity of workloads, and the boxes are large so I was considering doing something really trivial that eats up some consistent % of CPU then graphing machines that are some n of SD out of normal (calculating fibbonacci to some low-ish number, for instance).Is there any really good, clean way to do this that solves the problem in an elegant way?

verdverm · Accepted Answer

Is the problem data collecting and detection in time series, or more being able to reproduce so you can work towards a fix?