And, occasionally, one of the disks on my machines fills up. And, then my vagrant machine suspends.
Or, I reboot a machine and the VM does not come up automatically, because I forgot that that service isn't docker with restart set.
What are people using to make sure homelab services are running correctly?
My goal would be something that:
* Has a simple dashboard
* Should it be an agent on the VM, or an external process that checks from afar?
* Good backup and restore story so I can rebuild the service and move to another server if I need to.
* Polyglot checks: I want to see CPU, memory, disk space usage. But, I also want to check services on the VM, like docker. And, use SSH to do manual checks.
* Lightweight learning curve
I'm using Uptime Kuma for monitoring HTTP and SSH services and I love it. But, with Uptime Kuma cannot ssh into a server, it only checks to see if the SSH daemon is up, and I would love to actually enter the machine and do a health check which could be customized.Is there a single service that I could use for this? Or, should I wire together a bunch of smaller tools and put them into a centralized dashboard somehow?
I'm looking at dashy and it looks great. But, as of yet, I'm unsure how I can just "add SSH check for this hostname" and "add HTTP status check for this hostname" and "check diskspace for this hostname" and get it working without a lot of confusion.
https://live.dashy.to/
https://prometheus-operator.dev/
Kubernetes is not for everyone and is far from perfect but you already use Docker and you seem to seek many features offered by Kubernetes.
Bit of a longer dump for an answer...
Having been running services at home for way too long now and my day job being running the cloud for large businesses I am tired of special snowflakey prone to breaking hand rolled solutions. My idea of a well run infrastructure is that I should be able to walk away from it hands off for extended periods of time and it just continues running/self heals, to that effect this is what I've come down to:
- 3 node k8s cluster on a bunch of random mini nucs
- Github repo with helm charts/manifests hooked to ArgoCD (runs on cluster) for CD. All changes get checked into repo and auto deploy to cluster. https://www.argonaut.dev/ is an option to not run own ArgoCD
- Grafana cloud free tier for shipping machine/cluster metrics and monitoring. Alerting is via pushover, you can email too
- Uptime kuma on a fly.io free instance for inbound HTTP/DNS/Cert etc. monitoring from the outside hooked to techulus push/pushover for alerting
- Terraform for DNS/cloudflare management via TF cloud offering for automated deploys again