I'm looking for advice and insight on what y'all might use for an internally hosted logging solution in Kubernetes. Currently we use a self-hosted Graylog setup, but are finding it difficult to maintain as our system grows.
Here's our current setup:
- Multiple clusters
- Logs aggregated to a single Graylog setup, itself running in Kubernetes
- Logs are sent to Graylog via Fluentbit
Some problems we've had are: - Index management in Graylog's ElasticSearch cluster is a PITA when you have many differently shaped log forms going to shared indices (managing separate indices per source is also a pain)
- Management of MongoDB in Kubernetes is frustrating and has been a reliability challenge
I'd love for us to be able to use a hosted logging solution but $$$ obviously. I'm aware of many other alternatives, but one of the things I've painfully learned is that a basic feature matrix only tells a very small piece of any story. The real education comes from running this type of tech and living with it through scale and various lifecycle events.Some questions I have:
- What logging solutions are you using in your Kubernetes environment, and how has your experience been?
- How do you handle log retention and storage costs?
TIA
We store all the logs from all the containers running in our Kubernetes clusters into VictoriaLogs during the last year. It works smoothly and uses very small amounts of RAM and disk space. For example, one of our VictoriaLogs instance contains 2 terabytes of logs while using 35 gigabytes of disk space and 200MB of RAM on average.
> How do you handle log retention and storage costs?
VictoriaLogs provides a single clear command-line flag for limiting disk space usage - `-retention.maxDiskSpaceUsageBytes`. It automatically removes the oldest logs when disk space usage reaches the configured limit. See https://docs.victoriametrics.com/victorialogs/#retention-by-... .
P.S. I can be biased, because I'm the core developer of VictoriaLogs. I recommend trying to use in production VictoriaLogs alongside other log management solutions and then choosing the solution which fits better your particular needs from operations, costs, usability and performance PoV.
[0] https://github.com/coroot/coroot
[1] demo: https://community-demo.coroot.com/p/qcih204s/app/default:Dep...
When I worked at Postmates we had a proprietary log search built on Clickhouse which was excelent. The same idea was also implemented concurrently at Uber (yay multiple discovery) and is documented at a relatively high level here: https://www.uber.com/blog/logging/
If gun was placed to head I would rebuild that over running the existing logging solutions.
I also worked for several months on building my own purpose built logging storage and indexing engine based trigram bitmap indices for accelerated regex searches ala CodeSearch but I ran out of motivation to finish it and commercialisation seemed very difficult, too much competition even if that competition is bad. Really really should get around to finishing it enough that it can be OSSed at least.
Apart from that, it provides various other feature:
- Dynamic datatype [0] which are very useful for semi-structured fields which generally logs contains very often.
- You can configure column's & table's TTL [1] which provides efficient way to configure retention.
At my previous job (Cloudflare), we migrated from Elasticsearch to ClickHouse and saved nearly 10x reduction in data size and got 5x perf improvement. You can read more about it [2] and watch the recording here [3]
Recently, ClickHouse engineers published a wondering detailed blog about their logging pipeline [4]
[0] https://clickhouse.com/docs/en/sql-reference/data-types/dyna...
[1] https://clickhouse.com/docs/en/engines/table-engines/mergetr...
[2] https://blog.cloudflare.com/log-analytics-using-clickhouse
[3] https://vimeo.com/730379928
[4] https://clickhouse.com/blog/building-a-logging-platform-with...
I think Elasticsearch had its day when it's used to derive metrics from logs and performing aggregate searches. But now as logging is often paired with metrics from Prometheus or similar tdb, we don't run such complex log queries anymore, and so we find ourselves questioning whether it's worth running such a intensive and complex Elasticsearch installation.
I was a big Elasticsearch user during several years, I wasn't convinced by Grafana Loki which is far less expensive because the data are stored in object storage, but have poor read performances because it's not a real search engine.
Then I discovered Quickwit which is a combination of the advantages of both world. With Quickwit you can ingest the logs the way you're already used to : through OTLP/grpc, with a log collector like fluentbit or Vector which can ingest the stdout of your pods and forward to Quickwit using the http API, etc.
And you can then use Grafana with pretty much the same features available for the Elasticsearch datasource.
https://quickwit.io/docs/log-management/send-logs/using-flue...
The read performance of Quickwit are incredible because of their amazing indexing engine (which is kind of like Lucene rewritten in Rust) and the storage is very cheap and without limitation other than your cloud provider capabilities.
That being said, if it is setup and forget, then Loki is as low resource friendly as you can get without spending big $$$ to maintain it.
ELK is massive resource hog and is best kept in cloud, but if storage and compute is irrelevant over search experience, then ELK is unbeatable.
We do log collection and per service log rate limiting via https://github.com/NinesStack/logtailer to make sure we don’t blow out the budget because someone deployed with debug logging enabled. Fluentbit doesn’t support that per service. Logs are primarily for debugging and we send metrics separately. Rate limiting logs encourages good logging practices as well, because people want to be sure they have the valuable logs when they need them. We dashboard which services are hitting the rate limit. This usually indicates something more deeply wrong that otherwise didn’t get caught.
This logging setup gives us everything we’ve needed in seven years of production on two stacks.
idk ... imho. - as always
* keep things "stupid-simple" ~ rsyslog to some centralized (linux)system
* i want something "more modern" & with a GUI ~ grafana loki
* "more capable" but still FOSS ~ ELK
* i'm enterprisy, i want "more comfort" and i want to pay for my logging-solution / for the "peace of mind" ~ splunk
* i'm making a "hell of money" with that system so it better performs well, provides a lot of insight etc. and i don't care what i pay for it ~ dynatrace
did i miss something!? ;))
just my 0.02€
Request logging should be done in a structured form. You don't need an indexing solution for this kind of request logging - It's vaguely timestamp ordered, and that's about it. If you need to search it, it gets loaded into a structured data query engine - Spark, or Bigquery/Athena.
Audit logging belongs in a durable database and it requires being written and committed before the request is finished serving - Logging frameworks that dump to disk or stdout obviously fail his requirement.
Otherwise Loki. Also seen used and I think it's fine. That's more "pure" logging where ELK has more advanced searching/indexing/dashboards etc.