HACKER Q&A
📣 ryanisnan

What's your preferred logging stack in Kubernetes


Hi HN,

I'm looking for advice and insight on what y'all might use for an internally hosted logging solution in Kubernetes. Currently we use a self-hosted Graylog setup, but are finding it difficult to maintain as our system grows.

Here's our current setup:

  - Multiple clusters
  - Logs aggregated to a single Graylog setup, itself running in Kubernetes
  - Logs are sent to Graylog via Fluentbit

Some problems we've had are:

  - Index management in Graylog's ElasticSearch cluster is a PITA when you have many differently shaped log forms going to shared indices (managing separate indices per source is also a pain)
  - Management of MongoDB in Kubernetes is frustrating and has been a reliability challenge
I'd love for us to be able to use a hosted logging solution but $$$ obviously. I'm aware of many other alternatives, but one of the things I've painfully learned is that a basic feature matrix only tells a very small piece of any story. The real education comes from running this type of tech and living with it through scale and various lifecycle events.

Some questions I have:

  - What logging solutions are you using in your Kubernetes environment, and how has your experience been?
  - How do you handle log retention and storage costs?
TIA


  👤 valyala Accepted Answer ✓
> What logging solutions are you using in your Kubernetes environment, and how has your experience been?

We store all the logs from all the containers running in our Kubernetes clusters into VictoriaLogs during the last year. It works smoothly and uses very small amounts of RAM and disk space. For example, one of our VictoriaLogs instance contains 2 terabytes of logs while using 35 gigabytes of disk space and 200MB of RAM on average.

> How do you handle log retention and storage costs?

VictoriaLogs provides a single clear command-line flag for limiting disk space usage - `-retention.maxDiskSpaceUsageBytes`. It automatically removes the oldest logs when disk space usage reaches the configured limit. See https://docs.victoriametrics.com/victorialogs/#retention-by-... .

P.S. I can be biased, because I'm the core developer of VictoriaLogs. I recommend trying to use in production VictoriaLogs alongside other log management solutions and then choosing the solution which fits better your particular needs from operations, costs, usability and performance PoV.


👤 nikolay_sivko
Take a look at Coroot [0], which stores logs in ClickHouse with configurable TTL. Its agent can discover container logs and extract repeated patterns from logs [1].

[0] https://github.com/coroot/coroot

[1] demo: https://community-demo.coroot.com/p/qcih204s/app/default:Dep...


👤 jpgvm
They all sort of suck to be honest. The least suck has actually been hosted Google Cloud Logging of late, it's just "not bad enough" to get the job done.

When I worked at Postmates we had a proprietary log search built on Clickhouse which was excelent. The same idea was also implemented concurrently at Uber (yay multiple discovery) and is documented at a relatively high level here: https://www.uber.com/blog/logging/

If gun was placed to head I would rebuild that over running the existing logging solutions.

I also worked for several months on building my own purpose built logging storage and indexing engine based trigram bitmap indices for accelerated regex searches ala CodeSearch but I ran out of motivation to finish it and commercialisation seemed very difficult, too much competition even if that competition is bad. Really really should get around to finishing it enough that it can be OSSed at least.


👤 pradeepchhetri
I would recommend using ClickHouse which provides very efficient compression and thus reduces the size of data drastically.

Apart from that, it provides various other feature:

- Dynamic datatype [0] which are very useful for semi-structured fields which generally logs contains very often.

- You can configure column's & table's TTL [1] which provides efficient way to configure retention.

At my previous job (Cloudflare), we migrated from Elasticsearch to ClickHouse and saved nearly 10x reduction in data size and got 5x perf improvement. You can read more about it [2] and watch the recording here [3]

Recently, ClickHouse engineers published a wondering detailed blog about their logging pipeline [4]

[0] https://clickhouse.com/docs/en/sql-reference/data-types/dyna...

[1] https://clickhouse.com/docs/en/engines/table-engines/mergetr...

[2] https://blog.cloudflare.com/log-analytics-using-clickhouse

[3] https://vimeo.com/730379928

[4] https://clickhouse.com/blog/building-a-logging-platform-with...


👤 Atreiden
Loki backed by S3 and queried via Grafana is a good, mostly FOSS solution. Installs pretty easily via helm and S3 gives a reasonable balance between cost, ease, and durability if you're in AWS already.

👤 wczekalski
Loki is not bad but PSA, don't use it with anything else than real AWS S3. The performance with Minio is awful (and can't be good, because of how minio works). Might be a bit better with Seaweedfs.


👤 nullify88
I'm currently migrating from Elasticsearch to Loki, and it's much simpler to run and still meet our requirements.

I think Elasticsearch had its day when it's used to derive metrics from logs and performing aggregate searches. But now as logging is often paired with metrics from Prometheus or similar tdb, we don't run such complex log queries anymore, and so we find ourselves questioning whether it's worth running such a intensive and complex Elasticsearch installation.


👤 ineumann
Hi.

I was a big Elasticsearch user during several years, I wasn't convinced by Grafana Loki which is far less expensive because the data are stored in object storage, but have poor read performances because it's not a real search engine.

Then I discovered Quickwit which is a combination of the advantages of both world. With Quickwit you can ingest the logs the way you're already used to : through OTLP/grpc, with a log collector like fluentbit or Vector which can ingest the stdout of your pods and forward to Quickwit using the http API, etc.

And you can then use Grafana with pretty much the same features available for the Elasticsearch datasource.

https://quickwit.io/docs/log-management/send-logs/using-flue...

The read performance of Quickwit are incredible because of their amazing indexing engine (which is kind of like Lucene rewritten in Rust) and the storage is very cheap and without limitation other than your cloud provider capabilities.

https://quickwit.io/blog/quickwit-binance-story


👤 n_ary
Careful with Loki if you ever plan to export logs out of it. It can export limited logs based on very very limited time range and search range over long time is PITA if your log volume is fairly medium to high.

That being said, if it is setup and forget, then Loki is as low resource friendly as you can get without spending big $$$ to maintain it.

ELK is massive resource hog and is best kept in cloud, but if storage and compute is irrelevant over search experience, then ELK is unbeatable.


👤 relistan
Since 2017, at two different companies, I’ve sent logs via UDP to Sumo Logic, via their collector hosted in our cluster. Sumo Logic is reasonably priced, super powerful, easy to use, and really flexible. Can’t recommend it enough.

We do log collection and per service log rate limiting via https://github.com/NinesStack/logtailer to make sure we don’t blow out the budget because someone deployed with debug logging enabled. Fluentbit doesn’t support that per service. Logs are primarily for debugging and we send metrics separately. Rate limiting logs encourages good logging practices as well, because people want to be sure they have the valuable logs when they need them. We dashboard which services are hitting the rate limit. This usually indicates something more deeply wrong that otherwise didn’t get caught.

This logging setup gives us everything we’ve needed in seven years of production on two stacks.


👤 t312227
hello,

idk ... imho. - as always

* keep things "stupid-simple" ~ rsyslog to some centralized (linux)system

* i want something "more modern" & with a GUI ~ grafana loki

* "more capable" but still FOSS ~ ELK

* i'm enterprisy, i want "more comfort" and i want to pay for my logging-solution / for the "peace of mind" ~ splunk

* i'm making a "hell of money" with that system so it better performs well, provides a lot of insight etc. and i don't care what i pay for it ~ dynatrace

did i miss something!? ;))

just my 0.02€


👤 GauntletWizard
Stop using logging. You're using logging wrong and there is no using it right. Logging (unqualified) is for temporary debugging data. It shouldn't go anywhere or be aggregated unless you need to be debugging, and then it should go to the developer doing the debugging's machine.

Request logging should be done in a structured form. You don't need an indexing solution for this kind of request logging - It's vaguely timestamp ordered, and that's about it. If you need to search it, it gets loaded into a structured data query engine - Spark, or Bigquery/Athena.

Audit logging belongs in a durable database and it requires being written and committed before the request is finished serving - Logging frameworks that dump to disk or stdout obviously fail his requirement.


👤 YZF
Filebeat + ELK stack is pretty good. You can easily run filebeat as a daemonset and have it detect all your pods and logs. This is what I'm using right now.

Otherwise Loki. Also seen used and I think it's fine. That's more "pure" logging where ELK has more advanced searching/indexing/dashboards etc.


👤 cyberpunk
EFK. Loki just sucks if you’re used to kibana searches.

👤 loosescrews
I'm not entirely sold on it yet, but Quickwit seems to be the current trendy solution.

👤 mikeshi42
We pull in everything across our AWS infra via otel collectors (metrics/logs/traces) and forward to HyperDX (ourselves, w/ storage backed on Clickhouse). You'll find that Clickhouse is a ton more efficient than Elastic when it comes to observability use cases, which helps a lot with keeping costs under control. There's less schema management typically needed on Clickhouse as they have more flexible map types for chaotic structured logs. The Otel collector is also very flexible in adding filtering rules to throw out noisy messages.

👤 vbezhenar
I'm using loki with S3 storage (not AWS, OpenStack Swift that my hoster sells). Can't say I'm amazed, logcli is not pleasant to use, Grafana integration is very bare-bones and I spent more than I'd like to make it work smoothly, but in the end it works, so no big issues either. Log retention is configured in loki and storage costs are low, it compresses things well.

👤 wanderingmind
Can some one ELI5 why logging with loki is better than using a database like sqlite or postgresql.

👤 nijave
Has anyone tried running SigNoz? I see a lot of comments about Clickhouse and they offer an observability stack built on Clickhouse.

👤 karmajunkie
is anyone else using cloudwatch? we aren’t logging huge amounts but i can’t tell if im missing something from the rest of this thread…

👤 rootsu
Self-hosted Loki + Traces using Tempo + Grafana.

👤 endre
one word: axoflow