HACKER Q&A
📣 tusharsoni

How do you manage logs for your backend services?


I have been working on a few side projects that usually have Go backend services. Right now, they log straight to a file and it works just fine since I only have a couple of instances running.

If I were to scale this to more instances/servers, I would need a centralized logging service. I've considered ELK but it's fairly expensive to both self-host or buy a managed subscription.

Any advice?


  👤 viraptor Accepted Answer ✓
People jumped to recommending things before asking:

What's the volume of logs?

What's the expected usage? (debugging? audit?)

Are you using text, or structured logs? (or can you switch to structured easily?)

How far back do you want to keep the logs? How far back do you want to query them easily?

Are you in an environment with an existing simple solution? (GCP, AWS)


👤 stickfigure
Google's Stackdriver.

I've been using Google App Engine for some ten years now and I'm still dumbfounded that this is still an ongoing struggle for other platforms. It collates logs from a variety of sources, presents requests as a single unit, has sophisticated searching capabilities, and the UI doesn't suck.

Best of all, it just works... there's zero configuration on GAE. Such a time saver.


👤 WinonaRyder
We love JSON logs and previously just sent most of it to systemd's journald and use a custom tool to view them. But maybe a year ago Grafana released https://github.com/grafana/loki and we've been using it on https://oya.to/ ever since.

IIRC, the recommended way to integrate it with Grafana is via promtail but we weren't too keen on the added complexity of yet-another service in the middle so we developed a custom client library in Go to just send the logs straight to Loki (which we should probably open source at some point).

I don't think there's any fancy graph integration yet, but the Grafana explore tab with log level/severity and label filtering works well enough esp. since they introduced support for pretty printed JSON log payloads.


👤 jjjensen90
I highly recommend Datadog's logging platform. One important lesson I've learned in my career is never to run your own observability platform if you can afford for someone else (whose entire product is observability) do it for you.

I've used ELK (managed and hosted), Splunk, NewRelic, Loki, and home grown local/cloud file logs and nothing has been as cheap, easy, and powerful as Datadog. They charge per million log events indexed but also allow you to exclude log events by patterns/source/etc and they ingest but ignore those rows (you pay $0.10/gb for those ignored logs).

The 12 factor way to do logging is very easy with Datadog, as you can tell the agent to injest from stdout, containers, or file sources, then the application is agnostic to the log aggregator as the agent will collect and send logs to the platform.

Not only is it cheap and easy to set up, it also gives you the option to take advantage of the other features of Datadog that can be built on your log data. Metrics based on log parsing, alerts, configuration via terraform, etc become possible when you ship your logs to their platform.

I've seen our production apps log 10-20k messages per second without Datadog breaking a sweat but I'm not sure if they have any limits.


👤 johntash
I don't have a lot of experience with it yet, but Loki looks promising for small projects. You'd still use it as a centralized logging server, but it's not as resource-expensive as something like self-hosting ELK.

I've only been using it for my homelab, and haven't even moved everything to it yet - but I like it so far. I already use Grafana+influxdb for metrics so having logs in the same interface is nice.

https://grafana.com/oss/loki/


👤 JdeBP
ELK is expensive, in terms of hardware and time to configure/manage. But it does scale to large volumes in a way that your current ad hoc 1970s logging will not.

I had a good experience with a local decentralized logging system, essentially daemontools-style service logs, that then fed in to ELK. ELK provided the bulk storage and analysis; and the local daemontools logs provided the immediate on-machine per-individual-service recent log access, and decoupled logging from the network connection to logstash.

* http://jdebp.uk./Softwares/nosh/guide/commands/export-to-rsy...

* http://jdebp.uk./Softwares/nosh/guide/commands/follow-log-di...

One of the advantages of this approach is that one can do the daemontools-style logging first, very simply, without centralization, and with comparatively minor expense; and then tack on ELK later, when volume or number of services gets large enough, without having to alter the daemontools-style logging when doing so.

Of course, it can be something else other than ELK, fed from the daemontools-style logs.

One thing that I recommend against, ELK or no, is laboriously re-treading the path of the late 1970s and the 1980s, starting from that "logging straight to a file". Skip straight through to the 1990s. (-:

* http://jdebp.uk./FGA/do-not-use-logrotate.html


👤 iDemonix
Graylog is a really nice free product, and although it can look a bit scary, it's not that hard to get setup - especially since the introduction of the ElasticSearch REST API, meaning you no longer have to make GrayLog join the ES cluster as a non-data node.

You can spin it up on a single machine with ES and start playing with it. I usually forward all of my logs to rsyslog, then that duplicates the logs out - they go to flat file storage, and to graylog for analysis.


👤 defanor
For POSIX systems, syslog is the standard way. For systemd-based ones, journald may be preferable because of its additional features; both support sending of logs to a remote server. I'd suggest to avoid custom logging facilities (including just writing into a file) whenever possible, since maintaining many services with each using custom logging becomes quite painful.

👤 thraxil
We run on GCP and I have to say that Stackdriver Logging with the google fluentd agent is actually pretty good and relatively cheap. I don't like Stackdriver's metrics at all, but Logging feels more like part of GCP. Fluentd has given me far fewer problems than logstash or filebeat did when we were running an ELK setup. The search UI is obviously nowhere near as nice as something like Kibana, but it gets the job done. If you aren't on GCP, it's not worth it, but if you are, the whole setup is "good enough" that you might not need to set something more sophisticated up (I'm still looking at Loki though because I can't help myself).

👤 particlesplus
Papertrail/Timber.io - cheapest way to aggregate logs and has simple search functionality.

Scalyr - my personal favorite. Just a little more costly than Papertrail, but can do as much as any full service SaaS - powerful queries, dashboards, and alerts. Takes some practice to learn, but their support is very helpful.

Sumologic - Fully featured log agregator. It works pretty good but their UI is super duper annoying. You have to use their tabbing system on their page, you can't open multiple dashboards/logs in multiple browser tabs. For the money, I personally prefered Scalyr, but this is a reasonable option.

Splunk - a great place to $plunk down your $$$$. I think their cheapest plan was $60k/yr, but I will admit that it was easy to get going and use and also had the most features. It's not a bad bang for your buck as long as you have lots of bucks to spend.


👤 natebutler
We use flume forwarding to s3 and then athena to query the logs. Flume processes each logfile with morphline (which is akin to logstash) and parses each rawlog into json before pushing to s3.

We used to run an elk stack but hit a bottleneck crunching logs with logstash. We found flume's morphline to be performant enough and the nice property of flume is that you can fanout and write to multiple datasources.

It's ironic, but because Athena is kind of a flaky product (lots of weird Hive Exceptions querying lots of data) and because it's really only good at searching if you know what you're looking for, we're considering having flume write to an elasticsearch cluster (but still persisting logs to long-term storage on s3).


👤 l0b0
IMO ELK is worth it. All the filtering, sorting and graphing means you can easily do post mortems (which makes you more likely to do them at all), you can get detailed performance and other metrics without spending days setting up robust testing, and it makes it easy to correlate events from your entire infrastructure. Just make sure NTP is enabled everywhere :)

I would budget some time after setting it up to weed out uselessly verbose logging and rotating old logs out of RAM and onto cheap storage. You'll love it.


👤 zygy
We use and like https://www.scalyr.com/.

👤 rubyn00bie
I was just about to start looking into doing this myself and for the foreseeable future, I'll probably just use `dsh`... since I'm a cheapskate, have been trying to reduce my usage on cloud tools, and I just found out about it today:

https://www.netfort.gr.jp/~dancer/software/dsh.html.en

Once installed, change the default from rsh to ssh where it's installed e.g. `/usr/local/Cellar/dsh/0.25.10/etc/dsh.conf`

Then setup a group for machines, in this case I'm calling it "web"

> mkdir -p .dsh/group

> echo "some-domain-or-ip" >> .dsh/group/web

> echo "some-domain-or-ip-2" >> .dsh/group/web

Then fire off a command:

> dsh -M -g web -c -- tail -f /var/log/nginx/access.log

> some-domain-or-ip [... log message here ...]

> some-domain-or-ip-2 [... log message here ...]

The flags I used are:

-M "Show machine name"

-g "group to use"

-c run concurrently

That's about as easy as I can think of... /shrug while I like the idea of centralized logging services I haven't really found one I actually cared for... most just run rampant with noise, slow UIs, and strange query languages no one wants to learn. I guess I could start a machine up with `dsh` on it in my cluster and then write the output from dsh to a file... easy centralized logging on the cheap, ha!


👤 peterwwillis
Presumably you're running your service as a container, and presumably you only run one master process per container. Print your logs to stdout/stderr, and then you can use any log-capturing mechanism (fluent bit?) to stream them to any log-indexing system.

Tweak your system so you can selectively turn on log collection, and collect useful metrics 24/7. Logs almost never matter until you're diagnosing an active problem, anyway, and then only a few logs will be necessary. Differentiating the type and quality of particular log messages is also very useful.


👤 newjobseeker
I used Papertrail at my last job, search function works well, and it was easy to use.

👤 lokar
You should first consider what you need logging for.

Is it just ad-hoc text "debug" logs? Structured transaction or request logs to feed some sort of analytics? This impacts the backend trade-offs quite a bit.

Are you trying to do monitoring via logs? Don't. Export metrics via Prometheus (or something like it), much cheaper and dependable then extracting metrics via a log collection system.


👤 whatsmyusername
If you haven't read the chapter of 12factor on logging I highly recommend it https://www.12factor.net/logs

This is coming from an ops person, do that and I'll be happy. Essentially the goal is to externalize all your log routing to stdout, then wrap tooling around your application to route it wherever you want it to go. It's geared toward heroku but same rules apply in docker land and more traditional VM environments.

We either send logs ECS -> Cloudwatch for our AWS stuff or docker swarm -> fluentd -> s3 for an onsite appliance (also anything syslog ends up in Sumologic through a collector we host in the environment). From there the logs get consumed by our SIEM (in our case Sumologic, which is a great service). We keep logs hot for 90 days and ship archives back to s3 for a year. Set up the lifecycle management stuff, keeping log files forever is not only a waste but can actively hurt you if they ever get exposed in a breach.

I highly recommend formatting your logs JSON (even if you have to do it by hand like in Apache). If you do that and go to Splunk, Sumologic, or ELK all your fields will be populated either automatically or with a single filter. Saves writing or buying your own and if you add a field there's no action for you to take.

nginx/apache default logging is complete trash. Look at the variables they expose, there's a lot of stuff in there you'll want to add to the log format to make your life loads easier. I have a format I use I'd be willing to send you if you want it.

I don't recommend logstash (the L in ELK), ever (except maybe if you're java across the board). It's way too damn heavy to run on a workload host, fluentd is much lighter (and not java, why would I deploy java for a system tool ever?). Maybe as a network collector you throw syslog at but that would be it.

For your use case Sumologics free service would be great. You can get I think up to 200m a day with a weeks retention for free and you'll get exposed to what an SIEM can do for you (ingestion rate and retention period are typically how hosted solutions are billed, you'll need email with any non-free email domain to get a free account from them). IMO you have to get to some fairly insane log rates for me to ever recommend running ELK stack yourself, it has way too much care and feeding if you want to run it correctly with good security.


👤 deepersprout
Keep logs local, and send only errors and noteworthy events to a central server like sentry.io, rollbar, or something else.

If you log everything to a single server, as you noticed yourself, it will become very expensive and difficult to filter out the stuff that you really need when looking for errors or when you try to hunt down a specific problem.


👤 bradstewart
When running on AWS, I generally just use CloudWatch. For non-AWS hosts, and/or when I need something more feature rich: DataDog is a solid hosted service with reasonable pricing.

👤 sethammons
It is expensive. We forward logs to spkunk (we run our own instances). Splunk is really solid. All the logs are json and require certain fields. We use it for tend analysis, alerting, graphs, reports, and digging into production issues. It digs through terabytes of data relatively quickly.

👤 cmclaughlin
AWS CloudWatch Logs has come a long way. The new Insights UI is great. No need for us to manage ELK for logs anymore.

👤 lycidas
For cost, stackdriver works the best. Idk if they have a custom agent but Fluentd works great to ship logs to your platform of choice.

AWS cloudwatch is also good for cost but has much slower query speeds (the slowness makes me think it's not Lucene based?).

If you could splurge on hosted services, my favorite logging goes to datadog and it has all the other bits of observability built in for down the road.

https://landscape.cncf.io/ is usually my go-to if you wanna find best-in-class solutions to host yourself.


👤 geewee
I've had the misfortune of setting up Application Insights for logging across a distributed system.

It's awful. The integration with most (C#) logging frameworks is horribly, the adaptive sampling, which is hard to turn off, means that Application Insights randomly drops logs, which makes any sort of distributed tracing of events really difficult.

To top that, there's a delay of 5-10 minutes from the logs are written until they're queryable, which is a huge pain when debugging your setup.


👤 mamcx
Also:

- Exist a tool that allow to navigate structured logs easily, without bring a heavy machinery like ELF stack? and work in terminal?

- That also filter only ERRORS and few lines above/below?


👤 ariel_coralogix
DISCLAIMER - Im the CEO of Coralogix.com

ELK can be pretty expensive and is kind of a pain to manage. for simple use cases though, hosted ELK by AWS should do the trick and wouldn't cost too much. Small startups and dev shops should choose a SaaS logging too l since most of them start and 10$-30$ which is cheaper than anything you'll spin on our own.

Looking at the market right now, looks like logs, metrics, and SIEM are going to combine in the next 2-3 years.


👤 lmeyerov
for the endpoint itself, consider switching to syslog so you get a bunch of stuff for free (auto-rotations, docker logging, ...) and more easy to change decisions later (pipe to splunk/elk/...). it's thought out and pretty easy!

👤 zMiller
fluentd is great. You can setup forwarding nodes, that relay logs to one or mutiple masters that then persists into whatever layer(s) you want. Tolerance and failover baked in. Tons of connectors and best of all docker logs driver is built ships with docker so almost zero setup to get your container logs to fluentd. Also works nicely with kubernetes too!

👤 ekimekim
My main advice is avoid ELK. I have no clue how Elastic managed to convince the world that Elasticsearch should be the default log database when it is _terrible_ for logs.

If you're logging structured JSON, then you'll hit a ton of issues - Elasticsearch can't handle, say, one record with {foo: 123} and another with {foo: "abc"} - it'll choke on the different types and 400 error on ingest.

Even if you try to coerce values to string, you'll hit problems with nested values like {foo: "abc"} vs {foo: {bar: "baz"}}. So now you have to coerce to something like {"foo.bar": "baz"}, and then you have to escape the dot when querying...

Finally, if you solve all the above, you'll hit problems with high cardinality of unique fields. Especially if you are accepting arbitrary nested values, at some point someone is going to log something like {users_by_email: {: }} and now you have one field for every unique email...

These problems are tractable but a massive hassle to get right without majorly locking down what applications are allowed to log in terms of structured data.

As a seperate issue, Elasticsearch does fuzzy keyword matching by default. eg. if you search for "foo" you'll get results for "fooing" and "fooed" but not nessecarily for "foobar" (because it's splitting by word - and the characters it considers part of one "word" aren't obvious). This is great if you want to search a product catalog, but horrible when you're trying to find an exact string in a technical context. Yes you can rephrase your query to avoid it, but that's not the default and most people won't know how to structure their query perfectly to avoid all the footguns.

Finally, as others are saying here, Elasticsearch is just painful and heavy to manage.

As for what to use instead...I don't have good answers. I haven't exhaustively checked out all the other products being mentioned, but in my experience a lot of them will have similar issues around field cardinality, which means it'll always be possible to cripple your database with bad data. This is less of an issue if you're just running a few services, but in larger orgs it can be nigh impossible to keep ahead of.

For smaller scale deployments, don't underestimate just shipping everything to timestamp+service named files as newline-delimited JSON, and using jq and grep for search and a cronjob to delete/archive old files.

When it comes to the "read from local source and ship elsewhere" component, I've had the best luck with filebeat (specifically for files -> kafka). Most others tend to read as fast as they can then buffer in memory or disk if they can't write immediately, whereas filebeat will only read the source file as fast as it can write downstream.

Note however that all such components are awful to configure, as they seek to provide a (often turing-complete) configuration file for transforming your logs before shipping them, and like most turing-complete configuration scripts, they're less readable and more buggy than the equivalent would've been in any real programming language.

Ok, rant over. Sorry, a good logging system is kind of my white whale.


👤 markpapadakis
Our (https://bestprice.gr/) services/“programs” generate three different types of events:

- Short events (no longer than ~1k in size) where the cost to generate and transmit is very low(sent to dedicated service via UDP). We can generate dozens of them before we need to care about the cost to do so. They are binary-encoded. The service that receives those datagrams generates JSON representations of those events and forwards them to all connected clients(firehose) and also publishes them to a special TANK partition.

- Events of arbitrary size and complexity, often JSON encoded but not always -- they are published to TANK(https://github.com/phaistos-networks/TANK) partitions

- Timing samples. We capture timing traces for requests that may take longer-than-expected time to be processed, and random samples from various requests for other reasons. They capture the full context of a request(complete with annotations, hierarchies, etc). Those are also persisted to TANK topics

So, effectively, everything’s available on TANK. This has all sorts of benefits. Some of them include:

- We can, and, have all sort of consumers who process those generates events, looking for interesting/unexpected events and reacting to them (e.g notifying whoever needs to know about them, etc)

- It’s trivial to query those TANK topics using `tank-cli`, like so:

  tank-cli -b  -t apache/0 get -T T-2h+15m -f "GET /search"  | grep -a Googlebot
This will fetch all events starting 2 hours ago, for up to 15 minutes later, that include “GET /search”

All told, we are very happy with our setup, and if we were to start over, we’d do it the same way again.


👤 solatic
Allow me to rep the tool I help build: Coralogix, which is a managed log analytics service. You haven't said what your budget is, but our pricing starts at $15/month to handle 5 GB/month of logs - certainly cheaper than running ELK yourself.

https://coralogix.com


👤 james_s_tayler
I've recently spent a few days adding proper instrumentation to a .NET Core side project using CloudWatch Logs.

I don't know anything about the Go ecosystem, but if there is a nice solution for structured logging then CloudWatch Logs is very easy to implement, very cheap, can easily make decent dashboards with it and if needed in the future you can forward the logs onto Elastic Search.

I'm using a library called Serilog in my project to log everything in a consistent structured log that gets all kinds of metadata automatically appended to it and the json payload winds up in CloudWatch Logs. Then I've got a couple of Custom metrics to measure throughput as well as some log filter metrics to track latency of the service and it's downstream dependencies.

It works very well and was surprisingly quick to put together. I believe cost wise my usage level is covered pretty much for free. Can't complain!


👤 fiveguys94
We used to use an ELK cluster but it was always breaking - I'm sure this stuff can be reliable but we just wanted an easy way to search ~300GB of logs (10GB/day)

Somehow I came across scalyr and it's just phenomenally fast - and cost less than our ELK cluster. Definitely worth trying if it provides the features you need.


👤 gingerlime
I used to use Logentries for a while. It's essentially a hosted ELK. I think they got bought-out or something, and the service went downhill. Don't remember exactly, but it was slow and clunky, and support wasn't great either.

Then discovered Scalyr. It was awesome. The UI isn't pretty, but it's super-powerful. It's fast. You have to format your logs to make the most of it, but it's worth it.

Unfortunately their alerting wasn't as flexible at the time (for example, it wouldn't include contextual info from the matching logs that triggered the alert). Besides that, we decided to consolidate things and move to Datadog.

Datadog is pretty great. The monitoring and alerting features are solid. They then added logging, APM and keep adding more. It's not that cheap, but overall works great for us.


👤 enobrev
I'm pretty late to this party, but I've been using rsyslog and "EK" (skip the L - it's way too slow and resource hungry).

rsyslog / syslog-ng handles shipping logs to a central server and it's dead simple to keep local logs and a central log at the same time. Every language can spit logs to syslog vey quickly. And then you can use plugins to inject your life from rsyslog directly into elastisearch, which is incredibly fast.

Other critiques about ES still apply especially when it comes to managing conflicting keys in structured logs, but most complaints about fragility and scaling are because of Logstash, which I agree, is Terrible for logging.

I've written this up in detail if anyone is interested.


👤 jorgelbg
At trivago, we rely on the ELK stack to support our logging pipeline. We use a mix of filebeat and Gollum (github.com/trivago/gollum) to write logs into Kafka (usually we encode them using Protocol Buffers) these are later on read by Logstash and written into our ELK cluster.

For us, ELK has scaled quite well both in terms of queries/s and writes/s and we ingest ~1.5TB of logs daily just in our on-premise infrastructure. The performance/monitoring team (where I work at) consists of only 4 people and although we take care of our ELK cluster it is not our only job (nor a component that requires constant attention).


👤 indigodaddy
Nobody mentioning Splunk? Too obvious or am I not understanding the ask?

👤 keyle
If it's small, text files that you rotate per day and delete after 1-3 month(s).

If it's big, Graylog is great.

If it's too big, /dev/null, best logs gathering since 1971.


👤 sumedh
I have used sumologic, your servers send the logs to sumo and you can setup your alerts/monitoring/dashboards etc on their side.

👤 nkobber
We use https://www.humio.com/ and we love it

👤 reacharavindh
I have been having a TODO on my list to explore using Toshi + Tantivvy( Rust projects as replacement for ELastic Search) and using it to supplement a simpler (ripgrep + AGrind) file based search on logs centralized using rsyslog. Haven’t gotten around to play with them yet. Hopefully sometime this year.

I could not find an equivalent to Kibana though :-(


👤 dunnotbh
Check out the free version of Graylog.

👤 exabrial
Graylog, it's a purpose-built package for log management over elastic search! We transport our logs over ActiveMQ from our apps and they're read off the broker via an openwire input. The setup can handle several thousand rights per second on modest hardware.

👤 gesman
How much data is being generated by your logs?

If it’s less than 500MB/day you may use Splunk for free forever.


👤 SeriousM
We're using Seq as log server which you must host yourself. This fine and free product is from the same guys as serilog which is known in the dotnet world.

https://datalust.co/seq


👤 franzwong
In the old days, I had some daily cron jobs to upload the logs to a centralized place. Then you use grep to find which file contains the log you want. Because you are fine with file, I guess you don't need to make it too complicated.

👤 narnianal
The less experience you have the more you should pay a SaaS. Then when you gain more experience you start using frameworks like self managed ELK stack more. If that is not enough at some point you can roll out your own.

👤 hemantv
LogDNA is best and it's cheap enough you don't have to worry about it.

👤 cpach
Here’s some inspiration from Avery Pennarun: https://apenwarr.ca/log/20190216

👤 winrid
We used Sumologic for a long time and still do. You can query your logs in an SQL-way (you can do joins for instance) and last I checked they have a free tier.

👤 niftylettuce
I built CabinJS after I was frustrated with all existing solutions.

https://cabinjs.com


👤 stephenboyd
We use DataDog. We run their logging agent in a container which tails the log file and syncs with their cloud service.

👤 weitzj
I used logdna in a previous company. This is a really nice hosted, cheap logging service (like papertrail)

👤 paulmendoza
Cloudwatch. Insights makes it so much easier to query but be sure to log out JSON.

👤 jaequery
What do you guys do with docker logs? They seem to just accumulate over time.

👤 bribri
Cloudwatch logs + insights or Fluentd - kinesis firehose - elasticsearch

👤 dillonmckay
Every developer for themselves because the devops guy.

👤 jmakov
Clickhouse + Grafana or Prometheus + Victoriametrics

👤 badrabbit
Try Graylog!!!!!

👤 janpieterz
We use getseq.net, a pretty smooth aggregator and it keeps it all on your own infra, so no/less GDPR concerns. Proven really powerful with tons of services pushing logs to it.

👤 toomuchtodo
Graylog

👤 duelingjello
syslog / stderr -> Kafka -> centralized ELK

Files / log rotation is completely the wrong approach because log entries are mostly innately structured, rich data that occurs at a specific time. Serializing and then parsing log lines again is wasted effort. Messaging is a better fit than lines in files which create log-management headaches like not rotating, losing messaging on rotation, compressing/decompressing and a lengthy soup of destructured data that fills up local disks.

Logging to files on local disks is wrong and often creates privacy problems. Logging to cloud services is also expensive, a legal quagmire and raise data portability concerns.