Would also love to hear if you’ve tried it and found roadblocks, or you’re planning to add OpenTelemetry to some of your stack in the next few months.
Sometimes it feels like what unites the CNCF projects Kubernetes and OpenTelemetry is that everyone talks about them and not everyone actually uses it.
Roadblocks: - Some backend providers still don't support opentelemetry format. Notably the number that do support it have increased dramatically over the past year. For those who don't natively support opentelemetry, you'll likely want to use the opentelemetry collector to translate to the format of choice - There's a fair amount of difference in maturity level depending on the language you're using. For example, python's opentelemetry ecosystem is very mature whereas rust's ecosystem was lagging behind. - Tracing is probably the strongest aspect. While metrics and logs are supported, they're definitely lag behind in terms of the TLC they're given
Tidbits of note: - Opentelemetry collector has rapidly become the go-to agent of choice. It has leapfrogged vector.dev in terms of features, and I'm seeing some enterprises choose to use it as their customer-facing agent (ie GCP's agents are clearly opentelemetry + plus some extras) - As more companies support opentelemetry as a format, there's less of a need for opentelemtry's format translation feature.
I think the fundamental issue for the DX is that it’s trying to do everything everyone might want out of all of the observability signals. That’s a useful and laudable goal but it means that everything is configurable and it sometimes feels like there are too many knobs that you can and must adjust to get it to do what you want.
This was for a large email platform. We receive billions of API calls to send mail. Any particular call may send out to a thousand recipients. Recipient inboxes may not respond for 3 days, and we let you schedule your send 3 days out. We wanted all traces so any customer could contact us and we could debug their issue. OpenTracing is not good for this problem as it likes smaller time windows and does better with one request to one response and often requires sampling.
We were able to make some choices and were able to instrument a lot of the system. From the instrumentation, we cut out retries, delayed responses, multiplexed requests, and the like. But running this was wildly expensive still, so we started sampling more and more aggressively. It was an upward fight the entire time to implement and get data from the traces. Multiple teams had to work together to connect traces from start to finish across different services and stacks. It didn't help that management would only give us token amounts of time to keep implementing it as other projects would be prioritized.
We ended up keeping our older system. Each message has metadata, including a list of times and/or durations of internal actions. This metadata is recorded to our metric systems after we are done handling the request. We don't get cool flamegraphs, interesting relationships exposed, and a whole lot of cool things that OTel can do. But we know what parts of our system cause a bottleneck and when and for who, though sometimes we have to dig to see it.
Overal I am very happy with it. I would recommend it.
Good: - I like that with Otel you can build a vendor agnostic architecture. By that I mean you could easily swap visualization tool and you are not depending on a vendors own agent. - It works. - lots of plugins, receivers, processors and exporters. - run it as deployment or run it also as agent you can get k8 metrics.
Cons: - Documentation can be improved. There is the website and then is their GitHub repo. To me it feels like the website only explains the architecture and some components, but to implement it you need to read the readme and values.yml in the GitHub repo from the various receivers and exporters. - for some it this might be a con. It doesn’t have a ui. Like Calyptia Core
I haven’t tried the auto instrumentation. That is still on the to do list.