HACKER Q&A
📣 millon

How do you handle WebSocket connections reconnect problem?


There are few examples on HN, how to handle 1M to 10M websocket connections. I think it's a solved problem. Most of my connections will be idle for most of the time.

Now the real problem is how to make them production ready. If we add TLS, it will become super slow to connect new connections. I think per core can handle few hundred new connection with TLS. Reconnect can be faster.

How did you solve the TLS with websocket problem? What happens when 1 million connections get disconnected and try to reconnect at the same time? What is your reconnection rate per core?


  👤 austin-cheney Accepted Answer ✓
I wrote my own web socket library so that I can integrate authentication into the connection handshake. Since I have my own library I also wrote conventions to automatically attempt reconnects in 15 second intervals when the connection drops.

https://github.com/prettydiff/share-file-systems/blob/master...


👤 mindcrash
Instead of letting clients directly interface with your services over websockets, consider using Pushpin [1], which allows you to completely isolate realtime communication from your services.

As a bonus, it also provides you the ability to cycle (redeploy/restart) your services without your clients having to reconnect (that's where the name comes from). And as you can imagine - because communication with your services is entirely stateless it scales like crazy.

[1] https://pushpin.org/


👤 hayst4ck
> What happens when 1 million connections get disconnected and try to reconnect at the same time?

I think the distributed system term for your problem is called the 'thundering herd problem,' so searches that involve that would likely be fruitful. "Thundering herd websockets" would probably be fruitful.

From a reliability perspective, implement exponential back-off on the client that includes jitter. This is a core necessity in all clients. I only skimmed this article, but it looked right: https://aws.amazon.com/blogs/architecture/exponential-backof...

When Signal had outages from the increased load during the WhatsApp exodus, it was due to this not being implemented in their clients.

Additionally, consider your load balancing architecture. If one machine goes down, do all reconnects go to that machine, or do the reconnects get distributed to all the machines? Can you administratively drain a machine? Can you quickly allocate some spare capacity?

Lastly, you can get into situations where your entire infrastructure is overloaded. You will need a throttling mechanism. That throttling mechanism can synergisticly work with your load balancer or client. If you benchmark your server and it can only handle 500 concurrent re-connections, then that is a hard limit you know you can enforce fail-fast behavior with.

Summary:

  Clients implemented with exponential backoff and jitter
  Loadbalancer architecture
  Defensive "fail fast" throttling or ability to administratively throttle.

👤 gizmo
You can have a separate load check endpoint (doesn’t even need tls) for clients to check if the client can go ahead with a websocket (re-)connect. The load status check can be served directly by the web server from memory and the connection can be closed immediately after responding so it’s super fast.

And if the servers are so overloaded that the load level endpoint fails to respond? That’s fine because that’s answer also.