HACKER Q&A
📣 ravirajx7

How do you scale WebSocket?


Hi All,

I have a screen which displays QR code on browser screen & at the same time it opens a Websocket connection with my backend(Spring). Once the payment is done a Webhook response comes to one of my backend endpoint with the payment state(success/failure) & other order details.

The application is working fine if it is running locally or with only single instance. But since we have our own Auto scaling group and load balancer configured over DNS, the connection is not getting established all the time.

So, how exactly shall this be architectured so as to scale the same horizontally? I don't have any DB configured as of now. I have thought of using SNS with SQS but it seems they are way too many overheads. How do big companies like WhatsApp scale?


  👤 toast0 Accepted Answer ✓
From what you've commented elsewhere, it sounds like the immediate question is how do you route an external event (webhook) to the right websocket.

You need to include information that comes back in the event that identifies the websocket, either directly (server id + an id the server would understand) or indirectly (something that you can look up in a table/hash to get direct information).

More likely than not, you'll actually want to do the indirect option with something that identifies the browser (session id?) in case the browser reconnects to the websocket before the event comes in. Connectivity is fragile, and clients roam across wifi access points or between wifi and lte, or sometimes between lte boundaries that necessitate different IPs, or sometimes their modem is reset while they wait and they get a new IP, etc. Or something in their network path hates the world and closes idle connections on a very aggressive timeout, or closes connections after a short timeout regardless of idle. Lots of scenarios where a reconnection is likely.

Finally, since you asked about horizontal scaling, my best advice is to scale vertically first. It's usually simpler to manage one server than can do 1M connections than 1k servers that can each do 1k connections. Depending on details, less servers, but larger can be less expensive than many smaller servers; although that changes when your large servers start getting exotic, more than two cpu sockets is a big inflection point, you most likely want to scale horizontally rather than get a quad socket monster (but they exist, and so do eight sockets)


👤 unraveller
https://medium.com/@14domino/using-nats-to-build-a-very-func...

no database, just have the user's websocket reach a simple websocket server which always sends on requests to a fuller API server which can speak back to the NATS server who triggers a push to the user since the websocket server is coupled to NATS. This gives horizontal scale to API servers (if they ignore users/work not for them) and websocket servers.


👤 monroewalker
I could be wrong, but I wouldn't think the autoscaling or load balancing would affect the websocket connection. There may be another aspect of the infrastructure that's preventing the connection though. Can you share more about the setup? Strange that the connection would succeed sometimes but not others.. This could be related to the configuration of some intermediate network layer. Eg. if Nginx is used, you may have to look into the settings that are needed to ensure websockets work well. Take a look at these pages:

https://stackoverflow.com/questions/12102110/nginx-to-revers...

https://stackoverflow.com/questions/10550558/nginx-tcp-webso...


👤 timebomb0
Socket.io has a great article on how you can setup your architecture to scale Web Socket servers: https://socket.io/docs/v4/using-multiple-nodes/

👤 blablablub
" the connection is not getting established all the time" Which connection does not get established? The webhook connection to your backend or the websocket connection to your backend? Or do you get a webhook response, but failing to send a response via websocket?

👤 matt321
Id use a database and have the websocket just periodically check the database for payment confirmation. This way when once instance get the webhook activity, it puts it in the database and then.... ??? .... profit