Are there any books, code, or other resources you would recommend?
I'm trying to understand the actual environment. When you say "deployment", what is changed, where does it start, and how far does it propagate?
For example, would one option for zero downtime be to have replicated (2 or more) "control systems" beyond some "layer" (sorry, it's hard to be precise without knowing more) and enforcing synchronicity between those while having only actually controlling at any time. Then, when you are patching or updating, you freeze on one, update the other, then switch to the other? Not advocating a solution, just trying to understand the situation by throwing out an example to talk around.
I'm not an expert in this at all, but if what I'm talking about above is even close to being on track, I'd recommend this book for starters: https://www.amazon.com/Introduction-Embedded-Systems-Cyber-P...
The hard part is being very strict to ensure that every change is safe and/or be able to rapidly/automatically restore a working state to stay within a very low error budget. Each '9' in 99.9.. of uptime is order(s) of magnitude harder.
12 Factor Applications
Site Reliability Engineering
Chaos Engineering
Kubernetes