The service has current up time of 55 Months. The general consensus therefore is that as long as it is never restarted it will continue to perform its function until a replacement can be put in place. Which seems a little fatalistic to me...
Has anyone experience of doing something sensible in a similar situation?
1. Suggest that the work to replace it is prioritized commensurate with the business impact caused whilst re-establishing service as if it went down right now
2. Remind them that it will go down at the worst possible time.
3. Ensure your name is attached to these two warnings.
4. Promise yourself you wouldn't run your business this way.
5. Get on with your life.
The problem is that anything you do that's potentially destructive in service of getting the system to be more sustainable is going to be met with heavy criticism. So you must be careful, the company has accepted the risk that they're in and you'll have to contend with that most likely.
First thing is first: is it a VM or a Physical machine? Things get a little easier if it's a VM because there might be vMotion or some kind of live-migration in place meaning hardware failure might not actually bring the service down.
Next thing is you absolutely have to plan for failure because the one thing that I learn as I learn more about computers is that they're basically working "by accident", so much error correction goes into systems that it absolutely boggles my mind and failures still get through. So, it's certainly not a question of "if" but "when" and plan for it being soon.
Now, the obvious technical things are:
* Dump the memory
* Grab any files it has open from its file-descriptors (/proc/ Be sure to cleanly detatch : https://ftp.gnu.org/old-gnu/Manuals/gdb/html_node/gdb_22.htm... Don't use breakpoints! They will obviously halt execution. If it was my service I would also capture packet's and see what kind of traffic it receives most commonly; and I would make some kind of stub component to replace it before working on the machine. Just in case I break it and it causes everything to go down that depends on it. But, this is a horrible situation, reverse engineering this thing is going to be a pain. Good Luck.
We started working on solution to run parallel to the existing one, which would receive shadow traffic so we could observe behaviour, find all edge cases and put them in a testsuite for the new solution. After we where confident that our testsuite contained most of the important behaviour we switched traffic to our new solution, keeping the old online just in case we needed to switch back.
Key is monitoring and learning the expected behaviour of the connected systems, to sense if that behaviour is not as expected and being able to act upon that as soon as possible.
Hope this helps!
[0] https://github.com/volatilityfoundation/volatility
[1] https://www.andreafortuna.org/2017/07/10/volatility-my-own-c...
As is, this is all relying on the HW not giving up and the UPS working in a power failure!
The main challenge was that nothing could not be resent if a transmission failed, unless there was an actual change on the shipyard side, or by an highly complex and manual method. There was no source code to anything, and while the enterpricy system was a fairly standard off-the-shelf type thing, the API was completely customized. Nobody knew anything about how it talked to the system, dependencies or anything else. Changing anything was out of the question, as everything had been defined in contracts and processes making waterfall seem agile.
The team I was in was basically there to set up a new application in a separate environment, so that we could migrate and replace the existing setup once the shipyard had handed over all the information. In the end everything was kind of anti-climactic, with everything working as expected for as long as it needed to.
First of all you are there to build a replacement service. As long as you do that in the allotted time what happens to the old should not bother you. Focus on your part of the solution.
The fact that you do seem to worry suggests you need something from the old service. Do you have a good design for the new service? Are there details of the old service that you do not know about? It won't be the first time a company has an application that runs complex calculations, but nobody knows how it is done. We actually built black-box applications in the past to replace calculation modules on obsolete hardware.
If you need information from the old system you may have to resort to cloning the input and output preferably on the network level so you don't have to mess with the existing service.
If the replacement service is years out you might suggest to the company to put in a red team. Another person or persons who work on finding out how the service works by running a duplicate on another system, rent it if necessary, and poking at it seeing if they can get comparable results quicker than the replacement service. But that is not you.
There are a couple of things you need to have in place before going live. Make sure you can clone input and output so you can run the old service next to the new one for a while. You don't want to stop the old service the moment you go live, because if anything goes wrong you need to be able to switch back. Even than starving the old service of input might have adverse effects, so cloning is better. If the old service uses external resources, like a database, files, etc. make sure you do not interfere with file locks, sequence numbers,etc. when running in parallel.
Find an expert who has masses of experience who can consult on it.
This isn't a good time to be learning and testing those lessons.
Prioritise a replacement and don't decide to try smart things with the running one.
Surely there are options, have you tried de-compiling the source from the binary?
>but the physical config was accidentally overwritten and there are no backups
Any old dev PCs lying around somewhere? It's worth reaching out to the old developers to see if they have a copy, in really old legacy systems it's fairly likely someone will have a copy somewhere.
1. That gives you a pretty good chance of attaching a debugger (to the offline memory file) and extracting the config from memory without touching the running system.
2. You are safe in case things turn awful, which seems likely.
As for the source code, if it is an interpreted language like Java or Python or C#, your chances of recovering a fully-working source code tree are pretty good. At the very least, it'll help with understanding what the system does.
For C / C++, buy Hex-Rays IDA and Decompiler. That's what the pros use for reverse engineering video game protections, breaking DRM, etc. So that tool set is excellent for getting a overview of what a binary does with the ability to go into all the nitty-gritty details if you need help re-implementing one detail. Plus, Hex-Rays can actually propagate types and variable names through your disassembly and then export compile-able C source code for functions from your binary.
If it is not on your head, don't fuck with it. I understand the instinct to fix a Big Problem and look great to management. However, this is too high risk. If you solve the problem, you are a hero. If you fuck it up, you are an idiot and fired and cost the company $$.
Why run the risk at all? Just cash the paychecks and fix other things that can't go catastrophically wrong.
Don't touch the server, if anything happens, you'll be held responsible.
My experience with migration projects is that they can drag on in time, and all the while this system is just itching to go down due to a power failure or some other issue.
Welcome to legacy.
So for a situation like this, there are several things that you need to think about. First and foremost... what is the impact when (not if) this process finally stops? This isn't just for technical people. You need a business impact assessment. You need the users involved. They're your lever for fighting the inevitable fear-based political hurdles. Is it an annoyance? Or does the company go out of business? The potential severity of the impact matters a great deal. If it's putting the entire business at risk, you should be able to get support from the highest levels of management to do whatever is necessary.
Second... how do you recover? There are a variety of ways off the top of my head. The most obvious would be to reconstruct the physical config. The "obvious to others" that is probably a stupid idea is rewriting the application. Let's ignore the stupid one and start dealing with reconstructing the configuration.
Do you have the source code for the system? If so, you can probably reconstruct the configuration architecture from reading source, at least. It may suck, but it's something.
Is there a test environment with its own running copy of the app? If so, it will have its own configuration, which will make reconstruction much easier, as then you differ only by values and don't have to figure out what the fields are.
Now, what kind of data is in the configuration that makes it difficult? Resource locations? Authentication credentials? Something else? If it's connecting to external systems, you can look at logs, packet-sniff, etc, to at least figure out where it's going. Credentials can be reset for a new version - a painful one-way trip, but it can work. Do you own any external systems, or are they outside your control?
Now, all systems have inputs and outputs. What is the output of this? Is it going to a database? If so, are you backing up that data? Make sure any locally stored data is getting backed up!
If there isn't a duplicate test system, what would need to be done to create one? Are there licensing restrictions? Specialized hardware/OS? Are you building from source code? Do you have the source? Do whatever it takes to create a parallel system that you can test configuration on, make it run elsewhere.
I can just go on and on with this, but the important thing is to be able to duplicate as much as possible before you try to replace. And find out what the cost is - that buys you authority.
If you where in a sound organisation, which I would say most Swedish IT organisations are, you need to think about the company and the clients first.
Someone advised you to: “don’t do anything until explicitly asked to”. I think that’s just bad advise. You obviously know this is a major problem and risk. You also have some ideas about how to proceeds in mitigating and solving it. You should jump at the chance to help your company with this. Highlight this directly to your managers, talk with as many as you can, gather the information you need, get the approvals, and fixed the problem ASAP or find someone who can.
Your clients and customers might be badly affected when hell breaks loose.
I think anyone advising being passive, or plainly hiding from the problem, is totally wrong. Those are not the kind of colleagues I would want.
Get to work.
If no: turn it off and on, and see. The simple fact of the matter is that sooner or later it will happen anyway, and its better to bring that about, learn and solve. And if it results in large financial losses for extended downtime, then the management that allowed it to get to that state is already at fault (not you) and some better, safer alternative will arise from your efforts. Don't sweat it.
I've not looked for a long time but you could potentially snapshot the running process and restore it. Obviously do your reading extensively first though.
I sometimes have dreams about specific hex codes that were common / would mean bad things.
However, they were really good about documentation and updates / backups.
Best thing is to start working on folks in the business to develop a backup plan / make your voice heard about the potential fallout of a failure. Or even just ask "what would happen if" of the decision makers.
Config data structures usually don't change while a process is running. Often they are just values in global variables.
If you have the executable file for the process, it may be possible to run that with trial config files, and then compare the GDB dump from the running service with the GDB dump from a trial config, to compare the relevant data structures. That can provide more confidence than just figuring out what the config ought to be.
Getting a GDB dump of the running service will be quite disruptive if it's done manually, but that might not matter. It will depend on whether the service is heavily used or if it's only doing something occasionally.
If the service is in constant use, it could make sense to automate GDB so it captures the config data structures quickly then safely detaches, and only briefly pauses the service.
Alternatively, if even automated GDB is too disruptive or difficult to use, or if the ptrace() syscalls used by GDB might cause a problem, it is often possible to capture a memory dump of the running process without affecting the process much, via /proc/PID/mem on Linux.
If necessary, write a tool in C to walk the heap by reading it from /proc/PID/mem to reconstruct the config data structures that way.
(All the above assumes Linux.)
You said that you have been assigned the task of replacing the "zombie" program, which I assume means that you are to write a new one that interfaces whatever depends on the zombie.
If the zombie were to die right now, how severe would the consequences be? On the scale of total disruption of business to minor inconvenience that could be worked around until it's fixed?
How complex is the zombie? What is the nature of the state that was supplied by the lost configuration file? Is it stuff like which port to listen on and how to connect to its database, or is it more like huge tables of magic values that won't be easy to figure out again?
Do you currently have enough information about the zombie to write a replacement? Do you have what you need to test your replacement before deploying it?
If you are confident you have sufficient information to write the replacement, how long do you think you need to write and test that?
You said the zombie runs on a physical server. What operating system? What language/runtime/stack is the zombie based on?
(Update re the downvote: You can't tell me that uptime isn't a little bit funny. I almost did a spit take at this. As someone who sees and thinks an uptime of 150 days is way too much this number just shocked me, that's all)
It was replaced by a Pentium with a virtualized MS-DOS due to y2k event, though I am quite sure nothing changed in the underlying MS-DOS program that was more or less time independent.
Transitioning risky services which can't be restarted, is clearly complicated, especially for complex systems. I wonder if there is body of knowledge with principles that could apply to not only to these massive nationwide scale, but to that of web-services and alike. Does anybody have resources in this respect?
Sounds like you could try copying the binary, putting it into an isolated VM—or a virtual copy of your other services—and beating a file with a stick until the service accepts it as a config.
Yes; any Unix init daemon.
What's "restarted"? Does accepting any new configuration and changing behavior in response to it count as a restart?
Or do we have to terminate the process entirely and start a new one for it to be a "restart"?
My colleagues told me that it was still there 2 years ago.
There was a active passive cluster with some complicated raid storage.
One day the passive's storage was gone(there were problems all along, corruptions, physical storage change/swap etc etc, mgmt put it on hold since there was a migration) and since there were already plans to migrate the cluster to somewhere else, no one from mgmt wanted to build another set of servers for such a "short" period.
Guess what... that short period was 18 months :)..
So for a 1 year and half , those systems were never touched and never restarted, all the time serving some applications out of it.
Sometimes we get lucky, sometimes we are out of job :)
Anyway, that's quite a situation you have got there :) can you not suspend to disc and keep the suspend image safe and able to be booted into after power failures or so? Maybe suspend to disc (hibernate) and then image the swap partition/file? Then some grub editing... or... what OS is this anyway?
The main RPC service had a dependency on the lock service to start up and vice versa. If both services went globally offline at the same time, they wouldn't be able to turn either (thus most of Google) back on again.
Someone came up with a wonderful hack to solve this involving basically air-gaped laptops which when connected could help a datacenter bootstrap itself.
The organization in question paid for a team of forensic software experts to reverse engineer it to the best of their ability ahead of a full data center migration to a new facility.
I left the company while they were somewhere in the middle of this project.
https://unix.stackexchange.com/questions/268247/recover-file...
Actually as it is legacy system, changes are this will be easier.
To be honest, the path to follow mostly depends on the OS.
As it is critical system, I would start with reverse engineering the binary, making sure the config is preserved across life time of application.
Seen that. And then one day, power failed. UPSes kicked in. In one hour, still no mains power. Batteries depleted, systems started shutting down...including that one.
Had to build a replacement in a real hurry after that.
The Internet, I think, is restartable as long as layer 0 were available. It’s not really clear what it would mean for the Internet to need to be restarted — perhaps some attack/failure of the BGP infrastructure?
Then there will be more and more interesting work to do for you and others, either rediscovering and properly documenting the config, or, hopefully, architecting and coding its replacement! In the aftermath, the organization will be more robust. If it actually collapses bc of this, then it deserved to die anyway, you only helped accelerate the outcome and reduced the suffering.
Some things and processes need to be "helped to fail faster", everyone will benefit from the renewal in the end, even if most will hate it ;)
If it’s vm take a snapshot periodically, I suspect it’s not.
Try a p2v which converts it to a vm on the fly, and leave the vm off and periodically rerun the p2v
The modern solution would be to capture the incoming packets as a training set then apply machine learning to create a model that can perfectly recreate the outgoing packets.
It’s still an inexplicable black box of course, that is the nature of ML, but at least you can run it in the cloud now.