HACKER Q&A
📣 scandox

Ever worked with a service that can never be restarted?


I'm currently working with a legacy system. One element of it has a loaded config in memory, but the physical config was accidentally overwritten and there are no backups. In addition the source code for this compiled binary has also been lost in the mists of time.

The service has current up time of 55 Months. The general consensus therefore is that as long as it is never restarted it will continue to perform its function until a replacement can be put in place. Which seems a little fatalistic to me...

Has anyone experience of doing something sensible in a similar situation?


  👤 verytrivial Accepted Answer ✓
My advice:

1. Suggest that the work to replace it is prioritized commensurate with the business impact caused whilst re-establishing service as if it went down right now

2. Remind them that it will go down at the worst possible time.

3. Ensure your name is attached to these two warnings.

4. Promise yourself you wouldn't run your business this way.

5. Get on with your life.


👤 dijit
Actually, I have experience here.

The problem is that anything you do that's potentially destructive in service of getting the system to be more sustainable is going to be met with heavy criticism. So you must be careful, the company has accepted the risk that they're in and you'll have to contend with that most likely.

First thing is first: is it a VM or a Physical machine? Things get a little easier if it's a VM because there might be vMotion or some kind of live-migration in place meaning hardware failure might not actually bring the service down.

Next thing is you absolutely have to plan for failure because the one thing that I learn as I learn more about computers is that they're basically working "by accident", so much error correction goes into systems that it absolutely boggles my mind and failures still get through. So, it's certainly not a question of "if" but "when" and plan for it being soon.

Now, the obvious technical things are:

* Dump the memory * Grab any files it has open from its file-descriptors (/proc//fd/) Your config file might be there... but somehow I doubt it. Attach a debugger and dump as much state as possible

Be sure to cleanly detatch : https://ftp.gnu.org/old-gnu/Manuals/gdb/html_node/gdb_22.htm...

Don't use breakpoints! They will obviously halt execution.

If it was my service I would also capture packet's and see what kind of traffic it receives most commonly; and I would make some kind of stub component to replace it before working on the machine. Just in case I break it and it causes everything to go down that depends on it.

But, this is a horrible situation, reverse engineering this thing is going to be a pain. Good Luck.


👤 aequitas
At a previous job I once sshd into a legacy system which had a motd in the vein of: "Don't ever reboot this system, if you do, I will find you and do horrible things to you". No explaination of why. My opinion was that the person who dared to leave a system in this state was the one who needed the horrible things done to them instead. Needless to say, he was already fired.

We started working on solution to run parallel to the existing one, which would receive shadow traffic so we could observe behaviour, find all edge cases and put them in a testsuite for the new solution. After we where confident that our testsuite contained most of the important behaviour we switched traffic to our new solution, keeping the old online just in case we needed to switch back.

Key is monitoring and learning the expected behaviour of the connected systems, to sense if that behaviour is not as expected and being able to act upon that as soon as possible.


👤 sdmike1
There is a tool used in malware analysis and computer forensics called Volatility[0]. It has some very powerful analysis tools and works on Linux Mac and Windows. In your case its ability to dump the memory of a running process without messing with the process state [1] may be very helpful! It also has the ability to run a Yara scan against the dumped memory which could let you find the region of memory containing the config file (so long as you know some of the strings in it).

Hope this helps!

[0] https://github.com/volatilityfoundation/volatility

[1] https://www.andreafortuna.org/2017/07/10/volatility-my-own-c...


👤 phillipseamore
Dump memory and see if you can retrieve the config that's running. If someone has a clue about how the config should be, you could probably get all the elements of it from simply running 'strings' on the binary. You could also disassemble the binary. Possibly the deleted config file could be salvaged from disk.

As is, this is all relying on the HW not giving up and the UPS working in a power failure!


👤 chha
I once worked with a client in such a situation. They were in the process of building a huge oil rig, costing somewhere north of $5.5bn USD. The basic premise was that a previous vendor had configured a big documentation system running on a soon-to-be-outdated Windows Server version ages ago, along with a "kind of" API allowing the shipyard to send information in the form of equipment/construction metadata, documents of various kinds and similar stuff.

The main challenge was that nothing could not be resent if a transmission failed, unless there was an actual change on the shipyard side, or by an highly complex and manual method. There was no source code to anything, and while the enterpricy system was a fairly standard off-the-shelf type thing, the API was completely customized. Nobody knew anything about how it talked to the system, dependencies or anything else. Changing anything was out of the question, as everything had been defined in contracts and processes making waterfall seem agile.

The team I was in was basically there to set up a new application in a separate environment, so that we could migrate and replace the existing setup once the shipyard had handed over all the information. In the end everything was kind of anti-climactic, with everything working as expected for as long as it needed to.


👤 ratel
I have some experience in this field.

First of all you are there to build a replacement service. As long as you do that in the allotted time what happens to the old should not bother you. Focus on your part of the solution.

The fact that you do seem to worry suggests you need something from the old service. Do you have a good design for the new service? Are there details of the old service that you do not know about? It won't be the first time a company has an application that runs complex calculations, but nobody knows how it is done. We actually built black-box applications in the past to replace calculation modules on obsolete hardware.

If you need information from the old system you may have to resort to cloning the input and output preferably on the network level so you don't have to mess with the existing service.

If the replacement service is years out you might suggest to the company to put in a red team. Another person or persons who work on finding out how the service works by running a duplicate on another system, rent it if necessary, and poking at it seeing if they can get comparable results quicker than the replacement service. But that is not you.

There are a couple of things you need to have in place before going live. Make sure you can clone input and output so you can run the old service next to the new one for a while. You don't want to stop the old service the moment you go live, because if anything goes wrong you need to be able to switch back. Even than starving the old service of input might have adverse effects, so cloning is better. If the old service uses external resources, like a database, files, etc. make sure you do not interfere with file locks, sequence numbers,etc. when running in parallel.


👤 ThomasRedstone
A lot of talk about what to do yourself, I'd say do nothing yourself.

Find an expert who has masses of experience who can consult on it.

This isn't a good time to be learning and testing those lessons.


👤 weego
My experience in this situation is that the more you decide to the hero and take on things that everyone else is washing their hands of when you have no reason to above anyone else, the more likely it is that you'll also be shouldered with responsibility for things going wrong that were similarly nothing to do with you in the first place.

Prioritise a replacement and don't decide to try smart things with the running one.


👤 jacquesm
If it is running you can still get at the binary file through /proc and recover it. The file system will only really delete a file when there are no more users and a running process counts as the file being in used (so that pages can be paged back in from it if needed).

👤 jomkr
I'm not sure that 55 months of up-time indicates it's more or less likely to go down in the next month, but I'd guess more likely.

Surely there are options, have you tried de-compiling the source from the binary?

>but the physical config was accidentally overwritten and there are no backups

Any old dev PCs lying around somewhere? It's worth reaching out to the old developers to see if they have a copy, in really old legacy systems it's fairly likely someone will have a copy somewhere.


👤 fxtentacle
Hire a consultant to do a full-memory dump.

1. That gives you a pretty good chance of attaching a debugger (to the offline memory file) and extracting the config from memory without touching the running system.

2. You are safe in case things turn awful, which seems likely.

As for the source code, if it is an interpreted language like Java or Python or C#, your chances of recovering a fully-working source code tree are pretty good. At the very least, it'll help with understanding what the system does.

For C / C++, buy Hex-Rays IDA and Decompiler. That's what the pros use for reverse engineering video game protections, breaking DRM, etc. So that tool set is excellent for getting a overview of what a binary does with the ability to go into all the nitty-gritty details if you need help re-implementing one detail. Plus, Hex-Rays can actually propagate types and variable names through your disassembly and then export compile-able C source code for functions from your binary.


👤 honkycat
My mother had a saying: Choose your battles. I feel like the value of calling things "not my problem" in software is undervalued. You cannot fix every problem, you cannot train every engineer, and you cannot control everything. Learn to prioritize.

If it is not on your head, don't fuck with it. I understand the instinct to fix a Big Problem and look great to management. However, this is too high risk. If you solve the problem, you are a hero. If you fuck it up, you are an idiot and fired and cost the company $$.

Why run the risk at all? Just cash the paychecks and fix other things that can't go catastrophically wrong.


👤 d--b
I personally would look for that code, it must be somewhere... In old code repos or hard drives. Maybe you guys have tape backups or things of that nature.

Don't touch the server, if anything happens, you'll be held responsible.


👤 Tilian
The configuration still exists in memory somewhere, so you could extract it from a dump... whether you should want to do this is an entirely different question though.

👤 letharion
This is also metioned by dijit, but here's more concrete instructions on how to potentially recover the config from /proc/ : https://web.archive.org/web/20171226131940/https://www.linux...

👤 PaulHoule
In case you want to replace or upgrade the UPS note they make devices that can cut into the power cord used by law enforcement to move a computer which is being impounded without depowering it.

👤 swalsh
You guys may not have the best processes in place, but whoever developed that app really deserves some credit. 55 month uptime without restart. Not bad.

👤 INTPenis
Rather than focus on a replacement they should focus on reading the systems memory out and creating a replacement virtual machine or perhaps even trying to decipher the config structure from memory.

My experience with migration projects is that they can drag on in time, and all the while this system is just itching to go down due to a power failure or some other issue.


👤 beat
"The physical config was accidentally overwritten and there are no backups".

Welcome to legacy.

So for a situation like this, there are several things that you need to think about. First and foremost... what is the impact when (not if) this process finally stops? This isn't just for technical people. You need a business impact assessment. You need the users involved. They're your lever for fighting the inevitable fear-based political hurdles. Is it an annoyance? Or does the company go out of business? The potential severity of the impact matters a great deal. If it's putting the entire business at risk, you should be able to get support from the highest levels of management to do whatever is necessary.

Second... how do you recover? There are a variety of ways off the top of my head. The most obvious would be to reconstruct the physical config. The "obvious to others" that is probably a stupid idea is rewriting the application. Let's ignore the stupid one and start dealing with reconstructing the configuration.

Do you have the source code for the system? If so, you can probably reconstruct the configuration architecture from reading source, at least. It may suck, but it's something.

Is there a test environment with its own running copy of the app? If so, it will have its own configuration, which will make reconstruction much easier, as then you differ only by values and don't have to figure out what the fields are.

Now, what kind of data is in the configuration that makes it difficult? Resource locations? Authentication credentials? Something else? If it's connecting to external systems, you can look at logs, packet-sniff, etc, to at least figure out where it's going. Credentials can be reset for a new version - a painful one-way trip, but it can work. Do you own any external systems, or are they outside your control?

Now, all systems have inputs and outputs. What is the output of this? Is it going to a database? If so, are you backing up that data? Make sure any locally stored data is getting backed up!

If there isn't a duplicate test system, what would need to be done to create one? Are there licensing restrictions? Specialized hardware/OS? Are you building from source code? Do you have the source? Do whatever it takes to create a parallel system that you can test configuration on, make it run elsewhere.

I can just go on and on with this, but the important thing is to be able to duplicate as much as possible before you try to replace. And find out what the cost is - that buys you authority.


👤 surfsvammel
Damn. Everyone here seems to advise you to just protect your own ass as a top priority. Is that a product of US work culture?

If you where in a sound organisation, which I would say most Swedish IT organisations are, you need to think about the company and the clients first.

Someone advised you to: “don’t do anything until explicitly asked to”. I think that’s just bad advise. You obviously know this is a major problem and risk. You also have some ideas about how to proceeds in mitigating and solving it. You should jump at the chance to help your company with this. Highlight this directly to your managers, talk with as many as you can, gather the information you need, get the approvals, and fixed the problem ASAP or find someone who can.

Your clients and customers might be badly affected when hell breaks loose.

I think anyone advising being passive, or plainly hiding from the problem, is totally wrong. Those are not the kind of colleagues I would want.

Get to work.


👤 guiriduro
Are you working on a Nuclear Reactor or something that will cause loss of life if rebooted accidentally? If yes, then you have a truly critical system that needs very careful uptime management, despite huge costs to carefully derisk and duplicate it, and there's plenty of good advice here already. But too many systems are 'super pets' like this and are mistakenly considered critical at exorbitant cost.

If no: turn it off and on, and see. The simple fact of the matter is that sooner or later it will happen anyway, and its better to bring that about, learn and solve. And if it results in large financial losses for extended downtime, then the management that allowed it to get to that state is already at fault (not you) and some better, safer alternative will arise from your efforts. Don't sweat it.


👤 hkt
Have a look at Checkpoint and Restore In Userspace: https://criu.org/Main_Page

I've not looked for a long time but you could potentially snapshot the running process and restore it. Obviously do your reading extensively first though.


👤 duxup
My first "real" job was working with some old equipment related to some big old IBM (and related) mainframes. To view or change anything you dialed in with a modem and then via a terminal application displayed and modified memory directly.

I sometimes have dreams about specific hex codes that were common / would mean bad things.

However, they were really good about documentation and updates / backups.

Best thing is to start working on folks in the business to develop a backup plan / make your voice heard about the potential fallout of a failure. Or even just ask "what would happen if" of the decision makers.


👤 jlokier
I have used GDB to look at a process and get a dump of particular run-time data structures, which is usually enough to reconstruct a config file.

Config data structures usually don't change while a process is running. Often they are just values in global variables.

If you have the executable file for the process, it may be possible to run that with trial config files, and then compare the GDB dump from the running service with the GDB dump from a trial config, to compare the relevant data structures. That can provide more confidence than just figuring out what the config ought to be.

Getting a GDB dump of the running service will be quite disruptive if it's done manually, but that might not matter. It will depend on whether the service is heavily used or if it's only doing something occasionally.

If the service is in constant use, it could make sense to automate GDB so it captures the config data structures quickly then safely detaches, and only briefly pauses the service.

Alternatively, if even automated GDB is too disruptive or difficult to use, or if the ptrace() syscalls used by GDB might cause a problem, it is often possible to capture a memory dump of the running process without affecting the process much, via /proc/PID/mem on Linux.

If necessary, write a tool in C to walk the heap by reading it from /proc/PID/mem to reconstruct the config data structures that way.

(All the above assumes Linux.)


👤 pezo1919
May I ask about the business you are in? What happens to the world if your service goes down? It's just curiosity. Hopefully not related to breathing machines like importance. :)

👤 Reith
Memory dump will probably be what you want to do, but are you sure the process closed the file? If it's a config file, that's sensible the process has closed the file descriptor, but if it didn't (probably because of developer error), it's possible that file be still accessible (through inode for example in Linux). Actually I faced the same problem today; it was a virtual machine disk file that has been overwritten but virtual machine was running, so the (deleted) file was open and I could get it back.

👤 Canada
Interesting problem. There's some good ideas in here, and I think you might get some better ones if you add more details.

You said that you have been assigned the task of replacing the "zombie" program, which I assume means that you are to write a new one that interfaces whatever depends on the zombie.

If the zombie were to die right now, how severe would the consequences be? On the scale of total disruption of business to minor inconvenience that could be worked around until it's fixed?

How complex is the zombie? What is the nature of the state that was supplied by the lost configuration file? Is it stuff like which port to listen on and how to connect to its database, or is it more like huge tables of magic values that won't be easy to figure out again?

Do you currently have enough information about the zombie to write a replacement? Do you have what you need to test your replacement before deploying it?

If you are confident you have sufficient information to write the replacement, how long do you think you need to write and test that?

You said the zombie runs on a physical server. What operating system? What language/runtime/stack is the zombie based on?


👤 abbadadda
I'm at the airport in Dublin. On a layover. I was enjoying my Oreo Fusion at BK and when I read, "The service has current up time of 55 Months" I burst out laughing hysterically.

(Update re the downvote: You can't tell me that uptime isn't a little bit funny. I almost did a spit take at this. As someone who sees and thinks an uptime of 150 days is way too much this number just shocked me, that's all)


👤 miohtama
My father worked in a district heating plant. The control PC was 80286 and had uptime over 10 years. Though the plant in theory could be run without it with knobs and levers.

It was replaced by a Pentium with a virtualized MS-DOS due to y2k event, though I am quite sure nothing changed in the underlying MS-DOS program that was more or less time independent.


👤 pschastain
I remember reading some time ago an old story about folks recovering accidentally deleted system files from the currently running system. Can't find it, I think it was Ritchie who told it but I'm not sure. The long and short is that it might be possible to recover the lost config from what's currently running.

👤 swayson
In South Africa the citizens are facing a crises of such a service and its our electricity system. Predominately coal power plants, that are at end of life, and rolling blackouts is a common occurrence. It is necessary action, to avoid a blackout, given it would take weeks to get the system going again. Furthermore, high inequality and unemployment, makes it paramount to grow the economy, so these rolling blackouts have a huge impact socio-ecomically. It is a service that can't simply be restarted.

Transitioning risky services which can't be restarted, is clearly complicated, especially for complex systems. I wonder if there is body of knowledge with principles that could apply to not only to these massive nationwide scale, but to that of web-services and alike. Does anybody have resources in this respect?


👤 farseer
The raptors escaped because the The Jurassic Park security system was never restarted before in its history :)

👤 aasasd
> In addition the source code for this compiled binary has also been lost in the mists of time.

Sounds like you could try copying the binary, putting it into an isolated VM—or a virtual copy of your other services—and beating a file with a stick until the service accepts it as a config.


👤 kazinator
> Ever worked with a service that can never be restarted?

Yes; any Unix init daemon.

What's "restarted"? Does accepting any new configuration and changing behavior in response to it count as a restart?

Or do we have to terminate the process entirely and start a new one for it to be a "restart"?


👤 lormayna
At my previous-previous job we had a core switch that was on from 5 years. No one wants to reload because everyone was scared it cannot restart anymore. We don't have maintenance, spare parts and cabling was a mess.

My colleagues told me that it was still there 2 years ago.


👤 totaldude87
Been there.. short story goes like..

There was a active passive cluster with some complicated raid storage.

One day the passive's storage was gone(there were problems all along, corruptions, physical storage change/swap etc etc, mgmt put it on hold since there was a migration) and since there were already plans to migrate the cluster to somewhere else, no one from mgmt wanted to build another set of servers for such a "short" period.

Guess what... that short period was 18 months :)..

So for a 1 year and half , those systems were never touched and never restarted, all the time serving some applications out of it.

Sometimes we get lucky, sometimes we are out of job :)


👤 teekert
I once overwrote my own partition table. The next week was spend keeping the laptop on or in standby and investigating recovery. I never found a good solution so I started investigating alternatives, and ended with the so called nuke-and-pave, I Reinstalled Ubuntu.

Anyway, that's quite a situation you have got there :) can you not suspend to disc and keep the suspend image safe and able to be booted into after power failures or so? Maybe suspend to disc (hibernate) and then image the swap partition/file? Then some grub editing... or... what OS is this anyway?


👤 jasonmar
If it's a critical system and you don't have the ability to fix it but you know it's only a matter of time before it fails, the sensible thing to do is change jobs.

👤 throwme9876
As of a few years ago, Google had something vaguely similar. The details escape me, but it was something like --

The main RPC service had a dependency on the lock service to start up and vice versa. If both services went globally offline at the same time, they wouldn't be able to turn either (thus most of Google) back on again.

Someone came up with a wonderful hack to solve this involving basically air-gaped laptops which when connected could help a datacenter bootstrap itself.


👤 withinboredom
If the file is still open by the process and not yet closed, the config file is still there if it’s Linux. You can find it in the process’s file handles in /proc

👤 asickness231
I have seen a scenario where a bespoke application which had also been running for sometime had been lost to the miasma (consultants took the source code with them).

The organization in question paid for a team of forensic software experts to reverse engineer it to the best of their ability ahead of a full data center migration to a new facility.

I left the company while they were somewhere in the middle of this project.


👤 noonespecial
You might try something like this, hoping that the process still has the gone file "open".

https://unix.stackexchange.com/questions/268247/recover-file...


👤 bluesign
From my experience, dumping memory and most likely reverse engineering the binary is a must.

Actually as it is legacy system, changes are this will be easier.

To be honest, the path to follow mostly depends on the OS.

As it is critical system, I would start with reverse engineering the binary, making sure the config is preserved across life time of application.


👤 villgax
You could try to make every possible request programmatically & save the responses & then rebuild it.

👤 Piskvorrr
"as long as it is never restarted"

Seen that. And then one day, power failed. UPSes kicked in. In one hour, still no mains power. Batteries depleted, systems started shutting down...including that one.

Had to build a replacement in a real hurry after that.


👤 gumby
I have made a phone call. It’s not clear the phone system could be restarted. For example the control plane (SS7) is simply transmitted over the links it controls. It was built incrementally as a continually running system for more than a century and certain bootstraps/incremental elements were long since optimized out. Note that this is unrelated to the fanatical backwards compatibility at the edge which you might think would imply that it is restartable.

The Internet, I think, is restartable as long as layer 0 were available. It’s not really clear what it would mean for the Internet to need to be restarted — perhaps some attack/failure of the BGP infrastructure?


👤 nnq
Unpopular advice: find a subtle way to make it crash, preferably stealthy, but if not possible, at least in a way that can be attributable to mild innocent incompetence instead of malice!

Then there will be more and more interesting work to do for you and others, either rediscovering and properly documenting the config, or, hopefully, architecting and coding its replacement! In the aftermath, the organization will be more robust. If it actually collapses bc of this, then it deserved to die anyway, you only helped accelerate the outcome and reduced the suffering.

Some things and processes need to be "helped to fail faster", everyone will benefit from the renewal in the end, even if most will hate it ;)


👤 lnanek2
We have a job running on the back end servers here that is expected to take two weeks in order to finish :) Hearing 55 months for your backend makes me more optimistic, haha.

👤 raverbashing
My first step would be to try and "undelete" the file from FS, even if it was overwriten, it is possible that some of it is still salvageable.

👤 joseph2342
Surprised with the scenario of losing code for a binary that is still running. In which software domains is this common ?

👤 ohiovr
Is it possible to run queries on the programs in effort to reconstuct the config or would that also be too dangerous?

👤 blinkingled
What OS? On Linux with Checkpoint/Restore you can checkpoint the process to disk and restore it when needed.

👤 sitkack
DD the disk over and scp pipe to another machine, the config file could still be in there.

👤 alinspired
check https://github.com/checkpoint-restore/criu - designed to dump a set of process to a file and restore it elsewere

👤 trcarney
One of the source control companies should take this story to use for marketing.

👤 eru
Can you take a core dump while it's running?

👤 lonelappde
The is how most systems work, at some scale.

👤 arrty88
Is it running in a docker container? Can you copy the suspended state of that container?

👤 zaphirplane
You know how with flu vaccine 1 in a big number get sick and with this it may end up worse.

If it’s vm take a snapshot periodically, I suspect it’s not.

Try a p2v which converts it to a vm on the fly, and leave the vm off and periodically rerun the p2v


👤 goatinaboat
capture packet's and see what kind of traffic it receives most commonly

The modern solution would be to capture the incoming packets as a training set then apply machine learning to create a model that can perfectly recreate the outgoing packets.

It’s still an inexplicable black box of course, that is the nature of ML, but at least you can run it in the cloud now.