How long till outages if Big Tech employees suddenly got Thanos-snapped?

Question

Given the recent mass layoffs and resignations at Twitter, it raises an interesting question: At what point will critical systems start failing and Twitter&rsquo;s Fail Whale will return?[1] When will there be an outage?While the layoffs at Facebook, Google, Amazon, Stripe, etc are all certainly being managed far better than Twitter&rsquo;s situation, it is an interesting question to understand the relationship between tech workers and the production infrastructure of modern tech companies.One would hope that most companies have used systems equivalent to chaos monkey + modern devops engineering to build automated resiliency into their systems without requiring much human intervention. But.....So how long do you think it would be until X systems/products at any given Big Tech company would degrade in service or experience an outage if they lost all their workers?[1] https://www.theatlantic.com/technology/archive/2015/01/the-story-behind-twitters-fail-whale/384313/

runlevel1 · Accepted Answer

I'm surprised I haven't yet seen anyone mention the risk of insider threats.
It only takes one pissed off person with the right knowledge to do a disproportionate amount of damage to the company.
Twitter is in an especially risky position right now:
1. The sheer size of the layoffs increase the odds at least one former employee wants to retaliate.
2. The manner of the layoffs increase the chances that former employees feel slighted.
3. The source of the layoffs is an individual (i.e. not something abstract like a falling economy), giving them an articulable target.
4. No longer having equity means former employees have less interest in its ongoing success.
5. The scale of the departures mean many former employees will no longer know anyone still there and so harm wouldn't befall someone they know.
6. The layoffs include roles with access to sensitive information. That could be anything from trade secrets, to credentials, to where the proverbial bodies are buried.
7. Security teams that would normally mitigate some of this risk might no longer be fully functioning.
I'm not confident enough to say it will happen -- only that the risk is much, much higher than normal.

matt_s · Answer

I think Twitter is especially vulnerable to an outage of some kind just with the massive knowledge that has left. Things will start falling apart when whoever is left starts making changes, could be 2 months, could be next summer. The other element is if they don't have infrastructure engineers (SRE?) left and nobody is monitoring and keeping tabs on various things that always fail and once those pile up something will fail publicly.
People may casually look at twitter and think its a 280 character text field and like 4 buttons to like/retweet/reply/share (and reply is a recursive feature in a way) and assume you don't need a lot of technical complexity for that. You're correct, you don't, someone could probably build that flow in a weekend or less. The major feature of any public social media company is content moderation. That is invisible to end users and I imagine a lot of back-end processing and systems are required for that. It doesn't matter what the content is, it needs to be moderated and that usually falls on human judgement at some point for the more nuanced content. Things that need to be moderated start with content like the spam you get in your email or low effort promotional content anywhere online. Then work your way up thru content that will bring lawsuits against twitter, etc.
My guess is once Musk starts initiating changes they will start discovering how complex it is to make changes, maybe backing out changes initially when BadThingsHappen™. Then he will get tired of it and put someone in charge and find another toy company to play with.

zaphod12 · Answer

Depends...you could rebalance to stay up forever, but just slow anything new at all.Your biggest issue is the data center. A lot of folks forget about this stuff, but hard drives are constantly dying, servers go bad or crash, heck even the ac needs maintenance (though that's contracted out I'm sure). None of that is glorious, but it's critical. The major systems are very very reliable in the face of a few hardware failures, but give it a couple of months of falling behind on maintenance and it would all crumble.

ldjkfkdsjnv · Answer

The key point not discussed enough is that outages happen as the code is changing. If you stop deploy new changes, big FAANGs basically wont go down. Obviously they are so complex thats hard to do, but slowing the rate of feature development will slow the rate of failure. And its probably not a linear relationship

ozzythecat · Answer

I&rsquo;m convinced this a bit overblown.I don&rsquo;t know enough about twitters infrastructure, so I&rsquo;m only speaking at the application layer.If the code isn&rsquo;t changing, things should be extremely stable and resilient. Presumably, Twitter had already made significant investments in resilience, fault tolerant services that function independently at scale.I&rsquo;d think the more risky parts are server/hardware failures, hardware load balancers, etc.One of the key services my big tech org owned was in support-only mode with no active feature development. Despite 500k requests per second, it had just a one person on pager duty.The majority of support issues were OS level updates and application level dependency updates/fixing vulnerabilities. But not doing that work wouldn&rsquo;t take the service down, so much as be corporate policy violations for not keeping software up to date. You could also definitely swing by exceptions.

gsatic · Answer

Just par for the course with tech. There are much more critical systems than twitter running all over the world that no one has updated or fixed in a long time.I worked with big telco a while back. The software/hardware we maintained for the telco exchanges was used in pretty much every country. That "stack" was in development for 30+ years. Hundreds of companies and thousands of devs had contributed to it. Using as many languages and tools you can imagine. Many don't exist anymore. Large chunks of the source code and tooling to build/fix just got lost with with time, relocations, layoffs, mergers etc. And things would break all the time. All we did was cook up hacks and work arounds to keep things running. No real fixes or updates were possible.

DevKoala · Answer

Twitter can continue running with 100 engineers or less. That said, can they iterate fast enough on moderation and fraud prevention? Apply security fixes responding to the newest threats? Deliver on advertising customer demands? Stay competitive in terms of features vs other social network platforms? I doubt it, but I bet they can do a decent job with less engineers than they used to have.

akomtu · Answer

Something like Twitter can run with 100 employees and a few thousand offshore content moderators.

makeitrain · Answer

Something better crash. Otherwise, why not cut 20% more headcount?

faangiq · Answer

Can easily keep all these places running with 10% headcount.

hayst4ck · Answer

Outages are primarily proportional to change.Problem one: Increasing scaleIf a company is growing, increasing scale forces change. It forces change to core systems, like upsharding database clusters. It pushes the limits of various systems in ways that require architecture change. If a company is not growing, that is a major motivator of change not happening and therefore outages that will never happen.Problem two: Adding featuresIf a company stops adding new features, new code doesn't really need to be pushed all that often. Bad code pushes are by far the number one cause of outages, although these outages generally don't have the types of blast radius that architecture changes do.Problem three: Rot/maintenance/upkeepNow we get to the crux of the issue, which is something on the order of 3 machine failures per 1000 machines per day (my empirical estimation based on experience). Hard drives fail, circuits fry, network interfaces become finicky, hard drives fill up. A good portion of this can be resolved via blind auto-remediation. There's a problem on a machine? Wipe it clean and reconfigure it for its task. Assuming there are functioning autoremediation systems and no SPOFs, that database systems can handle master failures etc, that results in the most major "people need to handle this" problem being hardware failure. There must be someone actively procuring new hardware and replacing old hardware.Systems can run up to 70% peak capacity, so that's likely on the order of 100 days of unaddressed machine rot before consequences will be seen depending on how capacity is allocated.Problem four: Context changeWhile most change is done by the company itself, the company exists in a certain context. Governments can come down and companies via regulation like GDPR, which will definitely require the company to make changes. Security problems can require major or minor changes to be made. When the context a company exists within changes, the company must adapt, and these forced changes can result in outages. Depending on the change, the level of expertise of the remaining employees would likely dictate the outage.So attempting to concretely estimate, I would guess something on the order of magnitude of months, maybe 3-6 months, with the caveat of good auto remediation and no SPOFs.

joshxyz · Answer

they fired soem overhead and hired nerds, should be good right?

theCrowing · Answer

Netflix. 7 days.