HACKER Q&A
📣 samvher

What (almost) company sinking engineering mistakes have you witnessed?


See title - I think a collection of failure stories would be a useful learning resource.

Edit: I’m thinking primarily about technical decisions, though of course these are made in a wider context (eg choosing tech that’s impossible to hire for).


  👤 spiffytech Accepted Answer ✓
I know of a company that was pretty successful with a VB6 app they wrote in the late 90s, but twenty years later they needed to modernize. They spent two years on a from scratch rewrite that was never completed (second system syndrome).

The company abandoned the rewrite, acquired a competitor, and rebranded it as the company's new product. It turns out the competitor's code base had serious deficiencies (couldn't scale to the company's size, had important bugs), and was very unpleasant to develop against. In the 6 months following the acquisition, well over half of the engineering team left the company (heavily skewed towards the highest-caliber engineers) because they couldn't stand what their jobs had become. Many long-time customers threatened to leave because of user facing technical issues in the acquired product.

A telling incident: the company needed to deploy an emergency fix to the mobile app. The only machine authorized to publish to the app store was a laptop that the owner of the acquired company had left at home overseas. To complete the deploy, he needed to call his wife and walk her through the publication process over the phone.

The company managed to fix enough of the problems that they're still around today, but for a while there was a lot of uncertainty around the company's future.


👤 softwarebeware
I think you're probably asking about mistakes like deleting a database? But more often companies, orgs, or teams dying has to do with engineering management in my experience. I'll name a few I have experienced and you all can decide whether or not these count as "engineering mistakes?"

- Saying nothing about it, but "secretly" moving the team to another country by changing the manager and director, and not hiring anymore in this country, even when team members quit. It took awhile but I figured out that the plan was to let natural attrition take over and for every person who quits from here, hire their replacement there.

- Different company: after being acquired, the acquiring company communicated for eight months that no changes in staffing would take place, and that once the acquisition closes, everyone will be put on exciting new projects and a new bright future will emerge. Then, after the acquisition closed, laying off a third of the company immediately. Another third chose the exit soon after. Eventually, the company died.

- Third company: after being acquired by a large competitor, the director of technology for this third company promised them a new product for four years. I'm led to understand by those closer to the situation than myself that this director then proceeded to coast on vaporware demos for those four years, claiming the need to "pivot" or "reboot" the product as necessary, and promising more and more pie-in-the-sky fantasies until finally the gig was up and he was fired. That subsidiary also officially closed its doors eventually.

Just a few from my own experience!


👤 h2odragon
Saw someone buy a lot of bumper stickers to promote a political issue, without paying sufficient attention to the printing details. The ink was not UV tolerant. Used outside, the stickers became blank white within a week.

as i recall part of the slogan was "long term thinking"


👤 nostrademons
When I've seen companies sink for technology reasons, it's usually been because of the cumulative effect of tech debt accumulating over a couple years. Engineers need to hit a launch deadline, so they throw in some global variables or make a few private APIs public in the interest of expediency, figuring that they'll clean it up later. Later comes, and the engineers move onto new projects instead of cleaning up the previous mess. New code gets written that depends on the previous hacks. Eventually you end up with a big ball of mud that breaks whenever anybody touches it, and nobody can launch any new features. Now the company is forced into the position of rewriting everything from scratch, but can't do so strategically, they just need to do it now because they don't have any other options.

Companies failing for "big" single decisions - like rewriting their code from scratch, or a poor tech stack choice, or a poor initial architecture - are much more rare, for the simple reason that most experienced technical leadership knows that these are risky decisions and tends to put a lot of thought into them, and then there's a lot more organizational commitment to following through on the decision once it's been made. Also, if done from a position of market leadership, you can usually recover from them - many startups start with a poor tech stack and poor architecture, and just assume that they'll rewrite everything once they get lots of funding and a suitable lead over competitors. The insidiousness of the "creeping tech debt" scenario is that you often don't realize you're screwed until you've fallen behind the competitors, which means you enter the "rewrite and tech switch" scenario from the position of being market laggards, which can kill you.


👤 irvingprime
Well, it didn't end the company but only because the company was bought and bought again - then killed.

The new head of engineering had no experience in software. He developed a dislike for the language the product had originally been built in for his own reasons. He ordered a complete rewrite of the software in a different language. He hired contractors since we had no in house skills in the new language.

This lack of in house skills also meant that oversight of the code the contractors produced was poor. Several of us raised alarms over this but we were ignored. Eventually, the day came when the first paying customer was signed for the new platform. It was a complete disaster. The code was very unstable and full of bugs. The deadline kept getting pushed farther and farther out.

The new owners concluded the entire division was a failure. They fired the head of engineering who had been in charge of the disaster and his boss too. Then they started work on a new version of the platform using entirely different people. The ones left behind on the old platform (in the old language) had nothing to do but provide support for the dwindling number of open contracts.

Millions of dollars went down the drain, years of time were wasted, customers were badly served and a lot of people were incredibly frustrated, all because one executive decided to ignore the in-house talent and experience and follow his own inflated ego instead.


👤 coldcode
In the early 2000's the consulting firm I worked for got a customer who was building a "revolutionary" (their words) new medical software suite. They were having difficulties shipping the product so we were hired to help. Turns out they were insanely paranoid about IP theft, so they had 20-30 programmers working on the code, each had zero access to any other programmer's slice (i.e. source code) of the app, only being allowed to use libraries and APIs to interact with. No one except the execs had access to the entire codebase, thus no code reviews, no unified architecture or design agreements; basically it was an app made up of 20-30 independent apps, all doing things differently without any coordination. After we got hired they fired all of their programmers and gave us the entire source code, it took months just to be able to build the app at all from source, but it was such a mess it was impossible to make something remotely shippable.

One day they vanished and owed us nearly $800,000. It was enough that our parent company just shut us down a few months later. Oddly enough we had a really good group of developers and likely could have rebuilt the whole app ourselves, but our parent insisted on trying to recover the money instead of just taking the IP.


👤 anonymousDan
Friend used to work for Knight capital that basically went bankrupt by inadvertently hooking up their test trading system to the live market.

👤 foobiekr
Not limiting access to production.

It’s both an operational and technical decision because it results in inadequate tooling and specials.

A service I know well had an issue, and as a result they started doing what modern services do - nuking things going it will clear as clean versions come up.

But unknown to the ops team, they nuked a bunch of custom stuff that no one knew how to build or really what they contained. Developers with direct access to production had rolled them out.

7 days of partial outage later they covered this up by hiding parts of the app until they could straighten things out.


👤 slg
I think this community and the tech world in general overvalues engineering and I say this as an engineer myself. In my experience engineering or technology rarely seem to be the reason for a company's success or failure. There are certainly outliers in which a tech is so incredible that it alone can build a company. There are also some engineering mistakes that can be too costly to fix later and people can always be negligent and not do something obvious like proper backups. However those situations are usually extremely rare. The most common engineering mistakes can be fixed with more money. Typically some other factors are what make or break a company. That is what leads to not having the money to fix those engineering mistakes.

👤 felizuno
- Telling the engineers it's an MVP and then when they're done releasing and selling against it not even as beta, but as v1 production code. I see this all the time and I take a lot of money from these companies to rework and retrofit these systems to the quality they should have been before going live.

- Choosing load-bearing tech that you're unsure will meet 100% of your needs based on hype. For example there are tons of companies with marketing websites that for one reason or another can't have useful user analytics software attached or run A/B tests.

- Letting the engineers define the entire product. This leaves you with a "perfect" solution which then can't be explained let alone sold because the customer/user perspective was not properly considered. I've seen more than one innovative (and desperately needed) startup with patented tech fail this way despite having a groundbreaking solution at hand. Product design matters.

- Dividing your org into "good" teams and "bad" teams by funneling productive engineers towards important problems and not redistributing them once those problems are solved. This "good" team eventually spends all their time fixing the broken parts of the system that had been relegated to the "bad" teams (because they are so degraded that they are now the most important problems). This then causes those "good" engineers to quit because they don't like pure maintenance work, and the resulting rapid loss of knowledge cripples the business.

- Wasting large quantities of dev hours on things that won't ever make your cost of labor back. Obvious examples include companies that spend more to support IE than they bring in from IE users (I've observed this regularly doubling implementation times on a per-task basis).

- Native code avoidance. Everybody I know that has spent > 2y on a React Native project eventually switches to native code and wishes they had started that way. This is a sample size of 10+ real $MM projects. I've seen the same for many Electron-style apps. The resulting "stop the presses" rewrite is almost always started too late to save the day thanks to simple sunk cost fallacy.

... the list goes on an on. Statistics don't lie, the road to failure is wide and welcoming ;-)


👤 logosmonkey
I worked for a company that did an SAP modernization project. The IBM consultants did a large part of converting all the custom ABAP stuff. The idea was to get back to as vanilla SAP as possible and included a ton of Business Objects and data warehouse work as well to convert old reporting etc. They were constantly behind and decided to just push the load testing off the road map to hit the cio's arbitrary go live date. Within three days the data volume got large enough to grind the entire system to a halt and the company couldn't take, bill or fulfill orders. Of course the consultants were well out the door by that point. I spent months unwinding the stupid crap they did on the Business objects reporting side.

👤 blablabla123
Worked at a startup that built a 3D modelling tool in the browser. They did a major refactoring that was still ongoing. At this point in the core team of 5 they churned through 10 engineers in 1 year. So when I started my job I learned that you could neither create a working build nor would any of the tests run. (Not to speak about the CI pipeline) Eventually the build could be fixed as well as the few tests. But as it turned out every basic user functionality had to be rewritten and the structural refactoring was still in progress.

IMHO the refactoring sounded very reasonable since the old code base was unmanageable. But there was a culture of extreme pressure from the CTO, making everyone haste through tasks preventing everyone to do "proper engineering". At the time the CTO was in some sort of permanent absence already. Afterwards the team lead left because of burnout, I also left and later on I heard they closed down the company.

Actually I've seen similar destructive refactorings at other places. At one they also had a lot of subtle problems leading to many user complaints. It could be fixed but by then it was already too late.

IMHO refactorings are great but it's always necessary to keep regressions in check all the time and really understand the design decisions of existing code.


👤 hakunin
Falling into the CMS trap[1] at a sensitive time in a startup can kill the business. In brief: when you try to build out too much complexity up front, everything easy becomes hard, everything hard becomes impossible, and generally any change takes too long. And since sometimes you don't have the luxury/runway to retry, you are stuck with it.

[1]: http://max.engineer/cms-trap


👤 LinuxBender
To protect their identity I won't go into specifics, but not implementing anti-tampering on local and remote backups i.e. protection from root. Backups only residing on live systems. Not protecting systems from bad automation. Not deprecating old automation frameworks and continually adding new automation frameworks. I am intentionally excluding specific incidents.

👤 electrondood
Not engineering, but CEO banking the entire future of the company on a partnership, where we were dependent on the partner company for future financing, but the success metrics of the partnership were entirely in the hands of their sales department.

Never bet the future of your company on metrics that are entirely out of your control.


👤 AnimalMuppet
Telling the entire engineering team that they need to move to a different city 1500 miles away.

👤 cyberge99
Not sure if it will sink the company, but moving from very good native apps to a single pwa/electron style app that was poorly written.

The business justification seemly makes sense to have a single development train. But the execution was horrible. Bugs, horrible performance. Inconsistent features copied over from the various native apps, etc.

It was released as a major upgrade, but it was almost alpha grade in reality.

That org probably saved a lot in engineering, but the customer backlash was immense and likely cost them more.

(It was a very popular and beloved note taking app. that tried to become an all-in-one day planner/calendar/todo/note platform).


👤 0xbadcafebee
Attempting to rewrite a multi-million dollar application at the core of the business. Never designed to be an embrace-extend-extinguish pattern, so either it would all work perfectly, or the whole business would be sunk.

👤 anyfactor
I have seen a couple of startups that wanted build and launch their product as fast as they can. The product was supposed to be a SAAS but in that rush they have chosen technologies that made adding features very difficult. While they were considering and procrastinating adding features, competitions started to pop up and started to take over the niche.

The lesson is that, if you are just planning to make an MVP make sure you have your scaling figured out. The blue ocean gets red faster than you can imagine.


👤 asychro1611
Company gave the Product team full reign and always "deferred" paying off tech debt to chase the next quarterly goal or business pivot. Eventually the tech stack was a giant, messy tarball with no test coverage and layers of hacks, but everyone could still deliver features because the original team was still there (this also provided justification to the Product team that tech debt payoff was unnecessary or could be "deferred" again).

After a while, business stagnation from constant pivots and new initiatives resulted in attrition and a downward spiral of budget cuts and reduced morale. Attrition and knowledge loss caused velocity to drop, which caused more of the team to leave, which made the business stagnation worse, which caused morale to go down more and budgets further cut, etc. Eventually the company couldn't recruit the same level of talent to replace people who left, hiring standards dropped dramatically, and it was now impossible to pay off tech debt or even really run the tarball reliably anymore (a Ruby monolith running an ancient, unsupported version of Rails with a million security holes and bugs).

Engineering leadership made a mass exodus, the few people that were left ended up on a death watch as the move to an outsourced engineering org from India was implemented on the way towards a full migration to a 3rd party vendor platform. Software engineering was completely eliminated from the company and the lesson the company leadership took from this was that "we never should have built the platform in the first place" along with a dose of "external business factors outside of our control caused the decline in revenue, forcing us to make hard decisions".


👤 acquacow
Running McAfee products on all production servers.

Lots of issues historically, and recently for that.


👤 gitfan86
Palantir is operating under the idea that "The AI will figure it out once we can access all the data"

This is a mistake because:

1. A lot of the data at big orgs is garbage or only understandable within a certain context by specific people. 2. Internal politics within lots of organizations prevents access to this data. 3. The AI cannot just figure it out. You need tons of humans in the mix which brings you back to #1 and #2.


👤 rurban
We saw recently the almost downfalls of Intel, Apple, Boeing because of grave engineering mistakes, but ultimately caused by grave management mistakes.

Apple seems to have jumped the ship successfully by switching to ARM, their keyboard and OS are still unusable though. Wonder how their China adventure will turn out.

Intel could be bought by Nvidia or AMD, but since the government invested so much into their backdoors there, they will be kept alive. But no chance that their architectural problems can be solved at all.

Boing is unfixable since its unfriendly McDonald-Douglas management takeover and technical decline since. Now even their stable flagship product line fell down in a straight line.


👤 Jemaclus
In my experience, most successful people (and by extension, people who run successful companies) don't really know why they were successful and often attribute it to their own actions (eg, I write great code), rather than luck or networking or having solid code reviewers or some external factor. Because of this, when they are successful, they often double down on whatever it is that they think they did the first time, and that frequently doesn't work a second time.

Maybe the first idea hit a niche that wasn't being satisfied by the market, and the second attempt tried to break into a heavily saturated market. Maybe they just lucked out by being in the right place at the right time, or having the right connection that could bring in a multi-million dollar contract, or...

From a broader perspective, I would say that the biggest and most common mistake that people make (engineers too!) is not to spend time examining the hows and whys of the success that you've had so far. Were you successful because you had brilliant ideas, or was it because you had a mediocre idea that filled an underserved niche market? Were you successful because you used Postgres or Ruby or Kafka or Elasticsearch? Were you successful because you created a culture of innovation and learning and team players? Or were you successful because you happened upon a fantastic solution for the specific problem at hand but can't generalize it to larger problems?

If you don't know why you were successful in the first place, it's hard to continue to be successful.

TL;DR: lack of introspection and evaluation of success criteria over time


👤 Mizza
Building/pivoting on the unverified assumption that a market demand exists.

👤 k__
Ignoring the cloud for too long.

👤 have_faith
A colleague accidentally swapped every product photo on one premiere league football team shop website with a kit photo of another premiere league team. He pushed up the change (with his testing image...) and went home without realising until angry phone calls started coming in.

👤 xnx
1) Elective rewrite of a working and performant system without an understanding or any testing mechanism to know if the rewrite was producing similar output.

2) Solving theoretical scaling problems with every possible technology du-jour.


👤 biorach
1) Employing only the cheapest contracting shops for several years and being baffled when all progress ground to a halt due to a mountain of spaghetti. 2) then deciding to rewrite

👤 rkk3
Unintentionally making sensitive enterprise information publicly accessible & indexable/searchable thru a cloud service

👤 tawan
[Meta] I'd love the read the book with all the stories mentioned in the comments.

👤 Apreche
Building the wrong thing.