HACKER Q&A
📣 exctaticraz

How do solo SaaS founders handle monitoring/PagerDuty?


Can you ever take a break? What if you go on vacation — or simply out for dinner with your friends — and the server goes down?

I guess for less complex apps this can be mitigated with something like Heroku, but still... do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?


  👤 smoe Accepted Answer ✓
You can't get out of it completely, but you can reduce the risk of it actually occuring and, maybe more importantly, reduce the constant paranoia whether the system is ok or not.

What has helped me as the only technical founder, as freelancer or in very small teams in general:

- Choose boring technology. Especially when alone I prefer reliability and tons of state of the art on how to operate it, over shiny features

- Choose technolgy and infrastructure that you know. It is a whole lot easier to maintain a stable system with something that you have ample experience with.

- Keep system complexity roughly aligned with team size. E.g. when alone, it might not be the best idea to maintain 5 very different database systems altough on paper each is "the best tool for the job"

- I don't think you need any super advanced, well tought out archiecture, but if you are constantly fire fighting while at work, it might not even be good enough

- Setup basic automation so the system can recover iteselffrom the unavoidable but benign hickup every now and then.

- Don't deploy before going for lunch, coffee break, dinner, weekends, etc.

- While working, observe your systems behaviour over time, and especially the impact of changes on it. If you see a degradation, fix it or at least put it in the backlog. Otherwise it will bite you eventually out of nowhere.

- Have nice error pages and messaging that are shown to users when the system fails. In my experience in early stage companies, crashes suck, but aren't actually that bad after all and users are quite lenient as long as they see that the system is down instead of having the bad experience of it just not working correctly.


👤 fxtentacle
Monit for automated restarts.

Hardware raid cards.

Plus an architecture that is robust.

In my experience, good dedicated servers practically never crash. You might lose a HDD every few years, but that is not urgent to fix if you have a good raid.

Avoid most cloud services. Heroku, Rackspace, AWS all had much more outages than Hetzner. Plus they'll sometimes force reboot or force migrate ( =pause) your instances.

So if you go cloud, you'll need failover, distributed database, all that messy and complicated stuff. If you go dedicated, it's much easier and you only need to keep that one box running.

Plus, honestly, would your customers really mind if you're offline for 5 minutes? My dedicated hoster also has a service where they will monitor standard services like Apache, Postgresql, Rails for you and restart as needed. They have 5-10 minutes response time in my experience and I belive its good enough :)

Also, going dedicated makes it affordable to overprovision 10x the hardware you need, so you will practically never have a traffic spike high enough to cause issues.

With Heroku / AWS on the other hand, everyone else will also be scaling up when their cloud has hiccups, so your on-demand instances might not start when you need them.

Anyway, Hetzner dedicated + raid + monit is how I've been running my SaaS company for 10+ years. And I don't even remember which year I last had an issue that was both urgent and required my attention. The Hetzner ppl can exchange HDDs just fine without me. C++ core, Ruby website, Postgresql and RabbitMQ. 100GB database, 5TB customer data.


👤 dazbradbury
OpenRent[1] founder here. I was the only technical person at our company until we hit 1m users (certainly only person who could restart/switch servers).

I guess the question is, what happens if the server goes down whilst you're at work? The answer is that if you're constantly fighting fires 9-6, your software is probably severely broken. I'd suggest this is pretty unusual, or at least, I've never heard of software being held together like that at a company that still exists.

You wouldn't want the servers to go down whilst you're at work, in a meeting, or out to dinner with friends. So you design things to be as redundant as reasonably possible.

Then when you make a mistake, you fix it so it never happens again.

Server fear should be the least of your worries. As a founder, lots of things can go wrong that will interrupt a holiday or downtime. In my experience, it's rarely, if ever, software or hardware issues.

[1] - https://www.openrent.co.uk


👤 MattyMc
Solo, technical founder here. I started a very niche EdTech company 5 years ago, non-venture funded, grew it (code+users) while having reasonably demanding full-time jobs, and now operate it FT.

In short: it's tough; you're never off. Our errors either surface by way of user emails or monitoring (shoutout to BugSnag), and to this day I still have anxiety going places without my laptop for fear of a critical error coming up and not being able to fix it. I can recall running out of conference talks, being at shopping mall with my wife, and SO many other incidents where I'd hop onto the floor of a hallway, pull out my laptop, and frantically try to figure out what's wrong (and fix it).

On the support side, we have a small number of large clients. In this regard, there's no such thing as completely disconnecting. I have a shortlist where if I get an email from _____, it doesn't matter what I'm doing, I'm responding within an hour. Outsourcing to "watch the shop" is quite difficult; I find that some businesses can do this more easily than others. For something highly niche, it's more challenging.

On the tech side, I use managed services wherever possible. Heroku is wonderful (IMO), BugSnag is fantastic, we recently switched to Postmark which helped with deliverability of emails.

I've loved building this business. Control over my time each day is a reasonable trade for having to occasionally (rarely now) drop everything. At the same time, I miss big tech and the community of being at a larger company.

Hope that helps :)


👤 jurajmasar
Full disclosure: I'm the CEO of BetterUptime.com.

One thing you can do is to properly configure your monitoring software.

1. Pick the right alert sensitivity + notification channel: If your app is well-built and never goes down, 30 second checks and getting alerted after the very first failed request works well. However, if another legacy app is unreliable and often goes down for ~5 minutes when making DB backups, configure your monitoring so that you only get alerted when the legacy service goes down for at least 10 minutes.

2. Get phone calls for high urgent alerts (e.g. homepage is down)

3. Push notification/Slack message for low urgency alerts (e.g. background processing queue has too many tasks enqueued). If you're at a dinner with friends and you get a low-urgency alert you can just ignore it.

4. Don't take it too seriously! Odds are it's not a life/death situation when your app goes down. Downtime happens to everyone!

5. Pick a reliable uptime monitoring provider so that you never get a false incident at 4am in the morning (shameless plug! :)


👤 nojvek
Many people here are advocating against cloud, but I’m a huge convert of serverless.

Google Firestore + Cloud Run + Cloud Storage really work well together. There aren’t any servers to maintain, it auto scales to zero.

Compared to some droplet VMs in digital ocean which got restarted every now and then, cloud run has given me 4 nines of reliability according to updown.io monitor.

It’s fast, it’s cheap, it’s low effort once you get the continuous deploy bits setup.


👤 mikesabbagh
Design your architecture for the acceptable downtime. We all want 0 downtime, but it happens. You really need to understand what you are building for. Calculate the time of downtime you are fine with. for 99.5% it is >3h but for 99.9 it is 43m. and 99.95 it goes down to 20m per month. So the less time you have allowed, the less time you have to react to a problem. So how long will it take you to turn on your pc on a weekend and try to figure out what is the source of downtime? so if you plan to go above 99.95 things will really get tough, and you will have to do major restructuring to reach this, as you cant allow failures.

So If you need 98% or 99.99% availability, your design will be very different. When you start designing for >99.97% stuff will get complicated.


👤 piou
Step one, as others have said, is make sure things are rock-solid enough that you're not nervous about being offline. When nothing critical has gone down in 6 months, you can start relaxing. If things are going down once a month, you need to work on your infrastructure and processes.

I do make sure I'm always available to fix things within a reasonable time. Practically, I try not to do anything where I would be physically unable to get to a computer with Internet access within 30 minutes, though pre-pandemic I would set my phone to silent when I went out to see a movie or was at the gym. Sometimes this also means bringing the laptop in the car when going places you don't plan to actually work, just in case.

One side effect of needing to watch for incoming notifications: I have East Coast relatives who insist on texting pre-7am Pacific (sometimes 20-message text chains), and wouldn't lay off when I told them it was too early and I couldn't just turn off my phone because I need to check for work notifications. Texts and calls from them are now muted 24/7, at least until I eventually get a work-specific cell phone.


👤 leesalminen
I was the only technical employee at a SaaS company for years. I made the mistake of building on Rackspace’s OpenStack Cloud. Their managed MySQL database would crash, seemingly randomly, about 4x/yr for 1+ hour. Pingdom alerts ruled my life. I actually bought a satellite phone that could receive the alerts for when I was on vacation (I tend to vacation in places with no cell service). It really wore me down after a while. To the point where we decided to migrate to AWS. We’ve been using Aurora ever since and have had exactly 1 instance of DB related downtime, and it was because of DNS (it’s always DNS). My life has considerably improved and I no longer have PTSD for the Pingdom sound at 4 AM. My advice? Choose your infra wisely.

👤 start123
As a solo-founder/developer of [1], this is what I have been doing for the past year or so.

1. I searched for basic monitoring solutions for actively monitoring the backend and finalized New-Relic. They provide a free plan that is good enough for most startups. I have added a bunch of graphs for system, infrastructure and application monitoring. It keeps me sane and well-informed before things go wrong.

2. On my Digital ocean droplets and database, I have set up Slack alerts that page me in case there is a spike. I have created a free slack workspace just for this and added a different alert ringtone so as to not get confused with other workspaces.

3.I use Freshping to monitor Uptime and again, if things go down, I get email and slack alerts within couple of minutes.

4.I have Rollbar agent running for log monitoring. I get an email alert when there is an exception or error.

5. If I am out for more than half a day, I take my laptop with me.

6. I keep my phone on. Always.

In the last one year, rarely things have gone down. I mean maybe a couple of times.

Things I do so I can sleep properly,

1. I do not deploy before heading out, or on Fridays or at bedtime.

2. My infrastructure has a lot of redundancy meaning, a larger instance than required to handle a spike in case I am unavailable.

3. Database usually breakdown, so have recently migrated to Digitalocean managed database.

Things I am planning to do,

1.Try out Monit to automate some of the tasks.

2. Write down a list of steps or a runbook in case things go wrong. It is easy to forget steps when the production system is down.

[1] https://blanq.io


👤 Jugurtha
Write good issue templates for features, bugs, and incidents. Do after incident reports, fix underlying issues by working to automate recovery or, at the very least, document the root cause and the recovery so you know how to do it manually really fast in case you don't know how to automate it yet.

Having clearly written incident reports tends to surface patterns that help you solve for a more general problem family or type, as opposed to playing whack-a-mole solving individual issues. Meaning the culprits will tend to become clear they're the "usual suspects". Some module or part of the code base, or functionality that's causing more crashes or outages in the code that will nudge you to write better tests for it, or find a better implementation, or better exception handling or validation, etc.

Doing this will either prevent future incidents, automatically recover from incidents, or speed up manual recovery while you try to figure out ways to automate this. All these amortize the pain, as you extract every bit of knowledge from these incidents and "institutionalize" that. You're a "solo founder", but there's no need future team members or "future You" have to go through all that: they'll have a knowledge base at their disposal when they join.

Apologize and explain things to your users.

Consistent, systematic effort.


👤 jblake
- Monitor (with Bugsnag) and fix all errors/exceptions like its your religion. No new feature dev if there's an open bug. Write tests.

- Use Heroku. Monitor metrics to ensure you don't have major performance issues

- Use Datadog. Datadog can monitor and fix many things (web request queue too big -> trigger lambda function to scale up Heroku dynos, Worker queue latency too high -> same thing, scale up worker dynos, memory swapping -> restart dyno).

- Spend a lot of time fine tuning your logging, and custom metrics in Datadog. Makes investigating much more pleasurable.

- Any issues or exception notifications route to a #devops channel in my slack. Other slack channels include signups, business metrics, daily revenue reports, etc

- If something ever happens where you had to intervene to fix it, do a real post-mortem with yourself and try to come up with a way for that to never be a problem again.

I also do a lot of remote camping & off-roading without internet. I'm working on a simple little app where I can get paged on my satellite messenger (Garmin Inreach) if something is wrong, and key clients can also ping me. Only trusted contacts can SMS the Garmin Inreach, so I would use Twilio as the communication pipe. And I've pre-ordered Starlink. My off road truck has an elaborate electrical system (Lithium battery, solar, etc) and I plan to find a way to run the Starlink dish off 12v.

Currently working on my home backup plan, which includes hot-standy Mac mini, time machine and cloud backups, home battery backup generator (Ecoflow Delta), Starlink, portable generator, etc.


👤 nickjj
I've never operated a SAAS app at large scale (ie. millions of customers with 50-100+ machines, etc.) but for smaller deploys I must say that things haven't ever gotten that bad.

In some of my own projects I've only gotten bitten by little things a few times over the last 5 years. Like an SSL cert not getting recreated successfully, but this could have been prevented at the time if I had registered the LE account with an email address to get notified it wasn't getting renewed in time.

If you put in your due diligence with writing tests, run them automatically as part of your CI pipeline, stick with stable software / tools and keep things as simple as possible until they no longer work then you'll set yourself up for a strong base to work off of. Then as you encounter issues, you automate fixing them as soon as possible.

Having monitoring in place to prevent disasters helps too. Like getting notified of unusual CPU / memory / disk usage and getting warned before it becomes a real problem. Sure this requires being messaged but it also means you probably have at least a day's notice before you need to take action. That means you don't need to be glued to a pager and respond in 5 minutes because your site is down. Big difference.

This sort of applies to customer support too. I currently do personal customer support for 30,000+ folks who take one of my programming related courses. From the outside you would think I'd be slammed with requests to the point where every day involves answering questions for 2 hours but really it's nothing like that. With a strong base (a working course that stays updated) it's a handful of emails most days and quite often times nothing.


👤 mooreds
I wasn't solo, but I was the sole technical founder of a startup; I was there for two years before I transitioned out (the startup is still going strong).

My take: Lean on managed services as much as you can. This will help ensure that you have other experts to reach out to if you have issues with a component of your system. We were on Heroku + AWS RDS (the latter because at the time the MySQL offerings in Heroku were problematic, and we were using MySQL). Even if you don't pay for Heroku support, they were pretty good.

Make sure you set your SLA to something reasonable. For the startup, I am not sure we even committed to an SLA, but we were handling people's money and a crucial part of their operations. So I tried to be responsive within a few hours, especially if the app was down.

As far as actually taking vacations, I did that a few times. If I was close to internet service, I took my laptop and made sure I had cell coverage. Remember freaking out a bit because a camping area I was at had spotty coverage.

One time I was going to take a trip to the Canadian wilds. I had a friend who was running a larger company and who had oncall set up for his product. I documented the heck out of the system and asked them to be oncall for the 10ish days I would be out of touch. I don't recall if we paid them (might have been a 'friend deal' where we would pay them if there were any incidents), but I do recall nothing happened.

To answer your question:

> do they hire freelancers to “watch the shop” when they want a break or are they chained to PagerDuty 24/7?

If I had to pick the category I was in, it was "chained to PagerDuty 24/7".


👤 onion2k
Be nice to your customers, be open about the difficulties of running a tech business on your own, and they won't abandon you if there's a bit of downtime.

👤 axisK
I worked at a ~smallish startup. While we had around 20 devs employed we shared oncall between 3 people.

We invested a lot into availability - especially DBs. Most of our issues were internal DNS related which we at one point automated into hosts files that updated every hour.

Oncall was shared between 3 of us with all 3 paged at once and us getting on WhatsApp to 1. diagnose and 2. fix. Most of the time only 1 of us was close to a laptop but all 3 of us would assist as best we could.

One of us wasn't tethered at any point in time but for the most part we were able to get to a laptop within 30 minutes at most. I now work at FAAMG and find oncall especially stressful but it's once every ~6 weeks.


👤 omneity
There is no single answer, but the general idea is to make your infra resilient and self healing.

That means healthchecks with auto restarts at every level of abstraction, stateless services...

And yeah on top of all that we have monitoring setup with a few alerts.

With that said, we only had one severe outage since we setup our infra as described above.


👤 katzgrau
Solo founder for years, but eventually grew to the point where I hired a small team that can handle 95% of issues. I was at the point where I had to or I'd lose my mind with the control it had over my personal life. Hiring yourself out of that role is a journey in itself.

Anyway, yes, you're the one wearing all the hats, so it's on you. There is no real break, because even if you had someone watching the shop, many times the thing that breaks is the thing only you have deep insight into.

I've been on cross country drives, woken in the middle of the night, at family parties, hanging out with friends when I've gotten paged - and immediately stop what I'm doing to fix the issue, even if it takes a while and ruins said occaision. My platform is ad related, so every second of downtime is pissing off a lot of people because it's directly linked to their revenue. Thankfully that never happened while I was on a plane. I did have to buy a ridiculously expense WiFi package on a cruise ship twice to monitor things.

I've mitigated most potential issues with better infrastructure, tests, and early warnings, but the occasional unexpected item slips in, maybe once or twice yearly. Luckily I have a staffer with deep knowledge of the platform to handle that now. It took a while to get to that point.


👤 jasonkester
I'm a solo SaaS guy, and my "shop" doesn't need "watching". By design.

Most of this was accomplished by simply picking a stack that doesn't ever fall down on me, and the rest was by watching for things that might flake out and either fixing the flake or replacing them with less flaky things.

As such, I get maybe one incident a year where I'll walk briskly across to the office to fix something that could do with addressing today rather than next week. But it's never anything as dramatic as the entire site being down. Most often it's the result of Google having shipped a new version of Chrome that breaks some 10 year old feature of their own browser.

The whole goal of the SaaS business stuff was to maximize my vacation time, so anything that got in the way of, say, taking an entire month off to crew a sailboat across the Darien Gap was a non-starter.

So I don't have a pager. Mostly because I've gone out of my way to ensure there will never be anything to page me about.

I've written at length about this all here:

https://www.expatsoftware.com/articles/happiness-is-a-boring...


👤 forgotmypw17
I architect everything around queues and run at least two redundant, independent processors (written in different languages, even) for each queue.

I try to have as few services involved as possible, which basically means the web server.


👤 ghiculescu
All the advice in this thread is good. All I’ll add is, don’t let it get you down. Remember the upside of working solo. I have many not-fond memories of SSHing into servers while in bathrooms at bars. But the freedom of working for yourself is worth it.

👤 iamgopal
I understand Google App engine is a failed product as far as HN crowd is concern. But many of my projects ( ~10000 requests a day ) still run years after years with almost no Maintainance. I.e select platform where you don’t need to do pager duty.

👤 bsenftner
I had a solo SaaS I ran from '06 to '15, operating on a 17 server cluster at, of all places, the former Enron data center in Los Angeles. In addition to the "traditional" 3 tiers of Dev, Staging, and Production, we (the startup was 2, me and another) had production setup with redundancy. If some hardware failed, other portions of the cluster would re-route and/or assume the failed hardware's duties. The only single point of failure we had was a Federal Reserve quality hardware firewall - that was the best investment I made, as it sustained massive DDOS attacks and more without breaking a sweat.

👤 speleding
In my setup I have everything on a (large) server in a colo, with and exact copy on a second server, with the databases in master/slave. Every two years I buy a new server and swap the oldest one out.

When the master server fails, I can run script to cut over with very minimal manual interaction. I have not had to use the script in 10 years, and only experienced one outage when the datacenter had a blip.

But... it's really hard to not worry about it occasionally, even after 10 years.


👤 o-__-o
I setup Zabbix[0] on a dedicated atom server and did all of the heavy leg work once (created templates, triggers, dashboards, auto-discovery ip ranges etc). Then I sit back and build my systems as usual and they all become monitored based on the tags applied to the vm. Notification is managed by zabbix which sends email alerts, has a tie in to twilio for sms notification and there are a few third party mobile apps for remote monitoring.

This also means I am on call 24/7. I have rundeck [1] (the real star of this automated show) running on another host to tackle most common tasks for me like restarting services or backing up DBs. But sometimes I do have to phone a friend and ask for help or direct them through tasks to get things running (happened once over 12 years)

My buddy and I are finishing up touches on a service monitoring SaaS which is just an html5 front end to the above system. If there is interest I will make a note to have a release party here on HN

[0] - https://www.zabbix.com

[1] - https://www.rundeck.com


👤 gwbas1c
Why are you so worried?

IMO, the best way to answer your question is to ask yourself why you're so worried about downtime. Then ask yourself what you can do to fix it.

Also: A mistake I see is for businesses to be so feature focused that they never go back and fix their technical debt. Make sure that your SaaS product is resilient enough for your lifestyle before you add new features or grow.


👤 nkristoffersen
How often does your server go down? If this is a common occurrence that you are stressed about it, try to solve that first.

My advice will be a little controversial in this thread, but cloud providers are really perfect for building durable products. Any situation you can find where you can trade dollars for durability is well worth the ROI as a solo tech owner. Load balancers, auto scaling, aurora clusters, s3, these are all services that help me sleep like a baby even though my SaaS needs a perfect uptime. Expect instance outages, so keep your servers stateless and run at least 2 instances, as small as possible and go horizontal.

Another good idea is to learn how your product can die. Load test, try to break your app, and then fix those weak spots.

These are my opinions and experiences, and have continued to serve me well.


👤 kureikain
Yes, I'm a solo founder and I can take a break and sleep etc.

The key to me is to keep thing simple and have it fail in a predictable way by not mixing server roles.

I run an email forwarding service so I clearly say this is incoming mail server, this is outgoing server. For each of them we have a pair of 2 with automatic failover.

Keeping stack simple so if they failed, only a portion of it failed. Example, it's ok if the landing page is down. People can still send and receive email.

If the mail service down, we have a check on our homepage to say that our mail service is down and we're working on it.

In other words, try to design the system in a way where you have a clear boundary between components of system, so when they failed you know exactly what failed and can do thing like restart to fix, scale up server(cpu/mem) to fix it.


👤 c0nrad
I've been solo running/building a startup (csper.io) for over the last year, it just hit profitability a few months ago.

It's easier said than done, but if you can prevent issues in the first place, things will be much more enjoyable.

Some things that worked well for me:

  * GKE on GCP is pretty smooth. When there's a spike in traffic everything autoscales up, so I don't have to do anything. Nice observability, things just work. Just make sure to set container cpu/mem limits.
  * Along that same note, I use MongoDB Atlas which also autoscales very nicely. It autoscales both up and down very well, saving both money, and making my infra resilient
  * GCP has a lot of monitoring/alerting/dashboards that I take advantage of. Health checks around the world, easy integration of logs/metrics. I find structured logging (json), makes setting up alerts pretty easy
  * Good consolidated logging for when there is an issue you know exactly what went wrong
  * GCP also support application tracing which can make timing issues easy to debug (although it requires a bit of work to setup) (for example if you are missing an index on some db)
  * Automatic deployments (thanks to k8s), there's no checklist for doing a deploy, I just run a single make command. I can't screw that up
  * A staging environment that's a match of production. Plenty of times I've crashed staging, it's worth every penny. It also makes life much less stressful
  * Lots of tests. The tests aren't important for when I'm writing the code, but for months later when I make changes and want to know I didn't mess something else up. I find a good test suite can really help you sleep at night, specially if the test suite covers the critical paths
  * An easy way for users to contact you if there is an issue. No one is perfect, but being able to respond quickly is usually forgiven.
Also "stay-cations" are also pretty nice. I try to do one a quarter. I'm still at home if something does break, but I don't do any work for the week. Just load up a new video game and relax for a week. I call it my "monitoring" week.

Hope that helps!


👤 taf2
I always made sure I had monit to keep services alive and a init.d scripts that boot all necessary services when the box starts up. Avoid single point of failures as much as possible. Minimize unbounded queries and always set a reasonable request timeout. Have away to collect stats (statds is nice)

The reality is yeah you probably are not gonna have many restful nights or peaceful dinners... 10 years later for me and I still avoid activities that don’t allow me to quickly access a computer. I still always have multiple mifis in my backpack to ensure if one cell network is not good maybe the other one is good enough for me to fix a server... you have to kind of enjoy it


👤 vptr
I'm solo founder. I don't have much monitoring, except for http://status.simpleokr.com/ which gives me high level insights into api/app being unavailable via email. But I run everything on gcp cloud run which ensure that my app is up. Database is also in HA mode. So everything is handled by the cloud provider. No outages in the past 3 years. I had one early in the days due to traffic and db load, had to scale the db server. But my business/saas is pretty small so I might be an outlier.

👤 cuu508
I've set up tons of monitoring, and automated what I could – if an app server goes down, it gets removed from load balancer rotation. If a load balancer goes down, it gets removed from DNS. I haven't automated DB failover, because it's just a too hard problem for me, with too many edge cases.

For critical notifications I use Pushover with an emergency setting – a repeating full volume alert on phone, regardless of volume settings or Do Not Disturb mode.

I do have a "go bag" with a dedicated, prepared laptop that I take with me on longer trips (not that there have been many in the past year).


👤 effie
Being onpage 24/7 is a sure way to end up in a mental facility. You get a partner (or employees) and set up shifts, or make the services redundant enough that outage isn't a big deal.

👤 tchock23
It sounds like you're pre-idea, so one suggestion is to avoid building a "mission critical" type of idea if your desire is to stay solo and not be chained to your laptop...

I run a small SaaS where I had a decision early on to pursue a live chat-based approach to the UX versus an asynchronous approach. A big reason I chose the latter was to avoid the need for 24/7 "real-time" support in favor of a better lifestyle, even though the live version likely would have garnered more customers.


👤 nwilkens
Shameless plug..

This is exactly what we do at MNX Solutions. We are a team of Linux engineers, and provide 24x7 monitoring and response to outages for your cloud based infrastructure.

https://www.mnxsolutions.com/it-services/managed-aws-cloud

Even if we're not a good fit, I'd be happy to chat with anyone about ways to improve their site reliability. It's something we're good at, and love to talk about!


👤 davidbanham
I put a lot of effort into making sure things don’t explode. I write tests. I think about the failure modes.

I use a simple tech stack. Golang monolith, Postgres database.

I pay a little extra for God managed services that auto-recover. I run my database on Cloud SQL and my web servers on Cloud Run on GCP.

As a last line of defence, I have a remote development environment I can access from my phone. I can make fixes and deploy from there. I also have a Garmin InReach satellite communicator that I can be contacted on if I’m out of phone range.


👤 michaelbuckbee
Your question seems a little dismissive of Heroku but I work in that space and it is managed and reliable in a way beyond what you would get piecing your own infrastructure together.

👤 ezekg
I use Heroku for my own sanity as a solo founder, and I use Papertrail with email notifications to monitor logs. Nothing fancy. I bring my laptop on vacation, but I don't really ever have to use it. I freeze releases a week before vacation, to try and ensure that I don't break anything so that I have room to relax. I agree with other here: use boring tech that you are confident in. I have good integration test coverage, so that also helps. :)

👤 ernsheong
If you use Google Cloud, use Cloud Monitoring (formerly Stackdriver) to set up policies and alerts. There is a Google Cloud Console mobile app that throws you the alerts. If you don't use Google Cloud, you can also use Cloud Monitoring (Stackdriver supports AWS, and probably still does)

In addition to that, use managed services as much as possible. On Google Cloud I use a lot of Cloud Run and Cloud SQL, and infrastructure work is kept to a minimum.


👤 sillycube
I am a solo founder working on my shopify apps. There was a memory issue and my app was going down for a while. Tbh It's quite hard to reboot the service when I'm out. I only restarted my app when I went home.

I don't hire any freelancer to watch my app. But I am using monitoring service and django notification emails when there's an outage

For solo founders, it is better to pick a product that can tolerate a small amount of down time.


👤 soulchild37
I run a small SaaS making just $100 MRR ish, I set up automated Ping from UptimeRobot and use Sentry to log exception, currently trying to set up Monit to restart service if it goes down or over use memory.

The SaaS provide just a small feature, if it goes down, the users probably wont have noticeable impact. Most issue are solvable by just restarting web server, I have SSH app installed on my phone on the go lol


👤 lazyant
With 1) database backups and PITR (from SaaS like RDS) so you don't lose customer's data 2) basic monitoring so you are alerted of downtime, even if not responding right away and 3) infra as code so you can deploy things pretty quickly will get you very close to what companies with dedicates teams do.

👤 admissionsguy
People nowadays put so much funny crap into their infrastructure, no wonder it's brittle.

The service suddenly going down shouldn't be a serious risk for a vast majority of online businesses (unless you are doing something exceptional or at an exceptional scale or an amateur).


👤 simplerman
Automation and high quality all around.

Also look for projects that don't need to be 100% available all the time. I use Zenfolio, they send out emails that the site will be down for few hours on a random night, I don't mind it. It is just a portfolio site.


👤 comprev
It's interesting reading the comments and seeing a fairly clear divide between developers and operators. Tooling exists to make horizontal scaling much easier these days without going all-out with k8s.

👤 deforciant
Mostly good test coverage, uptimerobot, sentry.io and nodered to continuously run various scenarios :) also, get infra from well known cloud providers

👤 acoyfellow
I have a master reboot script that I can access from Google Console (iOS app). I open it, run the script, and things are ok.

👤 s1k3s
I carry my laptop with me everywhere I go.

👤 ajawee
Try one of the below

1. NewRelic.com

2. Datadoghq.com

3. Atatus.com


👤 devops000
Try Cloud66

👤 forgotmysn
they get a co-founder