I was wondering does anyone these days Self-host databases on their infra?
At work, I never self-host. Life is so much easier if blame for whatever unrelated reason can be outsourced to some cloud provider or internal datacenter team.
Because if we self host the database, we will be responsible that it is not reachable, even if it's because someone in datacenter changed some seemingly unrelated firewall rule. I want to avoid having to explain anything to our customers. Just "have no idea, datacenter team is working on it. You can ask them?" has done wonders for my mental health and job satisfaction.
We're doing this for multiple reasons: a) As our DB grew the service became very expensive, one of the biggest items in our AWS invoice; b) Keeping the PG servers up to date is a pain, we simply don't have time for this; c) We wanted to be able to migrate to other clouds and even be able to offer a self-hosted version of our platform.
I think the problem is entirely caused by the US having absolutely abysmal private internet speeds and capacity. Since you can’t then have your own server at home, you are forced to have it elsewhere with sensible internet connections.
It’s as if, in an alternate reality, no private residences had parking space for cars; no garages, no street parking. Everyone would be forced to either use public transport, taxis and chauffeur services to get anywhere. Having a private vehicle would be an expensive hobby for the rich and/or enthusiasts, just like having a personal server is in our world.
— Me, 2019-10-13: https://news.ycombinator.com/item?id=21235957#21240357
You have control over the version.
You have control over features.
You have control over performance.
It’s tons cheaper for greater performance - especially when you go over a few hundred gigs.
Yes, the hosted ones have built in replication - but my data is far too valuable to put in the hands of a third party. If they lost it - they could only shrug and say sorry and that’s it. The TOS indemnifies them.
I think it’s kind of honestly lazy — to not take the management of your data in your hands if you run a database of any significant size.
That being said, we do replicate and backup 9 ways to Sunday.
I do this for cost reasons, self hosting the database is so much cheaper than the managed options that for the same cost I can run the entire app stack, load balanced web servers, etc.
No containers anywhere but maybe in the future I'll add some.
- Version upgrades don't require migrations.
- Replication is pretty easy, well understood and allows version differences between the two ends. Nowadays they even got crash resilience right.
- It doesn't require VACUUM or any other regular maintenance.
- XtraBackup is awesome.
I've been running a pretty large and rather critical MariaDB database for about 15 years (of course it has migrated to a different machine a couple of times), without anything 'interesting' happening. (Except for power failures messing up replication consistency - but that seems to be a thing of the past.)
My experiences managing PostgreSQL were... not as great. (Though with regard to most other aspects, PostgreSQL is a lot nicer than MariaDB.)
In general, this is implemented as two main flows:
- Data collected from "the edge": Web servers that serve ads, or receive form fill-outs for lead generation, that do nothing but record this information in a logfile
- Configuration and Reporting pushed from "the hub": A central processing node (usually in a hot/warm configuration, but I've been known to use things like raft here) that receives information from flow 1, makes a decision and writes the result to flow 2.
Because the "data" I want to materialise to clients always either fits into ram, or is in a coherent stream ready to deliver to a client exactly, my applications are very fast: I can always just add another node (in the right geographic place), however I also noticed another really interesting benefit: It is much easier to develop for than using an external database.
I have a general purpose tool I wrote decades ago that moves logfiles around reliably (and at low latency), merges them together, implements failover, etc, so I really don't have very much to write to make a new application. Sometimes I experiment with the encoding of the payload, but sometimes not.
On larger projects (typically once k8s gets involved) I'm running on a cloud provider anyway and I might as well use a hosted version like RDS for the main database.
It comes down to the importance/budget of the project. I'm not a Postgres expert by any means but I'm confident enough that for simple use cases I can manage selfhosted. And if I need more, hosted versions are available at a cost.
However, any hosted DB product I use has to be open source and in theory easily replaceable with a selfhosted version. After the Parse.com fiasco I'm averse to closed/proprietary components in my infrastructure.
Outside of RDS, you're afforded a bit more creativity in your performance profile: multiple disks in RAID 0, tiered storage (burstable AND, rather than OR, provisioned IOPS storage), and instance store, to name a few. I actually once witnessed one of our databases riding out a 400K IOPS storm for a few minutes.
Before I get downvoted: yes, these introduce risk into your data architecture. It's crucial to understand what these choices will do to your failure profile, and plan accordingly. In our case, fast-failover to a hot spare database, combined with total recovery within two hours, was enough of a compensating factor to manage that risk.
IMHO, managed solution ins't that much reliable than running your own with the right planning. Failover strategy of AWS RDS is just absurb. It's literally just promote other instance, switch DNS over. And I do see issue where the new master are behind a few transaction compare with the replica...
Managed database is just to move the responsibility when the database is down to somebody else.
Self host gives you a lot of flexibility to mix and experiment with new technologies and tooling.
So yeah running my own postgres, one on each of my web servers actually. Mostly because it's simply sooooooo much cheaper and I hate monetary growth limitations for my projects
> The good news is that the pain caused by many of the issues brought up in this post can be reduced or eliminated by using a managed database service like Heroku PostgreSQL, Compose PostgreSQL, Amazon RDS for PostgreSQL, or Google Cloud SQL for PostgreSQL. If you can use one of these services, for the love of all that is holy, please do!
And i am not the only one hosting it by my own, many people are using self-hosting PaaS like dokku, flynn or caprover. And all these solutions have a common problem, they all need to reinvent the database as a service layer. What is currently really missing is a good open-source PostgreSQL as a service appliance, something "simple" like Heroku. There have been attempts like Elephant Shed, but they all try to do too much and therefore fail in adoption as they never reach stability. Or people are forced to use complex solutions like patroni which is doing many things, but if something fails you have no clue what to do.
So what is really missing?
1. A simple old-school PostgreSQL vm image 2. Built on a copy-on-write file system to clone production environments like https://postgres.ai/ and Heroku for development reasons which will not really need any storage space. 3. A built-in backup solution like pgBackRest storing the files encrypted in some cloud storage (and restore options!) 4. It does not need a complex ui or inspection/monitoring software, there are many solutions you can run on a different machine. 5. Replication, Auto-Failover etc. are hard just make it a single server which you scale vertically, if you need horizontal scaling you have very specific needs, there can't be a one size fits all solution, so don't even try it. And a real bare-metal server with nvme disks has a lot of power it's insane how fast it is compared to cloud hosted databases.
One example of many: America's Frontline Doctors, famous for their dissemination of factual information about Hydroxychloroquine and Ivermectin for Covid treatment, had their hosting pulled by AWS just a few weeks back, despite the ever-growing evidence that their stance is scientifically supported. (And I don't care if they were lying their butts off, AWS still shouldn't have pulled their hosting.)
Cloud services offer awesome leverage, which is a huge advantage, but anyone relying on them for their business or organization to operate is being completely irresponsible.
We also host 2 separate PostgreSQL instances, each with an asynchronous replica (which serve some of the read queries) and with bareman for PITR.
The PG instances used to be on RDS. The performance difference (and cost savings) is....at least least 10x.
We have no devops/sysadmins.
As many have noted, DaaS is very convenient but not always completely flexible with configuration and the pricing is not comparable since there is usually a minimum price even if you only want 1 table with 10 rows so it won't scale in the same way.
For all of my home and work projects, we have servers hosted on the cloud to get better internet bandwidth and we install and run our own SQL Server and MySQL instances. A bare Vm is pretty cheap on the cloud and installing databases is fairly easy, although I have never personally setup clustering or failover in MySQL or SQL Server, I think that is quite involved.
Would a hosted service be a better Idea, in your opinion? (That machine also has a LAMP stack for some of my web stuff).
Then I needed to host that WP instance somewhere. That was also a pain.
I decided last year to try and use Node/Express/Mongo for my backend instead and stay fully in JS land. But now I need to host a Mongo db. Instead of that, I tried Atlas for hosted mongo, but ....it feels like it's from corporate America circa 2004.
Then came Supabase.
I randomly came across it, gave it a try, and it's heaven. It's absolutely dead simple to get working (for my simple needs). It can work as a regular DB or as a realtime thing like Firebase.
Strongly recommend trying Supabase.
I use a small library that encapsulates the use-case (including some schema migration stuff that should probably be replaced with flyway).
https://github.com/mschaef/sql-file
There's a lot that can be said about this approach, both for and against, but I find that it brings a lot of power, is easy to set up, and generally stays out of the way. Given that the fundamental abstraction presented to user code is essentially SQL with connection pooling, there are also good escape strategies to get to architectures that would support however much capacity I'm ever likely to need.
Hosting is not only about speed and price. We use atlas (=hosted mongodb), because we get that nice dashboard with all those statistics and performance tuning hints. Also, if we screw something up, mongo engineers are available immediately to help us. That was nice when we broke production because of index creation on the foreground. (Which pymongo defaults to, even though mongodb itself defaultst to backgound creation). We consulted them for tuning a frequently running “slow” query and that helped.
In short: hosted solution for support and less maintenance. Backups are arranged. We can choose which version to run on and upgrades are also arranged.
Question: is there a comparable service for postgresql like mongodb atlas?
The main reason we self-host is privacy and cost. Postgres costs almost nothing, because it's part of the cluster we require anyways (also self-hosted) and ClickHouse can be scaled as needed. Hetzner has some really cheap VM, our whole setup, including the cluster, costs about 45€ a month.
You can use a cheap virtual server + attached storage. In my mind, it's not as difficult as people let you believe (until you hit a scaling issue and want horizontal scaling for example).
I also teach how to do it in my upcoming book[0] where I even have a scripted demo to create your first cluster including SSL, SELinux, attached storage,...[1].
For work stuff, I would just use managed offering in the cloud the company already has. So far, that was AWS and Azure.
[0] https://gumroad.com/l/deploymentfromscratch [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...
I know better now, but also all of those other things are easier when they’re a terraform config than when they’re manually managed. I use RDS now.
On the plus side we can do ad hoc tests and experiments by creating a new Aurora cluster with a recent snapshot and try things out.
For smaller projects it's too expensive. I just deploy a dockerized database next to the application.
For bigger projects every customer so far wanted to have the data physically in their data center. So we just installed databases on VMs.
Sounds crazy until you learn that you can expose user-defined functions to SQL and eliminate all network overhead when it runs in the same process. SQLite operations are effectively a direct method invocation across a DLL boundary. If you want queries to reliably resolve within microseconds, this is a great path to go down.
Not for every application, but it fits ours extremely well. Deploying our software to customer environments is trivial because we only have to worry about the 1 binary image. No external databases, docker hosts, etc. are required to be installed.
In my experience, it is rare to have 100+ TPS applications. As for FAANG scale infra - YANGNI!
If y'all don't want to be on pager duty, though, pay someone else to manage it.
First is an MVP stage, when you do it on a single VPS for $5/month + $1 for backups.
Second is multifold scaling stage, when you rent or even buy a couple of cabinets of bare metal hardware.
BTW, AWS Postgres does not support master-master replication, which makes no-downtime migration pretty hard. Just remember about those small quirks making vendor lock stronger.
I work on the Data Infrastructure group and am more than happy to answer any questions about it
Postgres hot hot with 2 servers.
For clients I always recommend PaaS
Might consider researching using pg_bouncer later (but really not needed right now).
Pricing and support for vendor solutions are too bad for small scale app.
Besides this I also run a self hosted production grade elasticsearch cluster which I manage via ansible.
Both of these would be insanely expensive if I chose any of the managed solutions from aws or gcp.
For anything in the big clouds, also not anymore.
For everything else: "it depends", but hyperlocal and datacenter stuff is generally a mix of self-hosted and third-party-managed-self-hosted.
We are running mariadb on a debian vm in azure and using both vm backups and mariabackup.
Works fine.
At work I would always go with a managed DB.
The main reason being that the Azure postgres offering was not fit for (our) purpose.
We did start with the Azure offering, and our default is generally to use the cloud offering.