Just properly setting up and maintaining a small k8s cluster looks almost like a fulltime job.
I wonder how do your CI, deployment workflows look like?
Any particular tools, practices, guidelines, architecture patterns to reduce pain & timewaste?
How do you CI => deploy/rollback stateless instances? How do you manage your DBs and backups? Etc.
k8s, ci, etc are all really useful and solve a lot of hard problems, but god are they a bitch to set up and then monitor, fix when something weird happens, secure from nasty people outside, patch when some update becomes necessary, resist the urge to “improve”, etc. This doesn’t include the time it takes to work these tools into your project, like k8s for example really requires you grok its model of how stuff it runs is supposed to look. There’s a reason entire companies exist that simply offer managed versions of these services!
If you’re using fewer than like 10 boxes and don’t work with many people, you can get away with spooky arcane oldhat sysadmin deploy techniques (scp release to prod hosts, run deploy script, service or cron to keep running) and honestly it’s pretty fun to just run a command and poof you’re deployed. Then you can just wait until it starts making sense to use fancier tooling (more load, more sites/DCs, more services, more people/teams, etc). Monitoring is always good to start with, for other stuff a good litmus test is that price points for hosted k8s & ops stuff will start to make sense, you’ll be paying in time in any case. Who knows, maybe when that happens something new will pop up that makes even more sense. Until then, reducing the amount of stuff you need to worry about is always a good strategy.
The problems of a solo dev are very different than a dev on a team. Knowledge silos don't exist. Distributed expertise doesn't exist. There's no one to mentor, no shared vision to maintain, no intertia to combat.
I consult on big complicated team projects. I also manage multiple solo projects.
On solo projects, deployment is a script I run from my dev machine. I'm the only person who deploys; anything else would be solving a problem I don't have.
The only "CI" is running the tests before I run the deployment script. I'm the only one who needs to see the outcome of the tests. Anything more would be solving a problem I don't have.
Architecture is whatever makes the most sense to me personally -- which is often miles away from what I would recommend to a client, who needs something any new hire will recognize and be able to work with.
I pay a service to manage backups so I can walk away from a solo project for months and know it's ticking away.
The point is: solve problems you actually have. Don't try to run a "professional" operation by doing what big teams do all by yourself. Big teams have different problems.
Yes, you'll pay more for a Heroku Postgres instance than you would for a VPS on Digital Ocean with Postgres running on it, likewise you'll pay more for Heroku Dynos than another VPS to run your application server. On the other side though, your backup, rollback, deploy, and scaling strategies can all be "Heroku does that", and you can focus on the much more valuable job of building a business.
Here is what works for me:
CI: From the terminal, I run my tests and commit to git.
Deployments: rsync
Rollbacks: Never did one. If something breaks, I fix it and rsync the fix to production.
DB: MariaDB
k8s: I don't use it. Computers are very fast these days. A cheap single VPS will get you a long way.
Nightmare: Not really. I spend about 30 minutes per week on DevOps.
I think if you’re a solo founder you have to accept on day one you’re at a time disadvantage when it comes to the actual hours you’ll spend on moving the business forward vs admin, ops, legal stuff etc etc and so you have to really fight smart!
I.e K8S? Definitely don’t need it. Complex CI/Deployment workflow? Nope, just the minimal shell scripts to do clean deploys and rollbacks. Pay for a HA RDS so you never need to worry about DB replication etc.
If you just work with the minimal set of tools you’re comfortable with and relentlessly keep things simple it’s possible. I’ve built high availability stuff as a solo founder and still found (at least some...) time to work on sales and the business.
- CI on github actions
- Project management on post-its (short) and READMEs (long term)
- deployment is done by a github action trigger on the `main` branch
- Hosting is using AKS, GKE, or on-prem k3s on a raspberry pi. Want to restart a service ? just kill the pods.
- Devops took about 2 days of work initially and now shared by every project for less than 30 minutes by project.
- Deloying a test cluster (or a test k3s node) is reproducible on whatever hardware / cloud provider you get, sometimes I often create a full-blown dummy cluster for running test scenario
- Certificate renewal is automatic (cert-manager)
- DNS record creation to services/ingress is automatic (ExternalDNS)
- Authentication using OAuth/OIDC is also set up automatically on whatever identity provider you have
- Database backup is automatic
- Load Balancing is built-in
- Job scheduling is built-in (and -fuck- logs are much more accessible on a failed job container than in a custom cron + sh solution)
- Service discovery is not needed
- Monitoring and alerting is not, but cloud providers often have something for you
Note: this is highly effective because I already had significant K8s experience, so, if you're still learning about what Ingress-Controller to choose for your OIDC proxy, then don't go that route.
- Use NFS (EFS), put builds in a /releases folder
- Bash scripts for build, deploy, release
- Bash scripts to wrap apps (write pidfile, start, stop, etc)
- Cron those scripts
- Check crontabs into a repo along with all other config
- Cron a bash script to pull from that repo and update crontabs on every box
- Have a staging environment, deploy stuff there (esp DB migrations) and do smoke tests.
- Have some kind of monitoring (I have a slackbot that sends me messages when things break)
I general I follow the principle of proportional response, and n=2 (or 3 depending on how trivial the task is to automate).
Proportional response means you invest in automation / tooling proportional to the pain you have suffered. Burning a day to build a bulletproof script only makes sense to solve a big problem. For a smaller issue, maybe adding a debug line or writing a line somewhere that you can copy-paste later is good enough.
N=2 means you don’t solve stuff until the second time you need it. This stops you from burning hours building clueless complex garbage to solve unnecessary problems - at least by the time you are automating, you have solved it once or twice by hand and you know where you are going.
Elon’s 5 principles have been very useful to me as a solo dev:
1. Make requirements less stupid
2. Delete the part/requirement
3. Simplify/Improve the design
4. Accelerate cycle time
5. Automate
So in particular you might try 1-4 to see if you can magic problems away before you invest in building devops automation you will then need to maintain.
GitHub Actions
> deployments/rollbacks
Docker. Scaleway offers a container registry that's ridiculously cheap[1]. Deployments are infrequent and executed manually.
> DBs
Again, Scaleway's managed RDS[2].
Outside these, we have setup Grafana + Loki cloud[3] for monitoring and alerting. They have a generous free plan.
For product analytics that can be derived from our database we've a self hosted instance of Metabase[4].
[1]: https://www.scaleway.com/en/container-registry/
[2]: https://www.scaleway.com/en/database/
[3]: https://grafana.com/
[4]: https://www.metabase.com/
P.S. We were a 1 person company when most of this was setup. We're 3 now, works just as well.
It sounds like you have a great set of intuitions, and you're right, all that infrastructure is a nightmare to set up and manage.
Step 1: Question every requirement
Step 2: Drop everything that is not critical
Step 3: Profit!
Git hooks and Makefiles are great!
I have an HP workstation I bought from eBay under my desk that has 32GB of RAM, 2x 12 core CPUs and a static IP address. It's probably 8 years old, but fast enough to serve whatever we need for a long while.
The machine (Old Blue), hosts our Git repos, web app, database and integration service (git hook that calls, 'make publish').
We're not serving Google scale traffic, so we don't need Google scale infrastructure.
Keep it simple whenever possible and don't let the modern stack complexity creep in until you absolutely need it.
Even going to the cloud, when you've done it 10 times and know how, is way more work than you need when just starting out.
Take on those costs and complexities only when your traffic requires it, and you may just find out that you never have to pay rent for compute.
If we're talking a plain saas type deal, I'd keep it simple, elastic bean stalk, or use a heroku or render.com like setup until you grow to the point of hiring a team. If it's just a basic saas, I don't see how a 1 man team could really out grow this setup. I've seen 100 person teams using heroku.
K8s is just way too much work. Even cloud formation is to much for my tiny show.
Use the automated backups setup by your host for your db. If you need to roll back, just redeploy your previous git hash. I typically use GitHub actions to deploy, so rolling back is just a matter of force pushing the prod branch to the sha I want to deploy
Skip micro services, they are too much work for a small time thing, and don't really provide much benefit.
I don't have dev ops experience (I still don't know much) so a lot of what I did was to exhaust the limits of the setup before implementing any new infrastructure, sort of learning as I go along, with many 'oh, so that's why they do this' moments. I've read about K8 but still can't see when I would need that though.
A lot of what saved effort for me was to use off-the-shelf component whenever I can. I was quoted $10~20k to build a Django website for user & API management to connect to my back-end but it was much easier to use Azure API Management service (~$400/month), which also came with simple, reasonable looking web front-end that I was able to launch within a week.
Before moving things to cloud (Digital Ocean), I've also exhausted what I was able to do with simple NAS servers (overall I processed about 200TB of raw data with ~1TB/day on-going using what I call a 'poor man's HPC' setup) - luckily I'm based out of Hong Kong so had access to fiber-optic internet at home.
DBs run side by side and are simply backed up regularly with whatever backup solution the VPS offers.
That's it.
Out of school I put way more effort in building my infrastructure but in reality a good VPS, something like Cloudflare and some app based caching can run millions of daily views over 20+ different apps without issues.
Edit:// it's mostly rails apps I host btw. For my wordpress installs i use dedicated VPSes because those scale horrible.
For:
* Database: PostgreSQL installed through apt in the same server: https://github.com/sirodoht/mataroa/blob/master/docs/server-...
* Backups: MinIO-upload to an S3-compatible object storage: https://github.com/sirodoht/mataroa/blob/master/backup-datab...
* CI: Github Actions + sr.ht builds: https://github.com/sirodoht/mataroa/blob/master/.github/work... + https://github.com/sirodoht/mataroa/blob/master/.build.yml
* CD: (not exactly CD but...) ssh + git pull + uWSGI reload: https://github.com/sirodoht/mataroa/blob/master/deploy.sh
* Rollbacks: git revert HEAD + ./deploy.sh
* Architecture pattern: stick to the monolith; avoid to deploy another service at all costs; eg. we need to send multiple emails? not celery, that would mean hosting a redis/rabbitmq. We already have a database so let's use that. We can also use Django management commands and cron: https://github.com/sirodoht/mataroa/blob/5bb46e05524d99c346c... + https://github.com/sirodoht/mataroa/blob/master/main/managem...
I personally manage the CI/CD infra as well as the data infrastructure for my startup.
Here is my setup:
1. Self-hosted Drone.io instance connected to github(the instance is behind a firewall and can only be accessed via a self-hosted VPN. Github IPs are whitelisted so that auto build/deployment can be initiated via github hooks)
I have drone.io config that helps me manange the versioning, deployment to stage/production, run tests, do rollbacks etc.
for eg: merging to master on github auto builds the master branch on drone and deploys to staging along with pulling the latest db backup from S3 and adding to stage for testing
2. Self-hosted production and staging server that hosts my web/django app. It has a local postgres db and is encrypted and backed up on S3 every hour.
3. Self-hosted Elasticsearch cluster using Ansible playbooks with daily snapshot backups to S3.
beyond this, I self-host 3 VPNs to protect my entire infrastructure. read my old HN comment about it: https://news.ycombinator.com/item?id=28671753
All this is one time infrastructure setup and it all has been running smoothly for more than a year without any hiccups.
I would be happy to help you setup a similar setup if you want. hit me up at vikash@quantale.io
I've been running a Python-based, highly custom web store solo since 2007, and it supports multiple people doing fulfillment. I host on Opalstack so as to outsourcing patching, email config, database maintainence, etc. I run directly in a Git repo (actually Hg, it's that old) and releases are "git pull && ./restart.sh". Rollbacks are "git checkout ...".
I've had to migrate/rebuild the VM about every 5 years. Tech changes enough in that time that no automation will still work unmodified. So I just keep good notes about what I did last time, and figure out what the new equivalents are when I finally have to do it again (updating the notes, of course). Database and Conda are easy to port. It's usually DNS and email integrations that are a pain.
As others have said, KISS is key. Industry DevOps is for a work setting with a decent-sized team, where you can afford the overhead of maintaining it all in order to make the overall team more efficient.
Use Heroku (or any other PaaS) and throw in some serverless stuff here and there.
Again DO NOT geek out on the tech. Geek out on getting customers.
In short, I lean heavily on serverless and Github Actions.
Each story covers a different pattern in my stack and I apply these principles:
-I only want to pay for what I use -I don’t have a lot of time available for learning or building -I don’t have time for maintenance activities -I’m not a good UI designer or front end engineer
The DevOps side of things is trivial in the beginning and takes very little time compared to actual development.
Ubuntu server, Nginx, Mysql or Postgres, PHP, Go, Redis. Configured reasonably, it's a ridiculously reliable stack where things very rarely go wrong.
I prefer DigitalOcean these days. Takes <15 minutes to configure a base new setup - a template - from scratch, make a few adjustments and double check everything. From there I can pop up a lot of servers as needed. I usually tweak things based on the project, although that doesn't take a ton of time.
And that's it. Back-ups on eg a database focused droplet are automated by DigitalOcean. Occasionally I have specialized back-up needs, and I'll do a bit of custom work for that. Most of this could be offloaded by using DigitalOcean's database service, I just prefer to limit cost by not.
Under no circumstances would I use Kubernetes for a smaller to medium size service.
Both have a similar setup:
- 1 droplet with docker pre-installed from DigitalOcean
- clone directly the repo from github
- together with the code, I have a folder with a bunch of docker images (caddy, mysql, php, redis, etc.) that I can easily spin up.
- for any release, I manually ssh, git pull and run the migrations (if any) and manually rebuild any docker image if needed
- I have daily jobs that dumps and zip the entire DB to S3
- if I have some deployment that I know will break any functionality during deployment, I warn the users before and accept that downtime.
- never had to handle a "hard" rollback till now.
I've planned to change this setup for while, but until now, didn't find any reason that justifies the effort.I spend 20$ (10$ per droplet) + few cents on S3 per month with them.
I’ve been running my small company for 4 years so far. It’s a podcast search engine & api.
Product & infra & Devop & all kinds of manual processes all evolve over the past 4 years. Things were added / improved on demand.
4 years ago: 3 digital ocean instances + ssh to manually deploy code. No users. So no devops.
Now:
- ~20 EC2 instances, provisioned via ansible
- to deploy code: run a bash script on my MacBook, which is actually to run ansible. The script can also rollback to any specific git sha
- use rollbar + Datadog + PagerDuty for monitoring & alerting
- 3 Postgres database EC2 instances (1 master, 2 slaves). Daily cron job to automatically upload db dump to aws s3 and google cloud storage
- no CI. No docker / kubernatte
- a few interesting devop events are sent to slack, so I’m aware of what’s going on for the entire system on my phone. Eg, when a db dump was successfully uploaded to s3, when a Django request is slower than 5 seconds, when an important cron job fails…
Not very sophisticated, right? But in practice, we rarely have outages and spend very little time to do devops.
The most annoying thing was to upgrade Postgres to a new version. We had to prepare for an entire week and minimize downtime (in practice, disable db write for < 1 minute or so).
I’ve got a couple old blog posts on engineering for solo founder:
- https://www.listennotes.com/blog/the-boring-technology-behin...
- https://www.listennotes.com/blog/good-enough-engineering-to-...
Use tech that you understand and minimize the layers of complexity between deploying, starting and troubleshooting an instance.
For me, I like digital ocean droplet(s). Stupid simple IAM. SSH scripts for CI deployments. Everything sits behind a Cloudflare cache.
The backend sends key events like restarts and fatal errors into a dedicated Slack channel. Its my confirmation a deployment worked and first alert if something crashed.
My business partner and I are currently running a $1m/arr business with this setup on a single digital ocean instance, Python3 and Nuxtjs.
What I have now is terraform on Github which handles Networking (VPC, subnets), ECS/ECR, ALB for my backend and S3/CloudFront for frontend.
Everything is on AWS because I'm familiar with it, so it's faster. And I compensate for costs with AWS Activate + Free Tier. This combination is usually enough to understand whether the project will get traction or not.
Actually spending little time on IaC was one of the best recent investments. But I guess it holds more value when you have a couple of projects, still exploring and these tasks get repetitive
I've done this since writing my first CGI app in 2003. And it hasn't changed other than the commands I call. Back then it was manual through GUIs (visual source safe and FTP) and now it is git and scp, but 80% of the process hasn't changed in almost 2 decades.
If it breaks (which it shouldn't, but inevitably does), I just git checkout the previous commit and scp again, and try and fix the issue.
If I have a "high availability" product (which I hate doing), I will release onto a clone of the production server, then test it, swap to that clone, deploy to the others, swap back and kill the clone. It is a much more involved process that I really don't like doing, so I've stopped making high availability applications.
But that is web, so it's super simple. Also to avoid issues regarding data schemas, I always try to make sure my changes are additive. And on startup, my web server checks the database version then runs any upgrade scripts if needed. If the deploy fails, the db changes stay, and the code just gets updated. This is sloppy, but it's simple.
When a real schema change needs to happen, I abandon the project (just kidding). I usually do it in multiple updates. New table with the schema changes, upgrade script transfers the data, and code to point to new table. Manually verify that all of the data copied. Backup the old table and drop it. Then another update to rename the new table to the old name (if needed) and it's done with minimal disruption in service.
Envoyer and Forge for PHP stuff.
Render/Heroku for node.js stuff. Netlify for static stuff.
For the more complex bits I have, I use Pulumi to manage everything through IaC. It's still complex but at least it's robust and I can sense-chech changes to infra through a PR to myself.
I create my small/bootstrap projects the following way:
1- create a free tier AWS account - can work with any VPS/server really, but with AWS you can get it 100% free :)
2- create some Ansible provision/setup/deploy scripts
3- create some bash scripts to wrap this all
Create the Ansible scripts:
1- Provision script, consisting of 1 EC2 instance, 1 RDS and the security groups. Store the IPs/adresses of the newly created instances.
2- Setup script, basic instance config, packages, app/project repo, you name it
3- Deploy script, to run every time I want to update the app, mainly just doing a git pull on the instance and some database backup/migrations (although on RDS backups aren't always necessary)
I get these Ansible scripts wrapped into basic bash scripts, provide my AWS credentials via the AWS cli to keep it safe, few extra creds with Ansible vaults, SSH keys, publish in a Github [private] repo and I'm all set.
It took me a couple of days to get fully operational and I was learning Ansible at the same time so it can really be done with basic features. Now it's done, I can reuse the same skeleton for every new project in a couple of hours!
I find this solution extremely resilient. That's basically free hosting for personal projects. Every year, just need to run the script again when free-tier expire, and can change hosting provider anytime or upgrade the instance type if I decide to get a project publicly released. A small extra Ansible task to write for migrating the data and that's it!
I keep it super simple. I have a git repo where I directly commit into master. When I want to take the changes live I ssh onto my VPS, I git pull the changes and restart the website service.
I use an SQLite DB that I automatically copy (backup) to my local machine daily via the Windows equivalent of a cron job. Once it's on my local machine it's also backed up to other places as part of my local backup process.
I run my tests manually when I think I've made significant changes.
Otherwise I just react to errors ASAP (I get an email notification from sentry when a new error occurs)
A lot of this doesnt work with a bigger team of course.
It's a balance between risk and time investment. The simpler your dev ops stuff, the higher your risk of introducing bugs into production. But it saves you a lot of time not dealing with that stuff.
We're building [1] something similar to solve this problem but just on the infrastructure end. I really believe in the power of templates. In our future release, we're planning on launching a "infrastructure" catalog of commonly used setups so that you can just get the foundation in place. Why re-invent the wheel every time when you can just spin up a boilerplate that gets you 90% of the way there?
It isn't that hard, for someone with decent experience, to setup TeamCity/Jenkins/Octopus/Whatever but why spend the time? If you change something, you then have to go and spend time changing the automatic deployment.
As someone else said, Devops is about velocity with quality, if velocity is OK and quality won't improve with tooling, don't bother.
services:
certbot
haproxy
db
api
client
And gets deployed by a GitLab config file that looks like this:build:
stage: build
script:
- docker-compose -f docker-compose.yml -f docker-compose.release.yml up -d --remove-orphans --build
Unless you're doing something very slow / processor-intensive, you'll probably never even need an autoscaler before you can afford to hire an expert to do it for you. You'd be surprised how far you can get with a small VM running your whole stack.
I am not sure how it would work on a larger scale, but for my use is perfect.
I’m a solo-founder, with many years working in IT, and I focused on DevOps for part of it; I know what best practice looks like and it would be easy to fall down that rabbit hole doing an unnecessarily complex buildout. Currently I’m doing a small private beta, I’m avoiding 99% of standard practice. I’ve got a single EC2 node running Redis as the only datastore and NodeJS + Nginx + certbot, a cron to do backups to S3. No CI (there are no other devs to integrate code with) I run all tests locally and push/rollback with rsync. All code, assets, and server config (except creds) are in a monorepo.
If the server goes offline, I will have a little downtime, that’s fine. If I run out of memory for Redis (not likely to happen soon), I’ll change to a different datastore or scale up the node. If I lose data, I can restore from S3, and additionally the architecture is such that clients will re-push their latest changes.
Do the bare minimum to support the business, stick with what you know, outsource what you can, and properly value your time.
Dedicated servers, specced to provide lots of capacity for spikes. A 20-node k8s cluster could fit on 2 beefy dedicated servers for about the same cost. Decreased infrastructure redundancy but massively increased operational stability through simplicity.
Everything runs in a docker-compose project: one for the API, another for the everything else monolith. I've worked for a couple of small companies that ran with docker-compose, so have a good sense of the weaknesses and footguns (breaking your firewall, log rotation, handling secrets, etc).
CI is running `make test` on my dev machine. Deployment is `git pull && docker-compose up --build`. Everything sits behind haproxy or nginx which is set to hold and retry requests while the backend is down, so there aren't any failed requests in the few seconds a deployment takes, just increased latency. I only deploy to the API once or so per week, that stability reduces headaches.
DB backups are done with cron: every hour a pgdump is encrypted then uploaded to backblaze. Customer subscription data is mirrored from stripe anyway so an out of date DB backup isn't the end of the world. Error if the backup fails after a few retries.
Sentry on everything for error alerting. All logs go into New Relic.
- Manage your own DBs for production. I used to manage my own but now I use IaaS for it. It's worth the extra cost even if your budget is low since it lowers your risk and lets you move faster. AWS Aurora is awesome but expensive. There's a SaaS/IaaS out there for most DBs. And they handle the backups!
- Use Kubernetes. k8s is awesome, I love it, but it is a beast and probably overkill for your solo ops team that is also multitasking.
- Manually SSH / rsync / scp directly onto prod.
Do:
- Version control (Git) everything
- Use containers where it makes sense (I use Docker when I'm not using Serverless). It keeps your dev and prod in sync and helps with immutable deployments (see later points)
- Use Infrastructure as Code
- Use Github actions to do your deploys
- Use immutable deployments. More time to setup but makes rollbacks easier and if you never hot patch production you can't create a unique state that can't be restored.
Finally, (and possibly controversially): use serverless where you can (I personally use AWS Lambda, S3, and Cloudfront).
My personal flow:
- All my infrastructure is in a parameterized CloudFormation template.
- The template is checked into Git with the service code.
- After CI runs the tests, CD builds my serverless functions and uploads the zip.
- Then it runs the cloudformation template.
The same flow also works for Docker except "uploads the zip" becomes "pushes the container"
Using this setup I can tear down and build up my setup in minutes, making it easy (and cheaper) to separate dev and production.
As for DBs / engines and such, I try to go for hosted services as long as they're not stupidly priced. I'm currently using Elastic cloud. If I need a relational DB I'll probably deploy Postgres on AWS and install some scripts to back up the data to an S3 bucket every day or so.
I chose NOT to use k8s or Docker because, like you said, it's basically a fulltime job to maintain. I don't think it's needed at all unless you're a large organization.
So basically: a mixture of roll-your-own, streamlining platforms, and hosted services. Do whatever works and is easiest, and don't worry so much about having ALL the goodies.
Also, I review how much time each product / feature consumed each year and then decide what to deprecate. Each April I then have my own little party where I ceremoniously turn off last year's time wasters.
I'd say it all boils down to "know your customer lifetime value"
Here are the list of DevOpss flow I've done:
* One project is a website. I have a pipeline, triggered when something is merged to master, that builds Docker images, pushes them to a Dokku instance and deploy them.
* One project is a JS library. I have a pipeline, triggered when a git tag is created, that packages the library and uploads it to npm. I use Netlify to host the documentation website of it.
* For this same project, I'm planning to have a pipeline triggered for each PR to run the linter and tests.
I work as a DevOps/MLOps, and I'm quite comfortable with CI tools in general. I hate doing things manually and I love automating, so it comes quite naturally for me to do it. Once it's set up, it rarely needs to be changed.
GiHub actions-> docker build->docker push to github package repo
then on vm a 2 line script to docker pull && docker-compose up -d
Database all by hand.
Extremely easy i just push, wait build and restart
New services like render.com make it really easy to get robust, cheap hosting up. Much cheaper than Heroku.
For simple monoliths it's a non-CI method plus scp, for infrastructure-only it's SaltStack and Terraform. For something where the DevOps part is the deliverable, it can go wild pretty fast: GitLab CI with various CI steps (static analysis, building, unit tests, integration tests, browser tests, blob storage of the result), then CD steps depend on what it is that was built; docker images just get pushed to a registry and another async job listens for pushes and does the deploy in Fargate or Kubernetes. For some jobs with larger needs you get into ArgoCD and Kubernetes resource deployments that way.
Essentially it depends on the zoom level:
Zoomed in to the max: just a build.sh or makefile, optionally called by a CI or IDE of your choice so the building is exactly the same no matter what invoked it.
Zoomed out a little: the results of the build need to end up somewhere, sometimes a step in between like a package registry, docker image registry etc.
Zoomed out to level 3: once the deployment gets actually done, the runtime itself is responsible for checking for database migrations, locks, maintenance mode, so that has to be part of the startup. If it's a multi-instance workload it depends on database migrations if there has to be downtime.
Zoomed out to level 4: if it also needs to deploy 'other things', like rules and records to Cloudflare, resources in AWS, resources in K8S and even update some stuff in buckets or DynamoDB, that requires some orchestration and DSL like Terraform, SaltStack, Ansible and application-specific CD like ArgoCD or Flux or even just the aws cli tools to trigger a refresh.
Customer decides how much money they want to spend and that scopes what zoom level they get. Usually depends on how many developers or ops people they have themselves and how involved they want to be and what continuity or guarantees they are looking for. The only thing that is not optional is that everything has to be in Git. If a customer doesn't want that, they cannot be my customer. I'm done living in the 90's.
This was also how things began at a previous startup I worked at until we grew and we could hire more resources to fix things.
I have a few old boxes that run VMs. Some of these VMs make up a virtual k8s cluster, but almost everything I need runs on bare VMs rather than on k8s.
When I find something particularly annoying to administer, I move it off a bare VM and onto k8s.
The main thing this gives me is a migration path off of "a bunch of custom shell scripts" toward something I believe will be useful in the long run.
A bunch of shell scripts and rsync will work for a long time until they don't. And when they stop working you'll be in for some pain.
I didn’t do CI/CD. I just reviewed my own code the next day, improved it and put it live. As the only dev I knew more or less how my code worked and what risk factors to prepare for in each deploy, and serious, business-impacting production issues were rare.
A simple stack of Meteor on Galaxy + Mongo on Atlas meant all my tools integrated well together and a whole lot of devops things were easy enough that I rarely had to think about them after they were set up once.
I can think of many cases where this would not work but it worked well for my context.
I have multiple projects, none of them making money currently, so $100/month for each basic app is not good.
I switched to using Dokku on Vultr, $12 a month. You can easily create Postgres databases and link them to Docker apps. I haven't bothered to setup CD yet but it looks like it should be simple, for now I just push to it when I want to deploy. Liking this setup so far.
For frontend I use Netlify, I use their redirects to proxy the backend.
SVN checkout development, meld into testing. Copy production db into testing. Run a db upgrade script in testing. Test the new function for a day. If all ok stop the production HAproxy at 3am. run production DB backup, Checkout testing into the production server. mount the new dir into the NGINX path. Upgrade+start the DB, start production HAproxy.
Except for the meld and testing part, this is all automated.
A rollback on the app servers is pretty simple, just remount to the previous location. Even though there is a backup of the DB, a rollback there is not really feasible.
Everything is running in LXC containers, which I treat as cattle. Creating new ones is either automatic based on time/demand/failover or a few clicks in a custom web interface. HAproxy automatically pics them all up via DNS.
SVN and production DB are always backed up nightly and copied to a S3 compatible storage. Everything older than 6 months gets deleted. All the free space on the servers is globbed up with glusterfs to give a decent amount of fast,free storage.
Using Bash as scripting language, there is literary no problem in devops which hasn't been solved with bash and the solution is somewhere in the internet.
1) Server is high performance native C++ application running on rented native hardware on Linux. It processes thousands of requests/s in sustained manner hence no real need to use any of that k8s / docker / etc. There is also separate rented standby server.
2) I maintain a reusable single script that can rebuild complete system from the scratch with a single command. I periodically test it on local VM. It installs all needed compilers, tools, PostgreSQL, restores database from the latest backup, checks out from VCS and builds that C++ server, registers it as a daemon and refreshes Javascript Frontend.
3) I also wrote small server that subscribes to webhooks from VCS and can build artifacts. I only use it to pull and build Javascript part. It could do main server as well but being super cautious I trigger that step manually fingers crossed.
3) for DB I use automatic scheduled backups.
All in all after I've debugged and tested new version locally the release takes seconds. Script is written years ago and does not require much maintenance other then updating artifacts to new versions (replace apt-get install XXX-31 with apt-get install XXX-32) and register and build ZZZ server instead of YYY server. Compared to insanity I saw in some orgs my setups are piece of cake.
Provose is an open source project written in pure Terraform, so all of the high level abstractions are computed locally, and in general do not raise your AWS bill.
I started working on Northflank to make it more simple for developers either solo or in a team to manage and automate DevOps complexity away.
We support CI for GitHub, Bitbucket & Gitlab (SaaS and self-hosted) with either Dockerfiles or Buildpacks. We have a robust and feature rich platform via UI, API and CLI. Out of the box you get end-to-end DevOps: CI/CD/CD, Horizontal and vertical scaling, persistent workloads with managed (MongoDB, Redis, Postgres, MySQL, Minio), backups, restores, real-time observability and metrics, DNS, TLS, mTLS, Teams & RBAC and more…
Let me know what you think of our offering + site!
Platform: https://northflank.com
Application Documentation: https://northflank.com/docs/v1/application/overview
API Documentation: https://northflank.com/docs/v1/api/introduction
On that it was a Django product with deploy scripts for staging and production, i.e. `./deploy_production` ran:
#!/bin/bash
ssh web@[production ip] << 'ENDSSH'
cd /home/web/
./pg.sh
ENDSSH
Where `pg.sh` pulled from `master` ran static and migration management commands and restarted the web server.Last summer I moved away from the spookies and now do all deployment using Github Actions.
Commits to either a feature branch or `develop` run tests and all necessary deployment steps to a staging instance. (my stuff so far does not require k8s) and all containers are put up using docker compose. Commits to `main` (or `master` depending on the project)
I blended a few guides written by Michael Herman to build my GA-based CI, and it took several weeks to work on the scripts and learn Github Actions. But it was time absolutely worth investing because my DevOps is modern enough now.
No more bare metal stuff. No more scripts. Nice visual display of deployment success, "free" VMs for the deployment and CI UI integrated alongside my issues and code all on Github.
Importantly, I still use git flow. I tried moving to trunk based, but the macros inherint in git flow (including the pycharm plugin) make this production / staging // master / devops or feature/ work very well (I'm not making PRs to myself)
On every stage (dev, test, prod) I can deploy it automatically up to two times (blue and green) by running a simple shell script. So when I update the application I just deploy it another time on prod, test it one last time by adding the IP of the load balancer in my local hosts file and if I'm satisfied with the result switching the DNS entry to the new version using a weighted DNS record with the ability to switch back until I shut down the old version.
Doing anything continuous doesn't feel worth it in such a small setup with one update every 1-3 months.
What I like most about the approach is that I'm free to change any aspect in the main template without any risk to break something in the live application. Only changes to the elements shared between versions need to be handled carefully.
The main template includes VPC, network, fargate service, loadbalancer, firewall, KMS, DNS records, access rights, database tables, queues, monitoring metrics, email alerts
It excludes everything that is shared from update to update which are defined separately and just referenced from the main template such as some database tables, persistent storage container registry, user groups (Only the groups, not the rights assigned to them)
If you aren’t already comfortable with K8 I wouldn’t use them, there is a learning curve there for sure and not necessary.
I wouldn’t even worry about containers at all unless you are planning on using a managed container service like elasticbeanstalk.
My CI pipeline is almost nothing, just run tests on main and dev commits. I have no deploy pipeline, I choose to actively deploy.
My deployment workflow is a powershell script since I’m on Azure, if it was AWS and I was using plain EC2 instead of a container service I would probably use ansible to avoid having to repeat commands 2-3 times when I deploy something behind a load balancer, but it would be super minimal.
DB backups are super easy to schedule with RDS or similar offerings. For rollback with database changes I would schedule downtime, let your customers know, replace your page with a maintenance banner, do upgrade, test upgrade then repoint site to your running instance.
Overall I echo what many have said, use familiar tools as much as you can, don’t worry about whether what you are doing will scale to even 2-3 people if it costs significant time. Use managed services where possible to save you time, usually the cost isn’t high vs doing it yourself.
Basically, building a docker image and server it using Docker compose. Of course you'll need NGINx as a reverse proxy running on the server.
If I do it again I'll use a service like render. Not worth managing this myself.
At the start of this year I joined hands with Chris on https://hatchbox.io to work on cost effective deployment service. We are putting together all the deployment best practices we've learned over the years into this product in a cost effective way.
As for rollback, I don't really use rollback. I have a roll-forward strategy. My DB deployments are migrations and so long as there's no data-loss caused by a migration (my migrations are always backwards compatible with the previous version) there's no need to roll back. Azure brings point-in-time restore which can be triggered if necessary.
I think the key is like with anything. There's a learning curve on your toolset. Once you've overcome that hurdle and you know to set up automation for everything from the outset. You build your pipeline shell. You never do anything manually. If you need infrastructure of any form, it's always code-first, included in source control and done the right way. Never do anything manually and rely on your automation to carry you. It seems labour intensive up front, but once you're there, it's sustainable.
It takes discipline not to cut corners. The minute you start to cut corners and get lazy is the path leading to your doom.
We are all in on AWS.
0. Tech stack: NodeJS, React, RDS Aurora, Redis, S3, Route 53, SES, Lambda and a few more.
1. 3 EC2 machines behind an ELB. Deployment is scp + cd 2. Aurora takes care of everything around DB for us. Backup, read/write replica etc., etc., 3. No CI; our development cycle isn't rapid enough to warrant investing in CI. 4. We recently moved some of long running tasks to Lambda. There's definitely a big value here so we'll invest further on it. 5. Also, recently, we experimented with an internal service on "ALB + AWS Lambda"; it's a sweet combination so we'll invest further. 6. Frontend is hosted on AWS Amplify. It's an underrated nifty service; highly recommend it. 7. Datadog is for monitoring. It's quite good but a bit expensive. I can tell you from my experience that CI, auto-rollback etc., are overrated for smallish teams given their development velocity. Even if you have to do once or twice a deployments per week you can do it manually, don't need to wade into the CI territory. And unless you have prior experience steer clear of k8s.
There are a few rough edges you face once and then forget about but minimal resource impact and it was fairly simple. I have another project that is using hacky scripts and it's a pain when something is off or I need to run something else on top of the machine.
Some stuff is not terribly maintained but everything works.
If you're small, it's definitely not a full time job. I didn't pick k8s because I initially deployed on a single machine and didn't want overhead. Things kept working afterward and I'm reconsidering whether I need k8s at all.
Rollbacks are just deploying an old version + optional down migrations if needed. I try to minimise them and fix the code and redeploy anyway.
For migrations, I guess it's very application specific, but most tools in most languages are fine these days. I use typeorm, a node.js library.
There are some practices you can follow that you probably heard working in some normal company. Eg. I don't break the db structure immediately, I deprecate columns and then delete them after a while.
1. common (Google leap second smearing, increase max open connections, misc security and performance changes)
2. infra (install Nginx, install RabbitMQ, install Postgres, install SSDB)
3. product (run common stuff, then infra stuff, then any stuff specific to a single server)
Sometimes it breaks after upgrading Linux, but it's easy to fix. For CI, I use GitHub Actions. Deployments are mostly built with GH Actions. If deployment fails it just keeps the old server code, there is no need for rollbacks I just re-deploy with working code. If deployment passes but there's still a prod bug, I don't rollback. I just commit a revert and deploy that or commit a fix. LetsEncrypt renewal, log aggregation, db backups is all custom with combination of RabbitMQ and CRON + Shell scripts.
I miss out on autoscaling, so someday I'll invest the time to migrate web servers to hosted k8s. With limited time as a solo-founder I have to prioritize for impact meaning incremental improvements usually never get done.
Not having k8s and autoscaling isn't causing any problems, won't increase revenue, and won't save any meaningful expenses so it's currently only an incremental improvement for me.
Finally, I have a small Slack community for bootstrapped founders. Let me know if you want to chat in there about specifics.
Currently going with a container image as the minimal deployable unit that gets put on top of a clean up to date OS. For me that's created with a Dockerfile using Alpine image variants. In a way I could see someone's rsync as an ok equivalent, but I'd do versioned symlinked directories so I can easily roll back if necessary if I went with this method. Something like update-alternatives or UIUC Encap/Epk: https://www.ks.uiuc.edu/Development/Computers/docs/sysadmin/.... Anyone remember that? I guess the modern version of Epkg with dependencies these days is https://docs.brew.sh/Homebrew-on-Linux. :-) Or maybe Nixpkgs: https://github.com/NixOS/nixpkgs?
Deployment-wise I've already done the Bash script writing thing to help a friend automate his deployment to EC2 instance. For myself I was going to start using boto3, but just went ahead and learned Terraform instead. So now my scripts are just simple wrappers for Docker/Terraform that build, push, or deploy that work with AWS ECS Fargate or DigitalOcean Kubernetes.
No CI/CD yet. DBs/backups I'll tackle next as I want to make sure I can install or failover to a new datacenter without much difficulty.
1. Tricky and unfrequent changes: - mail server, load balancer, database etc: for this I just write bootstrap.sh, rsync to server and run. - It's literally just a for server in `foor bar`; scp ; ssh ` done
2. Deployment/Frequent changes: - docker-compose to spin up everything. It's super nice. Again, the deployment is done with a `rsync` then `docker-compose up -f docker-compose-prod.yml`
Eventually when deployment changes very frequent and need scale/ha I added in Kubernetes. K8S is way easiser to setup than you think and it handle all other suff(load balancer, environment variable etc).
And my deploy now become: `kubectl apply -f`
One trick I used is to use `sed` or `envsubst` to replace the image hash.
For backedup, I again, literally setup cronjob from an external server, `ssh` into database and run `pgdump`.
I also have a nice NFS server to centralize config and sync back to our git repo.
I used this whole setup to operate https://hanami.run an email forwarding service for the first 3 months before I added Kubernetes.
> I self-host three VPNs to protect my infrastructure heavy startup https://quantale.io
I am a big fan of Pritunl which is opensource and provides network security with a lot of ease. I am in no way affiliated with them, I am just a big fan of Pritunl. I use Pritunl to limit access to servers and web applications for my different teams. For each user, you can generate a profile and assign the servers and port they have access to on the server. For eg:
- Only dev team can access ssh port(22) on stage server and not open to internet.
- Any one in the team can access stage version(port 443) for testing purpose.(Not open to internet)
- Only I can access all ports on all Prod servers(only 443 open to public)
What hackers can't see, they can't attack. Especially the port 22 on your servers should only be accessible to you and not the internet.
I self-host one instance each of OpenVPN and Wireguard with Pi-Hole which is then used to access my Pritunl Server adding extra layer of security.
Each of these 3 servers can be hosted on Hetzner $2/month instance. With a mere $6, you can add an extra layer of security to your infrastructure. Pritunl itself also provides subscription so that is also an option.
https://news.ycombinator.com/item?id=28671753
If you want to discuss more about this or security in general, feel free to reach out to me on my email on profile.
Databases:
use AuroraDB on AWS. Pros: no server maintenance, no backups, no maintenance at all really. Automatic snapshopts set to 30 days. I back it up once a week to an S3 bucket for longer term storage. That S3 bucket is "aws s3 sync" to a local drive on a debian machine that has a tape drive in it. I do the tape backups just after the weekly DB backup.
Code deployment: For static assets I use GitHub actions. For the containerised servers I haven't migrated to GitHub actions yet (.NET Core app running in an Ubuntu container) but since the system is usually deployed with the AWS docker tools that should be easy to automate. Lambda functions just use the AWS dotnet command line tools.
This is all much easier when you use a debian machine as your main dev/integration box. You can use WSL2 as a reasonable substitute. The command line AWS and dotnet tools are much nicer in a unix shell as you can do things like use a Makefile or a shell script to capture the more complicated actions. PowerShell is an abomination.
Thankfully lots of tools do most of the heavy lifting. I use k8s, GKE does most of the work for me. It's very nice to have autoscaling for traffic spikes. Same with database (MongoDB Altas), dead simple autoscaling. I would never run my own k8s nor database.
I wrote more details about some of the Ops stuff I do here in a previous similar question: https://news.ycombinator.com/item?id=26204402
Coincidentally I also wrote some architecture notes about a new product last night: https://blog.c0nrad.io/posts/slack-latex/
I think everyone's milage will very, but as general principles, staging is nice, reading docs saves time overall, tests help you sleep at night and make it easier to make changes 6 months in the future, simple health checks (or anything on a critical path) help you catch the real issues that need immediate attention.
Good luck!
I tried to optimize for cost and very little devops at all. I still mull about it but so far, here is what my architecture would look like.
- no servers to manage - static website hosted on CDN, and even dynamic user-specific pages would be built statically every 8 minutes - all writes go to an SQS queue - every 8 minutes, a lambda is spun up to batch read the SQS queue and all writes go through a single writer process that writes to a SQLite file on EFS - every 8 minutes, a new static version of the site is built using the above hopefully updated SQLite file
The one thing I hate about this is that I can't in good conscious say to the user that their write request succeeded b/c no state has actually changed on the backend yet. but I might be okay with this tradeoff and just simply showing only them their most recent write.
The other tradeoff is when things go wrong, it'll probably go terribly wrong.
If it were a private repo I'd still shell out for Github Actions or Circle CI most likely. I'd also consider buying a chunky-enough minipc for ~$500 and an older mac mini and set up runners on them.
For the moment private runners isn't a problem. But soon I'll need to start integration-testing proprietary code paths like querying Oracle or MS SQL Server. In that case I probably need to set up a dedicated box with all the right licenses so I can run CI jobs on it.
[0] https://github.com/multiprocessio/datastation/blob/master/.g...
The server accepts .jars over HTTP with code (and files) so I can hotdeploy while developing on live on the entire cluster in real time. My turnaround is about 1 second.
The JSON database allows for schema-less simplicity, and it has all the features you need like indexes, relations, multi crossref and security (and then some, like global realtime distributed while still being performant) in 2000 lines of code.
I have zero pain developing the most scalable (and energy efficient) backend in the world, yet very few seem to care or use it: https://github.com/tinspin/rupy
It has been proven on a real project with 5 years uptime and 350.000 users: https://store.steampowered.com/app/486310/Meadow/
Try https://treblle.com/how-it-works
Treblle allows you to test API requests with one click. Because Treblle knows what data was sent and what was returned it can replicate any call that the user has made and quickly allow you to re-run specific requests. Treblle also provides you with a way of running manual tests from our platform.
It also has an auto-generated documentation feature. Treblle can generate the required documentation for each endpoint just after 1 API call. It understands your JSON responses, can detect various authentication methods, group variable URLs in a single endpoint, support different documentation versions, and similar.
https://www.youtube.com/watch?v=3_W7zHrsM7E&ab_channel=Trebl... Here is a quick overview.
Let me know what you think.
DevOps is not a synonym for "operations work" or "server work" or "devs doing operations work".
DevOps literally means "dev teams and ops teams working together". That's all. You don't "do DevOps" unless you are "doing the work of collaborating between two different teams of people".
[0] https://anthonynsimon.com/blog/one-man-saas-architecture/
I simply run "make deploy" and a new production build is made and rsync'd to production. Then the script executes a "deploy.sh" on the production server which reloads the system process to pick up the new binary.
Poof.
With that said, I run a semi resource intensive operation, so I've invested a bit into dev ops to keep our costs down. My setup is currently on AWS, primarily using ECS, RDS, Elasticache. Infra is managed via Terraform.
I felt ECS was a nice balance vs K8s, as it's much simpler to manage, while getting the benefit of maximizing resource utilization.
For CI / deployment, I use Github Actions to build an image and push, then start a rolling refresh to update the containers to the new version. It was pretty easy to setup.
On DBs, RDS handles all the backups and maintenance. For migrations, I use https://github.com/amacneil/dbmate.
Happy to answer any other questions you have, as I've learned a lot through trial and error.
If it’s just you, you can afford to use a simple shell script to run tests locally before deploying or use a Travis/Circle/Cloudbees CI option too.
You have enough to focus on with the application itself.
I've got a big client who went from 6 infrastructure engineers to zero. They survived for half a year without any real problem. You either pay a company to manage your infrastructure for you or you pay a person. Devops people are expensive.
Things I've done to automate DevOps Make sure you pay attention to automatic testing and deploys with Ci/CD. I like automated deploys on Gitlab. Trigger staging release on merge into main, and deploy prod on merge from main into a deploy branch. Infrastructure as code with Terraform. K8S is more trouble than it is worth (though it is robust after you invest 3 months into getting it tuned right, although a lot of people make critical mistakes in their config, I've seen a half dozen insane mistakes.)
Monitoring - Loads of cron jobs pinging slack (monitoring disk, CPU, Network I/O etc), Healthchecks.io, TurboAPI, Nginx Amplify. Mostly built up over time
DB - Done entirely via migrations. Most of the time they are small migrations so I can do it by running the command directly on the instance
Deployments - GitHub actions to build a docker image (there was a prebuilt template). Watchtower then runs on my server to pick up the new image and that's then deployed automatically. There is no rollback system as I've not needed that so far (and 99% of the time you won't).
Backups - DB backup is done via a cron pushing to AWS S3 Glacier. The app code is stored on GitHub already so no need to back that up. Happy to share the script if you like :)
Lead time to change Deployment Frequency Mean time to restore (MTTR) Change Failure Rate.
Though it often is, I do not believe it should be used interchangeably to refer to a suite of tools designed to enable CI/CD and SDLC Lifecycles.
A cron job which saves a copy of the DB to S3.
That’s it.
If you’re focused on anything more with no employees you’re probably wasting your time.
I have no idea what you are building or planning it to be but for my projects I have a single VPS with Hetzner per project running linux and hosting the applications. Deployment is done by a powershell script that copies files from my laptop to the server - that's it.
If I were to think about clusters, cloud instances, load balancers and what not then I would be scared away before having delivered anything of value. Cross that bridge when you get there.
> Just properly setting up and maintaining a small k8s cluster looks almost like a fulltime job.
ok.
The managed offerings floating around these days (DigitalOcean, AWS, and Google all have them, probably more) are insanely easy to set up and maintain. A managed cluster at DigitalOcean starts at $10/month. I definitely wouldn't recommend standing up your own cluster.
For new projects, I actually find it easier to whip up a deployment.yaml and toss it in my shared k8s cluster (DigitalOcean) than setting up a new VPS where I have to manage updates (including migration when the support window for my OS ends), service restarts, etc.
Github Actions for CI. Makefiles for deployment (`docker build+push` and `kubectl apply`).
If you're not very comfortable with k8s yet I really wouldn't recommend you to setup a cluster for a small project.
You can find more about our setup and ops here: https://pirsch.io/blog/techstack/
Some things can be simpler when there’s only one of you. Take those opportunities when they arise.
You don’t need remote CI when there’s only one dev. Just make running the test suite part of your deploy script.
Use hosted services. Sure it would be cheaper to rent a server from OVH and do it all yourself. You don’t have time for that. You’re saving money on staff, spend a little on a hosted database, k8s, etc.
I love Google Cloud Run. Love it. Seamless deploys, dead easy rollbacks, inbuilt secret management, etc.
Micro services are about scaling teams, not technology. You’re a team of one, that’s probably the number of services you should have.
Hatchbox manages the deployment, backups, rollback etc.
Some people probably don't realize that you can have a full-featured git-push deployment including a PostgreSQL database and Redis on few hundred lines of simple Bash code. The only exception to keeping it bare is security (proper SSL, SELinux, etc.).
- 1 cheap VM per project (I prefer Digital Ocean at this time) + snapshots
- No CI (run tests locally), no staging until later
- Little bit of Bash to configure everything, no IaC
- Simple systemd services (+ maybe systemd socket activation)
- git-push deploy... bad release? repush previous version
- Automatic system updates and log rotation
- Few auxiliary scripts for Rails console (./railsc.sh), backups (./backup.sh), etc.
- External error and performance monitoring, but keeping raw logs on server (rarely need them)
- Stable CentOS/Rocky Linux with long time support, rootless access
I teach all of that in my book https://deploymentfromscratch.com/ and I basically run my oldest side-project on the book demo which is similar to this. Some people might not like Bash, but it's surprisingly refreshing, and I keep my script flexible and idempotent:
Set everything up after providing IP address and domain name in settings.sh: $ ./setup.sh
Change the database configuration after changing the config file: $ ./setup.sh -u postgresql
Deploy a new version: $ git push production master:master
Later on I would separate the database and have staging, but it's overkill for my projects right now.
The funny thing is that apart from having CI and staging we could have the same setup at 2 early stage startups I worked on...and it would probably last 3-5 years with maybe an in-place server spec upgrade, no kidding. People really like to overdo operations and you know where overdoing it leads to? Mistakes. You maintain K8s and then forget to do something basic security-wise that almost kill the company (seen).
There are so many different options available but they very for different needs.
Regardless, I would assume that you are going to have to pay more for a managed solution that managing everything yourself.
I know the founders at https://releasehub.com/ and it sounds like it might hit that sweet spot for “outgrowing heroku” but “don’t want to do full time devops”.
You can do that with some shell scripts on an instance and an e-mail service and manually copy the whole mess for backup if it makes sense. The how of implementation is very case specific.
Thinking more on it I'd sum up as "provide a service that doesn't require that you be Google". If you start by specifying an impossible amount of infrastructure maybe there's an issue with management :).
The more I do this, the more I learn, the more I automate, the less I worry.
You can run almost ANYTHING inside dokku. I use it to run a private/public docker registry, minio (amazon S3 clone), analytics via matomo, and a ton of mini apps that share a bunch of databases. It's incredible.
With dokku, deploying your app is just "git push dokku main"
GitLab plus CI is amazing. And, other than hosting, I don't pay a penny for it. If you self-host, you can put it behind a wireguard VPN. It's incredible. You can run as many runners as you want on your personal laptops, or on the cloud, and bring them up and down when needed.
For CI GitLab is a bliss. Lots of examples and articles around the web. I made a test-build-dockerize-tag pipeline once and now just taking it from one project to another with minimal changes.
For deploy - Docker Compose is enough. Maybe Docker Swarm some time in the future. Gitlab hosts docker images, rollback is a simple change of container version.
For Postgres backups I am using WAL archive + daily snapshots from replica.
After I run tests locally, I've got a makefile that deploys to either Google Cloud Run or fly.io.
Database is Cloud SQL which works well enough for me.
Whenever there is a task that I have to manually do when building (eg. committing build hashes to a git repo, or manually replacing the version number in the UI), I create a gulp task to do that for me. Not only this saves time when deploying, but it also dramatically reduces the mistakes I can possibly make.
I use Gitlab for CI, commits to master create a new docker image in the Gitlab registry. I manually trigger deployment on k8s with `kubectl rollout restart XXX`
- Linode VPS
- Nginx
- SSH to deploy code
- Google Cloud hosted (& backed up) DB
Why would I need K8 and some crazy architecture when I'm just trying to get my product to my customers in the fastest / most efficient way possible? I'm not FB or Amazon and if I do get that size, I'll have plenty of time to adapt on the way there
Finally I was going to quote the Pieter levels case but someone else beat me to it
nginx ingress (I would probably go Contour+Envoy if I was to set it up now).
Kotlin w/Gradle. Build + Deploy single command `gradle deployStage`.
jib for container images, don't need Docker, reproducible super fast builds.
Prometheus + Grafana for metrics and alerting.
Things I like about it:
- JVM is super low maintenance runtime, tune it once and pretty much good to go. Very easy to diagnose memory or performance problems in production. I even have the debugger agent loaded in prod listen on localhost so I can use kubectl port-forward to access it and diagnose bugs that only occur in production or hung threads etc.
- Prometheus seems like overkill but it's just so much less effort because everything already has a prom exporter.
- Using k8s API saves a bunch of time again. No need to screw around with docker-compose or some godawful config management from the 2000s like Ansible (ick).
Things I don't like:
- It's not HA. The DB pods push their WALs to S3 and I have script to spawn a new PG from S3 using PITR. This is how I test new code so it's known working but it still means that if my server goes down it's down until I do something about it.
- The box is a bit snowflaky. I didn't set it up using a script, it was hand configured up until the k8s level.
- I rarely upgrade k3s because I have to do it manually. If I was using EKS or GKE I would probably press the upgrade button more often.
- Grafana is a chore. grafonnet helps (I also use Tanka w/jsonnet for k8s manifests).
CI/Deployment: hosted CI, merged PRs to main branch auto-deploy (I've used both Codeship and Github Actions with ECS)
DB: managed RDS
No k8s for sure.
I host my apps and databases in Heroku. Yes it's more expensive than alternatives, but it's easy and I trust their database backups.
For everything else, I find managed services.
Basically, I'm a developer and I let other companies handle most of my DevOps for me.
- use tech-stacks that I know
- even simpler – use node.js / express (i.e. use only one programming language)
- Standard AWS EC2, S3, Postgres instances
- scale vertically before scaling horizontally
- use GitHub Actions as much as possible
- deploys and rollbacks are done manually
- I keep a list of useful CLI commands in Notion
If you're a small team you shouldn't need to think about kubernetes tbh.
Both their Shared and Dedicated servers are fully managed. You can run PHP, NodeJS, etc on it. They also fully manage MySQL too.
All you have to do is drop your code onto their servers and it just runs. It's also hard to beat their pricing too.
Deploy with scp/sftp.
Also it's very useful to have DEV.md file for each repo. It helps to remember how to build and deploy your project. Documentation for future me :)
- Easy Postgres clusters (but not managed) - Heroku like deploys (but less magic) - Easily geodistributed (deploy near your users) - reasonable pricing to get started (and not too bad beyond that).
In my case, I have tried MetaCall: https://metacall.io
A single makefile command is enough to build .deb files, rsync and install them. Rollbacks are trivial.
Put configuration files as well in git and in the .deb
git tags give you a history of releases and also deployments.
Deployments can be automated
I usually work with PHP, so I already have playbooks to configure/install php/nginx and others
(just change few variables)
deployment only needs to add a few steps like git checkout ( easy to reuse since you're probably using the same framework across projects)
some examples here
https://github.com/MalpraveCorp/ansible
in specific
https://github.com/MalpraveCorp/ansible/blob/master/mal-be-p...
* Database: CloudSQL/PostgresQL (with Query Plan for analysis)
* Services: Kubernetes (Services in Go)
* CI: Cloud Build
It's pretty easy to maintain.
Deployed into two off-lease dell servers; one for staging and one for production; via docker-compose.
I’m fairly sure I can maintain up to 50 servers as solo developer with the above mentioned tools.
p.s. previously did quite a bit of AWS, Google, firebase; but the above had been working the best for about a year
Use a PaaS like Heroku, hook it up to your GitHub and a CI platform. Get your database hosted on Heroku or somewhere similar - plenty exists for anything you might need (Postgres, MySQL, Redis, Mongo...).
Don't do things that increase your operational overhead like using microservices or running unusual databases for which a good hosted solution doesn't exist. Stay away from anything k8s.
It will be a bit more expensive but well worth the extra money, compared to the time you'd spend on operations.
- cheap VPS for hosting, postgres for the db, nginx as a proxy, redis for everything else, including caching.
- deploy python projects by packing them as zipapp with shiv, then use fabric to ssh and performs any migrations necessary. No, not even ansible.
- build, lint, format and test, like all automatizable stuff, are made using pydoit. If you are solo, you don't need a CI service. Your laptop is the CI machine.
Scaling is always the same story:
1 - start with the cheapest VPS possible. This will force you to code with some constraints. Not crazy, just enough to balance it out.
2 - once you start seing your load average rising, just move the db on a second cheap VPS. You just tripled the load you can sustain. Not doubled, tripled. Most solo projects will stop here actually.
3 - wait again, collect some perfs data when you see the site starting to be slow. Check for slow queries, code hot paths, etc. Optimize that, add some cache, setup tasks queues. Use cron to purge / regenerate cache if you need to smooth out the curve. Maybe slap some varnish for extra peps, or just cloudflare.
4 - now you got what looks like a real load, and a more realistic code base, so when you peak again, request more perfs for your server, or migrate to another bigger one. You can scale vertically to crazy highs nowadays. Really, really crazy. You can get terabytes of RAM, 64 cores servers, etc. At your current scale, your service should generate enough money to pay 10 times for it anyway. But you probably won't need to. Even cheap servers are beast, look at the current leaseweb offer: https://www.leaseweb.com/dedicated-servers#NL
For €320.09 per month, you get:
2x 16 cores 2.30GHz
28GB DDR4
4x960GB SSD
30 TB traffic
To put it in context, myspace used to serve all its users with 2 servers only, until they reached 500,000 accounts. Only then it was too much. This was with hardware (and price!) in the years 200x.5 - you will probably never reach this point. This is the point where kubs, load balancers, sharing, etc. start to be interesting.
How to preserve your data:
- Raid
- Dump the db with a _randomized_ cron
- rsync the dump and all assets. You can do it to another server, or just your laptop at the beginning.
You can get fancy with database replication if you want, or use backups that stream in real time.
But there is one trick: not all data are equal. Identify some data in your db that you can't afford to lose, and make sure this one is saved separately and very regularly while being given priority. There are plenty of data, if you have a hole in it, most people don't care. E.G: if 1 tweet out of million from 10 years ago is missing, do you think it affects the service?
Monitoring:
- Sentry
That's all. It's free (or cheap), and for a small service, you don't need real time. It's ok to be down a few hours once a month for most services at first. I have a service with 700k unique users a day, it still goes down sometimes. It's a blip.
Sometimes, log into the servers, run htop and checks what's up. You can install open telemetry later if you really need to, but for now, even HN hug is not going to kill you.
Summary:
- Modern software and hardware are amazing. Max them out. Horizontal scaling is hard, and expensive.
- You don't need a perfect service. Unless you are handling patient cancer data, that is. Don't worry about perfect uptime, 0 data loss, etc. If you are a solo dev, the cost for that is huge. Just do 97% right⋅
- what worked 20 years ago still work today, and will likely be there tomorrow. And you can move from that to the cloud later. The reverse is not that nice.
If you need a bit more than Heroku:
1. Try Render[0].
2. If it is not enough, go with any Kubernetes you can afford, any docker registry and Pulumi or Pulumi+Helm[1]. For monitoring you can use either NewRelic[2] or Weave Cloud[3].
3. If you find big cloud kubernetes too expensive, need more bandwidth, or are just willing to try more affordable alternative new players, maybe Civo[4] fits the need.
In my case[5], since I've been bootstrapping for three years, a low-budget setup is key to keeping the lights on. Here are a few constraints I need to satisfy permanently-ish:
- My workload is bandwidth-intensive, so EC2 egress is too expensive for my case and my situation.
- It is not that cheap to run 12 micro-services in any of the existing Heroku-ish platforms, also private networking and routing of my app are a bit above "12 factor" at this point.
- I can't afford the same availability as everyone else: I'm in the business of supporting other apps but I don't want to pass down the cost of AWS, so I can't afford to be "down because everyone else in AWS is down".
- I don't know many things, and until I find PMF, keeping a low-cost setup gives me the resilience that I need as a founder.
Therefore to solve my problem I use an unholy combination of Ansible and Pulumi to manage 3 clusters of Kubernetes for both stateful and stateless workloads.
Why am I doing this to myself? I have mad respect for people who can run things in a VPS, use the hosting daily backup, and don't see the need to apply patches because they're behind Cloudflare; I just can't.
More details:
- Dedicated servers, 32GB RAM, 8 Cores, 500GB SSD(RAID 1), /29 ipv4, 30TB of Bandwidth for $70/mo each. XCP-NG[6] as hypervisor.
- To assign public IP to virtual machines, I run a tiny DHCP ubuntu VM managed by Ansible.
- On top of XCP virtual machines, k3s cluster managed with Ansible.
- To manage XCP and virtual machines, I use Xen-Orchestra[7] managed with Ansible.
- Pulumi+AWS ECR builds and pushes docker images, secret and configuration management using paraphrase secret provider.
- Hashicorp Vault backed by Postgres+AWS KMS managed by Ansible.
- Ansible+Pulumi+Helm to self-host Apache Pulsar with ZFS snapshots, a cron-job and rsync.net alerts. WAL-G+AWS S3 to backup and archive Postgres/TimescaleDB managed with Ansible.
- All logs aggregated by Gelf and Fluentbit into a self-hosted instance of Seq[8] managed with Pulumi+Helm.
- Tailscale between all virtual machines
- DNSimple for domains
- Cloudflare for all things except the self-hosted docker registry.
- Github Actions for CI but not a hard a requirement, I can run ops in my local computer when Github Actions is down.
Cost: $210 /mo.
Estimated cost in AWS: $2,800 /mo
[1] https://newrelic.com/platform
[2] https://www.weave.works/product/cloud/
[3] https://www.pulumi.com/docs/reference/pkg/kubernetes/helm/v3...
- docker
- dokku
- sqlite
- litestream
I don't know what everyone is doing with kubernetes that takes up so much time and effort, especially in a SME environment.
I'm currently operating three separate kubernetes clusters for three different clients. And this is just me, there are no other tech people involved day-to-day. These clusters require minimal maintenance, just the occasional upgrade.
These clusters are small (1-10 nodes), and are on either GKE or bare metal. The biggest is around 100-200 running pods. I have a CI/CD pipeline setup in GitHub actions for each client. Development & CI uses docker-compose.
I used to do the old-hat 'scp your code to the server' kind of deployment, and this was a lot more pain. Kubernetes is big, but it solves so many problems. Ensuring everything is actually running, networking, SSL certs, persistence, configuration, etc etc.
I've been running this setup for 3+ years now. If I cast my mind back, I think I spent a weekend reading the k8s docs concepts section, then pretty quickly deployed a cluster on GKE. Given it was a team of one, I ditched RBAC which removed some complexity. Getting the CI/CD pipeline setup took some time (maybe a few more days), and I built more tooling as a went.
I really think kubernetes can be really useful for tiny teams, but there is a lot in there that is aimed at big teams. Use the bare minimum possible at first, and grow from there.
---
Now here is a bonus unsolicited contrary comment.
Deploying k8s used to really require a cloud provider, particularly for persistent volumes and load balancing. But I think that has changed over the last few years. OpenEBS and MetalLB look like they are really addressing this.
Also, has anyone noticed that dedicated servers are cheap as chips now? I think the equivalent dedicated server would cost around 10x as much as an AWS instance. (I'm looking at Hetzner in particular, but OVH isn't far behind)
Given what is currently available from the k8s ecosystem, there seems to be a really strong case for deploying on bare metal these days.
For example, at a ballpark you can get a 400GB+/40core cluster for around $300/month (which also includes another 80GB/10cores of ancillary gubbins). The cost for that easily comes in at around $3,000 on AWS (ex any storage).
OpenEBS gives you replicated storage, MetalLB gives you your load balancing, and each instance has a couple of terabytes of NVMe too. And, if you're worried about reliability of physical servers, at that price you could replicate your setup across datacenters at minimal cost.
---
I should add a caveat here. I've been working freelance in tech for 15 years and I am a massive generalist. I do everything from business consulting to design, devops to development, and I also run an ISP so know my way around networking/BGP etc. That being said, setting up a quick cluster on GKE doesn't require many of those skills.
Having written this ambling comment I guess I'll also say that I am am available if anyone wants a cluster setting up. I'm also about to start providing managed k8s services.
https://gitlab.com/northscaler-public/release-management
It's a fairly low-tech set of shell scripts that implement a release management strategy that is based on one release branch per minor version. All it does is manage version strings, release commits, release branches & release tags. You can hook your CI/CD into it whenever you're ready for that.
We've used it to great effect on many client projects.
The workflow is pretty simple for a new release (assume a Node.js project in this example):
0. Your main ("main", "master", "trunk", "dev") branch is where all new features go. Assume our next version is going to be "2.3.0", so the version in the main branch starts out at "2.3.0-pre.0". If you need dev prereleases, issue them any time you'd like with `./release nodejs pre`. This will bump the version to "2.3.0-pre.1", "2.3.0-pre.2", etc each time.
1. Ceremony: decide that you're feature complete for your next release.
2. Use the release script to cut a release candidate ("rc"), say, with `./release nodejs rc`. You'll end up with a new branch of the form vmajor.minor, so v2.3 in this example, and the version in that branch will be 2.3.0-rc.0. Appropriate git tags will also be created. The version in the main branch is bumped to 2.4.0-pre.0, for the next minor release.
3. Test your release candidate, releasing more release candidates to your heart's content with `./release nodejs rc`. Meanwhile, developers can start working on new features off of the main branch.
4. Ceremony: decide you're bug-free enough to perform a "generally available" (GA) release.
5. Perform a GA release with `./release nodejs ga`. This will tag a release commit as "2.3.0", push the tag, then bump the version in the release branch (v2.3) to "2.3.1-rc.0".
6. If you find a bug in production, fix it in the release branch, issue as many RCs as you need until it's fixed, then finally release your patch with `./release nodejs patch`. You'll get a release commit & tag "2.3.1", and the version will be bumped to "2.3.2-rc.0". Lastly, cherry pick, (often, literally "git cherry-pick -x 7. Repeat ad nauseum. This allows you at least manage your versions, branches & git tags in a sane & portable way that's low-tech enough for anyone to work on and understand. It's also got plenty of idiot-proofing in it so that it's hard to shoot yourself in the foot. Further, it's very customizable. After years of use across lots & lots of projects, we recommend using "dev" as your main branch name and as your main branch's prerelease suffix, and using "qa" as your release branch's prerelease suffix. The defaults are "pre" & "rc", and too many folks are using these scripts nowadays for us to change the defaults.
I'd say the meta-level answer here should be to timebox your investment based on the tools you know, and an honest assessment of your uptime requirements. Also, know that what feels like a janky solution will usually get you much further than you think it will.
For an initial PoC you should probably not be spending more than an hour or two setting up your infra, so Heroku or a single VM is probably the way to go. But you still _might_ want to start with k8s if it's the tool you know best; you can set up a k8s cluster in GKE with about as many clicks as it takes to set up a single node, and if you have a full "django k8s yaml" or equivalent in your toolkit then you could be up and running in an hour or two from scratch. I wouldn't recommend this to most folks though.
Once you have some traction, beyond PoC stage, then you should be thinking about failure modes, their impact to your business, and how much time it's worth to spend to avoid them. If your users won't notice a few hours a week of downtime, then you don't need a multi-node setup like k8s or nodes behind a LB; a single node should suffice. If you have simple user data (like say comments) where in the worst-case losing everything since last night's backup would not sink your business, then running your DB with no standby replica and nightly backups might suffice. If you're storing financial transactions that's obviously not an option and your first DB with customer data needs replication (or however you want to get to RP0).
Given all that, here are my suggestions for your specific questions:
For CI/CD I'd say don't bother with CD, just set up Github/Gitlab to run your UTs on your PRs and on merges to master. Honestly you can just run UTs manually while building at PoC stage, but when you have paying customers I'd say you should take the 15 mins to wire up simple UTs. Getting a full Continuous Delivery pipeline set up won't save you much time vs. just manually deploying your artifacts, and at the beginning your tests are unlikely to be good enough to continuously deploy safely. At the point where you start having quality issues that your customers are complaining about, you might want to push for CD as a discipline thing; it forces you to write good-enough tests that you're happy to deploy if the tests are green. But it's not mandatory to deploy everything if your tests are good enough to do so.
For rollback on your stateless instances, I like the pattern of having your single-node running your apps in a Docker container (I had one for Django and one for Nginx), that way the upgrade/downgrade is simply `docker pull For running your DB, I see a few folks suggesting running your DB locally on your node -- if you are an experienced DB operator that might be a good option, but if you're not, I would personally recommend using whatever managed SQL offering your cloud provider has; RDS or Cloud SQL will give you a durable instance with backups & easy restore with a few clicks in the UI, which is hard to beat in terms of ROI. Possibly-controversial recommendation: in general unless you've spent a lot of time working with Terraform/Ansible/etc. I would recommend against using infrastructure-as-code in the early days. It's hard to beat the ROI of a couple clicks in your cloud provider's UI to get your infrastructure set up, you might spend 10-100x more time getting your repeatable build implemented, even if it's just a couple mins vs. a couple hours. Clicking in the UI is not reproducible, but it will get you further than you think. As above though, if you're fluent in these tools and the difference in time-to-provision is very small, then absolutely do this now instead of later. But don't feel like you MUST do it "the right way" from the beginning.