Managed Kubernetes cluster such as GKE for each environment, setup in cloud provider UI since this is not done often. If you automate it with terraform chances are next time you run it, the cloud provider has subtly changed some options and your automation is out-of-date.
Cluster services repository with Helm charts for ingress controller, centralized logging and monitoring, etc. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.
Configuration repository for each application with Helm Chart. If it's an app with one service or all services in a single repo this can go in the same repo. If it's an app with services across multiple repos, create a new repo. Use a values-${env}.yaml for environment differences. Deploy with CI service such as Jenkins.
Store secrets in cloud secrets manager and interpolate to Kubernetes secrets at deploy time.
Cloud provider keeps the cluster and VMs up-to-date, CI pipelines do the builds and deployments. No terraform/ansible/other required. Again, this only works for "cloud native" models.
https://www.ansible.com/ is surely a good solution for Bootstraping Linux cloud machines and can be quite flexible. I personally feel like its usage of YAML manifests instead of a domain-specific language can make complex playbooks harder to read and to maintain.
If all you do is to deploy containers on a managed Kubernetes or a similar platform, you might get away with some solution to YAML templating (jsonnet et al) and some shell glue.
I am keeping an eye on https://github.com/purpleidea/mgmt which is a newer contender which many interesting features but lacks more complex examples.
Others like saltstack and chef still see some usage as far as I know, but I've got no personal experience with them.
- If you have SSH access, you can use it. No matter what environment or company you work for, there’s no agent to install and no need to get approval to use the tool. It’s easy to build up a reproducible library of your shell habits that works locally or remotely, where each step can avoid being repeated in case there’s a need to rerun things.
- If you get into an environment where performance across many machines is more important you can switch to pull based execution. Because of that, I see very little advantage to any of the other tools that outweighs the advantages of Ansible.
What I prefer to do is use Terraform to create immutable infrastructure from code. CoreOS and most Linux variants can be configured at boot time (cloud-config, Ignition, etc) to start and run a certain workload. Ideally, all of your workloads would be containerised, so there's no need for configuration drift, or for any management software to be running on the box. If you need to update something, create the next version of your immutable machine and replace the existing ones.
When working solo I use Guix, both Guix and Nix are _seriously_ amazing.
So it's a very good tool to gradually get a legacy system under configuration management and thus source control.
I also use (In order of frequency): Terraform, Invoke (Sometimes there is no substitute for a full programming language like python), Saltstack (1000's of machines in a heterogenous environment)
If I were going to deploy a new app on k8s today, I would probably use something like https://github.com/fluxcd/flux.
I haven't really had a pleasant time with the tooling around serverless ecosystem yet once you get beyond hello worlds and canned code examples.
I really don't see why so many weird, unreadable languages like jsonnet or CUE were created, if there is already a type safe, script-like (Go compiles in miliseconds and there is even go run command), with full pledged IDE autocompletion support, abstractions and templating capabilities, mature dependency management and many many more.. Please tell me why we are inventing thousands weird things if we have ready tools that helps with configuration as well! (:
Alternatively, if you have the option of choosing the whole stack, Nix/NixOS and their deployment tools.
I would recommend staying away from large systems like k8s.
0. Self-hosted Gitlab and Gitlab CI.
1. Chef. I'd hardly mention it because it's use is so minimal but we have it setup for our base images for the nitpicky stuff like connecting to LDAP/AD.
2. Terraform for setting up base resources (network, storage, allocating infrastructure VMs for Grafana).
3. Kubernetes. We use a bare minimum of manually maintained configuration files; basically only for the long-lived services hosted in cluster plus the resources they need (ie: databases + persistent volumes), ACL configuration.
4. Spinnaker for managing deployments into Kubernetes. It really simplifies a lot of the day-to-day headaches; we have it poll our Gitlab container repository and deploy automatically when new containers are available. Works tremendously well and is super responsive.
Or Dockerfile/compose for container images.
Cloud resources are managed by Terraform/Terragrunt.
No orchestration as well FWIW, we usually have ansible configuring Docker to run and pulling the images...
As for the future I have been meaning to explore Terraform and some Orchestration platforms (Nomad).
config-package-dev is a tool for building site-specific Debian packages that override the config files in other Debian packages. It's useful when you have machines that are easy to reimage / you have some image-based infrastructure, but you do want to do local development too, since it integrates with the dpkg database properly and prevents upgraded distro packages from clobbering your config.
My current team uses it - and started using it before I joined the company (I didn't know we were using it when I joined, and they didn't know I was applying, I discovered this after starting on another team and eventually moved to this team). I take that as a sign that it's objectively useful and I'm not biased :) We also use some amount of CFEngine, and we're generally shifting towards config-package-dev for sitewide configuration / things that apply to a group of machines (e.g. "all developer VMs") and CFEngine or Ansible for machine-specific configuration. Our infrastructure is large but not quite FAANG-scale, and includes a mix of bare metal, private cloud and self-run Kubernetes, and public cloud.
I've previously used it for
- configuring Kerberos, AFS, email, LDAP, etc. for a university, both for university-run computer labs where we owned the machines and could reimage them easily and for personal machines that we didn't want to sysadmin and only wanted to install some defaults
- building an Ubuntu-based appliance where we shipped all updates to customers as image-based updates (a la CrOS or Bottlerocket) but we'd tinker with in-place changes and upgrades on our test machines to keep the edit/deploy/test cycle fast
All of cofiguration management tools( ansible, puppet, chef, salt etc ..) are bloated.
We already have FINE SHELL. Why do we need crappy ugly DSL or weird yaml ??
These days, Newbies write ansible playbooks without even basic unix shell & commands knowledge. What the hell?
I like ssh + pure posix shell approach like
Show HN: Posixcube, a shell script automation framework alternative to Ansible https://news.ycombinator.com/item?id=13378852
http://madhadron.com/posts/choosing_your_base_stack.html
Don't be distracted by FAANG scale. It's not relevant to most software and is usually dictated by what they started using and then deployed lots of engineering time to make work.
My suggestion is to figure out how you will manage your database server and monitoring for it. If you can do that, almost everything else can fall into line as needed.
I still think there's too much setup to get started - but am somewhat convinced ansible does a better job than a bunch of bespoke shell would (partly because ansible comes with some "primitives"/concepts such as "make sure this version of this file is in this location on that server - which is quick to get wrong across heterogeneous distributions).
We're moving towards managed kubernetes (for applications currently largely deployed with Docker and docker-compose on individual vms).
I do think the "make an appliance;run an appliance;replace the appliance" life cycle makes a lot of sense - I'm not sure if k8s does yet.
I think we could be quite happy on a docker swarm style setup - but apparently everything but k8s is being killed or at least left for dead by various upstream.
And k8s might be expensive to run in the cloud (a vm pr pod?) - but it comes with abstractions we (everyone) needs.
Trying to offload to SaaS that which makes sense as SaaS - primarily managed db (we're trying out elephant sql) - and some file storage (100s of MB large Pdf files).
For bespoke servers we lean a bit on etckeeper in order to at least keep track of changes. If we were to invest in something beyond k8s (it's such a big hammer, that one become a bit reluctant to put it down once picked up..) I'd probably look at gnu guix.
It's like terraform, except you can't review things for mistakes until it's already in the process of nuking something. Which is terrible when you're inheriting an environment.
I find Jenkins X really interesting for my applications. It seems to solve a lot of issues related to CI/CD and automation in Kubernetes. however, still lacks multi-cluster support.
I very much dislike ansible's YAML-based language and would hate to use it for configuration management beyond tiny systems, but it's pretty decent as a replacement for clusterssh and custom scripts.
I’ve tried Puppet and SaltStack, and I constantly find they are harder and more complex than Ansible. I can get something going in Ansible in short order.
Ansible really is my hammer.
A small startup shouldn't use any configuration management (assuming configuration management means software like Puppet, Chef, Salt, and Ansible). That is because small startups shouldn't be running anything on VMs (or bare metal). There are so many fully managed solutions out there. There is no reason to be running on VM, SSHing to servers, etc. App Engine, Heroku, GKE, Cloud Run, whatever.
Once you get to the point where you need to run VMs (or bare metal), there are many options. A lot of systems are going to a more image + container based solution. Think something like Container-Optimized OS[1] or Bottlerocket[2], where most of the file system is read-only, it is updated by swapping images (no package updates), and everything runs in containers.
If you are actually interested in config management, I'll give my opinions, and a bit of history. I've used all four of the current major config management systems (Puppet, Chef, Salt, and Ansible).
Puppet was the first of the bunch, it had its issues, but it was better than previous config managements systems. Twitter was one of the first big tech companies to use Puppet, and AFAIK they still do.
Chef was next, it was created by people to did Puppet consulting for a living. It follows a very similar model to Puppet, and solves most of the problems with Puppet, while introducing some problems of its own (mainly complexity in getting started). In my opinion Chef is a clear win over Puppet, and I don't think there is a good reason to pick Puppet anymore. One of the biggest advantages is that the config language is an actual programming language (Ruby). All the other systems started with language that was missing things like loops, and they have slowly grafted on programming language features. It is so much nicer to use an actual programming language. Facebook is a huge Chef user.
Salt was next. It was created by someone who wanted to run commands on a bunch of servers. It grew into a configuration management system. The underlying architecture of Salt is very nice, it is basically a bunch of nodes communicating over a message bus. Salt has different "renderers"[3], which are the language you write the config in, including ones that use a real programming language (Python). I'll back to Salt in a minute.
Ansible... it is very popular. This is going to sound harsh, but I'm just going to say it. I think is it popular with people who don't know how to use configuration management systems. You know how the Flask framework started as an April Fool's joke[4], where the author created something with what he thought were obviously bad ideas, but people liked some of them. Ansible is so obviously bad, at its core, that I actually went and read the first dozen Git commits to see if there were any signs that is was an April Fool's joke.
There was a time a few years ago when Ansible's website said things like "agentless", "masterless", "fast", "secure", "just YAML". They are all a joke.
Ansible isn't agentless. It has a network agent that you have to install and configure (SSH). Yes, to do it correctly you have to actually configure SSH, a user, keys, etc. It also has a runtime agent that you have to install (Python). You have to install Python, and all the Python dependencies your Ansible code needs. Then it has the actual code of the agent, which it copies to the machine each time it runs, which is stupidly inefficient. It is actually easier to install and configure the agents of all the other config management systems than it is to properly install, configure, and secure Ansible's agent(s).
Masterless isn't a good thing, and a proper Ansible setup wouldn't be masterless. The way Ansible is designed is that developers run the Ansible code from their laptops. That means anyone making code changes needs to be able to be able to SSH to every single server in production, with root permissions. And it also risks them running code that hasn't been committed to Git or approved. Any reasonable Ansible setup will have a server from which it runs, Tower, a CI system, etc.
Fast. Ha! I benchmarked it against Salt, wrote the same code in both, that managed the exact some things. Using local execution so Ansible wouldn't have an SSH disadvantage. Ansible was 9 times slower for a run with no changes (which is important because 99.9% of runs have no or few changes). It is even slower in real life. Why is it so slow? Well, SSH is part of it. SSH is wonderful, but it isn't a high performance RPC system. But an even bigger part of the slowness is the insane code execution. You'd think that when you use the `package` or `apt` modules to ensure a package is installed, that it would internally call some `package.installed` function/method. And that the arguments you pass are passed to the function. That is what all the other configuration management systems do. But not Ansible. No, it execs a script, passing the arguments as args to the script. That means every time you want to ensure a package is still installed (it is, you just want to make sure it is), Ansible execs a whole new Python VM to run the "function". It is incredibly inefficient.
Secure. Having a network that allows anyone to SSH to any machine in production and get root isn't the first step I'd take in making servers secure.
It isn't just YAML. It is a programming language that happens to sort of look like YAML. It has its own loop and variable syntax, in YAML. Then it has Jinja templating on top of that. "Just YAML" isn't a feature. To do config management correctly you need actual programming language features, so use an actual programming language.
If I had to pick one again, I'd pick Salt. Specifically I'd use Salt with PyObjects[5] and PillarStack[6].
But I'll reiterate, you shouldn't start with a config management system. Start with something fully managed. Once you need a config management system, take the time to do it correctly. Like it should be a six week project, not a thing you do in an hour. Chef and Salt will take more time to get started, but if setup correctly they will be much better than any Ansible setup. If you don't have the time or knowledge to do Chef or Salt correctly, then you don't have the time or knowledge to manage VMs correctly, so don't.
[1] https://cloud.google.com/container-optimized-os
[2] https://aws.amazon.com/bottlerocket/
[3] https://docs.saltstack.com/en/latest/ref/renderers/
[4] https://en.wikipedia.org/wiki/Flask_(web_framework)#History
[5] https://docs.saltstack.com/en/latest/ref/renderers/all/salt....
[6] https://docs.saltstack.com/en/master/ref/pillar/all/salt.pil...
It is silly to ask "what should be used at FAANG scale", because either you are working at a FAANG and you are using what they use, or you are very unlikely to ever be at that scale -- and somewhere along the journey to getting there, you will either find or write the system that you need.
For teams that have a large IaaS footprint: Chef (agent-less actually adds complexity in this environment.)
Puppet for a polished production. Puppet has robust and stable ecosystems and infrastructure. It is a client-server model from the beginning. It is easier to create and put in the production library of all your puppet modules. It has hiera for central config values and secrets management. At the same time, I hate the Puppet's resource relations. Puppet's architecture feels like something developed in 1991, an ugly monster monolith and extremely heavy.
Terraform. For actual low-level infrastructure management. And I don't like to put whole high-level host configuration into IaaC! IaaC has minimal host configuration capabilities. Set hostname, set IP, register with Puppet or call ansible - only a few lines in user-data or bash-script on boot, which then calls actual configuration management!
Gitlab-CI - switched from Jenkins. Concourse-ci looks extremely interesting! Also reviewing some GitOps frameworks. Kubernetes - bare-metal runs self-made puppet-based pure k8s. Also, kops and EKS for AWS. Applications in k8s are managed via Helm.
Compared to Puppet Ansible is less enterprissy, it is more like a hipster tool. I would like to replace the Puppet with Ansible. But maybe I need the help from all of YOU who have voted for Ansible. How do you achieve Puppet's level of management with Ansible? How do you achieve client-server setup with ansbile - somehow I do not see lot's of people using ansible-pull? (Without using Tower!) You create cronjob with ansible-pull on a node boot? :D Or whole your ansible usage is limited to running ansible-playbook from your console manually? Ok maybe you sometimes put it in the last action of your CI/CD pipeline ;) Nodes classification and review? Central config values management for everything?
I use hashi Vault and lots of other things too. Some questions are rhetoric. I've just expressed my mistrust in ansible which doesn't feel complete. :(
How to do you manage a fleet of 1000, 500 or even 200 hosts with ansible? When after provisioning you need to review your fleet, count groups, list groups, check states. Ah, you want to suggest Consul for that role? :)
Kubernetes for the win. It will replace config management diversity. It gives you node discovery, state review, and much much much more.