But in majority of the cases developers are very much aware of environments their code run on - they know that their containers are stored in ECR, ran in ECS, their data is stored in S3 and RDS.
It is trivial to build a container, upload it to ECR and then deploy it to ECS from a shell script. And it is a lot more readable and comprehensible for a person not familiar with the tool.
Maybe I am the problem and I just don't get the declarative style, where you only describe the wanted state, not the steps you take to achieve that state?
If we assumed by default that our cloud infrastructure provider is AWS, wouldn't it be simple to write a shell script that would call `aws-cli` few times?
I came to that question, when discussing a problem my friend, a DevOps engineer, was having - he wasn't able to get the Azure Resource Group from his `foobar.tf` files and he ended up with something like the following:
cat << EOF
{
"aksvnet": "$(az resource list --resource-group $1 | grep -i \"aks-vnet- | cut -d ":" -f2 | tr -d '", ')"
}
EOF
And what is it? It is a shell script inside of a JSON that was created in the shell! What for are these layers of abstraction? Why does he have to wrap the Resource Group name in a JSON? Why couldn't it be just piped in plaintext format, as all the tools that try to be POSIX-compatible do?UPD: This question is for environments where all the engineers are (to some extent) familiar with %CloudProviderSDK% and bash. And, in my opinion, it's a lot easier to pick up bash and %CloudProviderSDK%, as those are imperative therefore closer to engineers' daily routine, as opposed to Terraform's declarative style. Shell scripts, in my opinion, are just more intuitive by default.
I don't think having a unified interface is the motivation behind Terraform. You still need to understand the underlying resources you are dealing with, Terraform doesn't abstract that at all. The big idea behind Terraform is procedural vs declarative. You can write scripts to bring up all of your infrastructure but what if one of your scripts fails in the middle? What parts of it actually went into effect and which didn't? Can you just re-run it or will the first part now fail because the infrastructure already exists? What if you have several engineers working on the same environment applying scripts that may interfere with one another? What if there was an incident and you made some manual changes and now production is out of sync with what is represented in the script? What if you made some complicated infrastructure changes and you broke something and want to bring everything back to exactly how it was before?
Declarative infrastructure answers all of those questions. It lets you keep track of what the current state of your infrastructure is, and what you want it to be. It automatically identifies areas where the two don't match up and serves as a forcing function for documenting changes to your infrastructure. Declarative infrastructure is more complicated than procedural because bringing up infrastructure is a procedural process so you need a tool to make it into a declarative one and that is not always easy. But if your team's needs get complex enough the tradeoff is well worth it. I honestly can't even imagine life without it.
As a bonus it makes it easy to ship complete infrastructure solutions as re-usable modules that you can compose.
But a lot of applications have infrastructure far, far more complex than a single service running in a container and S3/RDS. It may involve a large number of lambdas, networks, API gateways, firewalls, proxies, certificates, etc.
Past a certain point, you need a way of managing all that complexity, keeping things consistent across environments/regions, ensuring all infrastructure changes are tracked and audited, and making it easier to update lots of resources at once, among other things. That's where Terraform helps.
This has absolutely not been my experience. I've worked with a few devs who might be curious to know how everything worked. Most devs I've worked with focus solely on the code they write.
I've also inherited many systems over the years and I'd take the ones managed with tf over bash every single time.
A non exhaustive list of what tf helps with.
1. Being able to know what has changed and what needs to change before you run
2. Managing infra outside of the large cloud providers and being able to combine the two
3. Quickly being able to add a new environment or region to an existing cluster
4. Some requirement has changed and some new policy/tool needs to be stitched in across all your environments
Terraform gives you a commmon language to make sense of it all that can grow as your cloud infra does.
When combined with git and CI/CD it's also an amazing self-service experince. For example you can put the Terraform code that describes your environment in a git repo, and allow any employee to open pull requests, and deploy changes on merge automaically, and require IT approval to merge. Now any engineer can self-service request access to a prod environment (by modifying IAM in Terraform), or configure a production deployment without ever needing actual access to prod. IT gets an audit log, they get a control gate (the code review), and engineers get to self-service changes which reduces the load on IT.
You have to be careful not to run your bash script twice or you get another instance/vpc/loadbalancer or whatever.
You run "terraform apply" twice and it does nothing on the second run.
If you start implementing that in your shell scripts you start implementing terraform in bash.
Terraform helps you to have a unified way to manage your resources, sure the bash scripts works for you, but what happens if you leave the company? Somebody else has to maintain your shell script.
What happens if somebody else is changing the infrastructure and they're not familiar with your shell script, they need time to dig in to figure things out and then update it, and in best case test it.
And you need to keep your scripts up to date, you need to build in fault tolerance, you need to think how you're going to deploy new resources. How are you going to handle destroying resources?
And on top of that you also need to learn the cloud Provider CLI tools or API to know what kind of calls to execute.
It just provides a standardised way to manage your infra.
Having said that, I (so far) like terraform for the same reason you noted: it's more readable and there is great tooling around it. I like state management and the ability to invoke lower-level components (as the shell breakout in your example) when you really have to.
edit: > the declarative style, where you only describe the wanted state, not the steps you take to achieve that state
That is a good and useful thing. It's called "desired configuration management", Ansible works the same way. When the underlying tool works well, it decides on its own how to implement what you want. If you ever watch terraform deploy a complex (10+ dissimilar resources) infrastructure it comes close to magic how it discovers what already has been done and what still needs to be done and in which order.
So, if you have terraformed a load balancer balancing load between 2 machines, and change your terraform to declare a load balancer balancing load between 3 machines, it won’t destroy two machines, destroy the load balancer, create a new load balancer, and then create 3 machines.
Instead, it will create a new machine and change the load balancer to know about it, so that your service is uninterrupted.
Problem is that the above isn’t quite true.
Firstly, comparing with the current state is slow, so terraform has a cache of what it thinks the current state is. If they get out of sync, things can get interesting.
Secondly, all changes are done by plugins of varying quality. Your cloud provider may, for example, support reconfiguration of a load balancer, but if the plug-in doesn’t, terraform will destroy and create a new one.
Based on what I've read, while Hashicorp tools may look like their only contribution is platform agnostic tooling, a deep dive into the docs reveals a focus on dynamically changing architecture and providing tooling to scale dynamically changing architectures in short time scales to any number of resources (i.e. not just machines / VM / containers / compute resources, but resources like users, user-generated resources, user-generated secrets etc.)
My impressions thus far is that Hashicorp is aware of the variety of alternative tools, that's why their certifications / training / professional services are only available for the tools truly core to supporting dynamic architectures : Terraform, Consul and Vault.
https://www.hashicorp.com/customer-success/professional-serv...
https://www.hashicorp.com/customer-success/enterprise-academ...
In case you need to get the metadata of a resource group you can use this: https://registry.terraform.io/providers/hashicorp/azurerm/la...
I am a very happy Terraform user, here are the benefits for me:
* Very simple workflow that helps prevent unintended consequences - first you write your code, generate a plan, inspect it carefully and only then apply. It is easy to work in a team setting where you can have one person write modules and others supply variables to them.
* I personally don't want to burden myself with Azure Resource Manager, CloudFormation or any other vendor specific IAC tool.
* I don't like other people's bash; there are tools like shell check, but usually a larger infra codebase becomes an awful ad-hoc mess of ENV variables and clever hacks. And infrastructure code is nasty to test and refactor.
Try to keep it simple as possible; anytime you are fighting Terraform it usually means there is a much simpler way to do it. And if there is inherent complexity it could be the wrong thing to do.
In case you need very dynamic behaviour (basically a part of an application) I advise the following - put in terraform the things that are not likely to change often or where the cost of breakage is higher - your virtual networks, DNS configuration, Load Balancers, VPNs, Autoscaling groups, important alerts, etc. Manage more ephemeral workloads in a more general purpose language if there is no straightforward way to do it in the official APIs. I am also very happy user of the AWS CLI in some cases + the cognitect aws libraries for Clojure. However if you need to do something very dynamic it is also likely to be wrong.
At Nimbus[1], we have been trialling using Terraform for template definitions as users are mostly familiar with it, and it allows them to integrate more easily with existing CI/CD processes. They can easily just add a new Nimbus Workspace to their Terraform and have it spin up a new development environment when their CI requires it.
[1]: https://usenimbus.com - Easy remote development infra for teams
You still need to write how to do it, how to bootstrap it and why you do things.
We have a basic tf layer, which does make it well documented, easily extendable and repeatable.
Yes we do destroy the whole setup and recreate it. Often no but still.
After the tf layer, there is only k8s which is also 100% IaC.
Also sry to say but we are experts, learning something like tf should not be a big hurdle.
What I saw in old sysadmin setups: tons of snow flake VMS no one knows why they exist, random setups different security versions on it.
If you don't have any tool to automate things you will not do it.
Feel free to create a small infra setup manually if you prefer, I prefer to codify it once and can recreate it instead of documenting it in some word doc.
Tf is not perfect btw.
It was also made for non-developers to be able to deploy what someone else built "anywhere"
The shell script needs to determine, for each resource, whether it exists; if it does exist, what changes to make and how to translate those into API calls, or if it doesn't exist, how to create it, and to clean up any resources no longer in the desired state.
Attributes of some resource that might exist only after creation need to be fed into other resources…
For even a single resource, over the lifetime of the many changes and adjustments to the resource, that is extremely complicated to do correctly in shell alone.
The declarative "desired state" style is more useful since the steps required to be undertaken often depend on the state of the infrastructure that exists, or doesn't exist.
(additionally, you'll also need to notate state about what infra exists and what doesn't, and store that somewhere, and transmit that state to coworkers … and TF handles that, too. While "its obvious" for some infra — i.e., the resource has a natural key — not all resources do, and often you have to deal with unmanaged resources and not decide to delete them simply because they're not part of your desired state.)
Lastly, you have to handle bugs and design flaws in the APIs. I've worked with a number of platforms where two, valid calls to the API in a shell script are a race condition because the API doesn't support read-your-writes.
All this reinvents the wheel that is TF.
There's also "why does this infra exist?": I can comment TF, I get a commit history and rationales for why infra exists. Shell scripts really push people towards "I'll just #yolo this small change to the infra" … and now, I don't know why the infra is the way it is. Often, I find dev/prd have drifted, or two prod instances of the "same" thing are really different. Comments cut down on this, TF modules really cut down on it, etc.
> And what is it? It is a shell script inside of a JSON that was created in the shell! What for are these layers of abstraction? Why does he have to wrap the Resource Group name in a JSON? Why couldn't it be just piped in plaintext format, as all the tools that try to be POSIX-compatible do?
JSON is a text format. Your shell scripter has piped that into what amounts to a buggy, broken, 5% reimplementation of a JSON parser. Pipe that to `jq`, instead. (You can also use --query on az to reduce the output to something that will be more easily handled by `jq`, but anything --query can do, jq can too, pretty much, and it might be better to have all the code in one language.)
Or just request that data from terraform, by accessing the appropriate attribute of the that resource.
While there are platform specific alternatives like Cloud Formation, learning a new system for each platform would be a pain, and frankly, things like Cloud Formation just aren't as nice to work with compared to Terraform.