With Xen-Orchestra you manage one or more XCP-Ng Servers, combine them in pools, live migrate between hosts, patch your hosts and schedule VM backups to S3. It's a killer combination.
I feel nomad is a bit resource intensive for my use case,
I've ended up writing basically this:
- Groovy for a lot of syntactic power, decent performance, ability to make workflow-ish scripts
- GPars for orchestration primitives to do stuff either serially or in parallel.
- a configuration service to have configuration scoped at the global/environment/cluster/datacenter/rack/individual node levels
- an access layer for executing remote commands and getting stdout/stderr back, and for sending, receiving files. It can be SSH, kubectl, AWS SSM, dockerrun, etc.
I've had to rewrite the core framework four times, but it has enabled me to do complicated orchestrations such as spawn clustered databases, spawn fleets of load generators, data loaders, or api servers, tear them down, etc.
I can code distributed backups and restores, and other cluster-wide "applications" with this as the basis.
I can run things from the IDE. I can run locally or from remote servers. I can CRON them or whatever. I can debug them from IDE. I can do adhoc/arbitrary cluster-wide operations/commands using the Groovy Console.
I will probably OSS it at some point, but it's not a finsihed product. But after three years I can't imagine doing what I do with it anywhere else.
It could be called by workflow engines, it can call out to terraform and other tools. It can do multicloud (with enough code :-), it can do mixed access / deployments (with enough code :-)
I guess the other limitation is the basic problem with a single orchestration node: probably won't easily scale to the 1000s of VM/nodes easily, or as efficiently as some other distributed coordinator could, but those things all come with the one-orchestration/integration/workflow-to-rule-them-all products. I don't want to be that.
However, for VM orchestration. OpenStack has been my choice for long time, but about to check KubeVirt and (old and underappreciated?) Apache CloudStack.
In my case we use Ansible and Puppet to orchestrate VMs. Puppet for the stuff all VMs need, and Ansible for deploying custom software. Given that we at the most have 8 VMs that are completely identical, Nomad makes very little sense.
Here comes a long one, but tl;dr: make sure you know your goal before you select your tools.
Orchestration is a somewhat broad concept and can be interpreted at many levels. From a management perspective it is about what processes are acting to the benefit of the organisation to fulfil a business need, regardless of the amount of automation. Having a room full of humans vs. a couple of binaries and event logic doesn't matter if the orchestrated result is the same.
Now, when it comes to your extra information you seem to be using microservices, but deploying them as individual virtual machines instead of just containers. While I applaud the choice of better isolation, virtual machines isn't the only way to do this. Firecracker does this too but much more efficient with the same protections (https://firecracker-microvm.github.io). So before going all-in on virtual machines, think about your goal and check to see if this is really the best way to do it. General business applications really don't have this level of isolation as the main factor in safeguarding applications as there are many other more fragile layers where the actual problems occur. A well-isolated application with no AAA on its interfaces might as well not be isolated at all from a security perspective.
The selection of tools is probably going to be influenced by what else you are going to do. If you have automated infrastructure you'll probably want to tie in to that. Some automations have to trigger at deploy-time to (de-)provision at the right moment, while others have to simply react to the fact that something is needed and autonomously scale with the requirements.
If your (micro)service needs a database, it probably also needs configuration to access it, and the database needs to be configured as well. Perhaps there are some networking components, and storage if you have object or block persistence needs.
If you are running on AWS for example, your 'orchestrator' is EC2, but you still need to 'orchestrate the orchestrator' to start the correct configuration with all its dependencies prepared ahead of time. You'll probably want to prepare the virtual machine image so it actually contains everything it needs during build-time so it can just do the running at run-time. Packer comes to mind for such a job.
Most systems that are 'good' got there because they had a lot of usage, because that is what generates data (feedback) on how well it does the job it needs to do. With that larger amount of usage comes a higher load from stakeholders and with that there tends to be a growth of commercial incentive turning a lot of products into some collection of commercial side-options or main-options.
To get a bit closer to a possible answer, I'll sketch out two scenarios, a managed one and an unmanaged one (from a Layer 1 and 2 perspective, hardware).
Scenario 1: Mostly managed primitives.
Depending on what you already have you are likely to have an API that interfaces with the storage, networking and compute resources that are available to you. Depending on the systems, Terraform and Ansible can configure and prepare almost anything to your liking and make sure it stays in the desired state.
When deploying an application as a virtual machine image, you'll probably have the following events happen:
1. SCM change triggers application build (Git and some CI solution)
2. Successful build triggers next integration step (CI)
3. Integration takes your build output and wraps it in a virtual machine image (Packer)
4. Successful integration triggers deployment step (CI)
5. Deployment step contacts your API of choice to let it know about the new image (Salt/Ansible/Terraform/FaaS/Webhook)
6. API registers the new image and you can start rolling over instances until all the old ones are gone
Scenario 2: You have to DIY everything.Steps in there are essentially things you have to do before you can start thinking about scenario 1 which is what follows after this.
1. Make your hardware boot from the network
2. Setup a network boot server
3. Prepare self-starting hypervisor images of your choice (KVM-based, Xen-based, Hyper-V-Base, vmWare-based)
4. Make sure your DHCP and/or TFTP servers have logs you can access/parse
5. Once a server boots, it loads a self-starting image and depending on how you configured the images they will start out 'empty' with a well-known key or other credential
6. Use something like Ansible or SaltStack to discover any servers and select the 'empty'/'fresh' ones
7. Depending on your selected hypervisor, you'll now have to connect to a management API, control plane or simply write out a list of nodes somewhere (useful for small-scale SaltStack setups!)
At this point, the 'compute' is done and ready for deployments. A big issue here is that we have many assumptions and the amount of new variables are so large that it's hard to get some more steps in.Next, you'll need storage which is much too broad to cover here, but it generally ends up with some sort of redundant NFS, iSCSI or more modern setup like Ceph. Those are all configurable via SaltStack or Ansible.
You'll want networking after that, something like a vswitch with VLANs maybe. A very simple way to do this is start an OpenWRT or OPNSense VM with a bunch of pre-loaded virtual interfaces, subnets, and firewall rules. How this exactly works depends very much on what hypervisor and control plane you have. VMWare with NSX is very different from Proxmox with openvswitch (but they do the same thing).
When you now want to 'deploy' a VM, you use your Ansible/Salt/Terraform-enabled infrastructure where you can 'update' the base configuration of a VM-based application and let Ansible/Salt/Terraform create new VMs and destroy the old ones. Configurations might be flat files written out to the network-booted compute nodes, or some configuration database in a controlplane behind an API.
Now, all of this can be partially homelab-automated with things like Proxmox, Xen XCP-ng, VMWare etc. But as soon as you actually need more facilities like actual API-configued load balancers, databases, metrics, discovery you'll start home-cooking a complete OpenStack replacement. At that point, just get OpenStack or realise that the benefit of scale you get with Azure/GCP/AWS lets you do what makes you money instead of doing busywork that doesn't actually matter in the end in many cases.