Good Infrastructure as Code needs a lot of investigation
How to manage fixed IP addresses with AWS Autoscaling
Disposable infrastructure has its charm - we are provisioning and configuring cloud infrastructure for our customers with Terraform and Ansible, thus allowing a great level of automation. The individual machine’s characteristics are of minor interest compared to its definition and the agent that keeps it available.
Sometimes we have to deal with systems that are not designed from scratch for this paradigm. In an actual project, Apache zookeper is one of those systems - the distributed service coordination cluster relies on fixed IP addresses for all of its members, and every machine has to know it in advance. Of course, the cluster itself is fail-safe, and a broken machine can be replaced without interrupting the service, but it must have the same IP address as before. DNS is not an option since addresses are only resolved at startup time.
In a Zookeeper scenario, the number of cluster members is fixed, so we don’t have to rely on an underlying elastic virtualization. But we wanted to automate substitution for failed devices, so we decided to use AWS Autoscaling for that purpose.
Using Autoscaling for drop out recovery
One disadvantage has that choice - The configuration is divided between the Autoscaling Manager (the entity that monitors the machines and creates replacement or scales down), and the Launch Template that defines the characteristics of the virtual machine to be created. There are some subtle hidden semantics between those two, e.g. AWS Autoscaling doesn’t allow to define multiple network interfaces with fixed addresses.
With the first approach we tried to attach an Elastic Network Interface at machine boot time as the second interface, which worked “somehow”. However, we discovered that this interface into the same subnet led to asymmetric routing at the network interface which was not what we wanted.
Then we tried to work with secondary IP addresses on the main eth0. Works fine, except that AWS doesn’t allow the use of ENI IP addresses for that value, so you would have to steal IP addresses from the AWS-managed CIDR range - again not so stable as we liked to have.
Third round was a weird engineering nightmare that used lifecycle hooks and SNS & Cloudwatch to start a lambda function that managed to change the network bindings with python code in a very early phase of the virtual machine life cycle. After having managed all the asynchronicities, this worked remarkably well, but the amount of resources and dependencies was barely maintainable.
We got it!
Finally we managed to create an efficient solution architecture with a very simple approach - we changed from managing all Zookeepers in one autoscaling group to an array of autoscaling groups with each group containing only one machine. In that case, AWS Launch template allows us to set a predefined Elastic Network Interface, which has a fixed IP address.
Our dropout recovery works with one machine in the group as well as with three, and the code to manage the configuration is simple - we use Terraform’s capability to create more than one instance out of one resource definition. Scripts that spread the IP information into the Zookeeper configuration files, we manage with Ansible that is called after Terraform has finished.
Good Infrastructure as Code needs a lot of investigation, so we needed the deep-dive experience to come to a simple and well-understood solution.