Multiple AWS Accounts and Golden Image Management with Ansible


AWS accounts managed at GumGum

1. Ansible EC2 Dynamic inventory script is a “must” when playing with AWS.

1.1. Leverage custom groups

If you use Amazon Web Services and especially EC2, maintaining an inventory file might not be the best approach, because hosts may come and go over time, be managed by external applications, or you might even be using AWS autoscaling. For this reason, you can use the EC2 external inventory script.

Each EC2 instance can have a variety of key/value pairs associated with it called Tags. The most common tag key is Name, though anything is possible.

This inventory script creates groups based on instance tags, in the format tag_KEY_VALUE. You will be able to target hosts using this kind syntax:

  • tag_Name_redis_master_001 -> Target based on the tag name
  • security_group_webservers -> Target all hosts contained in the security group named “webservers”

Running an Ansible playbook will look like:

ansible-playbook tomcat.yaml -l tag_Name_webapps

Wouldn't it be nice to be able to run:

ansible-playbook tomcat.yaml -l webapps

In order to make things easier, the EC2 external inventory script was extended in order to add custom grouping logic based on specific tags, which are:

  • AnsibleRole -> Which service is my server running (tomcat, elasticsearch, nginx).

  • AnsibleClusterId -> A unique ID which identifies the cluster (ad-server, elasticsearch-logstash, gumgum-webserver).

  • AnsibleClusterRole -> The role of the server in the cluster. For example, an Apache Storm cluster is made of a Nimbus server and one or more Supervisors, so the tag will be respectively set to either nimbus or supervisor.

The change we made in the EC2 external inventory script groups by: <AnsibleRole> <AnsibleRole>-<AnsibleClusterId> * <AnsibleRole>-<AnsibleClusterRole>-<AnsibleClusterId> (When AnsibleClusterRole is defined)

The disk status can now be easily checked on 3 Elasticsearch clusters with one command:

### Target all servers running Elasticsearch
ansible <AnsibleRole> -s -i <path_to_custom_ec2.py> -m shell -a "df -ah /mnt"

# Example
ansible elasticsearch -s -i <path_to_custom_ec2.py> -m shell -a "df -ah /mnt"

If it is needed to dig into the main Elasticsearch cluster, the following command can be used:

### Target all servers running Elasticsearch that are part of the “Main” cluster
ansible <AnsibleRole>-<AnsibleClusterId> -s -i <path_to_custom_ec2.py> -m shell -a "df -ah /mnt"

# Example
ansible elasticsearch-elasticsearch-main -s -i <path_to_custom_ec2.py> -m shell -a "df -ah /mnt"

Here is the result of the git diff output between the custom ec2.py and the ec2.py from the devel branch:

+        # Inventory: Group by custom GumGum tags
+        # Group by <AnsibleRole>: cron, elasticsearch.
+        # Group by <AnsibleRole>-<AnsibleClusterRole>: druid-overlord, storm-nimbus.
+        # Group by <AnsibleRole>-<AnsibleClusterRole>-<AnsibleClusterId>: druid-overlord-va-druid-0-8, storm-nimbus-storm-rtb.
+        # ..... or <AnsibleRole>-<AnsibleClusterId>: cron-sope, elasticsearch-main.
+
+        if self.group_by_gumgum_custom:
+            if 'AnsibleRole' in instance.tags.keys():
+                gg_ansible_role = self.to_safe(instance.tags['AnsibleRole'])
+                self.push(self.inventory, gg_ansible_role, dest)
+
+                gg_ansible_cluster_role = None
+                if 'AnsibleClusterRole' in instance.tags.keys():
+                    gg_ansible_cluster_role = self.to_safe(instance.tags['AnsibleRole'] + '-' + instance.tags['AnsibleClusterRole'])
+                    self.push(self.inventory, gg_ansible_cluster_role, dest)
+
+                if 'AnsibleClusterId' in instance.tags.keys():
+                    if gg_ansible_cluster_role is None:
+                        gg_ansible_cluster_id = self.to_safe(gg_ansible_role + '-' + instance.tags['AnsibleClusterId'])
+                    else:
+                        gg_ansible_cluster_id = self.to_safe(gg_ansible_cluster_role  + '-' + instance.tags['AnsibleClusterId'])
+
+                    self.push(self.inventory, gg_ansible_cluster_id, dest)
+

This custom grouping was not chosen randomly. You will see in the next section how to leverage the custom groups with multiple inventories and the concept of group_vars that Ansible offers.


2. Multi AWS account (or region) management

2.1 The concept of inventory

While working with configuration management software like Ansible, you want to be able to use your automations against all kind of servers, no matter if they are in the same AWS account or not.

Even though you want to use the same roles and playbooks against your servers, you may want to keep settings and variables related to AWS accounts separate from each other. It’s quite similar to the MVC pattern where playbooks and roles could be considered as “controllers” whereas variables and specific configurations are “models”. Controllers (Ansible roles) will be fed with the right models (variables and settings) depending on which AWS account it is running in.

Here is the list of AWS accounts and regions managed with the same Ansible repository at GumGum:

AWS accounts managed at GumGum

Here is how inventories are being organized in the source code repository:

~/workspace/ops/ansible/inventories (master)$ tree -L 1 *
--------------------------------------------------------------------------------------------
gumgum              ### AWS ACCOUNT: GUMGUM - AWS_REGION: US_EAST_1 - EC2_CLASSIC
├── ec2.ini        # Provides region specific settings to run Ansible in US_EAST_1 region
├── ec2.py         # Custom EC2 external inventory (Mentioned earlier)
├── group_vars/    # Contains US_EAST_1 cluster specific variables
└── localhost      # Static inventory with one entry for localhost
--------------------------------------------------------------------------------------------
virginia            ### AWS ACCOUNT: GUMGUM - AWS_REGION: US_EAST_1 - VPC
├── ec2.ini        # Provides region specific settings to run Ansible in US_EAST_1 region
├── ec2.py         # ...
├── group_vars/    # Contains US_EAST_1 cluster specific variables
└── localhost      # ...
--------------------------------------------------------------------------------------------
ireland             ### AWS ACCOUNT: GUMGUM - AWS_REGION: EU_WEST_1 - VPC
├── ec2.ini        # Provides region specific settings to run Ansible in EU_WEST_1 region
├── ec2.py         # ...
├── group_vars/    # Contains EU_WEST_1 cluster specific variables
└── localhost      # ...
--------------------------------------------------------------------------------------------
california          ### AWS ACCOUNT: GUMGUM - AWS_REGION: US_WEST_1 - VPC
├── ec2.ini        # Provides region specific settings to run Ansible in US_WEST_1 region
├── ec2.py         # ...
└── group_vars/    # Contains US_WEST_1 cluster specific variables
└── localhost      # ...
--------------------------------------------------------------------------------------------
mantii              ### AWS ACCOUNT: MANTII - AWS_REGION: US_EAST_1 - VPC
├── ec2.ini        # ...
├── ec2.py         # ...
├── group_vars/    # ...
└── localhost      # ...
--------------------------------------------------------------------------------------------
sandbox             ### AWS ACCOUNT: SANDBOX - AWS_REGION: US_EAST_1 - VPC
├── ec2.ini        # ...
├── ec2.py         # ...
├── group_vars/    # ...
└── localhost      # ...

For latency and account isolation reasons, a dedicated Ansible Server is used per inventory. In order to keep the use of the ansible-playbook command as easy as before, the following alias was created :

### Alias command
alias play="ansible-playbook \
    --vault-password-file ~/.ansible-vault-secret \
    --inventory <PATH_TO_REPO>/ansible/inventories/<AWS_ACCOUNT>"

### Where:
#    PATH_TO_REPO : Path where the root "ops" repository is deployed on the server
#    AWS_ACCOUNT : Name of the account for example: "virginia", "sandbox", "mantii"

Each Ansible server is properly configured based on where it is located. The play alias offers a way to specify the path for the Vault password file.

2.2 The concept of group_vars

As mentioned before, the group_vars folder is the place to put your group variables. This folder can contain files or subfolders. You are free to use it any way you want.

2.2.1 The default group called “all”

The file (or a folder) called group_vars/all is a place where you can define variables that will be accessible from all your remote hosts. For example, the EC2 VPC ID can be put there as well as other account level variables like the Java version you want to deploy on all your hosts).

Besides being a place to drop universal variables, it’s also a parent of all other groups. You can easily override variables that are defined at the all group level.

Example:

### The default value for `java_default_version` is 1.8 for the entire GumGum account
# file: gumgum/group_vars/all
---
java_default_version: 8


### But for some reason we want all servers running tomcat to use Java 1.7
# file: gumgum/group_vars/tomcat
---
java_default_version: 7

2.2.2 Organization of the group_vars folder

As an advanced use-case, you can create directories under group_vars, and Ansible will read all the files in these directories. The files should either have no extension or be suffixed with .json, .yaml or .yml). Variable files encrypted with Ansible Vault can also sit in the group_vars folder.

Most of GumGum playbooks leverage multiple roles (commonly: common packages, user configuration, service installation). In order to keep things organized, the following structure was adopted:

$ tree gumgum group_vars/

group_vars/
└── group_name          # Name of the group (mysql, tomcat, ...)
--------------------------------------------------------------------------------------------
    └── role_A          # Namespace where role_A variables stands and can be overridden
         ├── vars.yaml  # Non-sensitive variables (Software version, Log path, ...)
         └── vault.yaml # Sensitive secrets (AWS credentials, passwords, ...)
--------------------------------------------------------------------------------------------
    └── role_B
         ├── vars.yaml
         └── vault.yaml
--------------------------------------------------------------------------------------------
    └── role_C
        ├── vars.yaml
        └── vault.yaml

Here is an example for the mysqldb group:

$ tree gumgum/group_vars/mysqldb

group_vars/mysqldb/
    └── users
         └── vars.yaml    # SSH Users allowed on the system
--------------------------------------------------------------------------------------------
    └── mysql
         ├── vars.yaml    # Mysql Users, Tables which need to be created
         └── vault.yaml   # Mysql passwords
--------------------------------------------------------------------------------------------
    └── nagios
         └── vars.yaml    # Specific Nagios configuration for Mysql monitoring

2.2.3 Extending the concept of parent / children

Ansible does not offer the concept of “grand-parent”, but there is a workaround to implement this.

Think about having universal variables in the all group, then your service related group (mysql, tomcat, elasticsearch) and finally the cluster group ( elasticsearch-main, elasticsearch-logstash).

The only thing you need to know is how Ansible overrides variables when they are defined in multiple places. It is already known that specific group variables override the content of the one defined in the all group. Now, if the instance is part of two groups, let’s say tomcat and nginx groups, what will be the value of your variable given that it is defined in both groups?

Well, Ansible simply uses alphabetic order, so the variable will be overridden with the value from group_vars/tomcat/vars.yaml. This constraint needs to be taken in account while implementing the concept of “grand-parent”.

Now it is time to remember the custom grouping seen in the first part of this post. Here are the three custom groups (from most general to most specific):

  • <AnsibleRole>

  • <AnsibleRole>-<AnsibleClusterId>

  • <AnsibleRole>-<AnsibleClusterRole>-<AnsibleClusterId>

Thanks to this naming convention, the concept of “grand-parent” can be implemented. It allows you to define variables at the service level (cassandra, druid, zookeeper) then specify specific variables at the cluster level (cassandra-realtime, cassandra-analytics) or at the cluster role level (storm-nimbus-storm-rtb, storm-supervisor-storm-rtb), if this concept is applicable to the technology (for example, it is needed for Storm and Druid).

AWS accounts managed at GumGum

Let’s see how to use it for the Cassandra service. Here is an example of the cassandra group (the parent group of all Cassandra clusters):

group_vars/cassandra
└── ec2-001         # Defines common EC2 cassandra vars (Tags like AnsibleRole
    └── vars.yaml   # will be set to `cassandra`).
--------------------------------------------------------------------------------------------
└── ganglia-001     # Cassandra clusters are monitored with Ganglia. This file defines
    └── vars.yaml   # which node is going to be the master node which collects metrics.
--------------------------------------------------------------------------------------------
└── logstash-client-001 # Logstash is deployed on our Cassandra clusters so that alerts
     └── vars.yaml      # based on error thresholds can be configured. The logstash config
                         # is the same for all Cassandra clusters:
                         # - Parsing /var/log/cassandra/system.log
--------------------------------------------------------------------------------------------
└── raid0         # Cassandra clusters use by default a RAID 0 over 4 disks to store
    └── vars.yaml # their data and a separate disk for their commit logs (Instance storage
                   # disk). This role ensures everything is formatted and mounted properly.

Now that common Cassandra variables are defined, the focus can be placed on specific cluster variables without having to redefine all the previous set.

group_vars/cassandra-cassandra-analytics
├── cassandra-001 # Provides specific vars like cassandra.yaml configuration file content,
│   └── vars.yaml # the cassandra version, the opscenter node it should refer to.
│
├── ec2-001       # The Cassandra analytics cluster is a 6 r3.2xlarge cluster, all
│   └── vars.yaml # these settings go in this file as well as the VPC subnet.
│
├── ganglia-001   # For some reason we want to hardcode the ip of the Ganglia master node,
│   └── vars.yaml # we can override the parent configuration in this file.
│
├── raid0         # The analytics cluster use one more disk than other Cassandra clusters,
│   └── vars.yaml # we need to make sure the RAID0 will not include this disk reserved for# Spark.
│
└── spark-001     # Spark is deployed on the analytics Cassandra cluster. This file
    └── vars.yaml # overrides variables used to template the spark env configuration file.
--------------------------------------------------------------------------------------------
group_vars/cassandra-va-cassandra-realtime
├── cassandra-001 # Same as above but with different settings
│   └── vars.yaml # …
│
└── ec2-001       # The Cassandra Realtime cluster is a 21 r3.2xlarge cluster. Again the
    └── vars.yaml # settings goes in this file.

This example concludes the second part of this post. It shows how to leverage structured groups within group_vars in order to make things manageable at an higher level. The concept of grand-parent is really powerful and gives you a way to decrease the size of your variable files by splitting them into logical entities.


3. From nothing to the “Golden Image” exclusively with Ansible

It is a pretty common use case for Ops people to maintain “Golden Images”. By definition, a golden image is a disk image template (or baseline) used to spawn servers. In AWS, they are called Amazon Machine Images (AMIs).

GumGum uses AMIs for Auto Scaled servers. You can use AWS Auto Scaling to detect impaired Amazon EC2 instances and unhealthy applications, and replace the instances without manual intervention. This ensures that your application has the compute capacity that you expect. An Auto Scaling group leverages the use of a launch configuration to determine what AMI to use, what type of server to start and a few other settings like the size of the disk and the security group you want to put your servers in.

Thanks to the completeness of Ansible with AWS, you can easily find a module that will build an AMI from an EC2 server. This module is ec2_ami. Instead of showing you an example of usage (which I think is not interesting as you can find examples on docs.ansible.com), I will focus on integrating the AMI creation process into the automation process with Ansible.

It is important to think about what matters when you build a Golden Image. Sometimes backing up this kind of image can take up to 15 minutes or even more, which is why the build process has to ensure that what is baked is exactly what you wanted to bake.

The following steps were introduced in all GumGum playbooks (thanks to the use of specific tags) so that an automation can be used to go up to a specific step or go full-stack -- from nothing all the way to a test server launched from the previously-built AMI:

AWS accounts managed at GumGum

This workflow gives the following Ansible playbook template:

######################################################################
- name: <SERVICE_NAME> installation and configuration
  hosts: all
  become: yes

  roles:
    ##################################################################
    # Playbook dependencies
    - { role: common-001, tags: ['configure', 'test', 'cleanup', 'create-ami', 'test-ami', 'common'] }
    - { role: nagios-001, tags: ['configure', 'test', 'cleanup', 'create-ami', 'test-ami', 'nagios'] }

    - { role: aws-cli-001, tags: ['configure', 'test', 'cleanup', 'create-ami', 'test-ami', 'aws-cli'] }
    - { role: user-001, tags: ['configure', 'test', 'cleanup', 'create-ami', 'test-ami', 'user'] }

    ##################################################################
    # Main Role that can be called with 'configure', 'test' and 'cleanup'
    - { role: YOUR_MAIN_ROLE, tags: ['create-ami', 'test-ami', 'YOUR_MAIN_ROLE_TAG'] }

    ##################################################################
    # These two following roles allow you to create and test an AMI of the automated system
    - { role: ec2-ami-001, tags: ['create-ami', 'test-ami', 'ec2-ami'] }
    - { role: ec2-001, tags: ['test-ami'] }


######################################################################
### with: roles/YOUR_MAIN_ROLE/tasks/main.yaml
######################################################################
---
- include: configure.yaml
  tags: configure

- include: test.yaml
  tags: test

- include: cleanup.yaml
  tags: cleanup
######################################################################

Here is a combination of calls with specific tags with their associated results:

############################################################################################
# Usual call to the playbook. Makes sure to get the server ready to start, for example elasticsearch server ready to be started and join a cluster.
############################################################################################

$ ansible-playbook playbook-template.yaml -l <TARGET> -t configure
### 1) Run Common         - Full role including configure, test and cleanup includes
### 2) Run Nagios         - Full role including configure, test and cleanup includes
### 3) Run Aws-Cli        - Full role including configure, test and cleanup includes
### 4) Run User           - Full role including configure, test and cleanup includes
### 5) Run Your_Main_Role - Configure include only

############################################################################################
### Automate the service and get the server ready for a snapshot (helps you to do a final manual check on the server while debugging your playbooks).
############################################################################################

$ ansible-playbook playbook-template.yaml -l <TARGET> -t configure,test,cleanup
### 1) Run Common         - Full role including configure, test and cleanup includes
### 2) Run Nagios         - Full role including configure, test and cleanup includes
### 3) Run Aws-Cli        - Full role including configure, test and cleanup includes
### 4) Run User           - Full role including configure, test and cleanup includes
### 5) Run Your_Main_Role - Full role including configure, test and cleanup includes

############################################################################################
### Create an AMi of the targeted server and restart a new EC2 server from this AMI, this could be great to start the staging server right away.
############################################################################################
$ ansible-playbook playbook-template.yaml -l <TARGET> -t test-ami
### 1) Run Common         - Full role including configure, test and cleanup includes
### 2) Run Nagios         - Full role including configure, test and cleanup includes
### 3) Run Aws-Cli        - Full role including configure, test and cleanup includes
### 4) Run User           - Full role including configure, test and cleanup includes
### 5) Run Your_Main_Role - Full role including configure, test and cleanup includes
### 6) Run Ec2-AMI        - Steps to determine how to name the AMI and build it
### 7) Run Ec2            - Starts an EC2 server from the previously build AMI

These last examples conclude this blog post. Now you know how easy it is to introduce the AWS AMI build process into your workflow. You can then push the process further by setting up a Jenkins server with your Ansible repository on it and have Jenkins trigger the AMI creation process either when the main code of the automation has changed, or you can just use it as a “push to build” UI. I am pretty sure you can come up with more exotic CI/CD jobs, and I will be glad to hear your way to do it!