Make your Ansible Playbooks flexible, maintainable, and scalable

In the years since, I've learned a lot of tricks to help ease the maintenance burden for my work. It's important to me to have maintainable projects, because many of my projects---like Hosted Apache Solr---have been in operation for over a decade! If it's hard to maintain the project or it's hard to make major architecture changes, then I can lose customers to more nimble competitors, I can lose money, and---most importantly---I can lose my sanity!

I'm presenting a session at AnsibleFest Austin this year, "Make your Ansible Playbooks flexible, maintainable, and scalable", and I thought I'd summarize some of the major themes here.

Stay Organized

I love photography and automation, and so I spend a lot of time building electronics projects that involve Raspberry Pis and cameras. Without the organization system I use, it would be very frustrating putting together the right components for my project.

Similarly, in Ansible, I like to have my tasks organized so I can compose them more easily, test them, and manage them without too much effort.

I generally start a playbook with all the tasks in one file. Once I hit around 100 lines of YAML, I'll work to break related groups of tasks into separate files and include them in the playbook with include_tasks.

After the playbook starts becoming more complete, I often notice sets of tasks that are related and can be isolated---like installing a piece of software, copying a configuration for that software, then starting (or restarting) a daemon. So I create a new role using ansible-galaxy init ROLE_NAME, and then put those tasks into that role.

If the role is generic enough, I'll either put it on GitHub and submit it to Ansible Galaxy, or put it into a separate, private Git repository. Now I can add a generic set of tests for the role (with Molecule or some other testing setup), and I can share the role with many projects---even with projects managed by completely separate teams!

Then I include the external roles into my project via a requirements.yml file. For some projects, where stability is the most important trait, I will also define the version (a git ref or tag) for each included Ansible role. For other projects, where I can afford to sacrifice stability a little for easier maintenance over time (like test playbooks, or one-off server configurations), I'll just put the role name (and repo details if it's not on Galaxy).

For most projects, I don't commit the external roles (those defined in requirements.yml) to the repository---I have a task in my CI system which installs the roles fresh on every run. However, there are some cases where it's best to commit all the roles to the codebase. For example, since developers can run my Drupal VM playbook on a daily basis, and these developers often don't live near where Ansible Galaxy's servers are located, they had trouble installing the large number of Ansible Galaxy roles required. So I committed the roles to the codebase, and now they don't have to wait for all the roles to be installed every time they build a new Drupal VM instance.

If you do commit the roles to your codebase, you need to have a thorough process for updating roles---make sure you don't let your requirements.yml file go out of sync with the installed roles! I often run ansible-galaxy install -r requirements.yml --force to force-replace all the required roles in the codebase, and keep myself honest!

Simplify and Optimize

> YAML is not a programming language.
>
> ---Jeff Geerling

One of the reasons people enjoy using Ansible is because it uses YAML, and has a declarative syntax. You want a package installed, so you have the task package: name=httpd state=present. You want a service running, so you have the task service: name=httpd state=started.

There are many cases where you need to add a little more intelligence, though. For example, if you're using the same role to build both VMs and containers and you don't want the service started in the container, you need to add a when condition, like:

- name: Ensure Apache is started.
  service:
    name: httpd
    state: started
  when: 'server_type != "container"'

This kind of logic is simple, and makes sense when reading a task and figuring out what it does. But some may try to stuff tons of fancy logic inside when conditions or other places where Ansible gives a little exposure to Jinja2 and Python, and that's when things can get off the rails.

As a rule of thumb, if you've spent more than 10 minutes wrestling with escaping quotes in a when condition in your playbook, it's probably time to consider writing a separate module to perform the logic you need to do for the task. Python should generally be in a separate module, not inline with the rest of the YAML. There are exceptions to this (e.g. when comparing more complex dicts and strings), but I try to avoid writing any complex code in my Ansible playbooks.

Besides avoiding complex logic, it's also helpful to have your playbooks run faster. Many times, I'll profile a playbook timer in the ansible.cfg file defaults section and run the playbook, and find that one or two tasks or roles takes a really long time, compared to the rest of the playbook.

For example, one playbook used the copy module for a large directory with dozens of files. Because of the way Ansible performs a file copy internally, this meant there were many seconds wasted waiting for Ansible to ferry each file across the SSH connection.

Converting that task to use synchronize instead saved many seconds per playbook run. For one run, this doesn't seem like much; but when the playbook is run on a schedule (e.g. to enforce a certain configuration on a server), or run as part of your CI suite, it's important to help make it efficient. Otherwise this can burn extra CPU cycles on inefficient code, and developers often hate waiting a long time for CI tests to pass before they can know if their code broke something or not.