Control with Ansible Tower, Part 1

March 29, 2016 by Bill Nottingham

tower-control-series-screen.png

This is the first in a series of posts about how Ansible and Ansible Tower enable you to manage your infrastructure simply, securely, and efficiently.

When we talk about Tower, we often talk in terms of control, knowledge, and delegation. But what does that mean?  In this series of blog posts, we'll describe some of the ways you can use Ansible and Ansible Tower to manage your infrastructure.

CONTROL - THE BASICS

The first step of controlling your infrastructure is to define what it is actually supposed to be. For example, you may want to apply available updates - here's a basic Playbook that does that.

---
- hosts: all
  gather_facts: true
  become_method: sudo
  become_user: root
  tasks:
    - name: Apply any available updates
      yum:
        name: "*"
        state: latest
        update_cache: yes

Or you may have more detailed configuration. Here's an example playbook for basic system configuration. This Playbook:

  • Configures some users

  • Installs and configures chrony, sudo, and rsyslog remote logging

  • Sets some SELinux parameters

Normally, we’d organize our configuration into Ansible roles for reusability, but for the purpose of this exercise we're just going to use one long Playbook.

We'd want to apply this as part of our standard system configuration.

---
- hosts: all
  gather_facts: true
  become_method: sudo
  become_user: root
  tasks:

    - name: Ensure users are present
      user:
        name: "{{ item.name }}"
        groups: wheel
        state: present
        uid: "{{ item.uid }}"
      with_items:
        - { name: "apone", uid: 1200 }
        - { name: "gorman", uid: 1201 }
        - { name: "hicks", uid: 1202 }

    - name: Install needed software
      yum:
        name: "{{contact.item}}"
        state: latest
      with_items:
        - chrony
        - sudo
        - rsyslog

    - name: Ensure standard chrony config
      copy:
        src: files/chrony.conf
        dest: /etc/chrony.conf
        mode: 0644
        owner: root
        group: root

    - name: Ensure standard sudo config
      copy:
        src: files/sudoers
        dest: /etc/sudoers
        mode: 0640
        owner: root
        group: root

    - name: Ensure log forwarding is configured
      copy:
        src: files/syslog
        dest: /etc/rsyslog.d/forward.conf
        mode: 0644
        owner: root
        group: root
      notify: restart_syslog

    - name: Ensure SELinux is enabled
      selinux:
        policy: targeted
        state: enforcing

    - name: Ensure SELinux booleans are set properly
      seboolean:
        name: "{{contact.item}}"
        persistent: true
        state: false
      with_items:
        - httpd_execmem
        - selinuxuser_execstack
        - selinuxuser_execheap
      
    - name: Ensure proper services are running
      service:
        name: "{{contact.item}}"
        state: running
        enabled: yes
      with_items:
        - rsyslog
        - chronyd

  handlers:
    - name: restart_syslog
      service:
        name: rsyslog
        state: restarted

The first thing to do is to ensure that this configuration is applied whenever we provision a new machine. We can do that with Tower's provisioning callbacks.  Provisioning callbacks are a mechanism where any Playbook in Tower can be triggered by a machine via Tower's REST API.

To set up a provisioning callback, first we’ll enable them in the job template in Tower. Note the Provisioning Callback URL and the Host Config Key.

callback.png

Then, we’ll modify our provisioning Playbook to call this, via the ‘user_data’ feature in Amazon. Here's a sample EC2 provisioning playbook that calls this job template on launch. In this, you can see the user_data contains a modified version of Tower's request_tower_configuration.sh script that calls the Provsioning Callback URL with the Host Config Key from above.

---
- hosts: localhost
  connection: local
  gather_facts: False
  vars_files:
  - vars/ec2-vars

  tasks:

    - name: Launch some instances
      ec2:
        access_key: "{{contact.ec2_access_key}}"
        secret_key: "{{contact.ec2_secret_key}}"
        keypair: "{{contact.ec2_keypair}}"
        group: "{{contact.ec2_security_group}}"
        type: "{{contact.ec2_instance_type}}"
        image: "{{contact.ec2_image}}"
        region: "{{contact.ec2_region}}"
        instance_tags: "{'type':'{{contact.ec2_instance_type}}', 'group':'{{contact.ec2_security_group}}', 'Name':'demo_''{{contact.demo_tag_name}}'}"
        count: "{{contact.ec2_instance_count}}"
        wait: true
        user_data: |
                   #!/bin/bash
                   TOWER=tower-test.local
                   JOB=457
                   KEY=7acb361f01ca00414e8623433d49150b
                   
                   retry_attempts=10
                   attempt=0
                   while [[ $attempt -lt $retry_attempts ]]
                   do
                       status_code=`curl -s -i --data "host_config_key=${KEY}" http://${TOWER}/api/v1/job_templates/${JOB}/callback/ | head -n 1 | awk '{print $2}'`
                       if [[ $status_code == 202 ]]
                           then
                           exit 0
                       fi
                       attempt=$(( attempt + 1 ))
                       echo "${status_code} received... retrying in 1 minute. (Attempt ${attempt})"
                       sleep 60
                   done
                   exit 1
      register: ec2

    - name: Wait for SSH to come up
      become: false
      connection: local
      wait_for:
        host: "{{ item.public_dns_name }}"
        port: 22
        delay: 60
        timeout: 320
        state: started
      with_items:
        - "{{ ec2.instances }}"

This ensures that each new provisioned instance will have this configuration applied.

CONTROL - SCHEDULED JOBS

One of Tower's basic features is the ability to schedule jobs. For example, you may have a maintenance window when you apply changes. Let's take our earlier example playbook for applying updates. We want to ensure we only apply updates during our update window at 8am on Tuesdays.

For any job in Tower, it is easy to configure a schedule. When editing a job template, you can add them under the Schedules expander. You can also just go to the list of job templates, click the schedule icon, and then click ‘+’ and add a new schedule.

schedule-updates.png

Now this job will apply updates automatically on a schedule - and if you ever need to pause or stop the schedule, you can.

CONTROL - CONTINUOUS REMEDIATION

Of course, applying a configuration only at machine boot is not sufficient in many cases. Changes can come later, whether via operating system updates, application changes, or even sysadmins logging in locally and making changes. (No one would do that!) This is a phenomenon known as configuration drift.

Hence, the concept of continuous remediation - applying your configuration on a regular basis to ensure you don't have configuration drift away from its baseline.  Ansible makes continuous remediation efficient.  Tower's job scheduling makes it easy.

Here, we will configure a schedule for the one we defined earlier.

schedule-config.png

You can schedule this remediation to run as often as is convenient, although we would not recommend every ten seconds.

Our simple configuration here is just the tip of the sorts of configuration with continuous remediation. If you’re looking for more advanced examples of configuration and remediation, check out the Ansible RHEL 6 STIG role at  https://github.com/MindPointGroup/RHEL6-STIG.

I'VE REMEDIATED... NOW WHAT?

Once you're running your configuration remediation, then there's the matter of interpreting the results.

After all, while it is good that your configuration is consistently applied, it can be a sign of an issue if configuration needs to be constantly reset on the system. But to determine what we need to do, we need to know what changes we made.

Ansible's nature is that it only makes a change if it has to; otherwise the task is reported as OK. This is often referred to as desired state configuration or idempotency. Combine this with Tower's auditing and logging of all Ansible runs, and this makes finding these cases of configuration drift simple.

In Tower's UI, pick out any of the runs for this job. In the view, you’ll see the individual tasks on the left. Any of these that applied a change will be noted as ‘changed’ in yellow. If you click on the tasks, you can see the individual hosts on which changes were made, and clicking on the host where there are changes (again, noted in yellow), you can see the individual event and what was changed.

event1.png

We can also query Tower's REST API to get this information as well, in even more detail.

First, we'll need to know the job template we're using for configuration remediation. In this case, I know that this configuration job is job template 457 in my Tower instance.

We then query Tower's API to get the last succesful configuration run.

https://tower-test.local/api/v1/jobs/?job_template=457&status=successful&order_by=-started

The Tower API returns JSON of recent successful runs:

{
    "count": 4, 
    "next": null, 
    "previous": null, 
    "results": [
        {
            "id": 98, 
            "type": "job", 
            "url": "/api/v1/jobs/98/", 
...

So we know the last succesful run was job run 98.

Now we can pull out the data for anything that this job changed during its run. We query the API for all tasks that have 'changed' and filter by the 'runner_on_ok' event that denotes the task finishing.

https://tower-test.local/api/v1/jobs/98/job_events/?changed=true&event=runner_on_ok

{
    "count": 2, 
    "next": null, 
    "previous": null, 
    "results": [
        {
            "id": 1150, 
            "type": "job_event", 
            "url": "/api/v1/job_events/1150/", 
            "related": {
                "job": "/api/v1/jobs/98/", 
                "parent": "/api/v1/job_events/1142/", 
                "host": "/api/v1/hosts/240/"
            }, 
            "summary_fields": {
                "job": {
                    "name": "Apply configuration", 
                    "description": "", 
                    "status": "successful", 
                    "failed": false, 
                    "job_template_id": 457, 
                    "job_template_name": "Apply configuration"
                }, 
                "host": {
                    "name": "ec2-54-162-71-113.compute-1.amazonaws.com", 
                    "description": "imported", 
                    "has_active_failures": false, 
                    "has_inventory_sources": true
                }
            }, 
            "created": "2016-03-23T19:53:38.868Z", 
            "modified": "2016-03-23T19:53:38.868Z", 
            "job": 98, 
            "event": "runner_on_ok", 
            "counter": 6, 
            "event_display": "Host OK", 
            "event_data": {
                "res": {
                    "comment": "", 
                    "createhome": true, 
                    "group": 1201, 
                    "name": "gorman", 
                    "changed": true, 
                    "system": false, 
                    "item": {
                        "name": "gorman", 
                        "uid": 1201
                    }, 
                    "state": "present", 
                    "shell": "/bin/bash", 
                    "stderr": "useradd: warning: the home directory already exists.\nNot copying any file from skel directory into it.\nCreating mailbox file: File exists\n", 
                    "invocation": {
                        "module_name": "user", 
                        "module_complex_args": {
                            "state": "present", 
                            "name": "gorman", 
                            "groups": "wheel", 
                            "uid": "1201"
                        }, 
                        "module_args": ""
                    }, 
                    "home": "/home/gorman", 
                    "groups": "wheel", 
                    "uid": 1201
                }, 
                "host": "ec2-54-162-71-113.compute-1.amazonaws.com", 
                "task": "Ensure users are present", 
                "play": "all"
            }, 
            "event_level": 3, 
            "failed": false, 
            "changed": true, 
            "host": 240, 
            "host_name": "ec2-54-162-71-113.compute-1.amazonaws.com", 
            "parent": 1142, 
            "play": "all", 
            "task": "Ensure users are present", 
            "role": ""
        }, 
        {
            "id": 1187, 
            "type": "job_event", 
            "url": "/api/v1/job_events/1187/", 
            "related": {
                "job": "/api/v1/jobs/98/", 
                "parent": "/api/v1/job_events/1182/", 
                "host": "/api/v1/hosts/242/"
            }, 
            "summary_fields": {
                "job": {
                    "name": "Apply configuration", 
                    "description": "", 
                    "status": "successful", 
                    "failed": false, 
                    "job_template_id": 457, 
                    "job_template_name": "Apply configuration"
                }, 
                "host": {
                    "name": "ec2-54-81-177-65.compute-1.amazonaws.com", 
                    "description": "imported", 
                    "has_active_failures": false, 
                    "has_inventory_sources": true
                }
            }, 
            "created": "2016-03-23T19:53:57.204Z", 
            "modified": "2016-03-23T19:53:57.204Z", 
            "job": 98, 
            "event": "runner_on_ok", 
            "counter": 10, 
            "event_display": "Host OK", 
            "event_data": {
                "res": {
                    "changed": true, 
                    "state": "enforcing", 
                    "policy": "targeted", 
                    "configfile": "/etc/selinux/config", 
                    "invocation": {
                        "module_name": "selinux", 
                        "module_complex_args": {
                            "policy": "targeted", 
                            "state": "enforcing"
                        }, 
                        "module_args": ""
                    }, 
                    "msg": "runtime state changed from 'permissive' to 'enforcing'"
                }, 
                "host": "ec2-54-81-177-65.compute-1.amazonaws.com", 
                "task": "Ensure SELinux is enabled", 
                "play": "all"
            }, 
            "event_level": 3, 
            "failed": false, 
            "changed": true, 
            "host": 242, 
            "host_name": "ec2-54-81-177-65.compute-1.amazonaws.com", 
            "parent": 1182, 
            "play": "all", 
            "task": "Ensure SELinux is enabled", 
            "role": ""
        }
    ]
}

This JSON output describes the configuration changes in detail, down to the modules used and the full Ansible output. With this data, it is easy to generate custom reports with just a few API queries and a little bit of text processing.

AUTOMATED SAFETY ENFORCEMENT

Now that you know where and how your configuration needs to be reapplied, you can then start investigating whether someone is manually making changes, whether software is misbehaving, or whether some other issue is causing the problem.

In certain cases, reapplying the configuration to fix the problem may not be the behavior that is desired for machines that are out of specification. Take our example - if the remote syslog configuration isn't there, we could be missing important system events. Similarly, if SELinux is misconfigured, something could be very wrong.

How to remedy this? If we want to ensure our machines remain their pristine configuration, one option is to automatically terminate instances that are out of spec and replace them with fresh ones. Thanks to Ansible's flexible nature and Ansible 2.0's new block support, we can easily take care of these situations. Here's a modified version of our earlier configuration Playbook that does so.

---
- hosts: all
  gather_facts: true
  become_method: sudo
  become_user: root
  vars_files:
  - vars/ec2-vars

  tasks:

    - name: Ensure users are present
      user:
        name: "{{ item.name }}"
        groups: wheel
        state: present
        uid: "{{ item.uid }}"
      with_items:
        - { name: "apone", uid: 1200 }
        - { name: "gorman", uid: 1201 }
        - { name: "hicks", uid: 1202 }

    - name: Install needed software
      yum:
        name: "{{contact.item}}"
        state: latest
      with_items:
        - chrony
        - sudo
        - rsyslog

    - name: Ensure standard chrony config
      copy:
        src: files/chrony.conf
        dest: /etc/chrony.conf
        mode: 0644
        owner: root
        group: root

    - name: Ensure standard sudo config
      copy:
        src: files/sudoers
        dest: /etc/sudoers
        mode: 0640
        owner: root
        group: root

    - block:
      - name: Ensure log forwarding is configured
        copy:
          src: files/syslog
          dest: /etc/rsyslog.d/forward.conf
          mode: 0644
          owner: root
          group: root
        register: rsyslog_state

      - name: Ensure SELinux is enabled
        selinux:
          policy: targeted
          state: enforcing
        register: selinux_state

      - name: Ensure SELinux booleans are set properly
        seboolean:
          name: "{{contact.item}}"
          persistent: true
          state: false
        with_items:
          - httpd_execmem
          - selinuxuser_execstack
          - selinuxuser_execheap
        register: sebool_state
      
      - name: Ensure proper services are running
        service:
          name: "{{contact.item}}"
          state: running
          enabled: yes
        with_items:
          - rsyslog
          - chronyd

      - name: Abort if we made changes
        fail:
          msg: "Required configuration was not set"
        when: rsyslog_state|changed or selinux_state|changed or sebool_state|changed

      rescue:
        - name: Get EC2 instance information
          ec2_facts:

        - name: Terminate instance
          connection: local
          become: false
          ec2:
            region: "us-east-1"
            instance_ids: "{{contact.hostvars[inventory_hostname]['ansible_ec2_instance_id']}}"
            state: absent
            wait: true

        - name: Relaunch instance
          connection: local
          become: false
          ec2:
            access_key: "{{contact.ec2_access_key}}"
            secret_key: "{{contact.ec2_secret_key}}"
            keypair: "{{contact.ec2_keypair}}"
            group: "{{contact.ec2_security_group}}"
            type: "{{contact.ec2_instance_type}}"
            image: "{{contact.ec2_image}}"
            region: "{{contact.ec2_region}}"
            instance_tags: "{'type':'{{contact.ec2_instance_type}}', 'group':'{{contact.ec2_security_group}}', 'Name':'demo_''{{contact.demo_tag_name}}'}"
            count: "{{contact.ec2_instance_count}}"
            wait: true
            user_data: |
                       #!/bin/bash
                       TOWER=tower-test.local
                       JOB=457
                       KEY=7acb361f01ca00414e8623433d49150b
                       
                       retry_attempts=10
                       attempt=0
                       while [[ $attempt -lt $retry_attempts ]]
                       do
                           status_code=`curl -s -i --data "host_config_key=${KEY}" http://${TOWER}/api/v1/job_templates/${JOB}/callback/ | head -n 1 | awk '{print $2}'`
                           if [[ $status_code == 202 ]]
                               then
                               exit 0
                           fi
                           attempt=$(( attempt + 1 ))
                           echo "${status_code} received... retrying in 1 minute. (Attempt ${attempt})"
                           sleep 60
                       done
                       exit 1
          register: ec2

        - name: Wait for SSH to come up
          become: false
          connection: local
          wait_for:
            host: "{{ item.public_dns_name }}"
            port: 22
            delay: 60
            timeout: 320
            state: started
          with_items:
            - "{{ ec2.instances }}"

        - name: New instance
          debug:
            msg: "Instance relaunched due to configuration drift; new instance is {{ item.public_dns_name }}."
          with_items:
            - "{{ ec2.instances }}"

Scheduling this Playbook in Tower will then automatically refresh systems that are significantly out of spec, including calling back into Tower to apply our basic configuration once new instances are spun up.  Of course, you can take less drastic steps in the rescue block if you don’t need to completely regenerate the instance - you can remove them from a load balancer, or notify your ops team via e-mail.

GO NOW, AND HELP THE OTHERS!

We'll be back with more examples of how you can use Ansible and Tower to manage your infrastructure in the near future. In the meantime, all the playbook examples for this blog entry are available at http://github.com/ansible/ansible-blog-examples/.

 

Share:

Topics:
Ansible Tower, Control


 

Bill Nottingham

Bill Nottingham is the Director of Product at Ansible. He came to Ansible from Red Hat, where he spent 15+ years building and architecting Red Hat’s Linux products. His days are spent chatting with users and customers about Ansible and Tower. He can be found on twitter at @bill_nottingham, and occasionally doing a very poor impersonation of a soccer player.


rss-icon  RSS Feed

Ansible Tower by Red Hat
Ansible In-Depth Whitepaper
Ansible Tower by Red Hat
Learn About Ansible Tower