Tolerable Ansible

June 17, 2020 by John Westcott

Ansible Playbooks are very easy to read and their linear execution makes it simple to understand what will happen while a playbook is executing. Unfortunately, in some circumstances, the things you need to automate may not function in a linear fashion. For example, I was once asked to perform the following tasks with Ansible:

  • Notify an external patching system to patch a Windows target
  • Wait until the patching process was completed before moving on with the remaining playbooks tasks

While the request sounded simple, upon further investigation it would prove more challenging for the following reasons:

  • The system patched the server asynchronously from the call. i.e. the call into the patching system would simply put the target node into a queue to be patched 
  • The patching process itself could last for several hours
  • As part of the patching process the system would reboot no fewer than two times but with an unspecified maximum depending on the patches which need to be applied
  • Due to the specific implementation of the patching system the only reliable way to tell if patching was completed was by interrogating a registry entry on the client
  • If the patching took too long to complete additional actions needed to be taken

Due to the asynchronous nature, long running process and an unspecified number of reboots of the patching system, it was challenging to make Ansible correctly initiate and monitor the process. For example, if the machine was rebooted while Ansible tried to check the registry the playbook would fail. Or, if the reboot took too long, Ansible might continue with the playbook prematurely and fail to connect to the client. Fortunately, as you will see in this blog post there are features within Ansible which make it possible to handle these error conditions to achieve the desired effects for cases like this.

 

Setting Up A Simple Test

In the following examples, we are going to write some Ansible playbooks which will:

  • Look for the presence of a specific file on a target machine
  • Handle any number of non-Ansible initiated reboots
  • Timeout after some number of retries (which we will keep low for these tests)
  • Perform one of two actions based on the results of the monitoring (in our cases we will either fail or print a debug message)

These examples are designed to run on Linux machines but the concepts we are using would apply the same way for Windows tasks. 

At the root of our test we will see if a file exists. So let's start with a playbook like this:

---
- name: Try to survive and detect a reboot
  hosts: target_node
  gather_facts: False
  tasks:
    - name: Check for the file
      file:
        path: "/tmp/john"
        state: file
      register: my_file
 
    - name: Post Task
      debug:
        msg: "This task is the end of our monitoring"

With the file present out playbook completes successfully:

PLAY [Test file] *******************************************************************
 
TASK [Check for the file] *******************************************************************
ok: [192.168.0.26]
 
TASK [Post Task] *******************************************************************
ok: [192.168.0.26] => {
    "msg": "This task is the end of our monitoring"
}
 
PLAY RECAP *******************************************************************
192.168.0.26               : ok=4    changed=0    unreachable=0    failed=0

But when we run this playbook with the file missing the task fails with an error:

TASK [Check for the file] *******************************************************************
fatal: [192.168.0.26]: FAILED! => {"changed": false, "msg": "file (/tmp/john) is absent, cannot continue", "path": "/tmp/john", "state": "absent"}

 

Using a Loop to Wait

The first modification to our playbook will be to use a loop to allow the file check to wait for the file to show up. To do this, we will add some parameters to the “Check for the file” task:

    - name: Check for the file
      file:
        path: "/tmp/john"
        state: file
      register: my_file
      # Keep trying until we found the file
      until: my_file is succeeded
      retries: 2
      delay: 1

This tells Ansible that we want to retry this step up to two times with a one second delay between each check and, if the registered my_file variable was successful, we can be done with this step. This time, when we run our playbook without a file the specific task looks different already:

TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left).
FAILED - RETRYING: Check for the file (1 retries left).
fatal: [192.168.0.31]: FAILED! => {"attempts": 2, "changed": false, "msg": "file (/tmp/john) is absent, cannot continue", "path": "/tmp/john", "state": "absent"}

We can see Ansible try twice to find the file and then die. Now if we ran our playbook and, while its looping we create the file its looking for, the task output will become:

TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left)
ok: [192.168.0.31]

After that task, the final task will run which means our monitoring is now working as expected. However, note what happens in our task if I simulate a reboot of our server while the loop is monitoring for the file:

TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left)
fatal: [192.168.0.31]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: Shared connection to 192.168.0.31 closed.\r\n", "unreachable": true}

In this case we get a connection failure and Ansible terminates the execution on this node. Since we are only running on one node this also ends our play. To help with this, there was a recent feature added to Ansible called ignore_unreachable. This allows us to continue a playbook even if a host has become unreachable. Let's modify our check task to include this parameter:

    - name: Check for the file
      file:
        path: "/tmp/john"
        state: file
      register: my_file
      until: my_file is succeeded
      # Keep trying until we found the file
      retries: 2
      delay: 1
      # It is ok if we can’t connect to the server during this task
      ignore_unreachable: True

If we run the last test with the machine reboot in the loop we now get these results:

TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left)
fatal: [192.168.0.31]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 192.168.0.31 port 22: Connection refused", "skip_reason": "Host 192.168.0.31 is unreachable", "unreachable": true}
 
TASK [Post Task] *******************************************************************
ok: [192.168.0.31] => {
    "msg": "This task is the end of our monitoring"
}
 
PLAY RECAP *******************************************************************
192.168.0.31               : ok=2    changed=0    unreachable=1    failed=0    skipped=1    rescued=0    ignored=0

Note the skipped and the unreachable in the play recap at the end of the output.

 

Looping Over A Loop

With the configuration above, Ansible will now ignore the connection error for that step and continue to run the playbook. This is both good and bad for us. It’s bad because we made it to the final task indicating a successful patch (or in our example a found file) but we never actually put the file on the system (so we got a false positive). However, this is good because it means we can keep writing Ansible to further handle the error conditions. 

Next, we are going to make Ansible retry the logic to find the file after a reboot. To do this, we have to loop over the “Check for the file” task... but we already are. To perform a loop of a loop we are going to leverage the include_tasks module. Our main playbook will now look like this:

---
- name: Try to survive and detect a reboot
  hosts: target_node
  gather_facts: False
  tasks:
    - include_tasks: run_check_test.yml
 
    - name: Post Task
      debug:
        msg: "This task is the end of our monitoring"

And the included file (run_check_test.yml) will perform our check and then conditionally include the same file again:

---
# Check for my file in a loop
- name: Check for the file
  file:
    path: "/tmp/john"
    state: file
  register: my_file
  until: my_file is succeeded
  # Keep trying until we found the file
  retries: 2
  delay: 1
  # It is ok if we can’t connect to the server during this task
  ignore_unreachable: True
 
 
# if I didn’t find the file or my target host was unreachable
# run again
- include_tasks: run_check_test.yml
  when:
    - my_file is not succeeded or my_file is unreachable

If you are familiar with programming you may realize that this could potentially make an infinite loop if the file is never found. To prevent an infinite loop we will add another task and an additional check on the include within our include file to limit the total number of tries we have:

---
- name: Check for the file
  file:
    path: "/tmp/john"
    state: file
  register: my_file
  until: my_file is succeeded
  # Keep trying until we found the file
  retries: 2
  delay: 1
  # It is ok if we can’t connect to the server during this task
  ignore_unreachable: True
 
- set_fact:
    safety_counter: "1"
 
- include_tasks: run_check_test.yml
  when:
    - (safety_counter | int > 0)
    - my_file is not succeeded or my_file is unreachable

With this change we are guaranteed to never run our "outer loop" more than two times. Again, this is a low number specifically for our testing. Lets run our updated code without a reboot and without the file:

TASK [include_tasks] *******************************************************************
included: run_check_test.yml for 192.168.0.31
 
TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left).
FAILED - RETRYING: Check for the file (1 retries left).
fatal: [192.168.0.31]: FAILED! => {"attempts": 2, "changed": false, "msg": "file (/tmp/john) is absent, cannot continue", "path": "/tmp/john"}
 
PLAY RECAP *******************************************************************
192.168.0.31               : ok=2    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Did you expect that to happen? We get a failure in the until loop and Ansible failed out instead of running the include file again. To prevent Ansible from failing on our loop step we will add the igore_errors parameter on our file check task:

- name: Check for the file
  file:
    path: "/tmp/john"
    state: file
  register: my_file
  until: my_file is succeeded
  # Keep trying until we found the file
  retries: 2
  delay: 1
  # It is ok if we can’t connect to the server during this task
  ignore_unreachable: True
  # If this step fails we want to continue processing so we loop
  ignore_errors: True

With this modification running again with a reboot and no file we will now get this:

TASK [include_tasks] *******************************************************************
included: run_check_test.yml for 192.168.0.31
 
TASK [Check for the file] *******************************************************************
FAILED - RETRYING: Check for the file (2 retries left).
FAILED - RETRYING: Check for the file (1 retries left).
fatal: [192.168.0.31]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 192.168.0.31 port 22: Connection refused", "skip_reason": "Host 192.168.0.31 is unreachable", "unreachable": true}
 
TASK [set_fact] *******************************************************************
ok: [192.168.0.31]
 
TASK [include_tasks] *******************************************************************
included: run_check_test.yml for 192.168.0.31
 
TASK [Check for the file] *******************************************************************
fatal: [192.168.0.31]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 192.168.0.31 port 22: Connection refused", "skip_reason": "Host 192.168.0.31 is unreachable", "unreachable": true}
 
TASK [set_fact] *******************************************************************
ok: [192.168.0.31]
 
TASK [include_tasks] *******************************************************************
skipping: [192.168.0.31]
 
TASK [Post Task] *******************************************************************
ok: [192.168.0.31] => {
    "msg": "This task is the end of our monitoring"
}
 
PLAY RECAP *******************************************************************
192.168.0.31               : ok=6    changed=0    unreachable=2    failed=0    skipped=3    rescued=0    ignored=0

 

Slowing Things Down

You can't see the timing in the blog but once the server was down from the reboot the loop executes extremely fast. Since the node is down we looped around to the file check and the node was still rebooting so it immediately triggered the ignore_unavailable flag burning through our saftery_counter variable in no time at all. To prevent this from happening we can add a sleep step in our include file and only execute it if the node is down:

- name: Sleep if the host was unreachable
  pause:
    seconds: 3
  when: my_file is unreachable
  delegate_to: localhost

This will tell Ansible to pause for three seconds if it found the target node was unreachable. A server in a production environment may take more time than three seconds to start so this number should be adjusted accordingly for your environment. 

Another issue you might have spotted in our last output was that our last task ran indicating that we successfully found our file (which we did not). To handle this condition we want an error message to indicate that the patching was not completed within our retries. To achieve this we are going to add a conditional fail step in our main playbook after the loop. This could be an include_role or anything else you need to handle a failure in patching.

---
- name: Try to survive and detect a reboot
  hosts: target_node
  gather_facts: False
  tasks:
    - include_tasks: run_check_test.yml
 
    # Fail if we:
    #     never registered my_file or
    #     we failed to find the file or
    #     our loop ended with the server in an unreachable state
    - name: Fail if we didn't find the file
      fail:
        msg: "Patching failed within the timeframe specified"
      when: my_file is not defined or my_file is not succeeded or my_file is unreachable

    - name: Post Task
      debug:
        msg: "This task is the end of our monitoring"

 

The Completed Product

With all of these steps in this order, Ansible can now meet all our initial requirements:

  • It can launch a patching system
  • It can monitor something on the target node
  • It can survive an unspecified number of reboots
  • We can take additional actions if patching failed

Here are our final files with some additional in-line comments.

main.yml:

---
- name: Try to survive and detect a reboot
  hosts: all
  gather_facts: False
 
  tasks:
    # inside the include we will call the same include to perform a loop
    - include_tasks: run_check_test.yml
 
    # The my_file variable comes from the include, its a registered variable from a task
    # Here we are going to force a failure if:
    #     we didn't get to run for some reason
    #     we didn't succeed
    #     we were unable to reach the target
    - name: Fail if we didn't get the file
      fail:
        msg: "It really failed"
      when: my_file is not defined or my_file is not succeeded or my_file is unreachable
 
    # Otherwise we can move on to any remaining tasks we have
    - name: Post task
      debug:
        msg: "This is the post task"

And our run_check_test.yml:

# Perform our sample file check
- name: Check for the file
  file:
    path: "/tmp/john"
    state: file
  register: my_file
  # As long as we are connected, keep trying until we find the file
  until: my_file is succeeded
  # if the machine is up the retries will keep looping looking for the file pausing for the delay
  retries: 2
  delay: 1
  # This setting will not mark the machine as failed if its unreachable
  ignore_unreachable: True
  # We also need to ignore errors incase the file just does not exist on the server yet and we run out of retries
  ignore_errors: True
 
# If the machine was not available for the last step pause so we don't just cruise through the number of retries
- name: Sleep if the host was unreachable
  pause:
    seconds: 3
  when: my_file is unreachable
  delegate_to: localhost
 
# decrement a safety counter so we don't end up in an infinite loop
- set_fact:
    safety_counter: "{{ (safety_counter | default(6) | int) - 1}}"
 
# Loop if:
#    we still have safety retires
#    didn't get the file or we were unreachable to connect
- include_tasks: run_check_test.yml
  when:
    - (safety_counter | int > 0)
    - my_file is not succeeded or my_file is unreachable

To see an example of running our completed playbook while our target machine is rebooted twice before the file is created check out this listing

 

Takeaways and where to go next

Ansible is a great platform for performing automation. Putting together many fundamental concepts we can make Ansible extremely tolerant when dealing with unpredictable behaviour from client machines due to simultaneous processes being run outside of our Ansible automation.

If you want to learn more about the Red Hat Ansible Automation Platform:

Share:

Topics:
Ansible


 

John Westcott

John Westcott is a Senior Consultant at Red Hat


rss-icon  RSS Feed

RH-ansible-automation-platform_trial-banner
AnsibleFest-2020-banner-A