Ansible-Blog-Tower-Workflow-Convergence

In Red Hat Ansible Tower 3.1 we released a feature called Workflows. The feature, effectively, allowed users to compose job templates into arbitrary graph trees. A simple workflow we saw users creating was a linear pipeline; similar to the workflow below.

image4-1

The workflow feature also allowed branching. Each branch can run in parallel.

image1-2

But something was missing. The ability to wait for previous parallel operations to finish before proceeding. If this existed, you could simplify the above workflow (see below).

image3-2

In Red Hat Ansible Tower 3.4 the above workflow is now possible with the introduction of the Workflow Convergence feature.

For you computer sciencey folks, workflows are no longer restricted to a tree, you can create a DAG. More simply, we call this convergence; two nodes are allowed to point to the same downstream node. The concept is best shown through an example. Above, we have a workflow with 3 nodes. The first two job templates run in parallel. When they both finish the 3rd downstream, convergence node, will trigger.

In this blog post we will cover the changes to workflow failure scenarios, how workflow node failure and success propagate, how this affects the runtime graph and how to create a workflow that responds to ALL failures rather than ANY failures. After reading this blog post you will have an understanding for how the Workflows feature chooses to execute the graph and how to, effectively, alter the ANY vs ALL scenario using the utility playbook method.

Workflow Execution Scenario

Prior to the release of Ansible Tower 3.4 a workflow failure or success was determined by the final node’s resulting status. The workflow would get marked fail if either Install APP jobs failed.

image1-2

Now workflow success or failure works more like exception handling. If 1 or more jobs spawned from a workflow results in a failure WITHOUT a failure handler, then the workflow is marked failed; else it’s marked success. A failure handler is an always or failure path emanating from a node. Below is an example of how a common failure scenario would be handled using a workflow.

image6

The above workflow creates an EC2 instance and deploys our application code. If either of these two jobs fail, then the EC2 instance will be cleaned up by the Delete EC2 Instance job. The workflow job is not marked as failed, because the Delete EC2 Instance is the failure handler. If the Delete EC2 Instance job fails, the workflow will be marked failed and the user knows that manual intervention is required to inspect why the cleanup process fails and to manually delete the, effectively orphaned, EC2 instance.

image3-2

Now let’s look at a slightly more complex example. In the above workflow, our intent is to run the "Install App" Job Template ONLY if both instance creation jobs succeed. However, the workflow doesn’t work like that. Instead, the "Install App" Job Template will run if either parents succeed. In the next section I will show you how to get the wanted AND behavior instead of OR.

AND vs. OR

Let’s take a look at what I’ve said above, but in more general terms. Consider child nodes with multiple parents connected by a success relationship. If any one of the parents succeed, the child node will run. Sometimes this is the wanted behavior, sometimes though you may want to run a child node only if all parents succeed. So how do we do that? To accomplish this we will create a utility Job Template that we put before the convergence node that we want to make an ALL case (i.e. before the Install App in the previous example).

AND Utility Job Template

image2-3

Again, our goal is to have Install App run only if both “Create GCE Instance” and “Create EC2 Instance” succeed. To accomplish this goal we have created a new Job Template named “Utility All” and replaced our previous convergence node, “Install App”, with “Utility All”. The playbook associated with “Utility All” is shown below. The playbook gets the parent jobs, loops over them and if any of the parent jobs fail then the playbook itself fails. This achieves our wanted behavior of only running “Install App” when all parent nodes succeed.

# and_util.py
---
- hosts: localhost
  gather_facts: false
  vars:
    this_playbook_should_fail: false
    job_id: "{{ lookup('env', 'JOB_ID') }}"
    tower_base_url: "https://{{ lookup('env', 'TOWER_HOST') }}/api/v2"
    tower_username: "{{ lookup('env', 'TOWER_USERNAME') }}"
    tower_password: "{{ lookup('env', 'TOWER_PASSWORD') }}"
    tower_verify_ssl: "{{ lookup('env', 'TOWER_VERIFY_SSL') }}"
  tasks:
    - name: "Get Workflow job id for which this job belongs"
      shell: tower-cli job get {{ job_id }} -f json | jq ".related.source_workflow_job" | sed 's/\/"$//' | sed 's/.*\///'
      register: workflow_job_id

    - name: "Get Workflow node id for this job"
      uri:
        url: "{{ tower_base_url }}/workflow_job_nodes/?job_id={{ job_id }}"
        validate_certs: "{{ tower_verify_ssl }}"
        force_basic_auth: true
        user: "{{ tower_username }}"
        password: "{{ tower_password }}"
      register: result

    - name: "Get parent workflow nodes for this workflow node"
      uri:
        url: "{{ tower_base_url }}/workflow_job_nodes/?success_nodes={{ result.json.results[0].id }}"
        validate_certs: "{{ tower_verify_ssl }}"
        force_basic_auth: true
        user: "{{ tower_username }}"
        password: "{{ tower_password }}"
      register: result

    - name: "Fail this playbook if a parent node failed"
      fail:
        msg: "Parent workflow node {{ item }} failed"
      when: "item.summary_fields.job.status == 'failed'"
      loop: "{{ result.json.results }}"

image5-1

Conclusion

In our own testing we found this Workflow Convergence feature mapped better to actual working practices, so we hope you find it as useful for your own needs. In this blog post we have gone over how workflow failure scenarios have changed to accommodate the new convergence node feature, how the run-time graph is created, and how to use and change the default workflow convergence method. I invite you to check out the other workflow enhancements we added in Ansible Tower.


About the author

Chris is a Principal Software Engineer, Ansible, contributing Red Hat Ansible Tower backend APIs. Outside of work Chris hones his skills as an amateur carpenter on his house. To learn more about those you can follow him on Twitter at @oldmanmeyers85.

Read full bio