While you sleep, Automate resolving Dynatrace problem alerts and report them to ServiceNow!

December 7, 2023 by Ben Forrester

Integrating observability tools with automation is paramount in the realm of modern IT operations, as it fosters a symbiotic relationship between visibility and efficiency. Observability tools provide deep insights into the performance, health, and behavior of complex systems, enabling organizations to proactively identify and rectify issues before they escalate. 

When seamlessly integrated with automation frameworks, these tools empower businesses to not only monitor but also respond to dynamic changes in real time. This synergy between observability and automation enables IT teams to swiftly adapt to evolving conditions, minimize downtime, and optimize resource utilization. By automating responses based on observability data, organizations can enhance their agility, reduce manual intervention, and maintain a robust and resilient infrastructure. In essence, using observability with automation is indispensable for achieving a proactive, responsive, and streamlined operational environment in the fast-paced and complex landscape of today’s technology.

In this blog post, we will look at a common use case involving the monitoring of processes on both bare metal and virtual machines. Our exploration will focus on utilizing Dynatrace's OneAgent, a deployed binary file on hosts that encompasses a suite of specialized services meticulously configured for environment monitoring. These services actively gather telemetry metrics, capturing insights into various facets of your hosts, including hardware, operating systems, and application processes.

Within this use case, our objective is to establish a host-level monitor specifically for the NGINX web server process. I will guide you through the implementation of Event-Driven Ansible, a framework that links event sources with corresponding actions through defined rules. In this instance, the event source is Dynatrace.

Once the configuration is complete, we will simulate the following scenario:

  1. The NGINX web server experiences an unplanned downtime on the server.
  2. The process monitor, facilitated by Dynatrace OneAgent, promptly detects the failed NGINX process and generates a problem alert in the Dynatrace platform.
  3. The Dynatrace source plugin, as defined in the rulebook employed by Event-Driven Ansible, actively polls for failure events.
  4. Event-Driven Ansible, in response to the event, executes a job template that undertakes the following actions:
    • Initiates the creation of a ServiceNow incident ticket.
    • Attempts to restart the NGINX process.
    • Updates the incident ticket status to "In Progress."
    • Closes the ticket only if the NGINX process is successfully restored.

The flow chart below illustrates the interactions between these integrated components.

 

Before we begin, let’s get accustomed to some terminology that will coincide with the core concepts of Event-Driven Ansible:

Terminology:

An Ansible Rulebook encompasses both the event source and detailed rule-based instructions on the actions to take when specific conditions are met, offering a high degree of flexibility.

A decision environment is a container image crafted to execute Ansible Rulebooks used in the Event-Driven Ansible controller. 

Event source plugins are commonly constructed in Python and serve the purpose of gathering events from your specified event source. Additionally, plugins are distributed through Red Hat Ansible Certified Content Collections.

A decision environment will need to be built.  See below example of the build file:

---
version: 3

images:
  base_image:
    name: registry.redhat.io/ansible-automation-platform-24/de-minimal-rhel8:latest

dependencies:
  galaxy:
    collections:
      - ansible.eda
      - dynatrace.event_driven_ansible
  system:
    - pkgconf-pkg-config [platform:rpm]
    - systemd-devel [platform:rpm]
    - gcc [platform:rpm]
    - python39-devel [platform:rpm]

options:
  package_manager_path: /usr/bin/microdnf

Refer to the provided documentation for additional guidance on constructing decision environments. Following the creation of your decision environment, proceed to push the container image to your designated image repository and subsequently pull the image into your Event-Driven Ansible controller.

With the establishment of your decision environment, you are on the verge of creating a rule activation in the Event-Driven Ansible controller. This activation will encapsulate the rulebook, defining your event source along with detailed instructions on the actions to be executed under specific conditions. Similar to the organization of playbooks within projects in automation controller, Event-Driven Ansible controller employ projects to manage and contain our rulebooks.

Below, you'll find a standard directory hierarchy for organizing and storing your rulebooks and playbooks in your Git repository.

Once we have a project created in Event-Driven Ansible controller, we will need to create a rulebook activation, which is a background process defined by a rulebook running within a decision environment.

For this use case, we will build a rulebook that uses the Dynatrace plugin as the Ansible event source and we will specify what we want to do when a condition is met. 

In general, there are three integration patterns for source plugins: 

  1. polling
  2. webhook
  3. messaging

In the context of our use case, the Dynatrace source plugin efficiently retrieves events by actively polling according to the specified conditions outlined in the rulebook. This polling mechanism introduces a delay variable inherent to the Dynatrace plugin, as outlined in the rulebook (refer to the delay variable setting). 

This delay serves a crucial role in regulating the plugin's behavior by implementing a throttling mechanism. Essentially, it orchestrates the execution of API calls at predefined intervals, allowing the plugin to generate a new event based on the received response. This intentional pacing of API calls proves instrumental in managing and optimizing the overall workflow, mitigating the risk of encountering rate limits and ensuring the system operates seamlessly.

See below rulebook:

---
- name: Watching for Problems on Dynatrace
  hosts: all
  sources:
    - dynatrace.event_driven_ansible.dt_esa_api:
        dt_api_host: "{{ dynatrace_host }}"
        dt_api_token: "{{ dynatrace_token }}"
        delay: "{{ dynatrace_delay }}"

  rules:
    - name: Look for open Process monitor problem
      condition: event.title == "No process found for rule Nginx Host monitor"
      action:
        run_job_template:
          name: Fix Nginx and update all
          organization: "Default"
          job_args:
            extra_vars:
              problemID: "{{ event.displayId }}"
              reporting_host: "{{ event.impactedEntities[0].name }}"

Note: Red Hat provides no expressed support claims to the correctness of this code. All content is deemed unsupported unless otherwise specified.

In the above rulebook, there are two keys in the YAML that require attention. One is the condition key in the rules section. Notice that the event.title equals “No process found for rule Nginx Host monitor”. But where did I get that string? 

Second, take a look at the reporting_host variable under the action section where we call the job template to be run in automation controller. Where do I get event.impactedEntities[0].name

We'll delve deeper into the blog to uncover the definition and utilization of these keys in our event driven automation processes.

Once you have installed the Dynatace OneAgent on your target host, you will need to create an access token. Ensure the token has the following permissions:

You’ll need to configure a host level process availability monitor rule for the NGINX process. Ensure the hostname running NGINX aligns with the hostname specified in your inventory within automation controller.

After you’ve set the monitor up you can test the payload polled by the rulebook activation in Event-Driven Ansible by killing the NGINX process on your managed host. 

See the example rule audit of what the payload will look like in Event-Driven Ansible:

In the example above, you see that the payload event data from Dynatrace is in JSON format. We use the string set in event.title for our condition and the reporting_host variable is dynamically set by the event.impactedEntities[0].name value. Note impactedEntities.name[0].name could be more than one host.

Now that we know how the condition key value and reporting_host variables are set, what’s next? 

This is an opportune moment to assess the playbook intended for execution in automation controller as a job template. This playbook is used when the event payload, triggered by Event-Driven Ansible reporting the NGINX process down, is detected by Dynatrace:

---
- name: Restore nginx service create, update and close ServiceNow ticket after Ansible restores services
  hosts: "{{ reporting_host }}"
  gather_facts: false
  become: true
  vars:
    incident_description: Nginx Web Server is down
    sn_impact: medium
    sn_urgency: medium
  tasks:
    - name: Create an incident in ServiceNow
      servicenow.itsm.incident:
        state: new
        description: " Dynatrace reported {{ problemID }}"
        short_description: "Nginx is down per {{ problemID }} on {{ reporting_host }} reported by Dynatrace nginix monitor."
        caller: admin
        urgency: "{{ sn_urgency }}"
        impact: "{{ sn_impact }}"
      register: new_incident
      delegate_to: localhost

    - name: Display incident number
      ansible.builtin.debug:
        var: new_incident.record.number

    - name: Pass incident number
      ansible.builtin.set_fact:
        ticket_number: "{{ new_incident.record.number }}"

    - name: Try to restart nginx
      ansible.builtin.service:
        name: nginx
        state: restarted
      register: chksrvc

    - name: Update incident in ServiceNow
      servicenow.itsm.incident:
        state: in_progress
        number: "{{ ticket_number }}"
        other:
          comments: "Ansible automation is working on {{ problemID }}. on host {{ reporting_host }}"
      delegate_to: localhost

    - name: Validate service is up and update/close SNOW ticket
      block:
        - name: Close incident in ServiceNow
          servicenow.itsm.incident:
            state: closed
            number: "{{ ticket_number }}"
            close_code: "Solved (Permanently)"
            close_notes: "Go back to bed. Ansible fixed problem {{ problemID }} on host {{ reporting_host }} reported by Dynatrace."
          delegate_to: localhost

      when: chksrvc.state == "started"

“Red Hat provides no expressed support claims to the correctness of this code. All content is deemed unsupported unless otherwise specified”

It is important to emphasize that the job template name to be established in automation controller must align with the name specified in the run_job_template section of the rulebook. In the context of this use case example, I opted to incorporate surveys within my job template, enabling prompts on launch for the problemID and reporting_host variables, as passed from the rulebook.

For our use case to work you will need your automation controller to be integrated with ServiceNow and have an automation execution environment with the ITSM ServiceNow collection configured in automation controller to be used with your job template. Furthermore, ensure that you have created a project in automation controller housing the remediation playbook and the hostname hosting the NGINX web server must be included in the inventory of your automation controller. Finally, make sure that you have successfully integrated your Event-Driven Ansible controller with your automation controller.

Now that everything is set up properly, you should test this by killing your NGINX process on your host and observe the following:

  1. A generated alert in Dynatrace:

  2. A Rule audit event in Event-Driven Ansible:

  3. Job event where the remediation job template runs in automation controller


  4. An incident ticket is opened, updated, and closed if the NGINX process is restored on the problem host that was reported by Dynatrace.

 

In this example use case, we introduced the Dynatrace plugin and decision environments, exploring sample payload event data from our source to demonstrate how to dynamically populate variables. We implemented a host-level process monitor for the NGINX process in Dynatrace. Alternatively, we could have employed a synthetic monitor in Dynatrace for application-level monitoring.

Emphasizing the importance of adaptability, our remediation playbook remains dynamic and is specifically scoped to run solely on the host(s) reported as problematic by Dynatrace. While this example covers a breadth of topics, it's essential to recognize that automating complex tasks doesn't necessitate an all-encompassing approach from the outset. Instead, consider a gradual integration of remediation tasks into your playbook over time as you learn to automate fixes. Initially, you might opt to open an incident ticket before implementing any remediation action, gradually transitioning to the automation of common problems. Applying standard agile principles to your automation journey allows for an iterative and flexible approach. It's worth noting that the era of manually addressing common issues at 3 AM is evolving, offering you the opportunity to reclaim your sleep through effective enterprise level automation practices.

Happy automating! 

 

Additional resources and next steps

Want to learn more about Event-Driven Ansible? 

 

Share:

Topics:
ServiceNow, Automation controller, Automation execution environments, Event-Driven Ansible


 

Ben Forrester

Ben is a Senior Ansible Solution Architect with a wealth of expertise spanning more than 25 years in the realm of constructing and automating UNIX and Linux based distributed computing systems. During his distinguished career, Ben devoted two decades to the Federal Reserve, where he led the design and construction of pivotal infrastructure for crucial systems, including bank payment systems, Treasury Auction Application systems, and Federal Automated Clearing House infrastructure. He currently calls Atlanta, GA, home, where he resides with his wife and children, enjoying southern living.


rss-icon  RSS Feed