The Power of AI and the Science of Operations (Part 1)

November 4, 2021 by Anthony Lin

A variety of industry experts cite Artificial Intelligence and Automation as key emerging trends.  But if you look around your organizations, you will see the evidence of AI projects and also an increasing focus on using automation in a variety of ways.  IBM and Red Hat together can help you build on and apply these trends to your IT operations. 

In this article, which is part 1 of the 2 articles that I intend to write, we will show how complex application environments produce more data than the humans tasked with running those environments can feasibly understand. And how the combination of an AIOps platform like Instana with an enterprise automation platform like Ansible Automation Platform can give human operators the edge they need to keep business critical applications running and users satisfied.


So much data, so little time

Having worked as an operation engineer in the past, I am aware of the all-too-familiar challenge of receiving a storm of alerts and trying to locate the root cause of an anomaly so as to isolate the problem and recover the services in the shortest possible time. However, conventional monitoring tools are often only able to raise alarms based on predefined performance metrics and thresholds, but do little to pinpoint the root cause of the problem. These limitations make it difficult for operations teams to scale as organizations continue to embrace Digital Transformation and generate more IT operations data with increasingly complex systems. 

This is especially so with the dynamic nature of hybrid cloud environments and microservices architectures as well as the advent of edge computing. Adding to this is the adoption of DevOps practices where developers are increasingly having more influence over operational tasks. Coupled with an ever increasing demand for better SLA (service-level agreement) and faster turnaround time for root cause analysis (RCA), it is increasingly hard to just rely on manual correlation of events and data to resolve an issue within the stipulated SLA.


Can artificial intelligence help? 

Artificial Intelligence (AI) with Automation promises autonomous operations that can radically change the game for IT Operations and Site Reliability Engineering (SRE) Team. According to Gartner’s “Market Guide for AIOps Platforms”, AIOps (Artificial Intelligence for IT Operations) platform adoption is growing rapidly across enterprises and organizations are increasing their usage of AIOps across various aspects of IT operations management (ITOM) and maturing their use cases across DevOps and SRE practices.

AIOps platform consumes large volumes of operational data from different services and applications, and uses Machine Learning (ML) algorithms on those data to improve analysis and provide insights for event correlation, anomaly detection and causality determination. While the AI in AIOps is not able to completely replace humans, especially for complex issues at this point in time, it helps to relieve the tedious work of correlating events and information coming from a large number of data streams. Coupled with the usage of automation, especially in the area of auto-remediation, AIOps seeks to augment IT service management, improve SLA and reduce unplanned service downtime.

The nature of AIOps platforms means that they will need to integrate with a large number of information sources so as to be able to ingest the required data and perform analysis on them. In general, AIOps tools are split into 2 categories: Domain-Agnostic and Domain-Centric, depending on how they collect the data. Gartner notes that the market is shifting AIOps platforms towards domain-agnostic as there is a need for flexibility in processing highly diverse datasets. Some of the domain-centric AIOps platforms in the market today include Datadog APM, Dynatrace and AppDynamics while domain-agnostic AIOps platforms include Elasticsearch, IBM Watson AIOps and Splunk Enterprise amongst others.


Ansible Automation Platform and Instana

In the first part of this blog, we look at how the Red Hat Ansible Automation Platform can be integrated with Instana (part of IBM’s AIOps portfolio, which includes Cloud Pak for Watson AIOps and its ecosystem). Instana provides fully automated application observability and the context needed to take intelligent actions to ensure optimum application performance while Red Hat Ansible Automation Platform provides an enterprise framework for building and operating IT automation at scale, from hybrid cloud to the edge.

As can be seen from the architectural diagram below, Instana can be used to monitor different platforms and environments, such as the public clouds, Red Hat OpenShift Container Platform, Kubernetes or Virtual Machines (VMs) and Bare Metals running Operating Systems such as the Red Hat Enterprise Linux (RHEL) and Windows. It can also be used to monitor various applications, middleware and other programming languages.

The Ansible Automation Platform can be integrated with Instana where it can be used in areas such as agent deployments, configurations as well as performing the required remediation actions. Instana supports GitOps and hence the relevant configuration files are stored in Git repositories, alongside the playbooks that are used for Ansible automation. The usage of source control ensures consistency in deployment, configuration as well as auto-remediation. 



Different expectations, different requirements

AIOps platforms are used by different teams within the organization and each of these personas has different expectations and requirements. For instance, DevOps teams are mainly focused on log ingestion and analytics while business leaders are focused on user engagements. Some of the requirements from the different personas who are using the AIOps platform include:


  • Application SRE
    • I want to define my concerns and focus on the desired components so that I can monitor the applications’ performance in real-time with golden signals
    • I want to be alerted if there are any potential issues and simplify the RCA with a better context of the issue
  • Developer
    • I want to monitor my applications in real-time and perform the RCA in an easy and straightforward way
  • Business Owner
    • I want to understand my applications’ performance with the golden signals from the end-user perspective 
  • Infrastructure Operator
    • I want to quickly roll out the monitoring capability to my hybrid cloud environments that comprise of tens of clusters and thousands of VMs so that I can observe my VMs and clusters in real-time
    • I also want to simplify the Day 2 Operational User Experience (UX) for future configuration, patches and upgrades


Next Steps

We have gone through the need for AIOps platforms and automation in the world that we are in today. We have also touched on how the Ansible Automation Platform can be integrated with Instana as well as the requirements from the various personas across an organization. In Part 2, we will go into the details of the joint PoC (proof-of-concept) by me and Bright Zheng (my counterpart in IBM).


Ansible Automation Platform


Anthony Lin

Anthony Lin is a Senior Specialist Solution Architect, Cloud Automation, covering the Asian Growth & Emerging Markets (GEMs). Anthony previously worked for Ericsson as a DevOps engineer on the AT&T Integrated Cloud/Network Cloud Project in the United States. In this role, he has used Ansible automation to deploy and upgrade more than 50 customized OpenStack cloud clusters across the U.S and international production sites. He was also working as a core member/developer of the OpenStack Airship project prior to joining Red Hat.

rss-icon  RSS Feed