Ansible for Disaster Recovery

June 8, 2023Ajay Chenampara

Overview

When we get into the nuts and bolts of implementing a disaster recovery (DR) plan, an important step is to evaluate the tech stack that’s hosting the critical applications. The techstack oftentimes determines the order of operations and execution needed to effect the DR. Most organizations have the following tech stack pattern for their data centers:

Each of these layers has their own SMEs (Subject Matter Experts) who will need to work in tandem to address complexities and challenges during a DR event, and create a plan to ensure business continuity.

Challenges in creating a disaster recovery plan

“Everybody has a plan until they get punched in the face.” - Mike Tyson

Cyber attacks, natural disasters, human error, server failure–any number of potential events can bring on the need for disaster recovery. While the risk of experiencing a disaster event won’t go away, the negative impact of such an event can be drastically minimized with the right planning.

The following is a sample SOP to recover an application during a disaster. Depending on the needs of the organization, DR procedures could be simpler or more complex than the examples shown here. After monitoring systems have detected conditions to trigger a DR event, a typical DR sequence might flow through the following stages:

Primary site shutdown procedure

Run shutdown sequences
Ensure current data backup snapshots
Create tickets for admins
Safely stop existing code and applications
Push existing workloads to disaster recovery environment
Ensure business data integrity
Follow existing procedures and policies in regards to emergency response
Initiate DR provisioning

Failover procedure

Seamless transition to failover systems
Provisioning of needed cloud or DR site resources
Storage backups, restorations, and migration of critical systems
Updating application connections
Mitigating security risks
Connecting users to DR site
Pushing out messaging to users
Updating DNS and ALB’s

Finally, when the event is over and the threat is no longer present, normal operations can resume.

Return to normal operations procedure

Bringing systems back online
Merging versions of stored data
Restoring original connections
Scaling down disaster recovery environment
Evaluating damage and losses during emergency
Setting up or updating existing tickets for necessary action items from administrators
Reducing overlapping environment costs

Unplanned downtime can have a huge financial impact with analysts estimating the cost to an organization to be over $500K per hour of unplanned outage¹. For the public sector, the impact can be even more crippling. Outside of the financial implications, system outages can affect public safety and citizen well-being which can have longer term effects on public trust of the government.

Disasters may be unavoidable, but their negative impact can be minimized. Disaster recovery planning distills down to two things:

How quickly the services can be restored - the Mean Time To Recovery (MTTR)
Level of confidence in the DR plan - this comes from regularly scheduled successful testing of the plan

How Red Hat Ansible Automation Platform can support DR planning

An automated disaster recovery plan is a safer disaster recovery plan. Red Hat Ansible Automation Platform can automate disaster recovery plans by using a feature capability called workflows. Workflows can tie individual SME created bits of automation together into a cohesive orchestration process. For example:

Automate - Detection of DR event and kick off DR process

Automate - Primary site shut down

Automate - Failover

Automate - Return to Normal Operations

With Ansible it becomes easy to not only visualize the steps but also build in automated failure handling if any step does not go as expected. Once the process is tied down through a workflow it makes it easy to test the DR plan repeatedly.

In addition to the workflow capabilities, there are also powerful abstractions available as part of the Red Hat Ansible Automation Platform. The figure below represents a small sample of the certified and supported, powerful abstractions that are available as part of the Red Hat Ansible Automation Platform.

No matter how complex the DR process is, when it comes to the implementation of the process, IT operators have to interact with the tech stack on premises or in a cloud. If these operations are manual it has a direct impact on the time to recover.

Having an automated DR plan allows teams to schedule DR testing often, rather than once a year, and are able to build confidence in the DR process.
Automated steps reduce the time it takes to effect the changes at the endpoints. This allows for faster return to operations

Automation directly impacts how efficiently and accurately teams can deliver a Disaster Response, allowing for organizations to save money and maintain trust.¹

Red Hat Ansible Automation Platform - a trusted solution

The Red Hat Ansible Automation platform has been the silver bullet in the IT operator’s arsenal when it comes to Day1/Day2 Operations. The 2023 Forrester Wave named Red Hat the leader for Infrastructure Automation vendors. According to Forrester’s evaluation, “Red Hat sets the pace of the market by addressing operational challenges, skill gaps, and budgetary pressures."

¹ Application Downtime, According to IDC, Gartner, and Others.^.Statuscast.

This blog post is co-authored with Sean Anderson.

About the author

Ajay Chenampara

Acme Products

Ajay is an IT industry veteran with over 2 decades in this space. He is the Automation strategy leader for Red Hat's North America Public Sector. He is focused on helping customers achieve their business outcomes using Ansible for automating their Day0/1/2 challenges. Previously he was the global datacenter architect for a top 10 Fortune 500 enterprise, leading the network automation efforts there. He also worked for a community focused network automation startup, helping network engineers adopt DevOps tools and methodologies across the globe. Read his blog on termlen0.github.io

Read full bio

Browse by channel

Explore all channels

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Ansible for Disaster Recovery

Overview

Challenges in creating a disaster recovery plan

How Red Hat Ansible Automation Platform can support DR planning

Red Hat Ansible Automation Platform - a trusted solution

About the author

Ajay Chenampara

More like this

Browse by channel

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links