Subscribe to our blog

Overview

When we get into the nuts and bolts of implementing a disaster recovery (DR) plan, an important step is to evaluate the tech stack that’s hosting the critical applications. The techstack oftentimes determines the order of operations and execution needed to effect the DR. Most organizations have the following tech stack pattern for their data centers:

Each of these layers has their own SMEs (Subject Matter Experts) who will need to work in tandem to address complexities and challenges during a DR event, and create a plan to ensure business continuity.

 

Challenges in creating a disaster recovery plan

“Everybody has a plan until they get punched in the face.” - Mike Tyson

Cyber attacks, natural disasters, human error, server failure–any number of potential events can bring on the need for disaster recovery. While the risk of experiencing a disaster event won’t go away, the negative impact of such an event can be drastically minimized with the right planning.

The following is a sample SOP to recover an application during a disaster. Depending on the needs of the organization, DR procedures could be simpler or more complex than the examples shown here.  After monitoring systems have detected conditions to trigger a DR event, a typical DR sequence might flow through the following stages:

Primary site shutdown procedure

  • Run shutdown sequences
  • Ensure current data backup snapshots
  • Create tickets for admins
  • Safely stop existing code and applications
  • Push existing workloads to disaster recovery environment
  • Ensure business data integrity
  • Follow existing procedures and policies in regards to emergency response
  • Initiate DR provisioning

Failover procedure

  • Seamless transition to failover systems
  • Provisioning of needed cloud or DR site resources
  • Storage backups, restorations, and migration of critical systems
  • Updating application connections
  • Mitigating security risks
  • Connecting users to DR site
  • Pushing out messaging to users
  • Updating DNS and ALB’s

Finally, when the event is over and the threat is no longer present, normal operations can resume.

Return to normal operations procedure

  • Bringing systems back online
  • Merging versions of stored data
  • Restoring original connections
  • Scaling down disaster recovery environment
  • Evaluating damage and losses during emergency
  • Setting up or updating existing tickets for necessary action items from administrators
  • Reducing overlapping environment costs

 

Unplanned downtime can have a huge financial impact with analysts estimating the cost to an organization to be over $500K per hour of unplanned outage1. For the public sector, the impact can be even more crippling. Outside of the financial implications, system outages can affect public safety and citizen well-being which can have longer term effects on public trust of the government.

Disasters may be unavoidable, but their negative impact can be minimized. Disaster recovery planning distills down to two things:

  1. How quickly the services can be restored - the Mean Time To Recovery (MTTR)
  2. Level of confidence in the DR plan - this comes from regularly scheduled successful testing of the plan

 

How Red Hat Ansible Automation Platform can support DR planning

An automated disaster recovery plan is a safer disaster recovery plan. Red Hat Ansible Automation Platform can automate disaster recovery plans by using a feature capability called workflows.  Workflows  can tie individual SME created bits of automation together into a cohesive orchestration process. For example:

Automate - Detection of DR event and kick off DR process

Automate - Primary site shut down

Automate - Failover

Automate - Return to Normal Operations

 

With Ansible it becomes easy to not only visualize the steps but also build in automated failure handling if any step does not go as expected. Once the process is tied down through a workflow it makes it easy to test the DR plan repeatedly.

In addition to the workflow capabilities, there are also powerful abstractions available as part of the Red Hat Ansible Automation Platform. The figure below represents a small sample of the certified and supported, powerful abstractions that are available as part of the Red Hat Ansible Automation Platform.

No matter how complex the DR process is, when it comes to the implementation of the process, IT operators have to interact with the tech stack on premises or in a cloud. If these operations are manual it has a direct impact on the time to recover.

  • Having an automated DR plan allows teams to schedule DR testing often, rather than once a year, and are able to build confidence in the DR process.
  • Automated steps reduce the time it takes to effect the changes at the endpoints. This allows for faster return to operations

Automation directly impacts how efficiently and accurately teams can deliver a Disaster Response, allowing for organizations to save money and maintain trust.1

 

Red Hat Ansible Automation Platform - a trusted solution

The Red Hat Ansible Automation platform has been the silver bullet in the IT operator’s arsenal when it comes to Day1/Day2 Operations. The 2023 Forrester Wave named Red Hat the leader for Infrastructure Automation vendors.  According to Forrester’s evaluation, “Red Hat sets the pace of the market by addressing operational challenges, skill gaps, and budgetary pressures."

 

1 Application Downtime, According to IDC, Gartner, and Others.. Statuscast.

 

This blog post is co-authored with Sean Anderson.

 


About the author

Ajay is an IT industry veteran with over 2 decades in this space. He is the Automation strategy leader for Red Hat's North America Public Sector. He is focused on helping customers achieve their business outcomes using Ansible for automating their Day0/1/2 challenges. Previously he was the global datacenter architect for a top 10 Fortune 500 enterprise, leading the network automation efforts there. He also worked for a community focused network automation startup, helping network engineers adopt DevOps tools and methodologies across the globe. Read his blog on termlen0.github.io
Read full bio

Browse by channel

automation icon

Automation

The latest on IT automation that spans tech, teams, and environments

AI icon

Artificial intelligence

Explore the platforms and partners building a faster path for AI

open hybrid cloud icon

Open hybrid cloud

Explore how we build a more flexible future with hybrid cloud

security icon

Security

Explore how we reduce risks across environments and technologies

edge icon

Edge computing

Updates on the solutions that simplify infrastructure at the edge

Infrastructure icon

Infrastructure

Stay up to date on the world’s leading enterprise Linux platform

application development icon

Applications

The latest on our solutions to the toughest application challenges

Original series icon

Original shows

Entertaining stories from the makers and leaders in enterprise tech