When we get into the nuts and bolts of implementing a disaster recovery (DR) plan, an important step is to evaluate the tech stack that’s hosting the critical applications. The techstack oftentimes determines the order of operations and execution needed to effect the DR. Most organizations have the following tech stack pattern for their data centers:
Each of these layers has their own SMEs (Subject Matter Experts) who will need to work in tandem to address complexities and challenges during a DR event, and create a plan to ensure business continuity.
Challenges in creating a disaster recovery plan
“Everybody has a plan until they get punched in the face.” - Mike Tyson
Cyber attacks, natural disasters, human error, server failure–any number of potential events can bring on the need for disaster recovery. While the risk of experiencing a disaster event won’t go away, the negative impact of such an event can be drastically minimized with the right planning.
The following is a sample SOP to recover an application during a disaster. Depending on the needs of the organization, DR procedures could be simpler or more complex than the examples shown here. After monitoring systems have detected conditions to trigger a DR event, a typical DR sequence might flow through the following stages:
Primary site shutdown procedure
- Run shutdown sequences
- Ensure current data backup snapshots
- Create tickets for admins
- Safely stop existing code and applications
- Push existing workloads to disaster recovery environment
- Ensure business data integrity
- Follow existing procedures and policies in regards to emergency response
- Initiate DR provisioning
- Seamless transition to failover systems
- Provisioning of needed cloud or DR site resources
- Storage backups, restorations, and migration of critical systems
- Updating application connections
- Mitigating security risks
- Connecting users to DR site
- Pushing out messaging to users
- Updating DNS and ALB’s
Finally, when the event is over and the threat is no longer present, normal operations can resume.
Return to normal operations procedure
- Bringing systems back online
- Merging versions of stored data
- Restoring original connections
- Scaling down disaster recovery environment
- Evaluating damage and losses during emergency
- Setting up or updating existing tickets for necessary action items from administrators
- Reducing overlapping environment costs
Unplanned downtime can have a huge financial impact with analysts estimating the cost to an organization to be over $500K per hour of unplanned outage1. For the public sector, the impact can be even more crippling. Outside of the financial implications, system outages can affect public safety and citizen well-being which can have longer term effects on public trust of the government.
Disasters may be unavoidable, but their negative impact can be minimized. Disaster recovery planning distills down to two things:
How quickly the services can be restored - the Mean Time To Recovery (MTTR)
Level of confidence in the DR plan - this comes from regularly scheduled successful testing of the plan
How Red Hat Ansible Automation Platform can support DR planning
An automated disaster recovery plan is a safer disaster recovery plan. Red Hat Ansible Automation Platform can automate disaster recovery plans by using a feature capability called workflows. Workflows can tie individual SME created bits of automation together into a cohesive orchestration process. For example:
Automate - Detection of DR event and kick off DR process
Automate - Primary site shut down
Automate - Failover
Automate - Return to Normal Operations
With Ansible it becomes easy to not only visualize the steps but also build in automated failure handling if any step does not go as expected. Once the process is tied down through a workflow it makes it easy to test the DR plan repeatedly.
In addition to the workflow capabilities, there are also powerful abstractions available as part of the Red Hat Ansible Automation Platform. The figure below represents a small sample of the certified and supported, powerful abstractions that are available as part of the Red Hat Ansible Automation Platform.
No matter how complex the DR process is, when it comes to the implementation of the process, IT operators have to interact with the tech stack on premises or in a cloud. If these operations are manual it has a direct impact on the time to recover.
- Having an automated DR plan allows teams to schedule DR testing often, rather than once a year, and are able to build confidence in the DR process.
- Automated steps reduce the time it takes to effect the changes at the endpoints. This allows for faster return to operations
Automation directly impacts how efficiently and accurately teams can deliver a Disaster Response, allowing for organizations to save money and maintain trust.1
Red Hat Ansible Automation Platform - a trusted solution
The Red Hat Ansible Automation platform has been the silver bullet in the IT operator’s arsenal when it comes to Day1/Day2 Operations. The 2023 Forrester Wave named Red Hat the leader for Infrastructure Automation vendors. According to Forrester’s evaluation, “Red Hat sets the pace of the market by addressing operational challenges, skill gaps, and budgetary pressures."
1 Application Downtime, According to IDC, Gartner, and Others.. Statuscast.
This blog post is co-authored with Sean Anderson.