Defining DR, HA, RTO, and RPO + BCP
[VMware BCP Training Slides]
If you’re responsible for making your IT Server environment highly available, you need to hand the worksheet shown in Figure 1 to the business stakeholders for each application. Ask them to fill it in. Here’s how to explain the worksheet to them;
Worksheet for the Business Stakeholders
SLA objectives and time frames
Disaster recovery (DR) is what you do when your data center is reduced to a smoking crater because Superman archenemy Lex Luthor launched a missile that hit Metropolis. This scenario gets all the hype, but in reality, businesses usually suffer from more frequent hiccups that are less widespread.
High availability (HA) means that your databases won’t disappear completely when your primary database server goes bump in the night. HA refers to smaller, localized outages like a server’s motherboard flaking out, a memory chip coming down with Alzheimer’s, or Daily Planet photojournalist Jimmy Olsen tripping over the power cord.
For both DR and HA scenarios, you need to come up with two numbers: RPO and RTO. Well, actually, you won’t be the one coming up with the numbers. The business stakeholders need to make that decision, but you’re the one who has to ask the tough questions.
Recovery time objective (RTO) sets the countdown clock. It’s the amount of time the business is willing to be down. The clock starts at the moment the problem happens, not the time you find out about it. The clock keeps ticking until the end users are able to do their jobs again. Common RTOs are zero, one minute, one hour, and one day.
The business stakeholders will probably want to pick a range between one hour and one day, but that’s not really how technology works. A one-hour solution can include some manual work to fail over between systems, but the standby system must already be built and the data must already be restored. If you have to start the restore from scratch, you need a one-day threshold. You might think you can execute a restore from scratch in an hour, but what happens when the restore fails or someone runs a RESTORE WITH RECOVERY along the way? You’ll blow your timeline. RTO isn’t about the best-case scenario. It’s about the worst-case scenario, and things have already started to go wrong. Remember, you’ll be in a failure situation. This won’t be your finest hour.
Recovery point objective (RPO) represents how much data the business is willing to lose in the event of an incident. Like RTO, RPO is measured in terms of time (and not data size). Common RPOs are zero (no data loss), one minute’s worth of data loss, one hour’s worth of data loss, and one day’s worth of data loss.
After you have the RTOs and RPOs from the business stakeholders, you can pick the right technology to meet their needs—and give them a rough idea of what it’ll cost.
When you’re just starting to build your documentation, stick with the RTO and RPO numbers. After you’ve rehearsed, polished, and optimized your recovery plan, you can start getting fancy with goals such as an RTO of 3 hours and 45 minutes.