Exploring BCP/DRP and Disaster Avoidance / Continuous Availability (Active/Active) Design

Exploring BCP/DRP and Disaster Avoidance / Continuous Availability (Active/Active) Design


If a disaster or IT outage occurred to your business, RIGHT NOW. How confident are you with your DRP? In today’s network economy, down-time is not an option.

DRP and BCP strategy is governed by the CEO and CIO and they are responsible for the solution to define the business impact ,  requirements and investment. IT has to provide guidance to develop a BCP and DR Strategy and get buy-in from the business.


  1. BCP – Planning to continue your business operations in case of a disaster.
  2. DRP – Planning to recover from disaster situations – How the IT (information technology) should recover in case of a disaster.

Other Important Definitions

  1. Continuous Data Protection – Replication solutions can be either synchronous or asynchronous, meaning transfer of data to a remote copy is achieved either immediately or with a short time delay. Both methods create a secondary copy of data identical to the primary copy, with synchronous solutions achieving this in real time. This means that any data corruption or user file deletion is immediately (or very quickly) replicated to the secondary copy, therefore making it ineffective as a backup method.
    1. Copy-on-write snapshot – Most snapshot implementations use a technique called copy-on-write, which makes an initial snapshot then further updates as data is changed. Restoration to a specific point in time is possible as long as all iterations of the data have been kept. For that reason, snapshots can protect against data corruption, unlike replication.
    2. Clone/split-mirror snapshot – Another common snapshot variant is the split-mirror, where reference pointers are made to the entire contents of a mirrored set of drives, file system or LUN every time a snapshot is made. Clones take longer to create than copy-on-write snapshots because all data is physically copied when the clone is created. There is also the risk of some impact to production performance when the clone is created because the copy process has to access primary data at the same time as the host.
    3. Continuous data protection (CDP) – CDP is a method of snapshotting that tracks and store all updates to data as they occur. Theoretically, this means CDP solutions can roll back to any point in time, down to the smallest granularity of update. But there is a price to pay with CDP in terms of the cost of storage needed to keep every changed block copy and the performance impact of storing the data. As a result, some vendors implement what they call near-CDP, taking snapshots of changed data at set times and consolidating changes over a longer time period. This means heavily updated data doesn’t overwhelm the capacity of the CDP system. In virtual environments, APIs such as vSphere’s VADP enable CDP solutions to be implemented by third-party software vendors.
  2. Clustering and Availability 
    1. Fault Tolerant 
    2. Highly Available
    3. Metro/GeoClusters
    4. Culstering for Performance / Load Balancing (Scale-out)
  3. Backup – Backup is the process of making a secondary copy of data that can be restored to use if the primary copy becomes lost or unusable. Backups usually comprise a point-in-time copy of primary data taken on a repeated cycle – daily, monthly or weekly.
  4. Archival – Is storing copies of data all version for Long Retention periods, 7 years or more and in Legal Hold requirements for life time of that organisation. 

questions to ask YOURSELF?

Business Disaster Recovery Questionnaire-2014_201412231116104837

  1. Does your business/organization have a business continuity or disaster recovery plan? Not start, In Progress, Complete
  2. What are your most important business functions and how quickly can you resume following a disaster?
  3. How often do you review and test your disaster recovery plan?
  4. Have you established an alternative location where employees can work on key function off site?
  5. Do you have access to multiple, reliable methods of communicating with your employees (emergency phone numbers, pagers, radios or website)?

Disaster AVOIDANCE continuous availability – NO DR RUN BOOKS

  • “40% of all companies that experience a major disaster will go out of business if they cannot gain access to their data within 24 hours.” -Gartner
  • “43% of companies experiencing disasters never re-open and 29% close within two years.” -McGladrey and Pullen
  • “93% of businesses that lost their datacenter for 10 days went bankrupt within one year.” -National Archives & Records Administration

More information on some stats on DR Failures – http://www.continuitycentral.com/feature0660.html

Almost all business that were inside the Twin Towers that didn’t have proper DR technology solutions went bankrupt, a number of companies that develop continuous availability survived and prospered.

BUSINESS CONTINUITY Regulatory Requirements

  • ISO 22301 Business Continuity – http://www.iso.org/iso/catalogue_detail?csnumber=50038
  • ISO – ISO 22301:2012, “Societal security — Business continuity management systems — Requirements”, specifies a management system to manage an organization’s business continuity arrangements. It is formal in style in order to facilitate compliance auditing and certification.
  • It is supported by ISO 22313:2012, “Societal security — Business continuity management systems – Guidance” which provides more pragmatic advice concerning business continuity management.
  • ISO/IEC 27031:2011, “Information security – Security techniques — Guidelines for information and communication technology [ICT] readiness for business continuity” offers guidance on the ICT aspects of business continuity management.
  • United Kingdom – British Standard BS 25999 was a two-part business continuity management standard. “BS 25999-1:2006 Business Continuity Management. Code of Practice” offered pragmatic implementation guidance, but was withdrawn in 2012 when ISO 22313 effectively superseded it. “BS 25999-2:2007 Specification for Business Continuity Management” formally specified a set of requirements for a business continuity management system. It too was withdrawn in 2012 when it was (in effect) replaced by ISO 22301.North America – Published by the National Fire Protection Association NFPA 1600: Standard on Disaster/Emergency Management and Business Continuity Programs.North America – ASIS/BSI BCM.01:2010 published Dec 2010ANSI/ASIS SPC.1-2009 Organizational Resilience:
  • The ANSI/ASIS SPC.1-2009 Organizational Resilience: Security, Preparedness, and Continuity Management Systems—Requirements with Guidance for Use American National Standard is under consideration for inclusion in the DHS PS-Prep, a voluntary program designed to enhance national resilience in an all hazards environment by improving private sector preparedness.Australia – Published by Standards Australia HB 292-2006 : A practitioners guide to business continuity management HB 293-2006 : Executive guide to business continuity management In 2010, Standards Australia introduced their Standard AS/NZS 5050 that connects far more closely with traditional risk management practices. This interpretation is designed to be used in conjunction with AS/NZS 31000 covering risk management.
  • APRA


The complexity of maintaining DR Run Books and complex DR Technology and expensive solutions means that most of the time it is a wasted invested that fails. It is better to develop Continuous availability technology solution.

x – represent human capital and errors!

  1. Business impact analysis xxxx
  2. Business case xxxx
  3. BCP
  4. DRP
  5. DR Reference Architecture
  6. DR High Level Design
  7. DR Detailed Design
  8. DR Design for Applications
  9. Implementation and Configuration
  10. DR RUN Book development and maintenance
  11. Change Management and DR RUN book Mainteance
  12. Yearly DR Test xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Usually DR RUN books are updated at the time of the DR Test, not during BAU. Changes to the environment will also affect the DR RUN book and Change Management usually neglects to u[date the DR RUN books due to the Human Factor.

Continuous Availability can eliminate the most time consuming error prones areas and maintain DR posture by optimization or even eliminate Stage  11 and 12.
Continuous availability is simply achieved by virtualising the Network layer and Storage Layers. This reduces the overall complexity of DR RUN books and Operation Costs to test and maintain them in a separated environment and makes use of cold or passive datacenters. Initial investment of setting up continuos availability is achieved quickly when you can maintain 100% DR posture compared that to the cost of a maintain DR, isolated testing, exposure and failed compliance  How much does it cost for you to maintain DR RUN Books and test, invest all of these soft opex costs can be use building a Continuos availability solution. Continuous Availability will lower the costs and complexity.

Define RPO/RTO

It order to develop a proper DR solution for any customer, it is imperative to document RPO and RTO and classify and Tier Application. This needs to be documented and signed off in a Business case that should define a BCP plan. Read more info here –https://virtualizationandstorage.wordpress.com/2014/06/23/defining-dr-ha-rto-and-rpo-bcp/



In this business case you will need to allocate a investment to this DR design:

Identify Application Tiers and Uptime

You will also need to identify and classify key Applications and place them into Tiers of importance. A good reference for this classification is to use the uptime institute as a guide. This should also contain uptime requirements for each application Tier.

  • Tier 4 = Tier 1 + Tier 2 + Tier 3 + all components are fully fault-tolerant including uplinks, storage, chillers, HVAC systems, servers etc. Everything is dual-powered.
  • Tier 3 = Tier 1 + Tier 2 + Dual-powered equipments and multiple uplinks.
  • Tier 2 = Tier 1 + Redundant capacity components.
  • Tier 1 = Non-redundant capacity components (single uplink and servers).
  • Tier 1: Guaranteeing 99.671% availability.
  • Tier 2: Guaranteeing 99.741% availability.
  • Tier 3: Guaranteeing 99.982% availability.
  • Tier 4: Guaranteeing 99.995% availability.


  • MISSON CRITICAL – Your highest priority workloads with instantaneous recovery
  • BUSINESS CRITICAL – Your high priority workloads with prioritised failover
  • COMPLIANCE – Your compliant workloads that must meet regulations
  • GENERAL PURPOSE – Your non-critical workloads with a restart 




Define Budget

The following formula can be used to highlight the revenue lost due to a outage:

Lost of Revenue due to outage = $Revenue / 365 Days * (RTO + RPO)

The business case will always to to identify trade-offs, between, Price, Performance, Cost. You might not be able to achieve all of them and its you need to be realistic.


Define Success Criteria

Developing a DR Solution, I would class as a Spaghetti problem and requires a method to solve this type of problem.

  • In ‘spaghetti situations’ in which everything is connected to everything, and everything influences everything it is by far not obvious what the best solution is. All people involved have a different idea about what the problem is. And if you ask them, all these people have different ideas about what the solution could;. If you, as engineer, consultant, manager or analyst are in a situation like this, then what to do?
  • In order to solve complex problems its required to define a overall strategy and method that will guide the development of the complex solutions and following a set critical path, budget and timeframe.
  • In order to design and accelerate the implementation of a solution, customers must commit to design decision and any acceptable risks. Multiply components solutions adds to the complexity, through validation and support from vendors is a absolute requirement and they must be invested in the success of the solution

Define Plan

  1. Planning your availability transformation
  2. Analyzing your current state
  3. Assessing your continuous availability readiness
  4. Identifying infrastructure requirements
  5. Designing the solution architecture
  6. Performing a cost/benefit analysis

DR Design Options

I wanted to explore a number of options for DR Design

  1. Active/Active
  2. Active/Hot DR
  3. Active/Warm Passive DR (Standby)
  4. Active/Cold DR Recovery from Disk
  5. DR to Cloud
  6. Azure
  7. Cloud Only

 Disaster Recovery vs Disaster Avoidance

Disaster Recovery Technology Options

  • Network
    • Standard IP LAN
    • Load Balancer
    • WAN Optimisation/QoS
  • Storage
    • Synchronous / Asynchronous
    • SnapMirror
    • SRDF
    • FlashCopy
  • Application
    • SQL Replication/Mirror
  • OS
    •  Veritas
    • Microsoft Always On Clustering
  • Hypervison
    • HA
    • FT
    • Zerto
    • EverRun
    • Manual
    • SRM
  • Data protection
    • Brick and Storage Level Backups
    • CommVault
    • Avamar/Datadomain
    • Veema
    • Symantec
    • Tivoli Storage Manager
  • Compute
    • IBM SystemP PowerHA
    • Stratus
    • HP Service Guard
  • Switch
  • Legacy 

Continuous Availability and Disaster Avoidance technology options

  • Network Virtualisation
    • Stretch vLAN – OVT / VPLS – Virtualization (OTV) can be used for L2 extension between the customer’s data center and the cloud. L2 connectivity allows customers to use the same IP from enterprise network in the cloud without the need to change for accessing workloads in the cloud after recovery.
      Storage Virtualisation
    • IBM SVC, EMC VPLEX, NETAPP V-SERIES – Synchronous / Asynchronous
  • Data protection
    • Storage  Level Backup from Site 2
  • Virtualisation
    • Software Defined Networking – DMZ, Firewalls L2-L7 All virtualised 
    • VMware Metro Storage Cluster

VMware Metro Storage cluster Requirements


  • LAN Extension Deployment Scenarios
  • Ethernet Based Solutions
  • Cisco Nexus 7000 Series vPC (Virtual Port Channel) and Cisco Catalyst 6500 Series VSS (Virtual Switching System) for MAN distances
  • MPLS Based Solutions
  • EoMPLS (Ethernet over MPLS)
  • VPLS
  • A-VPLS
  • EVPN (Ethernet VPN)
  • P Based Solutions
  • Overlay Transport Virtualisation(OTV)
  • Fabric Solutions
  • Application Centric Infrastructure (ACI): VXLAN and Spine-Leaf architectures


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s