How to Create the Optimal Disaster Recovery Architecture
We have worked through designing and configuring a backup strategy. We have dispatched others on a quest to define their needs and roles in a disaster situation.
Now we need to focus on the critical aspect for the IT admin – to architect a solution that will carry technology solutions through. The most appropriate label for this portion is “business continuity”.
We want to enable our systems to maintain enough functionality to support business processes through adverse situations. This expands well beyond simple backup.
Using Secondary Sites for Business Continuity and Disaster Recovery
Geographically distant alternative operating sites are the most direct way to achieve business continuity in a disaster situation. Larger businesses may have additional locations that they can designate as secondary to others.
In order to qualify as a secondary or alternative site, the location must have sufficient computing capability and space for dislocated personnel to stand in for the primary site. Logistical constraints may force operations to run at diminished capacity but set your goal at reasonable continuation of business functions.
As we will see in the “Business Process for Disaster Recovery” section, recent technology advances have alleviated the pressure for secondary sites to match primaries.
Evaluating Secondary Site Viability
Merely owning another building does not automatically mean that you can use it as a disaster recovery location. Most importantly, the sites must have enough geographical distance between them that a single disaster won’t disable both.
A secondary site must also not require a great deal of effort to make into a functional workspace. An empty warehouse will not replace a datacenter and call center in short order, for example.
If you don’t already have business justification for a secondary site, then you might need the same employees to operate out of both locations. In that case, they must be close enough that traveling to the alternative site doesn’t constitute a hardship.
If the most probable types of concerns in your region are building fires and tornadoes, then a few miles should suffice. If hurricanes, tsunamis, wildfires, or earthquakes threaten you, then you might face a greater challenge.
Always consider the primary business function of a site. Sadly, secondary sites simply cannot save some. If you distribute widgets from your main warehouse and a fire eliminates all the inventory, would a backup site accomplish anything meaningful? If you can file insurance claims against the data loss and redirect suppliers and carriers to another warehouse, then you can answer, “yes”. If you cannot find a way for a backup site to continue the business functions of its primary, then it adds overhead without value.
Handling Split Responsibilities
With the limitless variety of configurations, one article series cannot cover all possibilities. This section is written as if all sites perform all roles (operations, finance, computing, etc.). Reality ranges somewhere between that and locations dedicated to specific activities.
In your documentation, rather than pairing one site to another, you can match a function of a site to the location that can act as its secondary. For example, a site that is just a datacenter might fail over to a building that houses server hardware and sales staff.
The computing equipment at the alternative location could use the dedicated datacenter as its secondary, but the sales functions would need to be targeted somewhere else.
Planning Hot Secondary Sites
A hot site can take over for a primary site very quickly. It has sufficient hardware onsite and is operational and receives regular data updates directly from the primary site. Enabling such a feat requires detailed planning, high quality equipment, frequent maintenance, and constant monitoring.
You will need regular, perhaps permanent, onsite staff. That staff must know how to keep the inter-site replication operating and how to fail over from the primary and back.
As you might expect, this level of functionality carries a significant cost. It works best in companies that have enough resources and volume to justify multiple locations even before considering business continuity. To operate as a hot secondary site, the location must have:
- Sufficient connectivity to the primary site during normal operations to support replication; measure speed and stability
- Server hardware powerful enough to operate in the absence of the primary site
- Physical space for personnel
- Suitable connectivity for failover conditions; think of computer and voice networking
An upcoming article dives deeper into replication. For now, understand that a hot site needs nearly constant data updates from its main site. That means that you need a fast and sturdy data connection between them. At the high end, you can order direct fiber runs between locations.
Research the available options for point-to-point services. If you cannot find or afford such services, you will need to use general Internet connectivity instead. If possible, utilize two different providers per site. For maximum value, separate providers should use different infrastructure.
It greatly reduces your redundancy if both follow the same circuits to your building or if they route through the same intermediate facilities. Rural installations have great susceptibility to outages from ditch-digging accidents. Identify such concerns and plan mitigations and workarounds.
Place a premium on data security implementation. If you can afford point-to-point technology, then you have a lower risk profile for data interception.
For the greatest protection, encrypt traffic as it traverses sites. Have devices under your control perform the encryption and decryption. Even lower-end equipment frequently supports site-to-site VPN technology. Forcing all traffic that crosses the line through an encrypted tunnel prevents the need to police all communications separately.
As a bonus, you can alleviate the CPU load on computing equipment by allowing your replication software to skip its own encryption functions.
Be mindful of the computing and data storage needs of a hot site. It will require at least as much as the primary, and perhaps more. It may become a “data dump” for archival purposes.
As a secondary site ages without handling a catastrophe, it might find some of its resources “temporarily” repurposed. You will probably not have any real power to stop that from happening, and these “temporary” activities tend to become permanent.
Make certain to maintain a minimum level of functionality and capacity at each secondary site.
Employee spaces need to be prepared to accept personnel at any time. Prepare it like any other work site. It needs:
- Power
- Water
- Lighting
- Seating
- Desktop computing
- Voice support
- Air handling
You might face some struggles acquiring maintenance support to make this viable. While the data recovery portions of a plan obviously fall to IT, these types of business continuity responsibilities fall outside its purview.
Your business managers will be reluctant to devote resources like this to any building that does not have a continuous personnel presence.
Even if you get sign-off in the first year, that does not preclude someone from looking back in a few years and deciding that it was wasteful and that the resources should go somewhere else. In those cases, you might lose your alternative site entirely.
If you are uncertain that you can maintain a hot site into perpetuity, strongly consider implementing a warm or cold site instead.
Planning Warm Secondary Sites
Warm sites mainly differ from hot sites in the lack of continuous data updates. We only treat that as a convention, not an unbreakable definition. In practical usage, a warm site may simply mean the closest that an organization gets to having a hot site. A warm site has two major distinctions from a hot site:
- A warm site needs more than a few minutes’ effort to resume operations from the primary
- The inter-site network connection between a primary site and a warm site does not need to pass any special quality tests
Because a warm site does not receive continuous updates, you must have a plan in place to transfer data to the site when needed.
You can achieve that by having employees transport backup tapes or drives to the site and restoring them on the hardware there. You can relay data through a cloud provider. Since your plan cannot depend on the presence of any specific individual, use the most generic descriptions and instructions possible.
Anyone that the task might fall to must understand their responsibilities before needing to undertake them.
The site does need to meet all the other tests that apply to a hot site. But, if it can’t function as an alternative location, then it fails the test entirely.
However, you have more flexibility as the architecture and definition of a warm site include an expectation that it will take some time to spin up. To properly distinguish itself from a cold site, it must have adequate onsite computing abilities to resume business functions from the primary site.
Planning Cold Secondary Sites
Cold sites have the widest definition of the three alternative sites. Anything that could replace primary site functionality can qualify. Like warm sites, they lack an active replication scheme. They differ from warm sites in that they do not contain enough computing hardware. Such a site requires significantly less cost and effort to maintain, especially at hardware refresh intervals.
These savings come with a risk trade-off. If you lose your primary site due to a localized building fire, then you can probably get replacement hardware quickly. If the calamity is widespread and affects a large number of businesses, you might face significant supply and delivery challenges.
At the same time, if both your primary and secondary sites exist within the danger zone, you might work from the odds that one of the buildings remains usable. In that situation, it might make sense to only gamble with the contents of one facility.
A cold site must pass most of the non-computing tests of a hot site without the always-on restrictions. The time waiting for computers to arrive and data restoration to be completed also gives you time for office furniture delivery.
The power and environmental systems must function before people start, so find out if your utility companies can make that happen quickly enough that you do not need to maintain them when not in use.
Cold sites require a meaningful amount of time to begin work. They reduce your ability to continually conduct business. Another upcoming article will explore the technologies that can greatly mitigate these shortcomings.
Ongoing Maintenance for Secondary Sites
If all goes well, you will never need to use a secondary site. Unfortunately, such good fortune can also cause a loss of interest and long-term unwillingness to sink further funds into it. You must include all secondary locations in the regular updates of your plan.
Ask:
- Does the site still have sufficient hardware to take over for the primary?
- Do we know that power, water, lighting, and environmental systems function?
- Do current employees know how to get to the site?
- Do current employees understand their role in transitioning to the alternative site?
- Do we have monitoring in place that guarantees the quality of replication?
- Are we replicating everything? Have we added any systems since we last answered this question?
Site maintenance goes well beyond the functions of IT. Keep the relevant departments invested. When you perform reviews of the disaster recovery plan, invite them to provide updates.
Analyzing Disaster Recovery Hardware Needs
In a perfect world (perfect except for disasters, anyway), you would establish secondary sites as complete mirrors of their primaries. Budgets and managerial tolerances rarely make that possible. So, you’ll need to document hardware needs to enable disaster recovery.
If you can afford secondary sites, then determine acceptable hardware levels. Whether you have only one site or many, you must have access to the necessary hardware to make disaster recovery possible.
End-User Infrastructure and Systems
Some things have no room for reduction. Every knowledge worker will need a computing device. Every one of those devices will need some way to attach to the network. On the non-technical side, each person will need a chair and a work surface.
The business managers responsible for related operations will need to participate in planning. They can provide headcounts and need assessments relevant to replacement and secondary site concerns.
End-user networking will require a physical survey of any secondary sites. You can determine port counts easily, but even a cold site should not wait for cabling.
You might also uncover conditions that dictate a different deployment strategy, such as per-floor local hardware instead of home runs from each endpoint.
Inter-site and Internet connectivity need planning as well. If you want the secondary to act as a hot site, you will need enough bandwidth, reliability, and security to safely transmit data from the primary.
If the site has another use when not in business continuity mode, then its current Internet connection may not have sufficient bandwidth to accommodate overflow employees from the primary. Consult employees and have them think through what they need to conduct a normal days’ business. Plan for printing, faxing, and other needs.
Server Infrastructure and Systems
For single-site recovery planning, usually you only need the specifications of your hardware. If you buy using any sort of account with a vendor, they probably maintain a purchase history. However, they probably don’t know the purpose of any of it.
For the best results, include a hardware catalog in your disaster recovery planning. Specify the hardware’s purpose, then its specifications. For general purpose equipment that exists to extend coverage, such as end-user aggregator switches and printers, you can use locations instead.
If you use hot or warm secondaries, they will need to have server systems onsite. Take care when configuring standby server hardware. There will be temptation to purchase lower-powered equipment than what you have in the primary site.
Since you may need to run at reduced functionality, that seems logical. However, you may fail to receive sufficient funding for the secondary site when it comes time to refresh the hardware at the primary site. If that’s a concern, then consider using somewhat higher-end equipment than you strictly need.
You might need to add switching, routing, firewall, and load-balancing equipment for any servers that will only operate in a failover condition. Having enough between sites to enable replication does not mean that it can also suddenly take on the load of dozens or hundreds of users performing their daily roles.
Inter-Site Hardware
Beginning the use of multiple sites for disaster recovery requires additional equipment. You have many architectural decisions to make, and some might challenge your networking teams’ current knowledge levels. Among the things to consider:
- Will you use point-to-point networking to enable replication?
- Will you maintain a constant direct Internet connection at the secondary, or will you have it physically connected but only have your provider turn it on when needed?
- If you have a constant point-to-point connection, will you also have constant Internet connections at the secondary?
- Will you require the remote sites to tunnel through the primary for their Internet access and only enable direct-connect in the event of an emergency?
- Do you have the necessary hardware to perform the desired functions at each location?
- Does your staff have the networking knowledge to configure this as desired?
- Will you use temporary consultants to configure things and repair them on-demand or train your staff?
Much of your decision-making will depend on how much of a shift these secondary sites represent for your organization. If you already have multiple sites and deal with these problems today, then you likely also have the expertise on hand. If you have always used a single site with simple networking, adding even one connected site can greatly complicate everything.
While the staff that you have today can certainly learn the additional functionality, you have no guarantees that they will stay. If you cannot afford to hire that level of talent into perpetuity, then consider hiring a professional networking firm to architect and maintain the inter-site links.
Disaster Recovery Hardware
With all of the talk of remote sites, networking tends to dominate the discussion. Don’t neglect the systems that will truly make disaster recovery possible. If you use tapes, make certain that you have access to tape drives that can read them. For tapes recorded last week, that’s easy. For tapes recorded in 2002, you might have to work harder.
As things transition more to using commodity hardware and online services, this concern shrinks. Look through your backup systems for anything that might require special handling if you lose the entire facility where the backup was taken. Make sure that you can restore its contents on alternative hardware.
Maximizing Disaster Recovery Architecture
The hardest question in business continuity planning: “What are we missing?” Even comprehensive guides don’t prepare you for everything. Sometimes, after going over a prepared checklist or write-up, we have a hard time thinking beyond it.
For help, review your brainstorming sessions from earlier articles. Reach out to colleagues that have a stake but have not seen what you’ve already come up with. Take a physical walk through your primary site and look for anything that wasn’t brought up in meetings.
At no point should you claim that you have “finished” your planning. Always leave a few blank lines, at least metaphorically, for more information. Add disaster recovery tie-ins to any formalized processes for starting new or updating existing projects of any kind. Start up a system for employees to suggest items that didn’t make the initial plans.
Another recent “wrinkle” in this planning is the adoption of work from home or work from anywhere schemes. Depending on your industry vertical everyone may not need to be in an office to perform their duties.
However, this presents other challenges to include in your planning, if your solution to a burnt-out office is “everyone just works from home”, do you have the security, networking and systems infrastructure to facilitate this? And if you do, what if the fire was larger and many of those homes have also been destroyed?
To properly protect your virtualization environment and all the data, use Hornetsecurity VM Backup to securely back up and replicate your virtual machine.
We ensure the security of your Microsoft 365 environment through our comprehensive 365 Total Protection Enterprise Backup and 365 Total Backup solutions.
For complete guidance, get our comprehensive Backup Bible, which serves as your indispensable resource containing invaluable information on backup and disaster recovery.
To keep up to date with the latest articles and practices, pay a visit to our Hornetsecurity blog now.
Conclusion
Establishing the optimal disaster recovery architecture involves a multifaceted approach. Having navigated through backup strategies and defined roles in disaster situations, the focus shifts to business continuity. Geographically distant secondary sites play a crucial role, requiring thorough evaluation for viability.
The distinction between hot, warm, and cold secondary sites necessitates careful planning, considering factors like hardware needs, inter-site connectivity, and ongoing maintenance. Analyzing end-user and server infrastructure, as well as maximizing disaster recovery architecture, ensures a comprehensive strategy.
The ever-evolving nature of business demands continuous review and adaptation, leaving room for improvement and innovation in disaster recovery planning.
FAQ
Steps in disaster recovery:
Risk Assessment: Identify potential risks and assess their impact.
Business Impact Analysis (BIA): Determine critical business functions and acceptable downtime.
Planning: Develop a comprehensive disaster recovery plan.
Data Backup: Regularly back up critical data offsite.
Redundancy: Implement redundant systems and infrastructure.
Testing: Regularly test the disaster recovery plan to ensure effectiveness.
Training: Educate employees on their roles during a disaster.
Documentation: Maintain up-to-date documentation of systems and procedures.
The best method for disaster recovery involves a combination of:
Data Backups: Regularly back up critical data.
Cloud Services: Leverage cloud platforms for data storage and application deployment.
Redundancy: Implement redundant systems and infrastructure.
Regular Testing: Regularly test the disaster recovery plan to identify and address potential issues.
Automation: Use automation tools for faster recovery processes.
The duration of disaster recovery varies based on factors like the extent of the disaster, IT infrastructure complexity, and the effectiveness of the recovery plan. Organizations set a Recovery Time Objective (RTO), ranging from minutes to hours, depending on business priorities and criticality. RTOs differ for each system or service, aiming for swift restoration to minimize downtime.