Designing Active-Active and Disaster Recovery Data Centers

Home » Webinars » Data Center Infrastructure » Designing Active-Active and Disaster Recovery Data Centers

This webinar covers typical design scenarios encountered when building a disaster recovery data center or deploying multiple data centers in an active-active configuration.

Last modified on 2024-03-12 (release notes)

Materials
Roadmap

Designing Active-Active and Disaster Recovery Data Centers

13:57 Introduction
In the fist section of this webinar we'll try to figure out why we'd want to migrate application workload between data centers, and define a few useful terms like RTO, RPO, MTTR and MTTI.
Introduction and Definitions	13:57	2017-03-29
49:45 Free items Typical Challenges
There are four typical reasons why you'd want to migrate application servers between data centers: migration, disaster recovery or avoidance, and workload load balancing.
Disaster Recovery	6:50	2017-03-29
Disaster Avoidance	29:46	2017-03-29
Data Center Migration	2:47	2017-03-29
Load Balancing Across Data Centers and Cloudbursting	10:22	2017-03-29
28:17 Limitations and Considerations
A number of factors limit our ability to deploy servers across multiple data centers: latency, bandwidth limitations, and data gravity.
Latency	8:09	2017-03-29
Limited Bandwidth	10:55	2017-03-29
Storage considerations	9:13	2017-03-29
16:12 Typical Solutions
Well-designed active-active applications used "swimlanes" - a concept where multiple copies of an application stack reside in different locations.
Parallel Application Stacks (Swimlanes)	16:12	2017-03-29
Describing Fault Domains
A great introduction to fault domains, fault levels, cascading failures, and fault hierarchy.
31:34 Free items Long-Distance VM Mobility Challenges
Instead of redesigning applications to make them work across multiple data centers, enterprise environments typically try to solve the challenges within the infrastructure, sometimes even moving running servers between data centers. This section describes most obvious drawbacks of that idea.
Inter-DC vMotion Bandwidth	6:04	2017-05-03
Large Layer-2 Domains	9:27	2017-05-03
Ingress and Egress Traffic Flows	16:03	2017-05-03
42:37 Summary & Questions
Time for a wrap-up. We'll discuss the right way of doing things, surviving infrastructure failures, and typical real-life designs.
Surviving the Failures	15:37	2017-05-03
The Right Way of Doing Things	10:12	2017-05-03
Typical Real-Life Designs	9:05	2017-05-03
Summary and Questions	7:43	2017-05-03
1:27:00 Lessons Learned Operating Active-Active Data Centers
Networking and virtualization vendors keep proposing crazier and crazier ideas that are supposed to allow you to run active-active data centers without touching the application architecture. Not surprisingly, most of them fail disastrously under the right failure conditions. If you want to have a highly-available application, there's simply no substitute for good design including global and local load balancing. In his presentation, Ethan Banks described the architecture he used when running multiple data centers for a large credit card payment processor, and lessons learned while operating them.
Definitions and Typical Setup	7:44	2016-10-09
Internet Edge, DNS, and BGP	16:08	2016-10-09
Firewalls	11:15	2016-10-09
Load Balancers	14:07	2016-10-09
Core Network	20:22	2016-10-09
High-Level Comments and Conclusions	17:24	2016-10-09
Slide Deck
Designing Active-Active and Disaster Recovery Data Centers	11M	2015-11-07
36:07 From the ipSpace.net Design Clinic
Migrating Application Stacks into Public Clouds	16:36	2021-12-27
Running Applications in Multi-Cloud Environment	19:31	2022-05-30
Additional Resources
The blog posts, articles, and books collected in this section might help you get a broader perspective on high-availability application architectures.
Application Design and Operations
Scalability Rules: Principles for Scaling Web Sites (2nd Edition)
A must-read book for anyone interested in robust high-availability application design.
Systems Design for Advanced Beginners
Site Reliability Engineering: How Google Runs Production Systems
More Site Reliability Engineering (SRE) resources
Disaster Recovery in AWS
High availability concepts don't change just because you're deploying your workloads in a public cloud. If anything, public clouds require cleaner architectures as they don't support enterprise kludges like layer-2 DCI. It's therefore worth reading the series of articles describing disaster recovery solutions within AWS.
Strategies
Architecture and Patterns
Backup and Restore
Pilot Light and Warm Standby
Multi-site Active/Active
Implementing Multi-Region Disaster Recovery Using Event-Driven Architecture
Disaster Recovery with AWS Services
AWS published several blog posts describing how you could use AWS services in a disaster recovery process. These documents are obviously self-serving, but you might find them valuable should you decide to deploy your workload on AWS, or you could use the same concepts when implementing disaster recovery in a different environment.
Disaster Recovery with AWS Managed Services (Single Region)
Multi-Region Backup and Restore
AWS Multi-Region Application Architecture with AWS Services
Part 1: Compute, Networking, and Security
Part 2: Data and Replication
Part 3: Application Management and Monitoring
Minimizing Dependencies in a Disaster Recovery Plan
Load Balancing and Service Discovery
Load balancing in Google network
Building a billion user load balancer (Facebook)
Ananta: Cloud Scale Load Balancing (Microsoft Azure)
GitHub Load Balancer
A quick intro to Consul
DNS-based Load Balancing with NSONE (podcast)
Redundancy and Resiliency
Redundant network designs usually use 1+1 redundancy. Applications (at least the database layer) are usually no better. However, 1+1 redundancy might not be good enough, and too much redundancy might decrease the overall availability.
1+1 Redundancy Just Isn’t Good Enough
Gray failures: the Achilles’ heel of cloud-scale systems
Why Shared Mutable State Is the Root of All Evil
Testing Resilient Application Stacks
Resilience Engineering: Learning to Embrace Failure
The Netflix Simian Army
Simian Army source code on GitHub
Testing in Production: Yes, You Can
AWS Fault Injection Simulator
Toxiproxy: a Framework for Simulating Network Conditions