AWS’s October 20 outage reminded us of a simple truth: resilience can’t be improvised. High availability (HA) within a single region helps, but it doesn’t replace a plan B that keeps services running when the common element too many components depend on fails (region, control plane, DNS, queues, identity, etc.).
This guide lays out a practical path from “HA” to business continuity, with patterns we see working in production—and that we at Stackscale implement with two active-active data centers. When required, these architectures combine smoothly with other providers. It all starts with the basics: define RTO/RPO and rehearse failover.
David Carrero (Stackscale co-founder): “HA is necessary, but if everything hinges on a single common point, HA fails. The difference between a scare and a crisis comes down to a rehearsed plan B.”
1) Align business and tech: RTO/RPO and a dependency map
- RTO (Recovery Time Objective): how long a service can be down.
- RPO (Recovery Point Objective): how much data loss (in time) you can accept when restoring.
With those targets signed off by the business, draw the dependency map: what breaks what? Where do identity, DNS, queues, catalogs, or global tables really anchor? The output is your list of services that can’t go down—and what they depend on.
2) Continuity patterns that actually work
Active–active across two data centers (RTO=0 / RPO=0)
For payments, identity, or core transactions:
- Synchronous storage replication between both DCs → RTO=0 / RPO=0.
- Distributed / quorum databases, with conflict handling (CRDTs / sagas).
- DNS/GTM with real service health checks (not just pings).
- Degrade modes pre-defined (read-only, feature flags).
At Stackscale we offer active–active across two European data centers with synchronous replication as a mission-critical foundation. You can complement that core with third parties (hyperscalers or other DCs) if you need a third continuity path or sovereignty/localization guarantees.
Multi-site warm standby (hot passive)
For most important workloads:
- Asynchronous data replication to a second site.
- Pre-provisioned infrastructure (templates / IaC) to promote site B in minutes.
- Automatic failover via DNS/GTM and rehearsed runbooks.
Minimum footprint at an alternate provider (formerly “pilot-light”)
When you want to lower concentration risk at a reasonable cost:
- Keep the bare minimum running outside your primary provider (e.g., DNS, observability, immutable backups, break-glass identity, status page).
- Promote to active-passive or active-active only what’s truly critical, based on RTO/RPO.
Carrero: “It’s not about abandoning hyperscalers—it’s about balancing. The European ecosystem—Spain included—is mature and a strong complement for resilience and sovereignty.”
3) Three layers that move the needle (and how to handle them)
Data
- Transactional: quorum-based distribution + conflict control.
- Objects: versioning + inter-site replication, immutability (WORM lock) and, where needed, air-gap.
- Catalogs/queues: avoid “global” services anchored in a single external region if your RTO/RPO can’t tolerate it.
Network / DNS / CDN
- Two DNS providers and GTM with business-level probes (a real “health transaction,” not a ping).
- Multi-CDN and alternate origins (A/A or A/P) with origin shielding.
- Redundant private connectivity (overlay SD-WAN) between cloud and DC.
Identity & access
- IdP with key (JWKS) caching and contextual re-auth.
- Break-glass accounts outside the failure domain, protected by strong MFA (hardware keys).
- App governance to stop consent phishing and OAuth abuse.
4) Observability, backups, and drills: without these, there’s no plan B
- Observability outside the same failure domain: at least one mirror of metrics/logs and a status page that don’t depend on the primary provider.
- Immutable backups and timed restore drills (with recent success).
- Quarterly gamedays: region/IdP/DNS/queue/DB failures. Measure real RTO, dwell time, and MTTR.
5) Two reference architectures (and how they fit with Stackscale)
All-in with Stackscale continuity
- DC A + DC B (Stackscale) in active–active with synchronous replication (RTO=0/RPO=0).
- Immutable backups in a third domain (another DC or isolated object storage).
- Multi-provider DNS/DNSSEC + GTM with business health checks.
- Observability mirrored out (dedicated provider or a second Stackscale site).
- Runbooks and regular gamedays with local, 24/7 support.
Hybrid continuity (Stackscale + another location/provider)
- DC A + DC B (Stackscale) active–active for the core.
- Minimum footprint at another provider for DNS, status, logs/SIEM, and immutable object storage (or vice versa).
- Non-critical workloads at the other location, with DR back to Stackscale if sovereignty or cost requires it.
- Private connectivity and portable policies (identity, logging, backup).
What Stackscale brings (without the sales pitch):
- Two European data centers with low latency, redundant power & networks, and nearby 24/7 support.
- Synchronous replication for mission-critical apps (RTO=0/RPO=0).
- High-performance storage (block/object) with versioning and WORM lock (Object Lock) for immutable backups.
- Bare-metal and private cloud to consolidate workloads, plus dedicated connectivity with carriers and public clouds via partners.
- Easy integration with external providers (DNS, CDN, observability, hyperscalers) for hybrid or selective multicloud strategies.
6) A realistic 30-60-90 day roadmap
Days 1–30
- Get RTO/RPO approved per service.
- Build the dependency & global-anchor map.
- Immutable backups and your first timed restore.
Days 31–60
- Multi-provider DNS/GTM, multi-CDN, and observability outside the same failure domain.
- Minimum footprint (emergency identity, status, SIEM).
- First gameday (DNS + DB/queues).
Days 61–90
- Active–active or warm standby between Stackscale’s two DCs.
- Integrations with third parties (if needed).
- Full-region failure gameday and metrics review.
7) Executive metrics that actually say something
- % of services with signed RTO/RPO and met in drills.
- Restore OK (last 30 days) and average restore time.
- Failover time (gamedays) and dwell time by scenario.
- Observability coverage “outside” the failure domain (yes/no by domain).
- “Global” dependencies with alternatives (yes/no).
Final word: resilience with a clear head (and without handcuffs)
The lesson isn’t to “flee the cloud,” but to design for failure and de-concentrate single points of breakage. HA is necessary, but not sufficient: you need a complete alternate route to the same outcome. With two active–active DCs (RTO=0/RPO=0) as the base and continuity layers—DNS, immutable copies, observability, and break-glass identity—outside the same failure domain, your platform will stay up when a provider or region stumbles.
At Stackscale, we support that transition every day. And when continuity goals or regulations call for it, we combine our two-data-center infrastructure with other providers. That way, plan B isn’t a PDF in a drawer—it’s a path you’ve already walked.