Backup, DR & Ransomware Resilience

The discipline that ransomware turned from an ops concern into a security control. Vendor-neutral guide to the 3-2-1-1-0 rule, RTO and RPO, immutability via Object Lock / Vault Lock, virtual air-gapping, KMS key custody, the cloud ransomware kill chain, the native AWS / Azure / GCP backup landscape, restoration drills, and the cyber-insurance reality of 2026.

Rows of illuminated server racks in a data center
Photo on Pexels

ยท ยท Vendor-neutral ยท View source on GitHub

The 30-second version: Cloud ransomware doesn't just encrypt your production data - it encrypts your backups first. The discipline of backup and DR moved from operations to security the day attackers learned that destroying the recovery path is what makes a ransom non-negotiable. Immutability is the new MFA: a backup that can be deleted by any production identity is not a backup, it is a liability with a retention policy.

The cloud-native pattern that holds: 3-2-1-1-0 (three copies, two media types, one offsite, one immutable, zero errors verified by restore), a pull-based backup architecture into a separate account / subscription / project with no admin trust to production, Compliance-mode Object Lock / Vault Lock with KMS keys in a different trust domain, and a quarterly restoration drill that proves you can actually do this under stress. Everything else is documentation.

On this page

  1. Why this is a security control, not just ops
  2. The 3-2-1 rule (and the 3-2-1-1-0 update)
  3. RTO and RPO
  4. Backup vs replication vs DR vs HA
  5. Immutability - what cloud providers offer
  6. Air-gapping in cloud
  7. Encryption & KMS key custody
  8. The cloud ransomware kill chain
  9. Cloud-specific ransomware techniques
  10. Backup architectures
  11. AWS backup landscape
  12. Azure backup landscape
  13. GCP backup landscape
  14. Database DR patterns
  15. Restoration drills
  16. The Encryption Key Custody pattern
  17. Cyber insurance
  18. Third-party backup/DR platforms
  19. Tabletop exercises
  20. Maturity stages
  21. AWS / Azure / GCP comparison
  22. Common pitfalls
  23. Further reading
  24. FAQ

Why this is a security control, not just ops

For most of computing's history, backup was an operations problem. The threats were tape failures, disk failures, building fires, accidental rm -rf. The control objective was availability - can we resume serving customers? Backup teams reported to infrastructure leaders, ran on quarterly tape-rotation cadences, and were measured on whether the nightly job ran clean.

Ransomware changed all of it. Once attackers learned that a victim with viable backups will not pay, the rational economic move for the attacker became to destroy or encrypt the backups before the production data. Modern ransomware operators now spend days to weeks in a target environment specifically to find, enumerate, and corrupt every backup destination they can reach. The attack on production is the last step, not the first.

That single shift moves backup squarely into the security org. The control objective is no longer "can we recover from a server fire?" - it is "can we recover from a thinking adversary who has spent two weeks inside our cloud account learning where our backups live, who has access to delete them, and how to break the chain of custody on the encryption keys?"

The practical implications:

If your org still has backup reporting into pure infrastructure with no security review of the destination architecture, you have a 2018 program defending a 2026 threat model. Fixing it is among the highest-leverage moves any cloud security team can make.

The 3-2-1 rule (and the 3-2-1-1-0 update)

The classic 3-2-1 rule, coined by photographer Peter Krogh in 2005 for protecting digital negatives, became the lingua franca of backup practitioners because it survived every technology change since:

It held up well for two decades of tape and disk. Cloud and ransomware exposed two gaps: nothing in the rule prevented an attacker from deleting all three copies if they had the right credentials, and nothing required that the copies actually be readable. Veeam popularized the updated formulation, now widely adopted:

3-2-1-1-0:

  • 3 - copies of the data
  • 2 - different storage media (or in cloud, different storage classes / services)
  • 1 - copy offsite (cross-region, cross-account, cross-cloud, or on-prem)
  • 1 - copy immutable or air-gapped - cannot be modified or deleted within retention
  • 0 - errors. Every restore is verified. If you don't restore, you don't have a backup.

The "1" for immutability and the "0" for verification are the security additions. Both are non-negotiable in a ransomware-aware architecture. The remaining three (copies / media / offsite) translate cleanly to cloud - though the meaning of "different media" deserves a footnote.

Different media in cloud, practically: the spirit of the original rule was to avoid a single technology failure wiping all copies. In cloud, the equivalent is to avoid a single service or account failure / compromise wiping all copies. Two S3 buckets in the same account are not "two media." S3 in one account + S3 in a separate account with cross-region replication + Glacier Deep Archive on a Vault Lock is closer in spirit - three services, three trust boundaries, three failure modes.

RTO and RPO

The two numbers every backup conversation eventually returns to. They are business decisions disguised as engineering parameters; the security team's job is to translate the business intent into a defensible target, then deliver against it.

RTO - Recovery Time Objective

How long after a disaster can the service be down before the impact becomes unacceptable? Measured in elapsed time from outage to restoration. A trading platform may demand RTO of seconds; a quarterly batch job can live with RTO of days. RTO drives the architecture of recovery - hot standby vs warm vs cold, replicated databases vs nightly restore.

RPO - Recovery Point Objective

How much data, measured in time, can you afford to lose? An RPO of 5 minutes means you need snapshots or replication at minute-level granularity. An RPO of 24 hours means a nightly backup is sufficient. RPO drives the frequency of backup - and is the budget line most often quietly relaxed when the math gets uncomfortable.

Setting RTO / RPO per data class

One number for the whole org is an anti-pattern; the cost of an RPO of seconds for marketing brochures is wasted. The pragmatic approach is to classify systems into tiers, define RTO / RPO per tier, and document which systems belong in each:

Tier Typical RTO Typical RPO Pattern Cost
Tier 0 - life safety / regulated Seconds to minutes Seconds (zero acceptable) Active-active multi-region, synchronous replication 2-3ร— single-region
Tier 1 - revenue critical Minutes to 1 hour Minutes Active-passive multi-region, async replication + immutable backup 1.5-2ร—
Tier 2 - business critical 4-24 hours 1 hour Cross-region snapshots, restore on demand ~1.2ร—
Tier 3 - important 1-3 days 24 hours Daily backups, standard recovery ~1.05ร—
Tier 4 - bulk / archive 1 week+ 7 days Weekly backup to cold storage ~1.01ร—

The honest conversation with the business is the cost slope on the left side of the table. An RTO of minutes for everything sounds great until the bill arrives; an RTO of days for everything is cheap until a tier-1 system has an actual incident. Tiering is the answer; the security team is responsible for forcing the conversation, not for the answer.

Backup vs replication vs DR vs HA - four distinct disciplines

These four terms get conflated constantly. They are different things, each with a different threat model. The mature program runs all four; the immature program runs one of them and calls it the others.

Discipline Protects against Does NOT protect against Typical mechanism
High availability (HA) Single-instance / single-AZ failure Region failure, data corruption, ransomware Multi-AZ deployment, load balancer, auto-scaling
Replication Single-region infrastructure failure Corruption, deletion, ransomware - all faithfully copied Cross-region async, storage-tier replication
Disaster recovery (DR) Loss of a region / site / data center Logical corruption that replicated before failover Warm or hot standby in second region, ASR-style orchestration
Backup Corruption, deletion, ransomware, time-travel needs Real-time failures (RTO measured in hours, not seconds) Point-in-time snapshots, immutable retention

The conflations that cause real outages:

Immutability - what cloud providers actually offer

Immutability is the property that, once written, an object cannot be modified or deleted until a defined retention period expires - even by the account root, even by a global admin. It is the single most important technical control against cloud-native ransomware. The major clouds each implement it differently; the details matter.

AWS - S3 Object Lock

S3 Object Lock enforces write-once-read-many (WORM) at the object level. Configured per object or via a default bucket-level policy; requires versioning enabled on the bucket. Two modes:

Pair Object Lock with Glacier Vault Lock for archive tiers - Vault Lock policies are write-once and cannot be changed after the lock is engaged (even by root). The combination is the AWS gold standard for immutable archive.

Azure - Blob Immutable Storage

Azure offers two flavors of immutability at the blob storage layer:

Azure Backup Recovery Services Vaults also support soft delete (default 14 days, can be extended to 180 days and made non-bypassable) and multi-user authorization (MUA) - a Resource Guard pattern that requires approval from a second tenant for destructive operations on the vault. MUA is the Azure-native answer to "we need two-person rule on backup deletion."

GCP - Cloud Storage Bucket Lock

GCS Bucket Lock applies a retention policy to a bucket; objects within cannot be deleted or overwritten until they exceed the retention period. The retention policy itself can be locked - once locked, neither the project owner nor org admin can shorten or remove it. The lock is permanent for the lifetime of the bucket; the only way out is bucket deletion, which requires the bucket to be empty (impossible while objects are within retention).

Pair with object versioning for protection against malicious overwrites, and use VPC Service Controls to draw a perimeter around the backup project preventing data exfiltration even by privileged identities.

The compliance-mode guarantee

The crucial property all three providers offer in their strongest modes: once configured, the cloud platform itself enforces the retention, and not even the account root / subscription owner / project owner can override it. This is fundamentally different from "we set up IAM so admins can't delete" - which any sufficiently privileged attacker who escalates to admin can undo. Compliance-mode immutability is enforced by the provider's control plane, outside the customer's IAM domain.

The practical implication: configure immutability with retention longer than your worst-case dwell-time-plus-recovery window. Common floor is 30 days; many regulated environments run 90+. Ransomware operators dwell for months; a 7-day immutability window is theater.

Air-gapping in cloud - the virtual air gap

Classical air-gapping means a physical separation - backup tape stored in a vault no network can reach. Cloud has no physical air gap to offer, but it can offer a virtual one: a destination so isolated by IAM, network, and KMS boundaries that the path from production to backup is effectively one-way.

The patterns that approximate an air gap

The common thread: production cannot reach the backup destination through a normal data path. The backup service, identified separately, comes to production with read-only credentials; production never has credentials with write or delete to the destination.

Encryption of backups & KMS key custody

Every cloud backup is encrypted. That is necessary but not sufficient. The threat model that matters is not "the storage service was breached" - it's "the attacker has credentials in your environment and is looking for ways to render your backups unrecoverable." Key compromise is one such way.

The trust-domain principle

The KMS keys that encrypt your backups must be in a separate trust domain from the keys that encrypt production. If the same key encrypts production data and the backup of production data, an attacker who compromises that key wins twice - they can read or destroy both copies. Worse, if the attacker can schedule deletion of the key (AWS KMS allows scheduled deletion with a minimum 7-day window; Azure Key Vault supports purge protection), they can render every backup encrypted by that key permanently unrecoverable in a week without ever touching the backup data itself.

The pattern

This is the discipline most cloud organizations get partially right and then live with the gap until it bites. Closing the gap is rarely expensive - it is mostly a matter of moving keys into the right accounts and editing IAM. The leverage is enormous.

The cloud ransomware kill chain

Modern cloud ransomware doesn't look like the on-prem ransomware of 2017. The kill chain has adapted to the cloud's architecture - most importantly, to the fact that the cloud's control plane and data plane are both API-accessible to anyone with credentials.

The seven steps

  1. Initial access. Compromised credentials (most often), exposed access keys in a public git repo, a vulnerable web app providing a path to an instance profile, a phished cloud-console session, a third-party SaaS connector with broader scopes than needed. The path varies; the outcome is the same - the attacker has some cloud identity.
  2. Discovery & enumeration. aws sts get-caller-identity equivalents in every cloud, then enumeration of IAM, accounts, organizations, S3 buckets, KMS keys, EBS snapshots, RDS instances. Tools like Pacu, ScoutSuite, ROADtools automate it. The attacker is now building a map of your environment - and crucially, finding your backups.
  3. Persistence & privilege escalation. Create additional IAM users / access keys / service principals as backdoors. Modify trust policies. Add federation. The goal is to outlive the inevitable detection of the initial compromise.
  4. Exfiltration. Modern ransomware is double-extortion - encrypt for ransom and leak for additional pressure. Data is copied out to attacker-controlled storage. In cloud, this often goes through legitimate-looking egress points to evade DLP.
  5. Encrypt the backups FIRST. This is the step that makes cloud-native ransomware brutal. The attacker - with the credentials they collected in step 2 - destroys, encrypts, or sabotages every backup destination they can reach. Snapshots deleted. KMS keys scheduled for deletion. S3 versions deleted (or versioning disabled and objects overwritten). Backup retention shortened. Cross-region replication destinations encrypted with attacker keys. Recovery Services Vaults soft-deleted, then hard-deleted after the soft-delete window. Whatever the attacker can reach, they break.
  6. Encrypt production. Now the production data is encrypted, dropped, or held. EBS volumes re-encrypted with attacker keys. RDS instances dropped. S3 objects overwritten with encrypted versions. The blast radius depends on the privilege the attacker has accumulated.
  7. Ransom note. Often delivered via the compromised cloud console, sometimes via email or Slack. Demand is sized to the perceived recovery cost; modern operators research the target's revenue and insurance posture and price accordingly.

The order of step 5 before step 6 is the most important technical fact in this entire page. If your backup architecture cannot survive step 5 - if the attacker can reach your backups with the credentials they accumulate during dwell time - then no amount of preparation in step 6 helps. The whole defensive strategy is to make step 5 fail.

Cloud-specific ransomware techniques

Specific TTPs (tactics, techniques, procedures) seen across reported incidents and red-team engagements in cloud environments:

Map your detection coverage against these techniques explicitly. The detection engineering page covers the rule-writing side; the Cloud SOC page covers the response side.

Close-up of a hand holding a padlock against a backdrop of code
Photo on Pexels

Backup architectures

Two architectural choices dominate the design. Get these right and most other decisions become local optimizations; get them wrong and no amount of tool selection will save the program.

Pull-from-prod, not push-to-backup

The single most important pattern: the backup destination is the active party in the relationship, not the production environment. The backup service, running in a separate account, assumes a read-only role into production, reads the data, writes to its own immutable storage. Production identities have no credentials with write or delete on the backup destination. Compromise of production does not yield access to backups.

Why this is the right shape:

Anti-pattern: production IAM roles with write permissions on a "backup bucket" in the same account. Common, easy to set up, and useless against ransomware.

Separate backup account / subscription / project

The destination lives in its own organizational unit. No admin trust from production. Separate IAM admins, separate KMS admins, separate logging, separate monitoring, separate billing if practical. Treat it as a tenant of one.

Sketch of the canonical architecture:

Production Account (or Sub, or Project)
   โ”‚
   โ”œโ”€ Workloads, data, primary KMS keys (production trust domain)
   โ”‚
   โ–ผ  (one-way pull, read-only role)
Backup Account (separate trust domain)
   โ”œโ”€ Backup service (AWS Backup / Azure Backup / GCP Backup & DR)
   โ”œโ”€ Immutable destination buckets/vaults (Object Lock / Vault Lock)
   โ”œโ”€ Backup KMS keys (separate from prod keys, HSM-backed for tier-0)
   โ”œโ”€ Separate IAM admins, separate KMS admins
   โ”œโ”€ MFA-enforced break-glass restore role (normally disabled)
   โ””โ”€ Alerting: every privileged action logged + alerted
        โ”‚
        โ–ผ  (optional further isolation)
Out-of-cloud / cross-cloud / on-prem destination
        for tier-0 data

Layer DR on top of this - the backup architecture is independent of (and often orthogonal to) the DR architecture. A multi-region active-passive DR setup with replication serves the "regional failure" RTO; the backup architecture serves the "logical corruption" RTO. Both are needed.

AWS backup landscape

AWS has consolidated most of its backup story under AWS Backup, with several adjacent service-specific capabilities still very much in use.

The native services

Reference architecture

A solid AWS pattern for a regulated org:

Azure backup landscape

Azure splits backup and DR more explicitly than AWS does - Azure Backup for backup, Azure Site Recovery (ASR) for DR, and several service-specific protections sitting around them.

The native services

Reference architecture

Azure pattern for a regulated org:

GCP backup landscape

GCP's backup story consolidated around the Backup and DR Service (originating from Google's 2021 acquisition of Actifio), with the storage-native primitives sitting underneath.

The native services

Reference architecture

GCP pattern for a regulated org:

Database DR patterns

Databases have their own RTO / RPO economics because the data volume is usually large, the change rate is high, and consistency requirements are strict. The native patterns each cloud provides:

Read replicas across regions

Async replication of a primary to one or more read-only secondaries, optionally in other regions. Failover promotes a replica to primary. RPO is the replication lag (seconds to minutes); RTO depends on whether the promotion is automatic or manual.

Point-in-time recovery (PITR)

Continuous capture of database changes (transaction logs / change feed) enabling restore to any second within the retention window. The right primitive for "we made a bad schema change three hours ago" or "an attacker dropped a table at 3am."

Cross-region snapshots

For the cold-start scenario - recovery from a complete region loss or full corruption that propagated through replication. The snapshot is independent of the live database; restoration is slower but the trust boundary is cleaner.

Native multi-region replication

Some managed databases offer synchronous or near-synchronous multi-region replication - Spanner is the clearest example, with externally consistent transactions across continents. Aurora Global, Azure SQL geo-replication, and Cosmos DB multi-region writes are similar at lower consistency guarantees. The right choice when RTO must be measured in seconds.

The combination that holds across providers: native replication for HA / DR, plus immutable snapshots in a separate trust domain for ransomware resilience. The replica protects against infrastructure failure; the immutable snapshot protects against logical corruption.

Restoration drills

The single most underinvested practice in cloud backup programs. The gap between "we have backups" and "we can restore at scale within RTO under stress" is enormous and only revealed by doing the restore.

Why drills are the control, not the documentation

Every cloud provider's documentation makes restoration look like a button-click. In practice, restoration at scale runs into:

The cadence

Tabletop scenarios worth running

Capture the times, the friction points, the missing tooling. Each drill should produce a list of runbook fixes; the program matures by closing those gaps over the year before the next drill.

The Encryption Key Custody pattern

An advanced control increasingly seen in regulated and high-target organizations. The principle: the encryption keys for backups should be HSM-backed and accessible only via a break-glass process, not the same identity that runs daily backups.

The pattern

This pattern adds operational friction; it is reserved for the highest-impact data. Not every workload needs it. For tier-0 systems handling regulated, financial, or otherwise irreplaceable data, the friction is the point - even a fully compromised cloud account cannot decrypt the backups without independent approval.

Cyber insurance - what carriers ask for now

The cyber insurance market reset between 2022 and 2025. Premiums tripled or worse, sublimits were introduced for ransomware specifically, and underwriters now ask detailed technical questions on the application. The 2026 reality is that minimum hygiene is a precondition for coverage - and certain failures lead to non-payment after a loss.

What carriers now require

What carriers will and will not pay

Common patterns in 2024-2026 carrier behavior:

The pragmatic takeaway: build to the carrier's expected controls regardless of whether you're seeking coverage; they encode the floor of reasonable cloud-ransomware hygiene in 2026. If you cannot answer "yes" to immutability, isolation, MFA, EDR, and tested restore - you have program gaps, not insurance gaps.

Third-party backup & DR platforms

Native cloud backup is excellent for within-cloud, single-cloud scenarios. Third-party platforms add cross-cloud, on-prem, SaaS coverage, centralized management, and frequently a better restore UX. The major categories:

Enterprise backup platforms

Cloud-native / SaaS backup platforms

When to add a third party on top of native

Many mature programs run both - native services for within-cloud day-to-day backup, a third party as the cross-environment, cyber-resilience-focused overlay. The cost is real; the right answer depends on where your data actually lives and what your restore scenarios look like.

Tabletop exercises

Tabletop exercises sit alongside restoration drills - the drill validates the technical capability; the tabletop validates the decision-making, coordination, and communications under pressure. Both are necessary; neither is sufficient.

What makes a tabletop useful

Cross-reference to Incident Response for full IR program design. The backup-specific tabletop is one of several scenarios a mature IR program rotates through annually.

Maturity stages

A staging model for cloud backup and DR programs:

Stage 1 - Backups exist

Some backups are running. Retention is set. Configuration may have drifted from policy. Restoration has not been tested at scale. Backups likely share the same trust domain as production. The default state of most cloud programs at year 1-2.

Stage 2 - Restored once

At least one successful restore has been completed. The team has discovered the IAM, KMS, and runbook gaps that surface only when you actually do it. Documented runbooks exist for the most critical systems. Still no immutability and still no separation.

Stage 3 - Tested quarterly

Regular drill cadence in place. Per-system, per-tier RTO / RPO defined and tested against. Time-to-restore measured and trending. Cross-region snapshots in use. Some immutability via Object Lock / Bucket Lock / immutable Blob policies, but possibly not in compliance mode.

Stage 4 - Immutable + isolated

Compliance-mode immutability on tier-1+ data. Separate account / subscription / project for backup destinations. KMS keys in a separate trust domain. Cross-account, cross-region replication. Cyber-insurance-grade controls. Annual tabletop with cross-functional participation.

Stage 5 - Continuous validation

Backup integrity continuously verified by an isolated process. Cross-cloud copies for the most critical data. HSM-backed custody keys with break-glass access. Isolated recovery environment pre-staged. Restore-time measured continuously. The program is a competitive advantage in regulated sales.

As with most security maturity models, skipping stages is a recipe for cost and complexity without resilience. Immutability without restoration testing leaves you with verifiably-intact backups you can't actually use under pressure.

AWS, Azure, and GCP backup & DR capabilities side-by-side

The native services each cloud ships, reduced to a one-screen reference:

Capability AWS Azure GCP
Centralized backup service AWS Backup Azure Backup (Recovery Services Vault, Backup Vault) Backup and DR Service (Actifio)
Object-store immutability S3 Object Lock (Governance / Compliance), MFA delete Blob immutable storage policies (locked / unlocked, version-level) GCS Bucket Lock (locked retention policy)
Archive-tier immutability Glacier Vault Lock Archive tier with immutability policy Archive storage class with Bucket Lock
Vault-level immutability AWS Backup Vault Lock (Compliance mode) Immutable vault + MUA Resource Guard Backup and DR retention enforcement
Two-person rule on delete SCP + KMS grants + MFA; not built-in single feature Multi-User Authorization (MUA) via Resource Guard Org policy + IAM separation; no single MUA primitive
DR (replication-based) Elastic Disaster Recovery (DRS), CloudEndure heritage Azure Site Recovery (ASR) Backup and DR Service, custom replication patterns
DB point-in-time recovery RDS / Aurora PITR (35 days), DynamoDB PITR Azure SQL PITR (35 days) + LTR (10 years) Cloud SQL PITR, Spanner backups, BigQuery time-travel
Cross-region copy AWS Backup copy, S3 CRR, snapshot copy GRS / RA-GRS storage, ASR cross-region Multi-regional GCS, cross-region snapshots, Backup and DR
Cross-account isolation Multi-account via Organizations + SCPs Multi-subscription, optionally multi-tenant for MUA Multi-project + VPC Service Controls perimeter
Air-gapped vault offering AWS Backup logically air-gapped vaults Immutable RSV + MUA across tenants Backup and DR appliance isolation
HSM-backed keys AWS CloudHSM, KMS Cloud HSM tier Azure Managed HSM Cloud HSM, External Key Manager
Ransomware-specific detection GuardDuty malware protection, S3 Malware Scanning Defender for Cloud (CSPM + workload), Defender for Storage Security Command Center Premium / Enterprise findings
SaaS / Microsoft 365 backup Not native - third-party (Veeam, Druva, etc.) Native is durability, not backup - third-party recommended Workspace native + third-party for fuller restore options

The native services are excellent for within-cloud scenarios. They do not natively cross clouds; multi-cloud organizations still typically deploy a third-party overlay for unified visibility and cross-environment recovery. The native capabilities and the third-party platforms are complements, not substitutes, at scale.

Common pitfalls

  1. Backups in the same account / subscription / project as production. Single compromise destroys both. The most common and most consequential mistake. Move them.
  2. No immutability - or immutability in Governance mode only. Bypassable retention is theater. Compliance mode (or equivalent locked policy) is the floor.
  3. Restoration never tested at scale. "We have backups" without "we have restored" is documentation, not a control. Quarterly drills minimum.
  4. Encryption key reused for prod and backup. A single key compromise destroys both copies. Keys for backup live in a separate trust domain, period.
  5. "We use replication, so we have backup." Replication faithfully copies deletions and encryption. It is not a backup. You need both.
  6. Backup retention shorter than ransomware dwell time. A 7-day retention against a 90-day dwell-time attacker means you have nothing to restore. Size retention against the threat model, not against the storage budget.
  7. Cross-region replication into an account the attacker also controls. If the destination shares trust with the source, it's not isolation. Cross-region + cross-account, both.
  8. No alerting on backup-destination changes. Retention shortening, vault deletion attempts, KMS deletion scheduling - all high-fidelity ransomware indicators. Alert on each.
  9. SaaS data assumed to be backed up by the SaaS vendor. Microsoft 365, Salesforce, GitHub - vendors guarantee durability, not customer-specified restoration. Use a third-party SaaS backup for any business-critical SaaS.
  10. Runbook stored only in the system that's down. The restore runbook lives in Confluence; Confluence runs on the compromised cloud account; the runbook is unreachable. Copy runbooks to an out-of-band location - a printed binder, an offline wiki, a separate SaaS the IR team can reach.

Further reading

Provider documentation

Standards & guidance

Third-party platforms

Related CSOH pages

FAQ

Is replication a backup?

No. Replication copies the current state - including deletions and corruption - from primary to secondary, usually in seconds. If ransomware encrypts your primary, replication faithfully encrypts the secondary. A backup is a point-in-time copy retained on a schedule, ideally immutable, in a trust boundary the production identities cannot reach. Replication is for availability and disaster recovery from infrastructure failure; backup is for recovery from corruption, accidental deletion, malicious deletion, and ransomware. You need both; neither substitutes for the other.

Should I pay the ransom?

Engage law enforcement (FBI / your country's equivalent) and counsel before any decision; in some jurisdictions paying certain sanctioned actors is illegal. The practical reality: paying does not reliably restore data - published industry data puts full recovery after payment well under 50%, decryption tools are often slow and buggy, and paid victims are statistically more likely to be hit again. Insurance carriers increasingly will not reimburse a ransom without documented exhaustion of alternatives. The single best determinant of not paying is having tested, immutable, isolated backups you can restore from in known RTO.

How long should I keep backups?

Drive retention from three inputs: the legal / regulatory minimum (HIPAA 6 years for some records, SOX 7 years for financial, GDPR limits maxima not minima), the business RPO and recovery scenario (a ransomware dwell time of 60-90 days means a backup younger than that may already be compromised), and cost. Common pattern: hourly snapshots for 24-48 hours, daily for 30 days, weekly for 90 days, monthly for 1 year, annual for 7 years, all with immutability locks matching retention. Move long-tail copies to cold storage (Glacier Deep Archive, Azure Archive, GCS Archive) to control cost.

Does S3 Object Lock prevent ransomware?

It prevents object deletion and overwrite for the duration of the lock - which is the most important property - but only if you use Compliance mode (Governance mode is bypassable by IAM principals with the right permission). Object Lock does not prevent attackers from deleting the bucket itself if they have S3 admin and bucket versioning is off, does not prevent KMS key disablement that renders objects unreadable, and does not prevent retention being set too short before an attack. The full pattern is: Compliance-mode Object Lock + bucket versioning + replication to a separate account + KMS keys whose admin and usage roles are in a different trust domain + MFA-delete on the source bucket.

How often should we test restoration?

Quarterly at minimum for any tier-1 system; monthly or continuous validation for the most critical. The gap between "we have backups" and "we can restore at scale within RTO" is enormous and only revealed by doing the restore. A useful cadence: per-system small restore monthly, full-stack tabletop quarterly, full-service restore annually (or after any major architecture change). The numbers to capture from each test: time to restore, data loss vs RPO, manual steps required, missing IAM, missing keys, missing tooling. Each finding feeds the runbook.

Do I need a third-party backup tool if my cloud provider has one?

It depends on three things: whether you're multi-cloud (native tools don't cross clouds), whether you need recovery into a different cloud or on-prem (cyber-resilience may require an out-of-cloud copy), and whether your RPO / RTO needs exceed what native tooling offers. Native services are excellent at within-cloud recovery and integrate with the rest of the cloud's IAM, KMS, and logging. Third parties (Veeam, Rubrik, Cohesity, Commvault, Druva, Clumio, HYCU) add cross-cloud, on-prem, and SaaS coverage, central management, and often a better restore UX. Many enterprises run both - native for the day-to-day, third party for cross-environment and cyber-resilience.

Is the backup account itself a target?

Yes - and the assumption that it is must drive the architecture. A modern ransomware operator's first move after gaining persistence in your production environment is to enumerate trust relationships and look for the backup destination. The backup account or subscription should have: no inbound trust from production, separate IAM admins from production, separate KMS keys with separate key admins, MFA enforced on every role with write or delete capability, alerts on every privileged action, and a break-glass restore-only role that is normally disabled.

Where next