Incident Response & Cloud Forensics

The discipline that turns "we think something is wrong" into "we know what happened, we stopped it, we proved it to regulators, and it can't happen the same way again." A vendor-neutral practitioner's guide to the IR lifecycle in cloud, forensic readiness, evidence collection by workload type, memory and container forensics, runbooks, tabletops, retainers, and the native tooling on AWS, Azure, and GCP.

Detailed view of a hand writing a signature on an official document with a ballpoint pen
Photo by Tima Miroshnichenko on Pexels

· · Vendor-neutral · View source on GitHub

The 30-second version: Cloud incident response is the same six-phase loop you already know - Prepare → Identify → Contain → Eradicate → Recover → Lessons Learned - but every phase changes when the workload lives behind an API in someone else's data center. The instance you want to image may already be gone. The credential the attacker is using leaves a perfect log if you're capturing it, and zero log if you're not. The blast radius is whatever scope an over-privileged role has, which is usually larger than anyone thinks.

The work that makes IR survivable in cloud is done before any incident: an immutable cross-account log archive, a dedicated forensics account with pre-staged tooling, snapshot pipelines, SCPs that block evidence destruction, and runbooks practiced in tabletop. Without those, you investigate blind. With them, the playbook is largely automatable.

On this page

  1. The IR lifecycle in cloud
  2. What cloud changes about IR
  3. Forensic readiness
  4. The "log everything that matters" baseline
  5. Evidence collection by workload type
  6. Memory forensics on cloud VMs
  7. Container forensics
  8. Isolating compromised workloads
  9. Credential rotation under incident
  10. IR runbooks
  11. Tabletop exercises
  12. DFIR retainers
  13. Communication during incident
  14. Post-incident
  15. AWS, Azure, and GCP side-by-side
  16. Specialized cloud-IR tooling
  17. Maturity stages
  18. Common pitfalls
  19. Further reading
  20. FAQ

The IR lifecycle in cloud

Two reference lifecycles dominate: NIST SP 800-61r2 (Preparation, Detection & Analysis, Containment / Eradication / Recovery, Post-Incident Activity) and the SANS PICERL model (Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned). They're functionally the same loop with slightly different boundaries. Either works; pick one, use its vocabulary consistently across your runbooks and your retros, and don't get religious about the seams.

Preparation

The 80% of the work. Immutable log archive, forensics account, snapshot pipelines, IR runbooks, on-call rotation, retainer contract, tabletop calendar, communications plan. If preparation is weak, every other phase is harder; if it's strong, the rest is mostly procedure.

Identification

Triage the alert. Is it real? What's the scope? What's the blast radius if the worst-case is true? In cloud this is largely a question of which principal, which resources, what API activity, what data - answerable from logs if you collected them.

Containment

Stop the bleeding without destroying evidence. Quarantine the workload (not stop it), revoke active sessions, narrow IAM, isolate the network. Two flavors: short-term ("buy us an hour") and long-term ("safe to operate while we eradicate").

Eradication

Remove the attacker's persistence - implanted IAM principals, modified Lambda code, malicious container images, backdoored AMIs, scheduled tasks, rogue OAuth grants. Cloud-specific: an attacker who got into your IdP has persistence options well beyond a single host.

Recovery

Restore service safely. Redeploy from known-good IaC, rotate keys and secrets, validate the environment matches the pre-incident known-good baseline, monitor for re-emergence. Cloud helps here - rebuild-from-Terraform is faster and cleaner than reimaging an on-prem box.

Lessons Learned

Blameless retro. What broke? What missed? Which detections fired late? Which runbooks were wrong? Outputs: detection backfill, runbook updates, control improvements, tabletop scenarios. The phase most often skipped; the one with the highest long-term return.

The phases are not strictly sequential. You'll loop between identification and containment as scope expands; you'll re-eradicate when you discover new persistence; you'll start lessons-learned work the moment the incident commander declares the active phase over. The lifecycle is a frame, not a Gantt chart.

What cloud changes about IR

The lifecycle survives the move to cloud. The execution looks meaningfully different in four ways that show up in every cloud incident.

Dimension Traditional IR Cloud IR
Workload persistence A physical host is still there next week Instances scale-in, containers restart, functions are gone in milliseconds - capture evidence at the moment of detection or lose it
Primary attack surface OS-level: shell, malware, lateral movement API-level: stolen credential calling AWS / Azure / GCP APIs from somewhere it shouldn't be
Primary evidence Disk image, memory dump, endpoint telemetry CloudTrail / Activity Log / Cloud Audit Logs, plus disk and memory when the workload is the target
Containment unit Pull the cable, block at firewall, isolate the host Detach IAM, swap to quarantine security group, revoke OAuth grants, kill role sessions
Blast radius Whatever's on the network segment Whatever the compromised role can call - often cross-account, cross-region, cross-service
Shared responsibility You own everything below the rack Provider owns hypervisor / network / storage layer; you escalate when an incident crosses it
Speed Hours to days to spread Minutes to seconds - automated exploitation can run thousands of API calls before a human is even paged

The single biggest practical shift is that most cloud breaches are control-plane incidents, not data-plane incidents. Someone stole a credential - from a developer laptop, a leaked git commit, a compromised CI runner, a phished session - and is now making API calls. The shell on the box, if there's one at all, is incidental. Your IR program has to be biased toward identity, log analysis, and API-call correlation; the host-forensics muscle that on-prem IR teams grew over twenty years is necessary but no longer sufficient.

Forensic readiness

The hardest lesson of cloud IR is that almost everything that matters has to be set up before the incident. The attacker who has compromised your environment is not going to wait while you deploy a logging pipeline. Treat the items below as preconditions to running a viable IR program, not as nice-to-haves.

1. Immutable, cross-account log archive

The foundational control. The attacker who lands in your production account will try to disable logging and delete logs; the log archive has to be somewhere they cannot reach.

The "what's the retention?" answer is "long enough for the longest regulatory clock that applies to you" - for most orgs, that's 365 days minimum, often 7 years for financial-services contexts.

2. Dedicated forensics account / subscription / project

A clean environment, on the same cloud as production, but in a separate account boundary. The contents:

3. Snapshot / AMI / image pipelines for evidence preservation

Scripts or runbooks that, given a workload ID, will: snapshot the disk, copy the snapshot to the forensics account, capture metadata (instance config, IAM role, network state, tags) and write it to immutable storage, all in under a few minutes. Idempotent - running twice doesn't double the work. Tested - runbooks that have never run for real don't run for real on the worst day.

4. Service control / organization policies that block evidence destruction

The attacker with admin in your production account should still be unable to: disable CloudTrail / Activity Log / Audit Logs, delete the log archive bucket, modify forensics IAM trust policies, change the immutable storage configuration. SCPs (AWS), management group policies (Azure), and organization policies (GCP) enforce this above the production account, so production admin doesn't grant the necessary scope to undo the controls. This is the second-most-important control after the immutable log archive itself.

5. The break-glass account, ready to use

A separately credentialed identity, used only in emergencies, with permissions broad enough to investigate any account but narrow enough that abuse is detectable. MFA-required, monitored, with use that pages the security team automatically. If you can't get into the compromised account because the attacker rotated the credentials of every other admin, you need this; if you never set it up, your IR program stalls during the incident.

The "log everything that matters" baseline

Logs are not retroactive. If they weren't being captured when the incident began, no clever query recovers them. The minimum baseline that pays for itself the first time you have an actual incident:

AWS

Azure

GCP

The unifying principle: log what changes, log who did it, log from where, log against what data. The volume looks alarming on the bill the first month. The single time an incident requires you to reconstruct three months of activity for a specific compromised credential, the cost has paid itself off many times over.

Evidence collection by workload type

Different workloads need different evidence captures. Cloud IR runbooks split by workload class because the mechanics genuinely differ.

EC2 / Azure VM / GCE instance

Closest to traditional host forensics, with the cloud adding capabilities you didn't have on-prem.

EKS / AKS / GKE pods

Containers are designed to be ephemeral and identical. Treat the container as evidence about the deployment, not as the state itself.

See also the Kubernetes page for the deeper detection and runtime-security context.

Lambda / Cloud Functions / Azure Functions

Serverless evidence is mostly side-channel. The function instance is gone; you investigate what it did, not the function itself.

S3 / Azure Blob / GCS objects

IAM credentials

Often the primary evidence. For any credential - IAM user access key, IAM role session, service principal secret, GCP service account key - the questions are the same:

Tools like Netflix Dispatch, cloudgrep, and AWS-native tools help here, but a well-indexed CloudTrail / Activity / Audit Log corpus in your SIEM is the foundation.

Memory forensics on cloud VMs

Memory forensics on cloud VMs is technically possible, useful for a narrow class of incidents, and constrained by cloud realities most on-prem tooling didn't anticipate.

The cloud-specific constraints to know:

For most cloud incidents, memory forensics is not the highest-leverage activity. Save the muscle for the cases that genuinely warrant it (fileless attacks, in-memory implants, suspected rootkits); for the rest, CloudTrail and a disk image will tell you more, faster.

Container forensics

Container forensics is its own discipline. The container is short-lived, the filesystem is layered, the logs scroll past quickly, and most of the interesting state is somewhere the container layer doesn't preserve.

Isolating compromised workloads

Containment without evidence loss is the hardest single skill in cloud IR. The instinct to "shut it down" is wrong; the right move is to isolate the workload from anything it can damage or call, while keeping it intact enough to interrogate.

The quarantine security group pattern (AWS)

A pre-built security group with no ingress rules and egress restricted to (a) your forensics VPC peering, (b) your logging endpoints, (c) Systems Manager / SSM endpoints for remote command execution. Replacing the instance's existing SGs with this one isolates the workload in seconds while leaving it running for evidence capture.

NSG / firewall isolation (Azure / GCP)

Same pattern, different primitives. Azure: a "quarantine" NSG associated with the NIC, replacing existing NSGs. GCP: tag the instance with a quarantine network tag that matches a firewall rule denying all ingress and egress except to logged forensic endpoints.

IAM containment

Often more important than network containment in cloud breaches. If a credential is compromised, you want to:

The "running but cannot reach anything" state

The goal is a workload that is alive enough to capture from, dead enough not to cause further harm. For most cloud incidents this is achievable in under 5 minutes if the runbooks and IAM are pre-built; if they aren't, you'll spend 30 minutes deciding what to do while the attacker continues.

Credential rotation under incident

Credential rotation is necessary but rarely the first thing to do. The discipline is to capture evidence and contain first, then rotate - and to rotate correctly so you don't leave windows open.

AWS

Azure

GCP

The cross-cloud principle: disable, don't delete, until evidence is captured. A deleted credential's last-used and policy-version history is harder to recover than a disabled one.

IR runbooks

Runbooks are the predefined sequences your team executes for the common cloud incident classes. The right set covers the 80% of incidents that look the same time after time. Each runbook should answer: who's on call, what to capture, how to contain, how to verify, who to notify.

Compromised IAM credential

The single most common cloud incident. Trigger: GuardDuty / Defender / SCC finding of unusual API activity, or a credential surfacing in a public leak (GitHub, paste sites). Sequence: identify the principal, query the credential's full CloudTrail / Activity Log history, capture and store the activity trail in immutable storage, disable the credential, revoke active sessions, scope the blast radius from the API calls made, then notify the workload owner, rotate, and audit downstream resources for attacker persistence (new IAM users, new roles, modified policies).

Exposed S3 / Storage bucket

Trigger: external researcher report, internal CSPM alert, or a finding from data-exposure scanners (GrayNoise, BinaryEdge, internal). Sequence: confirm exposure (don't trust the alert alone), enumerate accessed objects from server access logs, identify the sensitive data classes involved, fix the bucket ACL / policy / public-access-block, capture an immutable snapshot of the access logs, then escalate to the data-owner team and Legal for breach-notification analysis.

Crypto-mining EC2 / VM

Trigger: GuardDuty CryptoCurrency finding, billing spike alert, unusual outbound network volume to mining-pool IP ranges. Usually downstream of a compromised IAM credential that launched the workload. Sequence: snapshot the workload, capture network traffic, identify the launching principal (often a forgotten access key in a code repo), shutdown the workload, then move to the credential-compromise runbook to find the root cause.

Ransomware in storage

Trigger: customer reports of inaccessible files, S3 object versioning chain showing mass-deletion, ransom note found in a bucket. Sequence: pause all auto-replication so the attacker's deletions don't propagate to backups, verify object versioning is intact, restore from versioning history or backups, identify the credential used for the deletions, follow the credential-compromise runbook for the source, and engage Legal for ransom-decision and law-enforcement notification.

Exfiltration via egress

Trigger: VPC Flow Logs / NSG Flow Logs anomaly, GuardDuty UnusualNetworkActivity or OutboundDataTransferAnomaly, DNS-tunneling indicators in Route 53 / Cloud DNS query logs. Sequence: contain the workload via the quarantine SG / NSG pattern, capture the workload, identify the data being transferred (via S3 access logs, database query logs, application logs), determine the principal and the credential, then run the credential-compromise + ransomware-style data-loss runbooks in parallel.

Insider / privileged-user misuse

Trigger: HR notification of an investigation, anomalous behavior from a privileged principal, or a finding that an admin's account is doing something the role permits but the human shouldn't. Sequence: do NOT alert the subject; preserve evidence; engage HR and Legal before any technical action; capture the full activity trail; coordinate any rotation / disablement with HR's timing.

Reference playbooks worth reading

Tabletop exercises

An IR program that has never practiced will not perform under stress. Tabletops are the cheapest way to discover that your runbooks reference an account that no longer exists, that the on-call rotation hasn't been updated since the last reorg, or that nobody has paged Legal in 18 months and the contact is stale.

Cadence

Who attends

Incident commander (rotating), security engineering, on-call SRE / platform team, cloud-account owners for the affected workloads, communications / PR, Legal / privacy counsel, executive sponsor (often CTO or CISO), and a designated observer who takes notes and runs the retro. For tabletops with a DFIR retainer, the firm's account team should join - they should know your environment before the real call.

Scenarios worth running

DFIR retainers

A retainer is a contracted relationship with a DFIR firm that buys you a response SLA, pre-negotiated rates, and (most importantly) a team that knows your environment before the call. The major firms in 2026:

When the math works

Retainers pay off when:

Below those thresholds, an established phone-and-email relationship with one or two firms - without a paid retainer - is often enough. The worst case is having no one to call and shopping during the incident; the second-worst is having a retainer with a firm that has never looked at your environment.

Communication during incident

The communications layer of IR is where most programs reveal their immaturity. The technical containment can go well and the public reception can still be terrible, or vice versa.

Internal

External

Your IR runbook should include a one-page summary of the regulatory clocks that apply to you - by data type, by jurisdiction, by sector. Your General Counsel should own it; your security team should know where to find it at 2am.

Post-incident

The phase most programs underinvest in, and the one with the highest long-term return.

Blameless retro

The principle, borrowed from SRE: people made the best decisions they could with the information they had. Retros that hunt for fault produce defensive participants who hide details; blameless retros produce honest accounts that surface the systemic causes. Have a designated facilitator (not the IC, not the executive sponsor), follow a written agenda, and produce a written document with the timeline, the decisions, the lessons, and the action items.

Action item categories

The owner of action items is not Security alone. Detection backfill is Security; control improvements often live with platform / engineering teams; comms updates live with Legal and Marketing. The retro that produces action items only for Security is a retro that has misdiagnosed the incident.

Close-up of a checklist with green checkmarks
Photo by Towfiqu barbhuiya on Pexels

AWS, Azure, and GCP side-by-side

The native IR-supporting capabilities each cloud ships, reduced to a one-screen reference:

Capability AWS Azure GCP
Activity audit log CloudTrail (org trail, mgmt + data events) Activity Log + Diagnostic Settings Cloud Audit Logs (Admin + Data Access)
Network flow logs VPC Flow Logs NSG Flow Logs / VNet Flow Logs VPC Flow Logs
Threat detection GuardDuty (incl. EKS, S3, RDS, runtime) Defender for Cloud (CSPM + CWPP) Security Command Center (Premium / Enterprise)
Investigation / graph Detective Sentinel investigation graph SCC Investigation, Chronicle
SIEM / SOAR Security Lake + partner SIEM, Security Hub Microsoft Sentinel (SIEM + SOAR) Chronicle SIEM + SOAR
Disk evidence EBS snapshot, cross-account share Managed disk export, snapshot copy Persistent disk snapshot, cross-project share
Memory acquisition SSM Session Manager + LiME/AVML; no hypervisor-level Azure Run Command + AVML/WinPmem; no hypervisor-level OS Login / Cloud Shell + AVML; no hypervisor-level
Container runtime GuardDuty EKS Runtime / ECS Fargate Runtime Defender for Containers (eBPF) GKE Threat Detection, Workload Vuln Scanning
Isolation primitive Quarantine Security Group Quarantine NSG / NIC isolation Network tag + firewall rule
Session revocation AWSRevokeOlderSessions inline policy Entra ID refresh-token revocation Workspace session reset, IAM key disable
IR-specific service AWS Security Incident Response (CIRT-as-a-service) Microsoft Incident Response (paid engagement) Mandiant (Google Cloud) consulting + retainers
Vendor IR contact AWS Customer Incident Response Team (CIRT) via Support Microsoft Detection & Response Team (DART) Google Cloud TAM / Security Response

Native tools are necessary but not sufficient. None of the three clouds ship a complete forensics workbench, and the cross-cloud IR tooling section below fills the gap that all three leave open.

Specialized cloud-IR tooling

Beyond the native services, a category of tooling specifically targets cloud forensics and IR workflows:

Maturity stages

A useful staging model for a cloud IR program:

Stage 1 - Reactive

An incident happens; the team scrambles. Logs are partially captured. Runbooks are tribal knowledge. The forensics work happens in the production account. Detection is mostly post-hoc - billing alerts, customer complaints, external researcher tips. Survives small incidents, breaks on big ones.

Stage 2 - Documented

Immutable log archive is live. Forensics account exists. The most common runbooks (compromised credential, exposed bucket, crypto-mining) are written down. On-call rotation defined. A DFIR phone-and-email relationship with one firm. First tabletop completed.

Stage 3 - Practiced

Quarterly tabletops with multiple scenarios. Annual full-day exercise with Legal and execs. DFIR retainer in place; firm has been through an environment-familiarization engagement. Detection engineering matures alongside IR; detections tied to MITRE ATT&CK Cloud. SLA-tracked time-to-detect, time-to-contain.

Stage 4 - Automated

High-confidence playbooks auto-execute: GuardDuty finding → snapshot the workload, apply quarantine SG, page the on-call, file a ticket - all in seconds. Continuous tabletop via purple-team exercises. Forensics-as-code; evidence collection is reproducible from a runbook commit hash. The IR program is a competitive advantage in enterprise sales conversations about security maturity.

The skip-stage cost is real. An org trying to automate IR without runbooks is automating against nothing; the automation will be wrong, and people will distrust it. Sequence matters.

Common pitfalls

Further reading

Standards & frameworks

Provider IR documentation

Tooling

Related CSOH pages

FAQ

How is cloud incident response different from on-prem IR?

Three big shifts. First, ephemerality - the instance, container, or function that you want to image may not exist by the time you reach for it; if you didn't snapshot disk and capture logs proactively, that evidence is gone. Second, the API is the attack surface - most cloud breaches are not "shell on a box" but "stolen credential calling APIs", so CloudTrail / Activity Log / Cloud Audit Logs are the primary evidence source, not memory dumps. Third, shared responsibility - the hypervisor, the physical network, the underlying storage hardware are not yours to investigate; you work with what the provider exposes, and you escalate to the provider's IR team when an incident crosses that line.

What's the single highest-leverage thing to do before an incident happens?

Build an immutable, cross-account log archive that an attacker who compromises your production environment cannot tamper with. In AWS, that's CloudTrail (org trail, all regions, management + data events) writing to an S3 bucket in a dedicated Log Archive account, with bucket policies and SCPs that prevent deletion or alteration. In Azure, it's Activity Log + Diagnostic Settings exported to a Log Analytics workspace and a locked storage account in a separate subscription. In GCP, it's organization-aggregated Cloud Audit Logs sinking to a retention-locked bucket in a dedicated logging project. Without this, the attacker rotates credentials, disables CloudTrail, deletes logs, and you investigate blind.

Do I really need a separate forensics account?

Yes, and you need it provisioned before the incident, not during. A forensics account / subscription / project is a clean environment with pre-staged tooling (a forensics AMI or VM image, scripting, disk-imaging utilities, packet-capture tools, Velociraptor / GRR collectors), an isolated VPC with no route to production, IAM trust policies that allow the IR team to assume roles into compromised accounts in read-only or evidence-collection modes, and a budget. Doing forensics in the prod account contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and lawyers heartburn.

Should I rotate credentials immediately when I detect a compromise?

Not before you've captured what you need from the credential's audit trail and decided on a containment strategy. The instinct to "kill the access key" is right eventually, but rotating credentials prematurely (a) tips off the attacker, (b) destroys live session state that may be the best evidence of what's happening right now, and (c) can lock out legitimate workloads if the credential is shared. The order is: identify the compromised principal, query CloudTrail / Activity / Audit Logs for the credential's full activity history, snapshot any associated workloads, then rotate - and use AssumeRole session revocation (the AWSRevokeOlderSessions inline policy) to kill active sessions, not just the long-term credential.

How long do we have to disclose a breach?

It depends on jurisdiction and data type. GDPR requires notification to the supervisory authority within 72 hours of becoming aware of a personal-data breach, and to affected individuals "without undue delay" if there's high risk. U.S. state breach-notification laws vary (most 30-60 days, some shorter for specific data types). SEC Item 1.05 requires public companies to disclose material cybersecurity incidents within four business days of materiality determination. HIPAA breach notification is 60 days. Sector-specific rules (NYDFS Part 500, FFIEC, NERC CIP, NIS2, DORA) add their own clocks. Your IR runbook should include a decision tree that maps incident type to the regulatory clocks that apply - and your General Counsel should be in the war room from the first hour.

Is memory forensics still relevant on cloud VMs?

Yes for workloads that meaningfully run anything in memory beyond what's on disk - long-running services with in-memory state, JIT-compiled malware, fileless attacks that hide in process memory. The mechanics on cloud VMs work: LiME or AVML can capture RAM from a running Linux instance, MAGNET RAM Capture or similar from Windows. The constraint is that ephemeral cloud VMs may be gone by the time you reach for them, and capturing memory requires the instance to still be running (or for the hypervisor to support live snapshots - AWS / Azure / GCP don't generally expose this to tenants). For containers and serverless, memory forensics is largely impractical; you fall back on runtime-security telemetry (Falco, Tracee, eBPF-based sensors) captured before the workload terminates.

When does a DFIR retainer pay off?

When you have meaningful cloud presence, sensitive data, and an internal IR team that's small enough to be overwhelmed by a serious incident. A retainer (Mandiant, CrowdStrike, Kroll, Unit 42, Arctic Wolf, and similar) buys you a contracted response SLA - typically 1-4 hours to first responder - and pre-negotiated rates instead of emergency premium pricing. It also forces you to do a tabletop with the firm in advance so they know your environment when the call comes in. The math works above roughly 100 employees, or earlier if you're in a regulated industry. Below that, a "best-efforts" relationship with one or two firms you've talked to in calm times is often enough.

Where next