Incident Response & Cloud Forensics

Q: How is cloud incident response different from on-prem IR?

Three big shifts. First, ephemerality - the instance, container, or function that you want to image may not exist by the time you reach for it; if you didn't snapshot disk and capture logs proactively, that evidence is gone. Second, the API is the attack surface - most cloud breaches are not 'shell on a box' but 'stolen credential calling APIs', so CloudTrail / Activity Log / Cloud Audit Logs are the primary evidence source, not memory dumps. Third, shared responsibility - the hypervisor, the physical network, the underlying storage hardware are not yours to investigate; you work with what the provider exposes, and you escalate to the provider's IR team when an incident crosses that line.

Q: Do I really need a separate forensics account?

Yes, and you need it provisioned before the incident, not during. A forensics account / subscription / project is a clean environment with pre-staged tooling (a forensics AMI or VM image, scripting, disk-imaging utilities, packet-capture tools, Velociraptor / GRR collectors), an isolated VPC with no route to production, IAM trust policies that allow the IR team to assume roles into compromised accounts in read-only or evidence-collection modes, and a budget. Doing forensics in the prod account contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and lawyers heartburn.

Detailed view of a hand writing a signature on an official document with a ballpoint pen — Photo by Tima Miroshnichenko on Pexels

Last updated 2026-05-17 · By Shawn Nunley · Vendor-neutral · View source on GitHub

The 30-second version: Cloud incident response is the same six-phase loop you already know - Prepare → Identify → Contain → Eradicate → Recover → Lessons Learned - but every phase changes when the workload lives behind an API in someone else's data center. The instance you want to image may already be gone. The credential the attacker is using leaves a perfect log if you're capturing it, and zero log if you're not. The blast radius is whatever scope an over-privileged role has, which is usually larger than anyone thinks.

The work that makes IR survivable in cloud is done before any incident: an immutable cross-account log archive, a dedicated forensics account with pre-staged tooling, snapshot pipelines, SCPs that block evidence destruction, and runbooks practiced in tabletop. Without those, you investigate blind. With them, the playbook is largely automatable.

The IR lifecycle in cloud
What cloud changes about IR
Forensic readiness
The "log everything that matters" baseline
Evidence collection by workload type
Memory forensics on cloud VMs
Container forensics
Isolating compromised workloads
Credential rotation under incident
IR runbooks
Tabletop exercises
DFIR retainers
Communication during incident
Post-incident
AWS, Azure, and GCP side-by-side
Specialized cloud-IR tooling
Maturity stages
Common pitfalls
Further reading
FAQ

The IR lifecycle in cloud

Two reference lifecycles dominate: NIST SP 800-61r2 (Preparation, Detection & Analysis, Containment / Eradication / Recovery, Post-Incident Activity) and the SANS PICERL model (Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned). They're functionally the same loop with slightly different boundaries. Either works; pick one, use its vocabulary consistently across your runbooks and your retros, and don't get religious about the seams.

Preparation

The 80% of the work. Immutable log archive, forensics account, snapshot pipelines, IR runbooks, on-call rotation, retainer contract, tabletop calendar, communications plan. If preparation is weak, every other phase is harder; if it's strong, the rest is mostly procedure.

Identification

Triage the alert. Is it real? What's the scope? What's the blast radius if the worst-case is true? In cloud this is largely a question of which principal, which resources, what API activity, what data - answerable from logs if you collected them.

Containment

Stop the bleeding without destroying evidence. Quarantine the workload (not stop it), revoke active sessions, narrow IAM, isolate the network. Two flavors: short-term ("buy us an hour") and long-term ("safe to operate while we eradicate").

Eradication

Remove the attacker's persistence - implanted IAM principals, modified Lambda code, malicious container images, backdoored AMIs, scheduled tasks, rogue OAuth grants. Cloud-specific: an attacker who got into your IdP has persistence options well beyond a single host.

Recovery

Restore service safely. Redeploy from known-good IaC, rotate keys and secrets, validate the environment matches the pre-incident known-good baseline, monitor for re-emergence. Cloud helps here - rebuild-from-Terraform is faster and cleaner than reimaging an on-prem box.

Lessons Learned

Blameless retro. What broke? What missed? Which detections fired late? Which runbooks were wrong? Outputs: detection backfill, runbook updates, control improvements, tabletop scenarios. The phase most often skipped; the one with the highest long-term return.

The phases are not strictly sequential. You'll loop between identification and containment as scope expands; you'll re-eradicate when you discover new persistence; you'll start lessons-learned work the moment the incident commander declares the active phase over. The lifecycle is a frame, not a Gantt chart.

What cloud changes about IR

The lifecycle survives the move to cloud. The execution looks meaningfully different in four ways that show up in every cloud incident.

Dimension	Traditional IR	Cloud IR
Workload persistence	A physical host is still there next week	Instances scale-in, containers restart, functions are gone in milliseconds - capture evidence at the moment of detection or lose it
Primary attack surface	OS-level: shell, malware, lateral movement	API-level: stolen credential calling AWS / Azure / GCP APIs from somewhere it shouldn't be
Primary evidence	Disk image, memory dump, endpoint telemetry	CloudTrail / Activity Log / Cloud Audit Logs, plus disk and memory when the workload is the target
Containment unit	Pull the cable, block at firewall, isolate the host	Detach IAM, swap to quarantine security group, revoke OAuth grants, kill role sessions
Blast radius	Whatever's on the network segment	Whatever the compromised role can call - often cross-account, cross-region, cross-service
Shared responsibility	You own everything below the rack	Provider owns hypervisor / network / storage layer; you escalate when an incident crosses it
Speed	Hours to days to spread	Minutes to seconds - automated exploitation can run thousands of API calls before a human is even paged

The single biggest practical shift is that most cloud breaches are control-plane incidents, not data-plane incidents. Someone stole a credential - from a developer laptop, a leaked git commit, a compromised CI runner, a phished session - and is now making API calls. The shell on the box, if there's one at all, is incidental. Your IR program has to be biased toward identity, log analysis, and API-call correlation; the host-forensics muscle that on-prem IR teams grew over twenty years is necessary but no longer sufficient.

Forensic readiness

The hardest lesson of cloud IR is that almost everything that matters has to be set up before the incident. The attacker who has compromised your environment is not going to wait while you deploy a logging pipeline. Treat the items below as preconditions to running a viable IR program, not as nice-to-haves.

1. Immutable, cross-account log archive

The foundational control. The attacker who lands in your production account will try to disable logging and delete logs; the log archive has to be somewhere they cannot reach.

AWS. Organization CloudTrail (all regions, management + S3 data events at minimum) writing to an S3 bucket in a dedicated Log Archive account. S3 Object Lock in compliance mode, MFA delete, a bucket policy denying delete from any principal except a break-glass role, and a Service Control Policy at the org root that prevents anyone in production from disabling the trail. Pair with a separate Security tooling account that has read-only access for the SOC.
Azure. Activity Log + Diagnostic Settings exported to a Log Analytics workspace in a separate subscription, with immutable blob storage as the long-term archive. Management locks at the subscription level prevent deletion.
GCP. Organization-aggregated Cloud Audit Logs sinking to a dedicated logging project, with retention-locked Cloud Storage as the immutable archive. Organization policies prevent disabling of audit logs on production projects.

The "what's the retention?" answer is "long enough for the longest regulatory clock that applies to you" - for most orgs, that's 365 days minimum, often 7 years for financial-services contexts.

2. Dedicated forensics account / subscription / project

A clean environment, on the same cloud as production, but in a separate account boundary. The contents:

Pre-built forensics AMI / VM image / instance template with disk-imaging utilities (dc3dd, guymager), packet capture (tcpdump, zeek), memory tooling (LiME, AVML), and a Velociraptor / GRR / OSQuery collector pre-installed.
Isolated VPC / VNet with no route to production. Public egress only via a logged, monitored gateway.
IAM trust policies that let the IR team assume a read-only or evidence-capture role into compromised accounts - but no trust the other direction.
A logging archive of its own, separate from production's, so the IR team's actions are themselves auditable.
A small budget pool that won't trip cost alerts during a real incident.

3. Snapshot / AMI / image pipelines for evidence preservation

Scripts or runbooks that, given a workload ID, will: snapshot the disk, copy the snapshot to the forensics account, capture metadata (instance config, IAM role, network state, tags) and write it to immutable storage, all in under a few minutes. Idempotent - running twice doesn't double the work. Tested - runbooks that have never run for real don't run for real on the worst day.

4. Service control / organization policies that block evidence destruction

The attacker with admin in your production account should still be unable to: disable CloudTrail / Activity Log / Audit Logs, delete the log archive bucket, modify forensics IAM trust policies, change the immutable storage configuration. SCPs (AWS), management group policies (Azure), and organization policies (GCP) enforce this above the production account, so production admin doesn't grant the necessary scope to undo the controls. This is the second-most-important control after the immutable log archive itself.

5. The break-glass account, ready to use

A separately credentialed identity, used only in emergencies, with permissions broad enough to investigate any account but narrow enough that abuse is detectable. MFA-required, monitored, with use that pages the security team automatically. If you can't get into the compromised account because the attacker rotated the credentials of every other admin, you need this; if you never set it up, your IR program stalls during the incident.

The "log everything that matters" baseline

Logs are not retroactive. If they weren't being captured when the incident began, no clever query recovers them. The minimum baseline that pays for itself the first time you have an actual incident:

AWS

CloudTrail - org trail, all regions, management events + S3 data events at a minimum. Add Lambda invocation events for sensitive functions. Validate file integrity is on.
VPC Flow Logs - every production VPC, accept + reject, custom format that includes pkt-srcaddr / pkt-dstaddr for NAT-traversed flows.
Route 53 query logs - for resolver, public hosted zones, and Resolver query logging in critical VPCs. DNS is where exfil and C2 frequently show up.
S3 access logs or S3 server access logging for sensitive buckets; CloudTrail data events as the structured alternative.
GuardDuty findings - enabled in every account, every region. Runtime monitoring for EKS / ECS / EC2 where worth the spend.
IAM Access Analyzer - surfaces unused access and external sharing; useful as a baseline that an incident's findings can be diffed against.
WAF logs, ALB / NLB access logs, CloudFront logs for internet-facing services.

Azure

Activity Log - every subscription, exported via Diagnostic Settings to Log Analytics and immutable storage.
Diagnostic logs - for every relevant resource type (Key Vault, Storage, SQL, App Service, AKS, Function Apps). Defaults are not enough; enable explicitly per service.
Entra ID sign-in logs and audit logs - the cloud-identity equivalent of CloudTrail's identity events. Retention longer than the 30-day free tier.
NSG flow logs - to Storage and Traffic Analytics.
Microsoft Sentinel as the SIEM ingestion layer when you have the spend; the connectors do most of the work.
Defender for Cloud recommendations and alerts - surface compliance posture and active findings.

GCP

Cloud Audit Logs - Admin Activity (always on) and Data Access (must be enabled, costs more, and worth it for sensitive services like BigQuery, Cloud Storage, Cloud KMS, IAM).
VPC Flow Logs - every subnet in scope.
Cloud DNS query logging.
Cloud Logging sinks - aggregated at the organization level into the dedicated logging project, with retention-locked Cloud Storage as the long-term archive.
Security Command Center Premium / Enterprise for threat detection findings; Chronicle for SIEM-grade ingestion if you've adopted it.
Access Transparency / Access Approval - provider-personnel access logs, for the regulated workloads where that matters.

The unifying principle: log what changes, log who did it, log from where, log against what data. The volume looks alarming on the bill the first month. The single time an incident requires you to reconstruct three months of activity for a specific compromised credential, the cost has paid itself off many times over.

Evidence collection by workload type

Different workloads need different evidence captures. Cloud IR runbooks split by workload class because the mechanics genuinely differ.

EC2 / Azure VM / GCE instance

Closest to traditional host forensics, with the cloud adding capabilities you didn't have on-prem.

Snapshot first, stop second. The disk snapshot of a running instance captures state including some pagefile and journal contents; stopping the instance first risks losing in-memory state and may trigger anti-forensics in malware that watches for shutdown.
Share the snapshot to the forensics account (cross-account EBS snapshot sharing on AWS; Azure managed disk export to the forensics subscription; GCE disk snapshot in the forensics project).
Create a fresh forensics volume from the snapshot, attach to the analysis instance in an isolated VPC, mount read-only. Never attach the forensics volume to anything that has internet egress until you've confirmed what you have.
Capture metadata alongside the disk: IAM role and its policies, security groups, IMDS history if available, instance tags, launch template, AMI ID, user data. Cloud-side metadata is often more useful than anything on the disk itself.
If memory matters, capture it before snapshot - see the memory forensics section.

EKS / AKS / GKE pods

Containers are designed to be ephemeral and identical. Treat the container as evidence about the deployment, not as the state itself.

kubectl debug ephemeral container - attach a debug container to a running pod in the same namespace and PID space; capture /proc, environment, network state, running processes. Pre-build a debug image with your forensics tools.
Pin the image SHA before anything else. A pod compromised via image vulnerability is meaningless to investigate without knowing precisely which image layer was running.
Node-level disk snapshot for the underlying VM if the pod is suspected of having escaped or written to host paths.
Container runtime telemetry - Falco, Tracee, eBPF-based sensors. These have to have been running before the incident; standing them up after is too late.
Kubernetes audit logs for the API-server events that produced and changed the pod. These should be flowing to the same log archive as the rest of your audit logs.
kubectl cp for any specific files of interest, then capture stdout / stderr logs before pod termination.

See also the Kubernetes page for the deeper detection and runtime-security context.

Lambda / Cloud Functions / Azure Functions

Serverless evidence is mostly side-channel. The function instance is gone; you investigate what it did, not the function itself.

Function code + configuration at the time of the incident. CloudTrail / Activity Log / Audit Logs record the deployment events; capture the version that was running, including environment variables (they often contain secrets that need rotating).
Invocation logs - CloudWatch Logs (AWS), Application Insights / Log Analytics (Azure), Cloud Logging (GCP). Export anything covering the suspected incident window to immutable storage before retention windows expire.
Distributed tracing - X-Ray, Application Insights, Cloud Trace. Reconstructs the call graph of a request that triggered the function; useful for showing whether the function was the attacker's pivot or just a downstream caller.
IAM role activity from the function's execution role - what API calls did it actually make, when, against what.

S3 / Azure Blob / GCS objects

Object versioning and MFA delete (AWS) / soft delete + versioning (Azure) / object versioning + retention policies (GCS) - turn these on by default for sensitive buckets so that an attacker can't simply delete objects to cover tracks or to ransom them.
Server access logs / data events for the bucket - every GET, PUT, DELETE with the principal, source IP, user agent. Without this, the answer to "what did they exfiltrate?" is a guess.
Object hash verification - for any objects suspected of modification, compare the current MD5 / SHA256 with whatever last-known-good baseline you have. S3 supports SHA-256 checksums natively if you opt in.

IAM credentials

Often the primary evidence. For any credential - IAM user access key, IAM role session, service principal secret, GCP service account key - the questions are the same:

Who used it? CloudTrail's userIdentity, Activity Log's caller, Cloud Audit Logs' protoPayload.authenticationInfo.
From where? Source IP, user agent. Anomalous geos and unexpected user agents (cli tools when the workload uses the SDK; Python when the workload is Go) are classic indicators.
When? First-use timestamp, last-use, full activity timeline.
What API calls? The full list, with read / write / sensitive breakdown.

Tools like Netflix Dispatch, cloudgrep, and AWS-native tools help here, but a well-indexed CloudTrail / Activity / Audit Log corpus in your SIEM is the foundation.

Memory forensics on cloud VMs

Memory forensics on cloud VMs is technically possible, useful for a narrow class of incidents, and constrained by cloud realities most on-prem tooling didn't anticipate.

Linux: LiME - Loadable Kernel Module that dumps physical RAM to a file (or over the network); requires kernel-version-matched build. AVML (Microsoft) - userspace, no kernel module; works on most cloud Linux distros without recompilation. Both can be pre-staged on a forensics user-data script ready to run if needed.
Windows: MAGNET RAM Capture, FTK Imager, WinPmem. Run via Systems Manager Session Manager / Azure Run Command / GCP OS Login to avoid the need for inbound SSH/RDP.
Analysis: Volatility 3 is the de-facto open-source analysis framework, plus commercial tooling (Magnet Axiom Cyber, Cellebrite Inseyets, MAGNET IEF) for richer workflows.

The cloud-specific constraints to know:

You cannot ask the hypervisor for memory. AWS / Azure / GCP do not expose hypervisor-level memory acquisition to tenants. You're capturing from inside the guest, which means the malware can see you doing it.
Ephemeral instances may be gone by the time you decide to capture. Autoscaling, spot reclamation, container scheduling, and even just StopInstances calls by the attacker erase live state.
Encrypted memory features (AWS Nitro Enclaves, Azure Confidential Compute, GCP Confidential VMs) may prevent memory capture even from inside the VM. Plan around them rather than against them.
Network egress to ship the memory image may not be available if the host is being contained. Capture to local storage, then move the storage.

For most cloud incidents, memory forensics is not the highest-leverage activity. Save the muscle for the cases that genuinely warrant it (fileless attacks, in-memory implants, suspected rootkits); for the rest, CloudTrail and a disk image will tell you more, faster.

Container forensics

Container forensics is its own discipline. The container is short-lived, the filesystem is layered, the logs scroll past quickly, and most of the interesting state is somewhere the container layer doesn't preserve.

Image SHA capture. First action. Without the precise image digest, you cannot say what code was running, and the registry tag may move under you.
Runtime layer snapshot. The container's writable layer (the diff from the read-only image) holds whatever the malware wrote at runtime. docker commit on the container creates an image of the current state; crictl / nerdctl equivalents for containerd / CRI-O. Save the image to your forensics registry.
Ephemeral debug container (kubectl debug) for live state - process list, network sockets, mounted filesystems, environment.
Runtime telemetry from before the incident. The most valuable evidence for container investigations is what was being captured continuously: Falco rules firing on syscalls, Tracee traces, eBPF-based sensors (Cilium Tetragon, Sysdig, Datadog runtime security, Wiz Runtime Sensor). Without these, the container's history is largely unrecoverable after the fact.
Kubernetes audit log - every exec, every port-forward, every cp, every secret read. Often the attacker's clearest trail.
Node-level filesystem and process state for suspected container escape; capture the underlying node disk snapshot as if it were a VM compromise.

Isolating compromised workloads

Containment without evidence loss is the hardest single skill in cloud IR. The instinct to "shut it down" is wrong; the right move is to isolate the workload from anything it can damage or call, while keeping it intact enough to interrogate.

The quarantine security group pattern (AWS)

A pre-built security group with no ingress rules and egress restricted to (a) your forensics VPC peering, (b) your logging endpoints, (c) Systems Manager / SSM endpoints for remote command execution. Replacing the instance's existing SGs with this one isolates the workload in seconds while leaving it running for evidence capture.

NSG / firewall isolation (Azure / GCP)

Same pattern, different primitives. Azure: a "quarantine" NSG associated with the NIC, replacing existing NSGs. GCP: tag the instance with a quarantine network tag that matches a firewall rule denying all ingress and egress except to logged forensic endpoints.

IAM containment

Often more important than network containment in cloud breaches. If a credential is compromised, you want to:

Detach all policies from the user / role, replacing with an explicit deny on everything except the calls your IR team will use to investigate.
Revoke active sessions for IAM roles using the AWSRevokeOlderSessions inline policy - the credential continues to exist (so the audit trail of attempted reuse is preserved) but no API call succeeds.
For SSO / federated identities, disable the user at the IdP (Entra ID, Okta, Google Workspace) and revoke all refresh tokens.

The "running but cannot reach anything" state

The goal is a workload that is alive enough to capture from, dead enough not to cause further harm. For most cloud incidents this is achievable in under 5 minutes if the runbooks and IAM are pre-built; if they aren't, you'll spend 30 minutes deciding what to do while the attacker continues.

Credential rotation under incident

Credential rotation is necessary but rarely the first thing to do. The discipline is to capture evidence and contain first, then rotate - and to rotate correctly so you don't leave windows open.

AWS

IAM user access keys: aws iam update-access-key --status Inactive first (preserves the key for audit lookup); then aws iam delete-access-key after the active phase. Create a new key only after confirming the workload that uses it has a containment plan.
IAM role sessions: deny based on session-issue-time. The AWSRevokeOlderSessions inline policy applies a condition denying all actions where aws:TokenIssueTime is before a chosen timestamp; this kills active sessions without affecting the long-term credential.
STS federation: rotate the IdP-side credential (SAML signing key, OIDC trust) if the trust relationship itself is suspected.
Root credentials: rotate, re-MFA, audit the root API key usage history. The root key should never be in regular use; any use is an event in itself.

Azure

Service principal secrets and certificates: rotate via Entra ID. Old credentials remain in audit logs for the retention window.
Managed identities: cannot be "rotated" in the user-managed sense; instead, remove the role assignment and re-grant if you suspect the identity itself is compromised.
User sessions: Revoke-AzureADUserAllRefreshToken (or the Entra ID portal equivalent) forces reauth on every device.

GCP

Service account keys: disable first (preserves audit trail) via gcloud iam service-accounts keys disable; delete after the active phase. The presence of long-lived service account keys at all is increasingly a finding - prefer Workload Identity Federation.
OAuth tokens: revoke refresh tokens via Admin SDK or the IdP integration. Active access tokens expire on their own clock but cannot be retroactively killed in all cases.
User sessions: Workspace admin console offers session reset on a per-user basis.

The cross-cloud principle: disable, don't delete, until evidence is captured. A deleted credential's last-used and policy-version history is harder to recover than a disabled one.

IR runbooks

Runbooks are the predefined sequences your team executes for the common cloud incident classes. The right set covers the 80% of incidents that look the same time after time. Each runbook should answer: who's on call, what to capture, how to contain, how to verify, who to notify.

Compromised IAM credential

The single most common cloud incident. Trigger: GuardDuty / Defender / SCC finding of unusual API activity, or a credential surfacing in a public leak (GitHub, paste sites). Sequence: identify the principal, query the credential's full CloudTrail / Activity Log history, capture and store the activity trail in immutable storage, disable the credential, revoke active sessions, scope the blast radius from the API calls made, then notify the workload owner, rotate, and audit downstream resources for attacker persistence (new IAM users, new roles, modified policies).

Exposed S3 / Storage bucket

Trigger: external researcher report, internal CSPM alert, or a finding from data-exposure scanners (GrayNoise, BinaryEdge, internal). Sequence: confirm exposure (don't trust the alert alone), enumerate accessed objects from server access logs, identify the sensitive data classes involved, fix the bucket ACL / policy / public-access-block, capture an immutable snapshot of the access logs, then escalate to the data-owner team and Legal for breach-notification analysis.

Crypto-mining EC2 / VM

Trigger: GuardDuty CryptoCurrency finding, billing spike alert, unusual outbound network volume to mining-pool IP ranges. Usually downstream of a compromised IAM credential that launched the workload. Sequence: snapshot the workload, capture network traffic, identify the launching principal (often a forgotten access key in a code repo), shutdown the workload, then move to the credential-compromise runbook to find the root cause.

Ransomware in storage

Trigger: customer reports of inaccessible files, S3 object versioning chain showing mass-deletion, ransom note found in a bucket. Sequence: pause all auto-replication so the attacker's deletions don't propagate to backups, verify object versioning is intact, restore from versioning history or backups, identify the credential used for the deletions, follow the credential-compromise runbook for the source, and engage Legal for ransom-decision and law-enforcement notification.

Exfiltration via egress

Trigger: VPC Flow Logs / NSG Flow Logs anomaly, GuardDuty UnusualNetworkActivity or OutboundDataTransferAnomaly, DNS-tunneling indicators in Route 53 / Cloud DNS query logs. Sequence: contain the workload via the quarantine SG / NSG pattern, capture the workload, identify the data being transferred (via S3 access logs, database query logs, application logs), determine the principal and the credential, then run the credential-compromise + ransomware-style data-loss runbooks in parallel.

Insider / privileged-user misuse

Trigger: HR notification of an investigation, anomalous behavior from a privileged principal, or a finding that an admin's account is doing something the role permits but the human shouldn't. Sequence: do NOT alert the subject; preserve evidence; engage HR and Legal before any technical action; capture the full activity trail; coordinate any rotation / disablement with HR's timing.

Reference playbooks worth reading

AWS Incident Response Playbooks - the AWS-published reference set; technical, opinionated, useful even if you adapt them.
CISA Federal Cybersecurity Incident & Vulnerability Response Playbooks - federal but broadly applicable.
AWS Customer Playbook Framework - the structure for writing your own.
Microsoft Incident Response Playbooks - phishing, password spray, app consent grants.

Tabletop exercises

An IR program that has never practiced will not perform under stress. Tabletops are the cheapest way to discover that your runbooks reference an account that no longer exists, that the on-call rotation hasn't been updated since the last reorg, or that nobody has paged Legal in 18 months and the contact is stale.

Cadence

Quarterly - small, scenario-specific. 90 minutes. Run the runbook on paper, identify what breaks.
Annually - major, multi-team. A full day. Multiple scenarios, including ones that overlap (a credential compromise that becomes a ransomware event).
Ad-hoc - after a major architectural change, a leadership change, or a near-miss incident.

Who attends

Incident commander (rotating), security engineering, on-call SRE / platform team, cloud-account owners for the affected workloads, communications / PR, Legal / privacy counsel, executive sponsor (often CTO or CISO), and a designated observer who takes notes and runs the retro. For tabletops with a DFIR retainer, the firm's account team should join - they should know your environment before the real call.

Scenarios worth running

Leaked AWS access key in a public GitHub commit, with crypto-mining activity within 4 minutes.
Compromised Entra ID admin account during a session-hijack attack against the CFO.
Public S3 / GCS bucket discovered by an external researcher with media interest.
Ransomware on a sub-set of S3 buckets via a compromised CI/CD pipeline.
Insider exfiltration: a departing engineer with broad access to customer-data tables.
Supply-chain compromise: a malicious update to a widely-used base image discovered by your runtime sensor.
Cross-tenant cloud-provider incident: the provider notifies you of a vulnerability that affected your data.

DFIR retainers

A retainer is a contracted relationship with a DFIR firm that buys you a response SLA, pre-negotiated rates, and (most importantly) a team that knows your environment before the call. The major firms in 2026:

Mandiant (Google Cloud) - long history, deep nation-state experience, Mandiant Managed Defense for ongoing detection.
CrowdStrike Services - strong on endpoint-led investigations, paired with their EDR.
Kroll - broad coverage including breach-coach legal coordination; common in privacy-driven incidents.
Unit 42 (Palo Alto Networks) - strong on cloud and ransomware engagements.
Arctic Wolf - mid-market focus, broad retainer base.
Secureworks (now Sophos), Optiv, Coalfire, regional specialists.
Mitiga, Cado - cloud-native IR specialists, particularly useful for orgs whose incident surface is heavily AWS / Azure / GCP rather than endpoint.

When the math works

Retainers pay off when:

You have meaningful cloud presence and sensitive data, and an internal IR team too small to cover a serious incident without help.
You're in a regulated industry where regulatory notification timelines (GDPR 72h, NYDFS 72h, etc.) leave no slack for shopping for a firm during the incident.
You have cyber insurance - most policies require a panel-approved DFIR firm and offer rate concessions for retainer customers.
You want the firm to do an annual tabletop or environment-familiarization engagement to be useful when the real call comes.

Below those thresholds, an established phone-and-email relationship with one or two firms - without a paid retainer - is often enough. The worst case is having no one to call and shopping during the incident; the second-worst is having a retainer with a firm that has never looked at your environment.

Communication during incident

The communications layer of IR is where most programs reveal their immaturity. The technical containment can go well and the public reception can still be terrible, or vice versa.

Internal

The war room. A dedicated Slack / Teams channel (e.g. #inc-2026-05-17-credential-compromise) for the active incident. The IC, technical leads, comms, Legal, and executive sponsor live in it during the active phase. No side-channel discussion; everything in the channel is evidence.
RACI for the incident. Incident Commander (one, rotating), Technical Lead, Communications Lead, Legal Lead, Executive Sponsor. Each role has a named human; if that human is unavailable, the runbook says who replaces them.
Status updates on a cadence. Every 30 minutes during active response - even "no change" is a status. Drift to 60 minutes once containment is solid.
Decision log. Major decisions (containment scope, who to notify, when to rotate keys, when to declare the active phase over) written down with the reasoning at the time. Saves the retro from "why did we do that?" amnesia.

External

Customers. If their data is affected - even potentially - they get notified per the contract you signed with them and the law that applies to where they are. Vague language about "an investigation is ongoing" is acceptable initially; silence is not.
Regulators. GDPR Article 33 requires notification to the supervisory authority within 72 hours of becoming aware of a personal-data breach. U.S. state laws vary (most 30-60 days, some require law-enforcement consultation first). HIPAA breach-notification: 60 days. SEC 8-K Item 1.05: four business days from materiality determination for public companies. NIS2, DORA, and sector regulators add their own clocks.
Cyber insurance carrier. Notify per policy - often within 24-72 hours of awareness, sometimes before engaging counsel or a DFIR firm not on their panel.
Law enforcement. FBI / Secret Service / NCSC / local cybercrime unit. Usually optional, sometimes valuable for nation-state attribution or extortion cases.
Public. Press statement, status page update, blog post. The principle: tell the truth, tell it clearly, tell it before someone else does. Late or evasive comms are the durable reputational damage.

Your IR runbook should include a one-page summary of the regulatory clocks that apply to you - by data type, by jurisdiction, by sector. Your General Counsel should own it; your security team should know where to find it at 2am.

Post-incident

The phase most programs underinvest in, and the one with the highest long-term return.

Blameless retro

The principle, borrowed from SRE: people made the best decisions they could with the information they had. Retros that hunt for fault produce defensive participants who hide details; blameless retros produce honest accounts that surface the systemic causes. Have a designated facilitator (not the IC, not the executive sponsor), follow a written agenda, and produce a written document with the timeline, the decisions, the lessons, and the action items.

Action item categories

Control improvements. The actual prevention or limit that would have stopped or reduced this incident. Specific, owned, dated.
Detection backfill. The alert that should have fired earlier. New SIEM rules, new GuardDuty / Defender / SCC enablements, new detection-engineering work. Track from "we noticed" to "we noticed within an SLA."
Runbook updates. Every place the runbook was wrong, missing, or required improvisation. Updated, reviewed, tested in the next tabletop.
Tooling gaps. The capabilities you wished you had. Often: better evidence capture, better cross-account access for the IR team, better network isolation, better identity-side controls.
Process gaps. The places the org-chart let you down. New on-call agreements, new RACI, new escalation paths.

The owner of action items is not Security alone. Detection backfill is Security; control improvements often live with platform / engineering teams; comms updates live with Legal and Marketing. The retro that produces action items only for Security is a retro that has misdiagnosed the incident.

Close-up of a checklist with green checkmarks — Photo by Towfiqu barbhuiya on Pexels

AWS, Azure, and GCP side-by-side

The native IR-supporting capabilities each cloud ships, reduced to a one-screen reference:

Capability	AWS	Azure	GCP
Activity audit log	CloudTrail (org trail, mgmt + data events)	Activity Log + Diagnostic Settings	Cloud Audit Logs (Admin + Data Access)
Network flow logs	VPC Flow Logs	NSG Flow Logs / VNet Flow Logs	VPC Flow Logs
Threat detection	GuardDuty (incl. EKS, S3, RDS, runtime)	Defender for Cloud (CSPM + CWPP)	Security Command Center (Premium / Enterprise)
Investigation / graph	Detective	Sentinel investigation graph	SCC Investigation, Chronicle
SIEM / SOAR	Security Lake + partner SIEM, Security Hub	Microsoft Sentinel (SIEM + SOAR)	Chronicle SIEM + SOAR
Disk evidence	EBS snapshot, cross-account share	Managed disk export, snapshot copy	Persistent disk snapshot, cross-project share
Memory acquisition	SSM Session Manager + LiME/AVML; no hypervisor-level	Azure Run Command + AVML/WinPmem; no hypervisor-level	OS Login / Cloud Shell + AVML; no hypervisor-level
Container runtime	GuardDuty EKS Runtime / ECS Fargate Runtime	Defender for Containers (eBPF)	GKE Threat Detection, Workload Vuln Scanning
Isolation primitive	Quarantine Security Group	Quarantine NSG / NIC isolation	Network tag + firewall rule
Session revocation	AWSRevokeOlderSessions inline policy	Entra ID refresh-token revocation	Workspace session reset, IAM key disable
IR-specific service	AWS Security Incident Response (CIRT-as-a-service)	Microsoft Incident Response (paid engagement)	Mandiant (Google Cloud) consulting + retainers
Vendor IR contact	AWS Customer Incident Response Team (CIRT) via Support	Microsoft Detection & Response Team (DART)	Google Cloud TAM / Security Response

Native tools are necessary but not sufficient. None of the three clouds ship a complete forensics workbench, and the cross-cloud IR tooling section below fills the gap that all three leave open.

Specialized cloud-IR tooling

Beyond the native services, a category of tooling specifically targets cloud forensics and IR workflows:

Cado Security - cloud-native investigation platform; automated evidence capture for AWS / Azure / GCP, containers, serverless. The closest thing to a turnkey cloud-forensics workbench.
Mitiga - cloud IR-as-a-service, plus a platform for cloud-incident readiness and investigation.
Magnet Axiom Cyber - broader DFIR suite with cloud collectors; strong on Microsoft 365, Google Workspace, and SaaS-side evidence.
Mandiant Managed Defense - outsourced detection + IR; particularly valuable for orgs that can't staff 24/7 internally.
Velociraptor - open-source endpoint visibility and digital-forensics tool; deploys agents to cloud VMs for query-based collection.
GRR Rapid Response - Google's open-source remote live forensics framework; mature, scriptable, free.
Timesketch - collaborative forensic timeline analysis tool. Plays well with Plaso for evidence ingestion.
cloud-forensics-utils (Google) - Python library that automates disk-acquisition workflows across AWS, Azure, and GCP.
AWS Incident Response Playbooks repo - reference playbooks, not tooling, but the structure many teams adopt.
CISA playbooks & advisories - federal but generally useful; the Cybersecurity Incident & Vulnerability Response Playbooks are good baseline reading.

Maturity stages

A useful staging model for a cloud IR program:

Stage 1 - Reactive

An incident happens; the team scrambles. Logs are partially captured. Runbooks are tribal knowledge. The forensics work happens in the production account. Detection is mostly post-hoc - billing alerts, customer complaints, external researcher tips. Survives small incidents, breaks on big ones.

Stage 2 - Documented

Immutable log archive is live. Forensics account exists. The most common runbooks (compromised credential, exposed bucket, crypto-mining) are written down. On-call rotation defined. A DFIR phone-and-email relationship with one firm. First tabletop completed.

Stage 3 - Practiced

Quarterly tabletops with multiple scenarios. Annual full-day exercise with Legal and execs. DFIR retainer in place; firm has been through an environment-familiarization engagement. Detection engineering matures alongside IR; detections tied to MITRE ATT&CK Cloud. SLA-tracked time-to-detect, time-to-contain.

Stage 4 - Automated

High-confidence playbooks auto-execute: GuardDuty finding → snapshot the workload, apply quarantine SG, page the on-call, file a ticket - all in seconds. Continuous tabletop via purple-team exercises. Forensics-as-code; evidence collection is reproducible from a runbook commit hash. The IR program is a competitive advantage in enterprise sales conversations about security maturity.

The skip-stage cost is real. An org trying to automate IR without runbooks is automating against nothing; the automation will be wrong, and people will distrust it. Sequence matters.

Common pitfalls

No immutable log archive. The single most common - and most consequential - IR readiness failure. The attacker disables logging on day zero; without a tamper-resistant archive in a separate account, the investigation goes blind.
Doing forensics in the production account. Contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and Legal heartburn. The dedicated forensics account is a precondition, not a nice-to-have.
No forensics account at all. Adjacent to the above. "We'll figure it out when it happens" is a Stage 1 program; setting it up takes a week and pays for itself the first time.
Rotating credentials before capturing evidence. Destroys the live session state that's often the best evidence of what the attacker is doing right now. Disable first, capture, then rotate.
No tabletop, ever. Runbooks that have never been practiced fail on the worst day. Quarterly is the minimum cadence that produces lasting muscle memory.
Conflating containment with eradication. Quarantining the workload buys time; it doesn't end the incident. Eradication - finding and removing the attacker's full set of footholds - is the work that prevents the second incident an hour later.
Forgetting identity-side persistence. An attacker who compromised an IdP admin can plant trust relationships, app consent grants, IdP-side users, and federation backdoors that survive every workload-side cleanup. Audit the IdP after every credential-compromise incident.
"We'll buy a tool when we need it." The tool that needs to be deployed and configured under incident pressure won't be useful in that incident. Pre-stage the forensics AMI, the collector agents, the network rules.
Underestimating regulatory clocks. GDPR's 72-hour window, SEC's four-business-day clock, state breach laws, sector regulators - these don't pause for your investigation. The runbook needs the clocks pre-mapped; the response needs Legal in the room from hour one.
Skipping the retro. The team is exhausted; everyone wants the incident to be over; the retro slips. Six months later, the same incident class recurs because the action items were never identified. Retro within two weeks, no exceptions.

FAQ

How is cloud incident response different from on-prem IR?

Three big shifts. First, ephemerality - the instance, container, or function that you want to image may not exist by the time you reach for it; if you didn't snapshot disk and capture logs proactively, that evidence is gone. Second, the API is the attack surface - most cloud breaches are not "shell on a box" but "stolen credential calling APIs", so CloudTrail / Activity Log / Cloud Audit Logs are the primary evidence source, not memory dumps. Third, shared responsibility - the hypervisor, the physical network, the underlying storage hardware are not yours to investigate; you work with what the provider exposes, and you escalate to the provider's IR team when an incident crosses that line.

What's the single highest-leverage thing to do before an incident happens?

Build an immutable, cross-account log archive that an attacker who compromises your production environment cannot tamper with. In AWS, that's CloudTrail (org trail, all regions, management + data events) writing to an S3 bucket in a dedicated Log Archive account, with bucket policies and SCPs that prevent deletion or alteration. In Azure, it's Activity Log + Diagnostic Settings exported to a Log Analytics workspace and a locked storage account in a separate subscription. In GCP, it's organization-aggregated Cloud Audit Logs sinking to a retention-locked bucket in a dedicated logging project. Without this, the attacker rotates credentials, disables CloudTrail, deletes logs, and you investigate blind.

Do I really need a separate forensics account?

Yes, and you need it provisioned before the incident, not during. A forensics account / subscription / project is a clean environment with pre-staged tooling (a forensics AMI or VM image, scripting, disk-imaging utilities, packet-capture tools, Velociraptor / GRR collectors), an isolated VPC with no route to production, IAM trust policies that allow the IR team to assume roles into compromised accounts in read-only or evidence-collection modes, and a budget. Doing forensics in the prod account contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and lawyers heartburn.

Should I rotate credentials immediately when I detect a compromise?

Not before you've captured what you need from the credential's audit trail and decided on a containment strategy. The instinct to "kill the access key" is right eventually, but rotating credentials prematurely (a) tips off the attacker, (b) destroys live session state that may be the best evidence of what's happening right now, and (c) can lock out legitimate workloads if the credential is shared. The order is: identify the compromised principal, query CloudTrail / Activity / Audit Logs for the credential's full activity history, snapshot any associated workloads, then rotate - and use AssumeRole session revocation (the AWSRevokeOlderSessions inline policy) to kill active sessions, not just the long-term credential.

How long do we have to disclose a breach?

It depends on jurisdiction and data type. GDPR requires notification to the supervisory authority within 72 hours of becoming aware of a personal-data breach, and to affected individuals "without undue delay" if there's high risk. U.S. state breach-notification laws vary (most 30-60 days, some shorter for specific data types). SEC Item 1.05 requires public companies to disclose material cybersecurity incidents within four business days of materiality determination. HIPAA breach notification is 60 days. Sector-specific rules (NYDFS Part 500, FFIEC, NERC CIP, NIS2, DORA) add their own clocks. Your IR runbook should include a decision tree that maps incident type to the regulatory clocks that apply - and your General Counsel should be in the war room from the first hour.

Is memory forensics still relevant on cloud VMs?

Yes for workloads that meaningfully run anything in memory beyond what's on disk - long-running services with in-memory state, JIT-compiled malware, fileless attacks that hide in process memory. The mechanics on cloud VMs work: LiME or AVML can capture RAM from a running Linux instance, MAGNET RAM Capture or similar from Windows. The constraint is that ephemeral cloud VMs may be gone by the time you reach for them, and capturing memory requires the instance to still be running (or for the hypervisor to support live snapshots - AWS / Azure / GCP don't generally expose this to tenants). For containers and serverless, memory forensics is largely impractical; you fall back on runtime-security telemetry (Falco, Tracee, eBPF-based sensors) captured before the workload terminates.

When does a DFIR retainer pay off?

When you have meaningful cloud presence, sensitive data, and an internal IR team that's small enough to be overwhelmed by a serious incident. A retainer (Mandiant, CrowdStrike, Kroll, Unit 42, Arctic Wolf, and similar) buys you a contracted response SLA - typically 1-4 hours to first responder - and pre-negotiated rates instead of emergency premium pricing. It also forces you to do a tabletop with the firm in advance so they know your environment when the call comes in. The math works above roughly 100 employees, or earlier if you're in a regulated industry. Below that, a "best-efforts" relationship with one or two firms you've talked to in calm times is often enough.

Where next

Detection engineering - the discipline that produces the alerts IR responds to.
Cloud SOC - the operating model that wraps detection and IR together.
Threat research - the attacker techniques your IR program defends against.
Breach kill chains - real incidents, end-to-end, with timelines and lessons.
Friday Zoom - IR readiness, tabletops, and post-incident retros come up regularly. Drop in.

Incident Response & Cloud Forensics

On this page

The IR lifecycle in cloud

Preparation

Identification

Containment

Eradication

Recovery

Lessons Learned

What cloud changes about IR

Forensic readiness

1. Immutable, cross-account log archive

2. Dedicated forensics account / subscription / project

3. Snapshot / AMI / image pipelines for evidence preservation

4. Service control / organization policies that block evidence destruction

5. The break-glass account, ready to use

The "log everything that matters" baseline

AWS

Azure

GCP

Evidence collection by workload type

EC2 / Azure VM / GCE instance

EKS / AKS / GKE pods

Lambda / Cloud Functions / Azure Functions

S3 / Azure Blob / GCS objects

IAM credentials

Memory forensics on cloud VMs

Container forensics

Isolating compromised workloads

The quarantine security group pattern (AWS)

NSG / firewall isolation (Azure / GCP)

IAM containment

The "running but cannot reach anything" state

Credential rotation under incident

AWS

Azure

GCP

IR runbooks

Compromised IAM credential

Exposed S3 / Storage bucket

Crypto-mining EC2 / VM

Ransomware in storage

Exfiltration via egress

Insider / privileged-user misuse

Reference playbooks worth reading

Tabletop exercises

Cadence

Who attends

Scenarios worth running

DFIR retainers

When the math works

Communication during incident

Internal

External

Post-incident

Blameless retro

Action item categories

AWS, Azure, and GCP side-by-side

Specialized cloud-IR tooling

Maturity stages

Stage 1 - Reactive

Stage 2 - Documented

Stage 3 - Practiced

Stage 4 - Automated

Common pitfalls

Further reading

Standards & frameworks

Provider IR documentation

Tooling

Related CSOH pages

FAQ

How is cloud incident response different from on-prem IR?

What's the single highest-leverage thing to do before an incident happens?

Do I really need a separate forensics account?

Should I rotate credentials immediately when I detect a compromise?

How long do we have to disclose a breach?

Is memory forensics still relevant on cloud VMs?

When does a DFIR retainer pay off?

Where next