The 30-second version: Cloud incident response is the same six-phase loop you already know - Prepare → Identify → Contain → Eradicate → Recover → Lessons Learned - but every phase changes when the workload lives behind an API in someone else's data center. The instance you want to image may already be gone. The credential the attacker is using leaves a perfect log if you're capturing it, and zero log if you're not. The blast radius is whatever scope an over-privileged role has, which is usually larger than anyone thinks.
The work that makes IR survivable in cloud is done before any incident: an immutable cross-account log archive, a dedicated forensics account with pre-staged tooling, snapshot pipelines, SCPs that block evidence destruction, and runbooks practiced in tabletop. Without those, you investigate blind. With them, the playbook is largely automatable.
On this page
- The IR lifecycle in cloud
- What cloud changes about IR
- Forensic readiness
- The "log everything that matters" baseline
- Evidence collection by workload type
- Memory forensics on cloud VMs
- Container forensics
- Isolating compromised workloads
- Credential rotation under incident
- IR runbooks
- Tabletop exercises
- DFIR retainers
- Communication during incident
- Post-incident
- AWS, Azure, and GCP side-by-side
- Specialized cloud-IR tooling
- Maturity stages
- Common pitfalls
- Further reading
- FAQ
The IR lifecycle in cloud
Two reference lifecycles dominate: NIST SP 800-61r2 (Preparation, Detection & Analysis, Containment / Eradication / Recovery, Post-Incident Activity) and the SANS PICERL model (Preparation, Identification, Containment, Eradication, Recovery, Lessons Learned). They're functionally the same loop with slightly different boundaries. Either works; pick one, use its vocabulary consistently across your runbooks and your retros, and don't get religious about the seams.
Preparation
The 80% of the work. Immutable log archive, forensics account, snapshot pipelines, IR runbooks, on-call rotation, retainer contract, tabletop calendar, communications plan. If preparation is weak, every other phase is harder; if it's strong, the rest is mostly procedure.
Identification
Triage the alert. Is it real? What's the scope? What's the blast radius if the worst-case is true? In cloud this is largely a question of which principal, which resources, what API activity, what data - answerable from logs if you collected them.
Containment
Stop the bleeding without destroying evidence. Quarantine the workload (not stop it), revoke active sessions, narrow IAM, isolate the network. Two flavors: short-term ("buy us an hour") and long-term ("safe to operate while we eradicate").
Eradication
Remove the attacker's persistence - implanted IAM principals, modified Lambda code, malicious container images, backdoored AMIs, scheduled tasks, rogue OAuth grants. Cloud-specific: an attacker who got into your IdP has persistence options well beyond a single host.
Recovery
Restore service safely. Redeploy from known-good IaC, rotate keys and secrets, validate the environment matches the pre-incident known-good baseline, monitor for re-emergence. Cloud helps here - rebuild-from-Terraform is faster and cleaner than reimaging an on-prem box.
Lessons Learned
Blameless retro. What broke? What missed? Which detections fired late? Which runbooks were wrong? Outputs: detection backfill, runbook updates, control improvements, tabletop scenarios. The phase most often skipped; the one with the highest long-term return.
The phases are not strictly sequential. You'll loop between identification and containment as scope expands; you'll re-eradicate when you discover new persistence; you'll start lessons-learned work the moment the incident commander declares the active phase over. The lifecycle is a frame, not a Gantt chart.
What cloud changes about IR
The lifecycle survives the move to cloud. The execution looks meaningfully different in four ways that show up in every cloud incident.
| Dimension | Traditional IR | Cloud IR |
|---|---|---|
| Workload persistence | A physical host is still there next week | Instances scale-in, containers restart, functions are gone in milliseconds - capture evidence at the moment of detection or lose it |
| Primary attack surface | OS-level: shell, malware, lateral movement | API-level: stolen credential calling AWS / Azure / GCP APIs from somewhere it shouldn't be |
| Primary evidence | Disk image, memory dump, endpoint telemetry | CloudTrail / Activity Log / Cloud Audit Logs, plus disk and memory when the workload is the target |
| Containment unit | Pull the cable, block at firewall, isolate the host | Detach IAM, swap to quarantine security group, revoke OAuth grants, kill role sessions |
| Blast radius | Whatever's on the network segment | Whatever the compromised role can call - often cross-account, cross-region, cross-service |
| Shared responsibility | You own everything below the rack | Provider owns hypervisor / network / storage layer; you escalate when an incident crosses it |
| Speed | Hours to days to spread | Minutes to seconds - automated exploitation can run thousands of API calls before a human is even paged |
The single biggest practical shift is that most cloud breaches are control-plane incidents, not data-plane incidents. Someone stole a credential - from a developer laptop, a leaked git commit, a compromised CI runner, a phished session - and is now making API calls. The shell on the box, if there's one at all, is incidental. Your IR program has to be biased toward identity, log analysis, and API-call correlation; the host-forensics muscle that on-prem IR teams grew over twenty years is necessary but no longer sufficient.
Forensic readiness
The hardest lesson of cloud IR is that almost everything that matters has to be set up before the incident. The attacker who has compromised your environment is not going to wait while you deploy a logging pipeline. Treat the items below as preconditions to running a viable IR program, not as nice-to-haves.
1. Immutable, cross-account log archive
The foundational control. The attacker who lands in your production account will try to disable logging and delete logs; the log archive has to be somewhere they cannot reach.
- AWS. Organization CloudTrail (all regions, management + S3 data events at minimum) writing to an S3 bucket in a dedicated Log Archive account. S3 Object Lock in compliance mode, MFA delete, a bucket policy denying delete from any principal except a break-glass role, and a Service Control Policy at the org root that prevents anyone in production from disabling the trail. Pair with a separate Security tooling account that has read-only access for the SOC.
- Azure. Activity Log + Diagnostic Settings exported to a Log Analytics workspace in a separate subscription, with immutable blob storage as the long-term archive. Management locks at the subscription level prevent deletion.
- GCP. Organization-aggregated Cloud Audit Logs sinking to a dedicated logging project, with retention-locked Cloud Storage as the immutable archive. Organization policies prevent disabling of audit logs on production projects.
The "what's the retention?" answer is "long enough for the longest regulatory clock that applies to you" - for most orgs, that's 365 days minimum, often 7 years for financial-services contexts.
2. Dedicated forensics account / subscription / project
A clean environment, on the same cloud as production, but in a separate account boundary. The contents:
- Pre-built forensics AMI / VM image / instance template with disk-imaging utilities (
dc3dd,guymager), packet capture (tcpdump,zeek), memory tooling (LiME, AVML), and a Velociraptor / GRR / OSQuery collector pre-installed. - Isolated VPC / VNet with no route to production. Public egress only via a logged, monitored gateway.
- IAM trust policies that let the IR team assume a read-only or evidence-capture role into compromised accounts - but no trust the other direction.
- A logging archive of its own, separate from production's, so the IR team's actions are themselves auditable.
- A small budget pool that won't trip cost alerts during a real incident.
3. Snapshot / AMI / image pipelines for evidence preservation
Scripts or runbooks that, given a workload ID, will: snapshot the disk, copy the snapshot to the forensics account, capture metadata (instance config, IAM role, network state, tags) and write it to immutable storage, all in under a few minutes. Idempotent - running twice doesn't double the work. Tested - runbooks that have never run for real don't run for real on the worst day.
4. Service control / organization policies that block evidence destruction
The attacker with admin in your production account should still be unable to: disable CloudTrail / Activity Log / Audit Logs, delete the log archive bucket, modify forensics IAM trust policies, change the immutable storage configuration. SCPs (AWS), management group policies (Azure), and organization policies (GCP) enforce this above the production account, so production admin doesn't grant the necessary scope to undo the controls. This is the second-most-important control after the immutable log archive itself.
5. The break-glass account, ready to use
A separately credentialed identity, used only in emergencies, with permissions broad enough to investigate any account but narrow enough that abuse is detectable. MFA-required, monitored, with use that pages the security team automatically. If you can't get into the compromised account because the attacker rotated the credentials of every other admin, you need this; if you never set it up, your IR program stalls during the incident.
The "log everything that matters" baseline
Logs are not retroactive. If they weren't being captured when the incident began, no clever query recovers them. The minimum baseline that pays for itself the first time you have an actual incident:
AWS
- CloudTrail - org trail, all regions, management events + S3 data events at a minimum. Add Lambda invocation events for sensitive functions. Validate file integrity is on.
- VPC Flow Logs - every production VPC, accept + reject, custom format that includes
pkt-srcaddr/pkt-dstaddrfor NAT-traversed flows. - Route 53 query logs - for resolver, public hosted zones, and Resolver query logging in critical VPCs. DNS is where exfil and C2 frequently show up.
- S3 access logs or S3 server access logging for sensitive buckets; CloudTrail data events as the structured alternative.
- GuardDuty findings - enabled in every account, every region. Runtime monitoring for EKS / ECS / EC2 where worth the spend.
- IAM Access Analyzer - surfaces unused access and external sharing; useful as a baseline that an incident's findings can be diffed against.
- WAF logs, ALB / NLB access logs, CloudFront logs for internet-facing services.
Azure
- Activity Log - every subscription, exported via Diagnostic Settings to Log Analytics and immutable storage.
- Diagnostic logs - for every relevant resource type (Key Vault, Storage, SQL, App Service, AKS, Function Apps). Defaults are not enough; enable explicitly per service.
- Entra ID sign-in logs and audit logs - the cloud-identity equivalent of CloudTrail's identity events. Retention longer than the 30-day free tier.
- NSG flow logs - to Storage and Traffic Analytics.
- Microsoft Sentinel as the SIEM ingestion layer when you have the spend; the connectors do most of the work.
- Defender for Cloud recommendations and alerts - surface compliance posture and active findings.
GCP
- Cloud Audit Logs - Admin Activity (always on) and Data Access (must be enabled, costs more, and worth it for sensitive services like BigQuery, Cloud Storage, Cloud KMS, IAM).
- VPC Flow Logs - every subnet in scope.
- Cloud DNS query logging.
- Cloud Logging sinks - aggregated at the organization level into the dedicated logging project, with retention-locked Cloud Storage as the long-term archive.
- Security Command Center Premium / Enterprise for threat detection findings; Chronicle for SIEM-grade ingestion if you've adopted it.
- Access Transparency / Access Approval - provider-personnel access logs, for the regulated workloads where that matters.
The unifying principle: log what changes, log who did it, log from where, log against what data. The volume looks alarming on the bill the first month. The single time an incident requires you to reconstruct three months of activity for a specific compromised credential, the cost has paid itself off many times over.
Evidence collection by workload type
Different workloads need different evidence captures. Cloud IR runbooks split by workload class because the mechanics genuinely differ.
EC2 / Azure VM / GCE instance
Closest to traditional host forensics, with the cloud adding capabilities you didn't have on-prem.
- Snapshot first, stop second. The disk snapshot of a running instance captures state including some pagefile and journal contents; stopping the instance first risks losing in-memory state and may trigger anti-forensics in malware that watches for shutdown.
- Share the snapshot to the forensics account (cross-account EBS snapshot sharing on AWS; Azure managed disk export to the forensics subscription; GCE disk snapshot in the forensics project).
- Create a fresh forensics volume from the snapshot, attach to the analysis instance in an isolated VPC, mount read-only. Never attach the forensics volume to anything that has internet egress until you've confirmed what you have.
- Capture metadata alongside the disk: IAM role and its policies, security groups, IMDS history if available, instance tags, launch template, AMI ID, user data. Cloud-side metadata is often more useful than anything on the disk itself.
- If memory matters, capture it before snapshot - see the memory forensics section.
EKS / AKS / GKE pods
Containers are designed to be ephemeral and identical. Treat the container as evidence about the deployment, not as the state itself.
kubectl debugephemeral container - attach a debug container to a running pod in the same namespace and PID space; capture/proc, environment, network state, running processes. Pre-build a debug image with your forensics tools.- Pin the image SHA before anything else. A pod compromised via image vulnerability is meaningless to investigate without knowing precisely which image layer was running.
- Node-level disk snapshot for the underlying VM if the pod is suspected of having escaped or written to host paths.
- Container runtime telemetry - Falco, Tracee, eBPF-based sensors. These have to have been running before the incident; standing them up after is too late.
- Kubernetes audit logs for the API-server events that produced and changed the pod. These should be flowing to the same log archive as the rest of your audit logs.
- kubectl
cpfor any specific files of interest, then capture stdout / stderr logs before pod termination.
See also the Kubernetes page for the deeper detection and runtime-security context.
Lambda / Cloud Functions / Azure Functions
Serverless evidence is mostly side-channel. The function instance is gone; you investigate what it did, not the function itself.
- Function code + configuration at the time of the incident. CloudTrail / Activity Log / Audit Logs record the deployment events; capture the version that was running, including environment variables (they often contain secrets that need rotating).
- Invocation logs - CloudWatch Logs (AWS), Application Insights / Log Analytics (Azure), Cloud Logging (GCP). Export anything covering the suspected incident window to immutable storage before retention windows expire.
- Distributed tracing - X-Ray, Application Insights, Cloud Trace. Reconstructs the call graph of a request that triggered the function; useful for showing whether the function was the attacker's pivot or just a downstream caller.
- IAM role activity from the function's execution role - what API calls did it actually make, when, against what.
S3 / Azure Blob / GCS objects
- Object versioning and MFA delete (AWS) / soft delete + versioning (Azure) / object versioning + retention policies (GCS) - turn these on by default for sensitive buckets so that an attacker can't simply delete objects to cover tracks or to ransom them.
- Server access logs / data events for the bucket - every GET, PUT, DELETE with the principal, source IP, user agent. Without this, the answer to "what did they exfiltrate?" is a guess.
- Object hash verification - for any objects suspected of modification, compare the current MD5 / SHA256 with whatever last-known-good baseline you have. S3 supports SHA-256 checksums natively if you opt in.
IAM credentials
Often the primary evidence. For any credential - IAM user access key, IAM role session, service principal secret, GCP service account key - the questions are the same:
- Who used it? CloudTrail's
userIdentity, Activity Log's caller, Cloud Audit Logs'protoPayload.authenticationInfo. - From where? Source IP, user agent. Anomalous geos and unexpected user agents (cli tools when the workload uses the SDK; Python when the workload is Go) are classic indicators.
- When? First-use timestamp, last-use, full activity timeline.
- What API calls? The full list, with read / write / sensitive breakdown.
Tools like Netflix Dispatch, cloudgrep, and AWS-native tools help here, but a well-indexed CloudTrail / Activity / Audit Log corpus in your SIEM is the foundation.
Memory forensics on cloud VMs
Memory forensics on cloud VMs is technically possible, useful for a narrow class of incidents, and constrained by cloud realities most on-prem tooling didn't anticipate.
- Linux: LiME - Loadable Kernel Module that dumps physical RAM to a file (or over the network); requires kernel-version-matched build. AVML (Microsoft) - userspace, no kernel module; works on most cloud Linux distros without recompilation. Both can be pre-staged on a forensics user-data script ready to run if needed.
- Windows: MAGNET RAM Capture, FTK Imager, WinPmem. Run via Systems Manager Session Manager / Azure Run Command / GCP OS Login to avoid the need for inbound SSH/RDP.
- Analysis: Volatility 3 is the de-facto open-source analysis framework, plus commercial tooling (Magnet Axiom Cyber, Cellebrite Inseyets, MAGNET IEF) for richer workflows.
The cloud-specific constraints to know:
- You cannot ask the hypervisor for memory. AWS / Azure / GCP do not expose hypervisor-level memory acquisition to tenants. You're capturing from inside the guest, which means the malware can see you doing it.
- Ephemeral instances may be gone by the time you decide to capture. Autoscaling, spot reclamation, container scheduling, and even just
StopInstancescalls by the attacker erase live state. - Encrypted memory features (AWS Nitro Enclaves, Azure Confidential Compute, GCP Confidential VMs) may prevent memory capture even from inside the VM. Plan around them rather than against them.
- Network egress to ship the memory image may not be available if the host is being contained. Capture to local storage, then move the storage.
For most cloud incidents, memory forensics is not the highest-leverage activity. Save the muscle for the cases that genuinely warrant it (fileless attacks, in-memory implants, suspected rootkits); for the rest, CloudTrail and a disk image will tell you more, faster.
Container forensics
Container forensics is its own discipline. The container is short-lived, the filesystem is layered, the logs scroll past quickly, and most of the interesting state is somewhere the container layer doesn't preserve.
- Image SHA capture. First action. Without the precise image digest, you cannot say what code was running, and the registry tag may move under you.
- Runtime layer snapshot. The container's writable layer (the diff from the read-only image) holds whatever the malware wrote at runtime.
docker commiton the container creates an image of the current state;crictl/nerdctlequivalents for containerd / CRI-O. Save the image to your forensics registry. - Ephemeral debug container (
kubectl debug) for live state - process list, network sockets, mounted filesystems, environment. - Runtime telemetry from before the incident. The most valuable evidence for container investigations is what was being captured continuously: Falco rules firing on syscalls, Tracee traces, eBPF-based sensors (Cilium Tetragon, Sysdig, Datadog runtime security, Wiz Runtime Sensor). Without these, the container's history is largely unrecoverable after the fact.
- Kubernetes audit log - every
exec, everyport-forward, everycp, every secret read. Often the attacker's clearest trail. - Node-level filesystem and process state for suspected container escape; capture the underlying node disk snapshot as if it were a VM compromise.
Isolating compromised workloads
Containment without evidence loss is the hardest single skill in cloud IR. The instinct to "shut it down" is wrong; the right move is to isolate the workload from anything it can damage or call, while keeping it intact enough to interrogate.
The quarantine security group pattern (AWS)
A pre-built security group with no ingress rules and egress restricted to (a) your forensics VPC peering, (b) your logging endpoints, (c) Systems Manager / SSM endpoints for remote command execution. Replacing the instance's existing SGs with this one isolates the workload in seconds while leaving it running for evidence capture.
NSG / firewall isolation (Azure / GCP)
Same pattern, different primitives. Azure: a "quarantine" NSG associated with the NIC, replacing existing NSGs. GCP: tag the instance with a quarantine network tag that matches a firewall rule denying all ingress and egress except to logged forensic endpoints.
IAM containment
Often more important than network containment in cloud breaches. If a credential is compromised, you want to:
- Detach all policies from the user / role, replacing with an explicit deny on everything except the calls your IR team will use to investigate.
- Revoke active sessions for IAM roles using the
AWSRevokeOlderSessionsinline policy - the credential continues to exist (so the audit trail of attempted reuse is preserved) but no API call succeeds. - For SSO / federated identities, disable the user at the IdP (Entra ID, Okta, Google Workspace) and revoke all refresh tokens.
The "running but cannot reach anything" state
The goal is a workload that is alive enough to capture from, dead enough not to cause further harm. For most cloud incidents this is achievable in under 5 minutes if the runbooks and IAM are pre-built; if they aren't, you'll spend 30 minutes deciding what to do while the attacker continues.
Credential rotation under incident
Credential rotation is necessary but rarely the first thing to do. The discipline is to capture evidence and contain first, then rotate - and to rotate correctly so you don't leave windows open.
AWS
- IAM user access keys:
aws iam update-access-key --status Inactivefirst (preserves the key for audit lookup); thenaws iam delete-access-keyafter the active phase. Create a new key only after confirming the workload that uses it has a containment plan. - IAM role sessions: deny based on session-issue-time. The AWSRevokeOlderSessions inline policy applies a condition denying all actions where
aws:TokenIssueTimeis before a chosen timestamp; this kills active sessions without affecting the long-term credential. - STS federation: rotate the IdP-side credential (SAML signing key, OIDC trust) if the trust relationship itself is suspected.
- Root credentials: rotate, re-MFA, audit the root API key usage history. The root key should never be in regular use; any use is an event in itself.
Azure
- Service principal secrets and certificates: rotate via Entra ID. Old credentials remain in audit logs for the retention window.
- Managed identities: cannot be "rotated" in the user-managed sense; instead, remove the role assignment and re-grant if you suspect the identity itself is compromised.
- User sessions:
Revoke-AzureADUserAllRefreshToken(or the Entra ID portal equivalent) forces reauth on every device.
GCP
- Service account keys: disable first (preserves audit trail) via
gcloud iam service-accounts keys disable; delete after the active phase. The presence of long-lived service account keys at all is increasingly a finding - prefer Workload Identity Federation. - OAuth tokens: revoke refresh tokens via Admin SDK or the IdP integration. Active access tokens expire on their own clock but cannot be retroactively killed in all cases.
- User sessions: Workspace admin console offers session reset on a per-user basis.
The cross-cloud principle: disable, don't delete, until evidence is captured. A deleted credential's last-used and policy-version history is harder to recover than a disabled one.
IR runbooks
Runbooks are the predefined sequences your team executes for the common cloud incident classes. The right set covers the 80% of incidents that look the same time after time. Each runbook should answer: who's on call, what to capture, how to contain, how to verify, who to notify.
Compromised IAM credential
The single most common cloud incident. Trigger: GuardDuty / Defender / SCC finding of unusual API activity, or a credential surfacing in a public leak (GitHub, paste sites). Sequence: identify the principal, query the credential's full CloudTrail / Activity Log history, capture and store the activity trail in immutable storage, disable the credential, revoke active sessions, scope the blast radius from the API calls made, then notify the workload owner, rotate, and audit downstream resources for attacker persistence (new IAM users, new roles, modified policies).
Exposed S3 / Storage bucket
Trigger: external researcher report, internal CSPM alert, or a finding from data-exposure scanners (GrayNoise, BinaryEdge, internal). Sequence: confirm exposure (don't trust the alert alone), enumerate accessed objects from server access logs, identify the sensitive data classes involved, fix the bucket ACL / policy / public-access-block, capture an immutable snapshot of the access logs, then escalate to the data-owner team and Legal for breach-notification analysis.
Crypto-mining EC2 / VM
Trigger: GuardDuty CryptoCurrency finding, billing spike alert, unusual outbound network volume to mining-pool IP ranges. Usually downstream of a compromised IAM credential that launched the workload. Sequence: snapshot the workload, capture network traffic, identify the launching principal (often a forgotten access key in a code repo), shutdown the workload, then move to the credential-compromise runbook to find the root cause.
Ransomware in storage
Trigger: customer reports of inaccessible files, S3 object versioning chain showing mass-deletion, ransom note found in a bucket. Sequence: pause all auto-replication so the attacker's deletions don't propagate to backups, verify object versioning is intact, restore from versioning history or backups, identify the credential used for the deletions, follow the credential-compromise runbook for the source, and engage Legal for ransom-decision and law-enforcement notification.
Exfiltration via egress
Trigger: VPC Flow Logs / NSG Flow Logs anomaly, GuardDuty UnusualNetworkActivity or OutboundDataTransferAnomaly, DNS-tunneling indicators in Route 53 / Cloud DNS query logs. Sequence: contain the workload via the quarantine SG / NSG pattern, capture the workload, identify the data being transferred (via S3 access logs, database query logs, application logs), determine the principal and the credential, then run the credential-compromise + ransomware-style data-loss runbooks in parallel.
Insider / privileged-user misuse
Trigger: HR notification of an investigation, anomalous behavior from a privileged principal, or a finding that an admin's account is doing something the role permits but the human shouldn't. Sequence: do NOT alert the subject; preserve evidence; engage HR and Legal before any technical action; capture the full activity trail; coordinate any rotation / disablement with HR's timing.
Reference playbooks worth reading
- AWS Incident Response Playbooks - the AWS-published reference set; technical, opinionated, useful even if you adapt them.
- CISA Federal Cybersecurity Incident & Vulnerability Response Playbooks - federal but broadly applicable.
- AWS Customer Playbook Framework - the structure for writing your own.
- Microsoft Incident Response Playbooks - phishing, password spray, app consent grants.
Tabletop exercises
An IR program that has never practiced will not perform under stress. Tabletops are the cheapest way to discover that your runbooks reference an account that no longer exists, that the on-call rotation hasn't been updated since the last reorg, or that nobody has paged Legal in 18 months and the contact is stale.
Cadence
- Quarterly - small, scenario-specific. 90 minutes. Run the runbook on paper, identify what breaks.
- Annually - major, multi-team. A full day. Multiple scenarios, including ones that overlap (a credential compromise that becomes a ransomware event).
- Ad-hoc - after a major architectural change, a leadership change, or a near-miss incident.
Who attends
Incident commander (rotating), security engineering, on-call SRE / platform team, cloud-account owners for the affected workloads, communications / PR, Legal / privacy counsel, executive sponsor (often CTO or CISO), and a designated observer who takes notes and runs the retro. For tabletops with a DFIR retainer, the firm's account team should join - they should know your environment before the real call.
Scenarios worth running
- Leaked AWS access key in a public GitHub commit, with crypto-mining activity within 4 minutes.
- Compromised Entra ID admin account during a session-hijack attack against the CFO.
- Public S3 / GCS bucket discovered by an external researcher with media interest.
- Ransomware on a sub-set of S3 buckets via a compromised CI/CD pipeline.
- Insider exfiltration: a departing engineer with broad access to customer-data tables.
- Supply-chain compromise: a malicious update to a widely-used base image discovered by your runtime sensor.
- Cross-tenant cloud-provider incident: the provider notifies you of a vulnerability that affected your data.
DFIR retainers
A retainer is a contracted relationship with a DFIR firm that buys you a response SLA, pre-negotiated rates, and (most importantly) a team that knows your environment before the call. The major firms in 2026:
- Mandiant (Google Cloud) - long history, deep nation-state experience, Mandiant Managed Defense for ongoing detection.
- CrowdStrike Services - strong on endpoint-led investigations, paired with their EDR.
- Kroll - broad coverage including breach-coach legal coordination; common in privacy-driven incidents.
- Unit 42 (Palo Alto Networks) - strong on cloud and ransomware engagements.
- Arctic Wolf - mid-market focus, broad retainer base.
- Secureworks (now Sophos), Optiv, Coalfire, regional specialists.
- Mitiga, Cado - cloud-native IR specialists, particularly useful for orgs whose incident surface is heavily AWS / Azure / GCP rather than endpoint.
When the math works
Retainers pay off when:
- You have meaningful cloud presence and sensitive data, and an internal IR team too small to cover a serious incident without help.
- You're in a regulated industry where regulatory notification timelines (GDPR 72h, NYDFS 72h, etc.) leave no slack for shopping for a firm during the incident.
- You have cyber insurance - most policies require a panel-approved DFIR firm and offer rate concessions for retainer customers.
- You want the firm to do an annual tabletop or environment-familiarization engagement to be useful when the real call comes.
Below those thresholds, an established phone-and-email relationship with one or two firms - without a paid retainer - is often enough. The worst case is having no one to call and shopping during the incident; the second-worst is having a retainer with a firm that has never looked at your environment.
Communication during incident
The communications layer of IR is where most programs reveal their immaturity. The technical containment can go well and the public reception can still be terrible, or vice versa.
Internal
- The war room. A dedicated Slack / Teams channel (e.g.
#inc-2026-05-17-credential-compromise) for the active incident. The IC, technical leads, comms, Legal, and executive sponsor live in it during the active phase. No side-channel discussion; everything in the channel is evidence. - RACI for the incident. Incident Commander (one, rotating), Technical Lead, Communications Lead, Legal Lead, Executive Sponsor. Each role has a named human; if that human is unavailable, the runbook says who replaces them.
- Status updates on a cadence. Every 30 minutes during active response - even "no change" is a status. Drift to 60 minutes once containment is solid.
- Decision log. Major decisions (containment scope, who to notify, when to rotate keys, when to declare the active phase over) written down with the reasoning at the time. Saves the retro from "why did we do that?" amnesia.
External
- Customers. If their data is affected - even potentially - they get notified per the contract you signed with them and the law that applies to where they are. Vague language about "an investigation is ongoing" is acceptable initially; silence is not.
- Regulators. GDPR Article 33 requires notification to the supervisory authority within 72 hours of becoming aware of a personal-data breach. U.S. state laws vary (most 30-60 days, some require law-enforcement consultation first). HIPAA breach-notification: 60 days. SEC 8-K Item 1.05: four business days from materiality determination for public companies. NIS2, DORA, and sector regulators add their own clocks.
- Cyber insurance carrier. Notify per policy - often within 24-72 hours of awareness, sometimes before engaging counsel or a DFIR firm not on their panel.
- Law enforcement. FBI / Secret Service / NCSC / local cybercrime unit. Usually optional, sometimes valuable for nation-state attribution or extortion cases.
- Public. Press statement, status page update, blog post. The principle: tell the truth, tell it clearly, tell it before someone else does. Late or evasive comms are the durable reputational damage.
Your IR runbook should include a one-page summary of the regulatory clocks that apply to you - by data type, by jurisdiction, by sector. Your General Counsel should own it; your security team should know where to find it at 2am.
Post-incident
The phase most programs underinvest in, and the one with the highest long-term return.
Blameless retro
The principle, borrowed from SRE: people made the best decisions they could with the information they had. Retros that hunt for fault produce defensive participants who hide details; blameless retros produce honest accounts that surface the systemic causes. Have a designated facilitator (not the IC, not the executive sponsor), follow a written agenda, and produce a written document with the timeline, the decisions, the lessons, and the action items.
Action item categories
- Control improvements. The actual prevention or limit that would have stopped or reduced this incident. Specific, owned, dated.
- Detection backfill. The alert that should have fired earlier. New SIEM rules, new GuardDuty / Defender / SCC enablements, new detection-engineering work. Track from "we noticed" to "we noticed within an SLA."
- Runbook updates. Every place the runbook was wrong, missing, or required improvisation. Updated, reviewed, tested in the next tabletop.
- Tooling gaps. The capabilities you wished you had. Often: better evidence capture, better cross-account access for the IR team, better network isolation, better identity-side controls.
- Process gaps. The places the org-chart let you down. New on-call agreements, new RACI, new escalation paths.
The owner of action items is not Security alone. Detection backfill is Security; control improvements often live with platform / engineering teams; comms updates live with Legal and Marketing. The retro that produces action items only for Security is a retro that has misdiagnosed the incident.
AWS, Azure, and GCP side-by-side
The native IR-supporting capabilities each cloud ships, reduced to a one-screen reference:
| Capability | AWS | Azure | GCP |
|---|---|---|---|
| Activity audit log | CloudTrail (org trail, mgmt + data events) | Activity Log + Diagnostic Settings | Cloud Audit Logs (Admin + Data Access) |
| Network flow logs | VPC Flow Logs | NSG Flow Logs / VNet Flow Logs | VPC Flow Logs |
| Threat detection | GuardDuty (incl. EKS, S3, RDS, runtime) | Defender for Cloud (CSPM + CWPP) | Security Command Center (Premium / Enterprise) |
| Investigation / graph | Detective | Sentinel investigation graph | SCC Investigation, Chronicle |
| SIEM / SOAR | Security Lake + partner SIEM, Security Hub | Microsoft Sentinel (SIEM + SOAR) | Chronicle SIEM + SOAR |
| Disk evidence | EBS snapshot, cross-account share | Managed disk export, snapshot copy | Persistent disk snapshot, cross-project share |
| Memory acquisition | SSM Session Manager + LiME/AVML; no hypervisor-level | Azure Run Command + AVML/WinPmem; no hypervisor-level | OS Login / Cloud Shell + AVML; no hypervisor-level |
| Container runtime | GuardDuty EKS Runtime / ECS Fargate Runtime | Defender for Containers (eBPF) | GKE Threat Detection, Workload Vuln Scanning |
| Isolation primitive | Quarantine Security Group | Quarantine NSG / NIC isolation | Network tag + firewall rule |
| Session revocation | AWSRevokeOlderSessions inline policy | Entra ID refresh-token revocation | Workspace session reset, IAM key disable |
| IR-specific service | AWS Security Incident Response (CIRT-as-a-service) | Microsoft Incident Response (paid engagement) | Mandiant (Google Cloud) consulting + retainers |
| Vendor IR contact | AWS Customer Incident Response Team (CIRT) via Support | Microsoft Detection & Response Team (DART) | Google Cloud TAM / Security Response |
Native tools are necessary but not sufficient. None of the three clouds ship a complete forensics workbench, and the cross-cloud IR tooling section below fills the gap that all three leave open.
Specialized cloud-IR tooling
Beyond the native services, a category of tooling specifically targets cloud forensics and IR workflows:
- Cado Security - cloud-native investigation platform; automated evidence capture for AWS / Azure / GCP, containers, serverless. The closest thing to a turnkey cloud-forensics workbench.
- Mitiga - cloud IR-as-a-service, plus a platform for cloud-incident readiness and investigation.
- Magnet Axiom Cyber - broader DFIR suite with cloud collectors; strong on Microsoft 365, Google Workspace, and SaaS-side evidence.
- Mandiant Managed Defense - outsourced detection + IR; particularly valuable for orgs that can't staff 24/7 internally.
- Velociraptor - open-source endpoint visibility and digital-forensics tool; deploys agents to cloud VMs for query-based collection.
- GRR Rapid Response - Google's open-source remote live forensics framework; mature, scriptable, free.
- Timesketch - collaborative forensic timeline analysis tool. Plays well with Plaso for evidence ingestion.
- cloud-forensics-utils (Google) - Python library that automates disk-acquisition workflows across AWS, Azure, and GCP.
- AWS Incident Response Playbooks repo - reference playbooks, not tooling, but the structure many teams adopt.
- CISA playbooks & advisories - federal but generally useful; the Cybersecurity Incident & Vulnerability Response Playbooks are good baseline reading.
Maturity stages
A useful staging model for a cloud IR program:
Stage 1 - Reactive
An incident happens; the team scrambles. Logs are partially captured. Runbooks are tribal knowledge. The forensics work happens in the production account. Detection is mostly post-hoc - billing alerts, customer complaints, external researcher tips. Survives small incidents, breaks on big ones.
Stage 2 - Documented
Immutable log archive is live. Forensics account exists. The most common runbooks (compromised credential, exposed bucket, crypto-mining) are written down. On-call rotation defined. A DFIR phone-and-email relationship with one firm. First tabletop completed.
Stage 3 - Practiced
Quarterly tabletops with multiple scenarios. Annual full-day exercise with Legal and execs. DFIR retainer in place; firm has been through an environment-familiarization engagement. Detection engineering matures alongside IR; detections tied to MITRE ATT&CK Cloud. SLA-tracked time-to-detect, time-to-contain.
Stage 4 - Automated
High-confidence playbooks auto-execute: GuardDuty finding → snapshot the workload, apply quarantine SG, page the on-call, file a ticket - all in seconds. Continuous tabletop via purple-team exercises. Forensics-as-code; evidence collection is reproducible from a runbook commit hash. The IR program is a competitive advantage in enterprise sales conversations about security maturity.
The skip-stage cost is real. An org trying to automate IR without runbooks is automating against nothing; the automation will be wrong, and people will distrust it. Sequence matters.
Common pitfalls
- No immutable log archive. The single most common - and most consequential - IR readiness failure. The attacker disables logging on day zero; without a tamper-resistant archive in a separate account, the investigation goes blind.
- Doing forensics in the production account. Contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and Legal heartburn. The dedicated forensics account is a precondition, not a nice-to-have.
- No forensics account at all. Adjacent to the above. "We'll figure it out when it happens" is a Stage 1 program; setting it up takes a week and pays for itself the first time.
- Rotating credentials before capturing evidence. Destroys the live session state that's often the best evidence of what the attacker is doing right now. Disable first, capture, then rotate.
- No tabletop, ever. Runbooks that have never been practiced fail on the worst day. Quarterly is the minimum cadence that produces lasting muscle memory.
- Conflating containment with eradication. Quarantining the workload buys time; it doesn't end the incident. Eradication - finding and removing the attacker's full set of footholds - is the work that prevents the second incident an hour later.
- Forgetting identity-side persistence. An attacker who compromised an IdP admin can plant trust relationships, app consent grants, IdP-side users, and federation backdoors that survive every workload-side cleanup. Audit the IdP after every credential-compromise incident.
- "We'll buy a tool when we need it." The tool that needs to be deployed and configured under incident pressure won't be useful in that incident. Pre-stage the forensics AMI, the collector agents, the network rules.
- Underestimating regulatory clocks. GDPR's 72-hour window, SEC's four-business-day clock, state breach laws, sector regulators - these don't pause for your investigation. The runbook needs the clocks pre-mapped; the response needs Legal in the room from hour one.
- Skipping the retro. The team is exhausted; everyone wants the incident to be over; the retro slips. Six months later, the same incident class recurs because the action items were never identified. Retro within two weeks, no exceptions.
Further reading
Standards & frameworks
- NIST SP 800-61r2 - Computer Security Incident Handling Guide
- NIST SP 800-86 - Guide to Integrating Forensic Techniques into IR
- SANS - Incident Handler's Handbook (PICERL)
- MITRE ATT&CK Cloud Matrix
- CISA - Federal IR & Vulnerability Response Playbooks
Provider IR documentation
- AWS Security Incident Response Guide
- AWS Incident Response Playbooks (GitHub)
- Microsoft Incident Response documentation
- Microsoft Incident Response Playbooks
- Google Cloud - Incident Response Best Practices
- Google - Data Incident Response Process
Tooling
- Volatility 3 - memory forensics framework
- Velociraptor - endpoint visibility & live forensics
- GRR Rapid Response
- cloud-forensics-utils
- AVML - Linux memory acquisition
- Falco - runtime security for containers
Related CSOH pages
- Detection engineering - the discipline that produces the alerts IR investigates.
- Cloud SOC - the operating model around detection and response.
- Threat research - the threats your IR program is defending against.
- Breach kill chains - anatomies of real cloud incidents, end-to-end.
- Shared responsibility - where your IR scope ends and the provider's begins.
- Glossary - every term on this page, defined.
FAQ
How is cloud incident response different from on-prem IR?
Three big shifts. First, ephemerality - the instance, container, or function that you want to image may not exist by the time you reach for it; if you didn't snapshot disk and capture logs proactively, that evidence is gone. Second, the API is the attack surface - most cloud breaches are not "shell on a box" but "stolen credential calling APIs", so CloudTrail / Activity Log / Cloud Audit Logs are the primary evidence source, not memory dumps. Third, shared responsibility - the hypervisor, the physical network, the underlying storage hardware are not yours to investigate; you work with what the provider exposes, and you escalate to the provider's IR team when an incident crosses that line.
What's the single highest-leverage thing to do before an incident happens?
Build an immutable, cross-account log archive that an attacker who compromises your production environment cannot tamper with. In AWS, that's CloudTrail (org trail, all regions, management + data events) writing to an S3 bucket in a dedicated Log Archive account, with bucket policies and SCPs that prevent deletion or alteration. In Azure, it's Activity Log + Diagnostic Settings exported to a Log Analytics workspace and a locked storage account in a separate subscription. In GCP, it's organization-aggregated Cloud Audit Logs sinking to a retention-locked bucket in a dedicated logging project. Without this, the attacker rotates credentials, disables CloudTrail, deletes logs, and you investigate blind.
Do I really need a separate forensics account?
Yes, and you need it provisioned before the incident, not during. A forensics account / subscription / project is a clean environment with pre-staged tooling (a forensics AMI or VM image, scripting, disk-imaging utilities, packet-capture tools, Velociraptor / GRR collectors), an isolated VPC with no route to production, IAM trust policies that allow the IR team to assume roles into compromised accounts in read-only or evidence-collection modes, and a budget. Doing forensics in the prod account contaminates evidence, exposes the IR team to whatever the attacker is still doing, and gives auditors and lawyers heartburn.
Should I rotate credentials immediately when I detect a compromise?
Not before you've captured what you need from the credential's audit trail and decided on a containment strategy. The instinct to "kill the access key" is right eventually, but rotating credentials prematurely (a) tips off the attacker, (b) destroys live session state that may be the best evidence of what's happening right now, and (c) can lock out legitimate workloads if the credential is shared. The order is: identify the compromised principal, query CloudTrail / Activity / Audit Logs for the credential's full activity history, snapshot any associated workloads, then rotate - and use AssumeRole session revocation (the AWSRevokeOlderSessions inline policy) to kill active sessions, not just the long-term credential.
How long do we have to disclose a breach?
It depends on jurisdiction and data type. GDPR requires notification to the supervisory authority within 72 hours of becoming aware of a personal-data breach, and to affected individuals "without undue delay" if there's high risk. U.S. state breach-notification laws vary (most 30-60 days, some shorter for specific data types). SEC Item 1.05 requires public companies to disclose material cybersecurity incidents within four business days of materiality determination. HIPAA breach notification is 60 days. Sector-specific rules (NYDFS Part 500, FFIEC, NERC CIP, NIS2, DORA) add their own clocks. Your IR runbook should include a decision tree that maps incident type to the regulatory clocks that apply - and your General Counsel should be in the war room from the first hour.
Is memory forensics still relevant on cloud VMs?
Yes for workloads that meaningfully run anything in memory beyond what's on disk - long-running services with in-memory state, JIT-compiled malware, fileless attacks that hide in process memory. The mechanics on cloud VMs work: LiME or AVML can capture RAM from a running Linux instance, MAGNET RAM Capture or similar from Windows. The constraint is that ephemeral cloud VMs may be gone by the time you reach for them, and capturing memory requires the instance to still be running (or for the hypervisor to support live snapshots - AWS / Azure / GCP don't generally expose this to tenants). For containers and serverless, memory forensics is largely impractical; you fall back on runtime-security telemetry (Falco, Tracee, eBPF-based sensors) captured before the workload terminates.
When does a DFIR retainer pay off?
When you have meaningful cloud presence, sensitive data, and an internal IR team that's small enough to be overwhelmed by a serious incident. A retainer (Mandiant, CrowdStrike, Kroll, Unit 42, Arctic Wolf, and similar) buys you a contracted response SLA - typically 1-4 hours to first responder - and pre-negotiated rates instead of emergency premium pricing. It also forces you to do a tabletop with the firm in advance so they know your environment when the call comes in. The math works above roughly 100 employees, or earlier if you're in a regulated industry. Below that, a "best-efforts" relationship with one or two firms you've talked to in calm times is often enough.
Where next
- Detection engineering - the discipline that produces the alerts IR responds to.
- Cloud SOC - the operating model that wraps detection and IR together.
- Threat research - the attacker techniques your IR program defends against.
- Breach kill chains - real incidents, end-to-end, with timelines and lessons.
- Friday Zoom - IR readiness, tabletops, and post-incident retros come up regularly. Drop in.