Cloud SOC & Threat Monitoring

Detection and response when the network is abstracted away and the perimeter is API-shaped. Vendor-neutral guide to the log sources that actually matter, the native cloud detectors, SIEM and SOAR choices, detection engineering as a discipline, incident response specifics, and the tools cloud SOC teams actually use.

Man analyzing business data and financial graphs on a laptop
Photo by Kaboompics on Pexels

· · Vendor-neutral · View source on GitHub

The 30-second version: A cloud SOC is the team and tooling responsible for detecting, investigating, and responding to security events across your cloud accounts. Where a traditional SOC watches packets and endpoints, a cloud SOC watches logs - control-plane audit trails (CloudTrail / Activity Log / Cloud Audit Logs), identity-provider logs, VPC flow, DNS queries, and application telemetry - because cloud abstracts the network.

The stack: native cloud detectors (GuardDuty, Defender for Cloud, Security Command Center) for fast-wins, a SIEM for cross-source correlation and retention, detection engineering as the practice that writes the rules, SOAR / autonomous response for the repetitive work, and threat intel wired in to enrich alerts. The work is more like data engineering than like staring at IDS console - and the mental shift is the hardest part of the transition.

On this page

  1. What cloud threat monitoring is
  2. Cloud SOC vs traditional SOC
  3. The cloud-native detection model
  4. Log sources you actually need
  5. Native cloud detection
  6. SIEM and the cloud
  7. Detection engineering as a practice
  8. Detection categories that matter
  9. Threat intel in cloud
  10. Incident response specifics
  11. SOC team structure & roles
  12. AWS, Azure, and GCP side-by-side
  13. Maturity stages
  14. Common pitfalls
  15. Further reading
  16. FAQ

What cloud threat monitoring is

Threat monitoring is the continuous collection of security-relevant signals from your environment, the rules and models that find suspicious activity in those signals, the alerts that surface what's worth a human's attention, and the analyst workflow that decides what to do about each one. Done well, it's the difference between "we got breached and learned about it from the FBI" and "we got an alert at 02:14, contained it by 02:31, and the customer never noticed."

Cloud threat monitoring is the same discipline applied to cloud environments - where the threat surface is API-shaped, the telemetry is log-shaped, and the dominant attack patterns chain identity compromise into resource manipulation rather than dropping a beacon on a Windows host.

The SOC (Security Operations Center) is the team that runs it. The name comes from the dedicated rooms full of screens of the early 2000s; the practice today is mostly distributed, mostly remote, mostly working in Slack and Jira with one or two big dashboards on the wall.

Cloud SOC vs traditional SOC

The mental-model shift coming from traditional SOC into cloud is the biggest hurdle for most analysts. It's also the one that takes the longest to internalize, because the muscle memory of network-centric monitoring is deep.

Dimension Traditional SOC Cloud SOC
Primary signal Packets, endpoint EDR telemetry, firewall logs Control-plane audit logs, identity events, app telemetry
Perimeter shape Network - choke points, DMZ, firewall rules Identity - every API call has an actor and a permission check
Sensor placement Span ports, taps, IDS sensors on the network Logs come for free from the provider; you only choose what to keep
Asset inventory Hard problem - devices walk in and out API-queryable, but counted in tens of thousands of ephemeral resources
Forensics target Disk image of a host, memory capture, packet pcap Log timeline reconstructed from CloudTrail + identity logs + app logs
Contain action Isolate host on the network Revoke session, rotate credential, detach IAM policy, quarantine resource
Dominant attack pattern Phish → endpoint → lateral → exfil Phish or token theft → IAM → resource manipulation → exfil
Required skillset Packet analysis, EDR queries, malware reversing Query languages (KQL, SPL, ESQL), cloud API knowledge, IAM understanding

The skills overlap is real - incident response judgment, alert triage discipline, the ability to read a log line and understand what an attacker is doing - but the specific knowledge is different enough that "I ran a SOC for 10 years" doesn't translate to "I can run a cloud SOC" without 6-12 months of cloud-specific learning. The reverse is also true.

The cloud-native detection model

Almost every credible cloud detection eventually reduces to the same shape: this identity did this action on this resource from this context at this time, and that combination is suspicious.

The fields that matter:

Detections then ask combination questions:

The vocabulary maps cleanly onto MITRE ATT&CK Cloud - every detection should map to one or more techniques, both for shared language across the team and for coverage analysis ("we have no detections in the Persistence column").

Log sources you actually need

The full list of logs each cloud can emit is enormous. The list you actually need on day one is smaller. Working priority order:

1. Control-plane audit

CloudTrail (AWS), Activity Log (Azure), Cloud Audit Logs (GCP). Every API call, every account, every region. The single most important log. Org-wide, multi-region, no exceptions.

2. Identity-provider logs

Entra ID sign-in & audit, Okta system log, Cognito events, Cloud Identity. Captures auth events before they reach cloud APIs - phishing, MFA bypass, suspicious sessions live here.

3. Network telemetry

VPC Flow Logs, NSG flow logs, GCP VPC Flow. Less interesting than CloudTrail for most cloud-native attacks, essential when a workload starts talking to known-bad IPs.

4. DNS query logs

Route 53 Resolver logs, Azure DNS, Cloud DNS. C2 beacons, DGA traffic, data exfiltration via DNS - all visible here that aren't in flow logs.

5. Data-plane logs

S3 access logs / data events, Azure Storage diagnostics, GCS data access logs. Required to investigate "what data was actually read?" - control plane only tells you that data was read, not which.

6. Workload & app logs

Lambda / Functions logs, container logs, app stdout / stderr. The application-layer evidence trail; required for app-vulnerability investigations.

7. Configuration history

AWS Config, Azure Resource Graph history, Cloud Asset Inventory. Answers "what did this resource look like before the attacker changed it?" without re-running detections.

8. SaaS & productivity

Workspace / M365 audit logs, GitHub audit log, Slack, etc. Auth events outside the cloud's own log feed; often the first signal of a compromised user.

Retention

Practical floor: 90 days hot in the SIEM (you query directly), 365+ days cold in cheap object storage (you re-hydrate for investigations). CIS / compliance frameworks often require longer cold retention - set policies once at the bucket / storage layer rather than per-source.

Native cloud detection

Each cloud ships its own threat-detection service. They share more than they differ - managed, ML-driven, fed by the cloud's own internal telemetry, deliver findings to a central console with minimal setup. The pragmatic answer is: turn them on, all of them, on day one. They catch a meaningful percentage of real attacks for very little operational cost.

AWS GuardDuty

Continuous threat detection across CloudTrail, VPC Flow Logs, DNS, S3, EKS audit, Lambda, Runtime Monitoring, and Malware Protection. The UnauthorizedAccess:IAMUser/InstanceCredentialExfiltration finding catches the exact pattern that produced the Capital One breach (see the kill chain). Org-wide via delegated admin to your audit account.

Microsoft Defender for Cloud

CSPM + workload protection, with plans for servers, app services, databases, storage, key vault, DNS, containers, APIs. Recommendations + alerts; integrates with Sentinel. The "everything" plan is expensive at scale; most orgs enable the workload-protection plans on prod subs only.

Google Security Command Center

SCC Premium / Enterprise includes Event Threat Detection (CloudTrail-equivalent analysis), Container Threat Detection, Virtual Machine Threat Detection, and a CNAPP-style posture layer. Standard tier is free and covers basic findings; Premium / Enterprise are paid.

What native detection is good for

What native detection is bad at

The right pattern: native detectors enabled everywhere as first-line; SIEM as the aggregation, correlation, and customization layer on top.

SIEM and the cloud

SIEM = Security Information & Event Management. The platform where logs land, get parsed, get queried, get correlated, and produce alerts. Once you have more than one cloud (or one cloud and SaaS, or one cloud and on-prem), the SIEM becomes the only place an analyst can ask cross-source questions.

The current landscape (vendor-neutral)

How to pick

What's not a great differentiator anymore: "ingests CloudTrail." They all do.

Detection engineering as a practice

Detection engineering is the discipline of writing, testing, deploying, and maintaining detection rules - treated as code, with the same review, versioning, and CI/CD as application code. Cloud SOC depends on it more than traditional SOC did, because so many cloud-specific detections are inherently custom (your IAM model, your service accounts, your business hours).

The detection-engineering loop

  1. Threat model the technique you want to catch. Where in MITRE ATT&CK does it sit? What's the prerequisite for an attacker to do it?
  2. Find or generate evidence in your logs that the technique would leave. Atomic Red Team, Stratus Red Team, and similar tools simulate cloud attacks specifically so you can see what they look like in CloudTrail.
  3. Write the detection - a query in your SIEM's language or a portable rule (Sigma).
  4. Tune for noise. The first iteration always over-fires. Whitelist legitimate sources, add context conditions, adjust thresholds.
  5. Test continuously. Re-run the atomic emulation periodically - when an upstream log schema changes, your rule may silently break.
  6. Measure efficacy. True positive rate, false positive rate, time-to-alert. Detection content that fires once a quarter and is always a true positive is good; detection content that fires 50 times a day and is always benign is worse than no detection.

Rule sources

Treat all of these as starting points. Every detection needs tuning to your environment before it earns a place in production.

Detection categories that matter

The classes of detection every cloud SOC should aim to cover, mapped roughly to MITRE ATT&CK Cloud:

For each category, the SOC should have at least one detection live, at least one quarterly emulation that exercises it, and at least one documented runbook. The first time a SOC enumerates "we have zero coverage in Defense Evasion" is usually the day after they needed it.

Threat intel in cloud

Cloud-specific threat intel is younger than network threat intel and the data quality varies. The actionable pieces:

Where to wire it in

Incident response specifics

Cloud IR differs from on-prem IR in two practical ways: the evidence is log-based (no disk image), and containment actions are API calls.

The cloud-IR playbook shape

  1. Triage the alert. Identity, action, resource, context. Decide: noise, suspicious-not-yet-confirmed, confirmed-compromise.
  2. Pivot from one event to the full session. What else did this identity do? Same source IP? Same session token? CloudTrail with the identity's principal-ARN as a filter, plus the time window, plus the identity-provider session ID.
  3. Determine blast radius. What resources did the compromised identity have access to? What of those were actually accessed? IAM Access Analyzer / Permissions Analyzer + CloudTrail data events.
  4. Contain. Revoke session tokens (AWS aws iam delete-access-key + aws sts revoke-session; Azure conditional-access revoke; GCP gcloud iam service-accounts disable). Detach IAM policies. Quarantine compromised workloads (security-group-of-one). Disable user accounts.
  5. Eradicate. Remove persistence - newly created keys, new IAM users, new SSH keys, modified Lambda functions, new federated IdPs.
  6. Recover. Restore from known-clean snapshots if data integrity is in question. Rotate the secrets the attacker may have seen.
  7. Post-incident. Timeline, root cause, what detection missed it, what would have detected it earlier, what's the runbook change.

Tools of the cloud-IR trade

A modern server room featuring network equipment with blue illumination
Photo by panumas nikhomkhai on Pexels

SOC team structure & roles

The classic "tier 1 / tier 2 / tier 3" structure persists but is shifting. AI-assisted triage is eating tier-1 work faster than other layers, which is pushing organizations toward flatter SOC structures with stronger detection-engineering and IR capabilities at the senior end.

The roles that actually exist in 2026

For more on these roles and how to break into them, see the Cloud Security Careers page.

AWS, Azure, and GCP side-by-side

The native detection + logging story on each cloud, reduced to a one-screen reference:

Building block AWS Azure GCP
Control-plane audit CloudTrail (org trail) Activity Log (subscription / mgmt group) Cloud Audit Logs (org-aggregated)
Identity events IAM Identity Center, federated IdP Entra ID sign-in & audit logs Cloud Identity audit, Cloud Logging
Network flow VPC Flow Logs NSG flow logs, VNet flow logs VPC Flow Logs
DNS Route 53 Resolver query logs Azure DNS analytics Cloud DNS query logs
Native threat detection GuardDuty (org-wide, all features) Defender for Cloud (workload plans) SCC Premium / Enterprise (Event & Container Threat Detection)
Native SIEM Security Lake + Athena, OpenSearch Microsoft Sentinel Google Security Operations (Chronicle)
SOAR / response automation Security Hub automation rules, Step Functions, EventBridge → Lambda Sentinel Logic Apps, Automation Rules SecOps SOAR (formerly Siemplify)
Posture / inventory Security Hub, AWS Config Defender for Cloud, Azure Resource Graph SCC, Cloud Asset Inventory
IR workflow AWS Security Incident Response (managed) Sentinel Incidents, Defender XDR SCC Cases

Cross-cloud reality: most SOCs end up running one of the third-party SIEMs (Splunk, Sentinel, Chronicle, Elastic, CrowdStrike, Datadog) as the unified plane, with the native cloud detectors enabled below it as sensors. Pure single-cloud-native-SIEM works for organizations genuinely living in one cloud; that's rarer than it sounds.

Maturity stages

SOC capability grows over time. A useful staging model:

Stage 1 - Visibility

Native detection enabled in every account/sub/project. Control-plane logs aggregated to a central destination. 90-day retention live. Alerts route to a defined channel. Coverage measured against MITRE ATT&CK Cloud.

Stage 2 - Triage

SIEM stood up, identity + network + DNS logs flowing. Defined alert taxonomy. Tier-1 analysts (or LLM-assisted automation) triaging within minutes. Documented runbooks for the top 20 alert types. MTTR measured.

Stage 3 - Engineering

Detection-as-code with CI/CD and tests. Adversary emulation (Stratus, Atomic Red Team) running monthly. SOAR automating the repetitive contain/enrich steps. Threat hunting on a regular cadence. Cross-cloud correlation working.

Stage 4 - Resilience

Tabletop exercises quarterly, full red-team / purple-team annually. SLO-driven SOC metrics. Threat intel feeding both detection and proactive hardening. Post-incident reviews drive durable changes in the platform.

Skipping stages is expensive in the same way it is for landing zones - a team trying to stand up SOAR before they have working alert triage just automates noise. The honest sequencing matters.

Common pitfalls

Further reading

Frameworks & guidance

Detection content sources

Vendor docs

Related CSOH pages

FAQ

How big does a cloud SOC need to be?

Smaller than most people assume, if it's well-tooled. A 2026 cloud SOC running native detection + a modern SIEM with good content + LLM-assisted triage can credibly cover a mid-size enterprise with 5-8 humans. Larger orgs scale the detection-engineering and IR functions more than tier-1 triage. The traditional "20-seat tier-1 room" is not the right model anymore.

Should the SOC sit inside platform / SRE or separate?

Both work. The orgs that ship cloud security well have SOC and platform close to each other - same chat channel at minimum, often the same on-call rotation. If they're separate, the friction shows up as "we asked for that log to flow last quarter" delays.

Is MDR (managed detection & response) a substitute for an in-house SOC?

It's a credible complement, not a substitute. MDR handles 24x7 monitoring and tier-1 triage; you still need someone in-house who owns detection engineering, IR judgment, and the relationship with the rest of the security org. Pure-MDR with no internal expertise is a known anti-pattern.

What about open-source SIEMs?

Wazuh, OpenSearch, Loki + Grafana - credible options for cost-sensitive shops. Detection content, ML correlation, and operational tooling are typically less mature than commercial SIEMs. Best fit when you have engineers happy to build and maintain the missing parts.

How do you measure if the SOC is working?

MTTD (detection), MTTA (acknowledge), MTTR (respond), MTTC (contain). Coverage against MITRE ATT&CK Cloud. False-positive rate per detection. Number of true-positive incidents detected by the SOC vs reported externally. Post-incident-review follow-through rate. None of these alone tell the story; together they do.

How does this relate to zero trust?

Zero trust says "assume breach and verify continuously." The SOC is the verification function - the team that closes the loop on the assumption. Without monitoring, zero-trust deployments are unverified claims; with monitoring, the principle becomes operationally measurable.

Where next