The 30-second version: Detection engineering is the discipline of building, testing, deploying, and retiring the rules that decide what an alert is. It's the build side; the Cloud SOC page covers the consume side. The job is mostly editing rule files in a Git repo, replaying historical logs against changes, and confirming the rule fires when the matching technique runs in a test environment.
In cloud the work hinges on three things: knowing what each platform emits (CloudTrail, Activity Log, Cloud Audit Logs, and the data-plane stream that's often off by default); writing rules portably (Sigma as the canonical format, compiled to the SIEM's native query language); and validating against tools like Stratus Red Team that emulate real cloud attacker techniques mapped to MITRE ATT&CK Cloud.
On this page
- What detection engineering is
- The detection engineering lifecycle
- Cloud logging fundamentals
- The data-access log gotcha
- Sigma - the lingua franca
- Vendor detection languages
- MITRE ATT&CK Cloud Matrix
- Detection-as-code workflow
- Native threat-detection services
- SIEM vs data lake vs XDR
- Log retention & cost
- Signal sources beyond audit logs
- Building a detection - walkthrough
- Tuning & noise reduction
- Validation & purple teaming
- AWS / Azure / GCP side-by-side
- Maturity stages
- Common pitfalls
- Further reading
- FAQ
What detection engineering is
Detection engineering is the discipline that produces, tests, and maintains the rules a SOC depends on. It is not the same as SOC analysis. The analyst sits in front of a queue of alerts and decides whether each one represents real malicious activity. The detection engineer sits in front of a code repo and a sample-log corpus and decides what the rule should match in the first place - and what it should ignore.
The two roles think differently. A SOC analyst's currency is triage minutes per alert and the false-positive rate of their queue; a detection engineer's currency is rules-deployed-per-quarter, mean time from threat-research finding to detection in production, and ATT&CK technique coverage. They cooperate constantly - the analyst's "I'm seeing this pattern again" feeds the engineer's backlog, and the engineer's new rule lands in the analyst's queue.
The reason the role exists separately is that modern detection looks much more like software engineering than like operations. Rules live in Git. They have unit tests. They go through code review. They deploy via CI/CD to one or more SIEM or data-lake backends. They have version history, owners, ATT&CK mappings, severity ratings, and an explicit retirement criterion. Programs that treat detection as a side-of-desk activity for the SOC team accumulate stale, untested rules and never close the coverage gaps that matter.
This page is the practitioner's view of the build side. For the consume side - alert triage, SOC structure, incident playbooks - see the Cloud SOC page.
The detection engineering lifecycle
The widely-cited model is Palantir's: research → develop → tune → deploy → validate → retire. Each phase has explicit inputs, outputs, and exit criteria. The lifecycle isn't ceremony - it's the only thing that keeps a rule library from becoming a graveyard.
1. Research
Start with a threat - an ATT&CK technique, a CTI report, a vendor advisory, an internal red-team finding, a real incident from yourself or a peer. Understand the technique end-to-end: what API calls it generates, what log fields are diagnostic, what benign activity looks similar. Output: a written hypothesis of "we should be able to detect X by looking for Y in Z."
2. Develop
Write the rule. Canonical form in Sigma where possible, or directly in the backend's query language. Include metadata - ATT&CK technique IDs, severity, owner, references. Run the rule against a corpus of historical logs and a corpus of test attack telemetry. Iterate. Output: a rule file in the Git repo with passing unit tests.
3. Tune
Run the rule against 7-30 days of historical production logs (a "backtest"). Measure the false-positive rate, identify benign sources, add suppressions or refine the logic. The acceptance bar is the analyst's tolerance - a rule that fires 50 times a day is dead on arrival unless every fire is an emergency. Output: a backtest report and a tuned rule.
4. Deploy
Merge to main; CI/CD pushes the rule to the SIEM / data-lake backend(s) via API. Stage in a "test" or "low-severity" mode for 1-2 weeks; promote to alerting only after the false-positive rate holds. Output: an active production rule with a documented owner and an SLA-tagged severity.
5. Validate
Execute the matching technique with Stratus Red Team (cloud) or Atomic Red Team (general). Confirm the rule fires, the alert reaches the right queue, and the metadata is intact. Re-run on a schedule (monthly or quarterly). Output: a validation log entry tied to the rule.
6. Retire
Every rule needs a retirement criterion. The underlying API is deprecated; the technique no longer applies; the false-positive rate has crept past tolerable; a better rule supersedes it. Without an explicit retirement step, dead rules accumulate and the analyst queue silently degrades. Output: an archived rule with a stated reason.
Two things distinguish mature programs from immature ones in this lifecycle. First, every step produces an artifact that lives in version control - the hypothesis, the backtest, the validation result. Second, the loop closes: a missed-detection incident in production triggers a research item, not just a postmortem action that quietly never ships.
Cloud logging fundamentals
You can't detect what you can't see. Every detection rule starts from a log source; every log source has costs, gotchas, and a default-on or default-off state you need to know. The catalog below is the practitioner's working set for each major cloud.
AWS
- CloudTrail - management events. The control-plane audit log. Every API call against an AWS service:
CreateUser,AssumeRole,PutBucketPolicy,StartInstance. On by default for the last 90 days in the console; you must create an organization trail to ship the events to S3 and aggregate across accounts. This is the single most important AWS log source. - CloudTrail - data events. The data-plane audit log:
GetObjecton S3,Invokeon Lambda, queries on DynamoDB. Off by default and per-resource scoped. Charged per event. Without it, you cannot see who read sensitive data - only who changed its permissions. - CloudTrail - Insights events. Anomaly detection on the management-event stream. Surfaces unusual write-API rates and error rates. Useful as a cheap "something is up" signal.
- CloudTrail Lake. A managed event data store on top of CloudTrail with SQL query, retention controls, and federation. Reduces the need to ship trails into a separate SIEM for the simplest cases.
- VPC Flow Logs. Network metadata (5-tuple + accept/reject) at the ENI, subnet, or VPC level. Choose v5 with custom fields for the useful ones (TCP flags, traffic path, region). The default sampling and aggregation interval (1 minute) is fine for detection; raise to per-flow for forensics.
- Route 53 query logs. DNS resolution events from your VPCs. Crucial for C2 / DGA detection.
- S3 server access logs / S3 access logs via CloudTrail data events. Two ways to get S3 read/write logs; CloudTrail data events is structured JSON and is the modern choice. S3 server access logs are legacy text format.
- CloudWatch Logs. The umbrella destination for application logs, Lambda execution logs, EKS control plane logs, and many service-specific streams. Subscriptions to Kinesis Firehose / S3 / SIEM are the standard egress pattern.
- GuardDuty findings. Managed threat-detection findings as a structured stream - see the native section below.
- Security Hub findings. Aggregated findings from GuardDuty, Inspector, Macie, IAM Access Analyzer, Config Rules, and third-party CSPMs, in a normalized ASFF (AWS Security Finding Format) shape.
- IAM Identity Center / IAM Access Analyzer. Sign-in events, permission analyses, last-used reports. Identity-side detection signal.
Azure
- Activity Log. The Azure control-plane audit log at the subscription / management-group level. Equivalent of CloudTrail management events. Configure Diagnostic Settings to ship to Log Analytics / Event Hub / storage.
- Entra ID Sign-in Logs. Every authentication against an Entra-protected resource: user, app, location, conditional-access result. The single most-queried log in any Microsoft environment.
- Entra ID Audit Logs. Directory-change events - user creation, group membership, role assignment, conditional-access policy edits. Pair with Sign-in for identity-side detection.
- Diagnostic Settings (per-resource). The mechanism by which resource-specific logs leave the resource and go somewhere queryable. Many high-value logs (Key Vault audit, Storage data-plane, SQL audit) are off by default until a Diagnostic Setting is configured.
- NSG Flow Logs / VNet Flow Logs (v2). Network metadata at the network-security-group or vnet level. VNet flow logs are the newer, recommended option.
- Microsoft Defender for Cloud - Recommendations & Alerts. The posture and threat-detection findings stream from Defender for Cloud's CWPP plans.
- Microsoft Sentinel. Microsoft's SIEM - runs on top of Log Analytics workspaces and ingests Activity Log, Entra logs, Defender alerts, and third-party connectors. KQL is the query language.
- Microsoft 365 Unified Audit Log. SaaS-side audit events for Exchange, SharePoint, Teams, Purview. Critical if M365 is in scope - and it usually is.
GCP
- Cloud Audit Logs - Admin Activity. Control-plane operations. Free, always on, 400-day retention by default at the project level. Aggregate to an org-level log sink for cross-project queryability.
- Cloud Audit Logs - System Event. Google-initiated events affecting your resources (live migration, automated maintenance, defender actions). Free, always on.
- Cloud Audit Logs - Policy Denied. Logs when an IAM or VPC-SC policy denies an action. Free, opt-in. Excellent attack-surface signal.
- Cloud Audit Logs - Data Access. The data-plane stream - reads of data in BigQuery, Cloud Storage, Pub/Sub, etc. Off by default for most services; paid. See the data-access gotcha section.
- VPC Flow Logs. Network metadata at the subnet level. Granular sampling controls. Don't forget Firewall Rules Logging - a separate enablement.
- Cloud DNS query logs. DNS-side detection signal, enabled per-policy.
- Security Command Center (SCC). The aggregator: Event Threat Detection, Container Threat Detection, VM Threat Detection, Security Health Analytics. Findings stream out to Pub/Sub for SIEM ingestion. See the native section.
- Access Transparency & Access Approval. Logs and approvals for Google-personnel access to your data - an attestation source that's unusually well-developed on GCP.
- Google Workspace Admin SDK / Reports API. SaaS-side audit events. Pull via API or stream to BigQuery.
Identity providers (the often-skipped layer)
- Okta - System Log API, every authentication and admin action.
- Entra ID - covered above; the cloud-provider audit logs don't show identity-provider activity that didn't reach the cloud.
- JumpCloud, OneLogin, Ping, Auth0 - every modern IdP exposes an event stream, usually as webhook or API.
- Workspace / M365 - the SaaS-platform audit logs that record what humans did in productivity tools.
If you ingest only one extra source beyond cloud control planes, ingest the IdP. The majority of cloud incidents start at identity - an MFA-fatigue push, a stolen session token, a service-account key leaked to GitHub - and the IdP log sees the first event in the kill chain.
The data-access log gotcha
Each cloud splits its audit log into "what someone did to the resource" (management / admin activity) and "what someone read or wrote inside the resource" (data events / data access). The split has cost and privacy reasons; it has detection consequences that catch programs off guard.
| Cloud | Control-plane log | Data-plane log | Default state of data-plane |
|---|---|---|---|
| AWS | CloudTrail management events (free for one trail) | CloudTrail data events (paid, per-resource scoped) | Off; enable per S3 bucket, Lambda function, DynamoDB table, etc. |
| Azure | Activity Log (free) | Per-resource Diagnostic Settings (paid storage) | Off; configure Diagnostic Settings on each resource |
| GCP | Cloud Audit Logs - Admin Activity (free) | Cloud Audit Logs - Data Access (paid) | Off for most services; explicit opt-in in IAM audit-config |
The single most common cloud-detection blind spot is "we have CloudTrail / Activity Log / Cloud Audit Logs enabled" without realizing that the data-plane stream is a separate enablement. The result: you can see that an attacker assumed a role, but not that they then listed every object in a sensitive bucket. You can see who changed a Key Vault access policy, but not whose secret was retrieved. You can see who granted BigQuery dataset permissions, but not who exported the data.
The pragmatic posture: enable data-plane logging on your crown-jewel data resources, scoped tightly. The cost of logging every S3 GetObject across thousands of buckets is real; the cost of logging the dozen buckets that hold customer data is trivial. Inventory first, scope deliberately, accept the bill.
Sigma - the lingua franca
Sigma is a YAML-based, vendor-neutral format for describing log-based detections. It exists because every SIEM has its own query language (Splunk SPL, KQL for Sentinel and Defender, Elastic ESQL/EQL, Sumo Logic, Datadog, Panther's Python) and writing the same rule in five places is the worst possible use of detection-engineer time.
A Sigma rule names the log source, declares match conditions, and tags ATT&CK techniques, severity, references, and an owner. A converter (the original Sigmac, the modern pySigma, or a vendor-provided one) compiles it to the target SIEM's query language. The Sigma source is the canonical artifact in your detection-as-code repo; the compiled output is a build artifact.
What Sigma is good at
- Portability. One rule, multiple backends. Useful if you run a SIEM and a data lake, or migrate between SIEMs.
- Shared catalogs. The SigmaHQ public rule library is thousands of community-maintained rules covering Windows, Linux, AWS, Azure, GCP, Okta, GitHub, etc. A useful starter library and benchmark.
- Readability. Sigma is more readable than most native query languages - easier to review in a PR.
- Versioning. Stable rule IDs and explicit revision tracking are part of the spec.
What Sigma is not good at
- Complex logic. Multi-event correlation, joins, and stateful behavior either don't translate well or compile to inefficient native queries. For the trickiest rules, write directly in the target language.
- Performance tuning. The converter doesn't know your indexes, your data model, or your time-window economics. Expect to hand-tune the compiled output for high-volume rules.
- Cloud-specific field schemas. Sigma's cloud-log taxonomies are still maturing relative to its Windows / endpoint coverage. Expect to define custom log-source mappings.
The practical workflow: write the rule in Sigma where the logic fits cleanly; write directly in the backend's query language where it doesn't; treat the choice as a per-rule decision, not a religious one.
Vendor detection languages
Each backend has its own query language. You will end up reading and writing the ones in your stack; familiarity with all of them is a strong differentiator for the role.
- Splunk SPL. Pipeline-style syntax (
search ... | stats ... | where ...). The most widely-known SIEM language. Splunk Cloud and Splunk Enterprise Security are still industry-standard at large enterprises. - KQL (Kusto Query Language). Microsoft's language for Sentinel, Defender XDR, Defender for Cloud, and Azure Monitor / Log Analytics. Also pipeline-style; more SQL-like than SPL. Cross-workspace queries are first-class.
- Elastic ESQL & EQL. Elastic's newer pipeline language (ESQL) and its older sequence-aware language (EQL) for endpoint-style chained detections. Lucene / KQL (the Kibana flavor, not Microsoft's) for ad-hoc.
- Panther (Python). Detections-as-code platform where rules are Python functions. Strong story for testing - every detection comes with sample-event tests. Plays well with Snowflake and data-lake-style architectures.
- Chronicle / Google SecOps - YARA-L 2.0. Google's detection language, derived from YARA's pattern syntax. Sequence-aware, hyper-fast against Chronicle's columnar store. Built-in ATT&CK and IOC enrichment.
- Sumo Logic. Pipeline language similar to SPL.
- Datadog (Cloud SIEM). Tag-based query syntax aligned with Datadog's broader observability model.
- Snowflake / BigQuery / Athena (SQL). When the backend is a data lake, the detection language is just SQL - with scheduled queries or a security-data-lake layer (Anvilogic, Hunters, Query.ai) on top.
- Falco rules (YAML). The detection language for runtime / syscall events from Falco. Different shape - closer to OPA than SIEM.
Pick the language your stack actually runs. Then read other languages' rule libraries - Splunk's public detections, Microsoft's Sentinel content hub, Elastic's detection-rules repo, Panther's analysis pack, and Chronicle's content packs are all valuable cross-reference even when you can't run their rules directly.
MITRE ATT&CK Cloud Matrix
MITRE ATT&CK is the threat-model taxonomy detection engineers and SOC analysts share. The Enterprise matrix has cloud-specific sub-matrices that matter directly to the role:
- IaaS - AWS, Azure, GCP infrastructure techniques: cloud account discovery, instance metadata abuse, modification of trust policies, snapshot exfiltration.
- SaaS - generic SaaS-platform techniques: API exploitation, OAuth-app abuse, session-token theft.
- Office 365 - Microsoft 365 / Exchange / SharePoint techniques: inbox rules, mailbox forwarding, eDiscovery abuse.
- Google Workspace - Workspace-specific techniques: app password abuse, Drive sharing exfiltration, Workspace admin role abuse.
- Azure AD / Entra ID - identity-platform techniques: device-code phishing, Primary Refresh Token theft, conditional-access bypass, golden SAML.
Every detection rule in a mature program tags one or more ATT&CK techniques. The mappings are what produce the coverage report - "we have detections for 142 of the 213 cloud techniques relevant to our environment; here's the prioritized backlog of the rest." That coverage report is the single most-requested artifact when a CISO or a customer asks about the detection program's maturity.
Cross-reference ATT&CK with the open-source Center for Threat-Informed Defense security stack mappings, which connect ATT&CK techniques to specific AWS, Azure, GCP service controls. Useful when planning preventative coverage alongside detective.
Detection-as-code workflow
Detection-as-code treats detection rules with the engineering rigor any other production code gets - branching, code review, automated testing, CI/CD deployment, observability of failures. The mechanics:
- The repo. One Git repository for the detection content. Top-level folders by log source or by ATT&CK tactic; per-rule files contain the rule, sample events for tests, ATT&CK tags, owner, severity, validation references. Schemas vary by tool (Panther, Splunk Content Control Tower / Splunk SOAR Content, Sentinel content hub, Elastic detection-rules, custom).
- The pipeline. On PR: lint the YAML / Python, validate against the rule schema, run unit tests (replay sample events, assert fire/no-fire), optionally compile Sigma → backend language, optionally backtest against a sample log corpus. On merge: deploy via the backend's API to staging; promote to production on a separate workflow.
- The tests. Every rule ships with at least one positive test (an event that should fire) and at least one negative test (an event that looks similar but shouldn't). Modern tools (Panther, Elastic detection-rules, Splunk Security Content) bake the test harness in.
- Versioning. Every rule has a stable ID; every revision is a commit; every deploy carries the commit SHA. When the SOC analyst asks "when did this rule change?" the answer is in the Git log.
- The build. Sigma-source rules compile to backend-specific output at build time, not runtime. The output is a deployment artifact.
- The deployment. SIEM APIs (Splunk REST, Sentinel ARM templates, Elastic Kibana Detection Engine API, Chronicle's API, Panther's deployment) are the destination. Manual rule edits in the SIEM UI are forbidden by policy - drift detection in the pipeline catches them.
Reference open-source content repos for the shape: Elastic detection-rules, Azure Sentinel, Splunk Security Content, Panther Analysis. Each has its own conventions; the patterns rhyme.
Native threat-detection services
Each major cloud ships a managed threat-detection service. Detection engineers use them as signal sources - high-confidence findings that flow into the SIEM as one input among many - rather than as a complete detection program. The tradeoff is the same in every cloud: the managed service catches the well-known stuff cheaply; everything else still needs custom rules.
AWS GuardDuty
GuardDuty analyzes CloudTrail, VPC Flow Logs, DNS queries, S3 data events, EKS audit logs, Lambda invocation logs, EBS volume snapshots, and RDS login activity. Findings are categorized by attack stage (Reconnaissance, Discovery, CredentialAccess, etc.) and map to ATT&CK. Strong areas: instance compromise, S3 anomalies, mining/crypto activity, credential exfiltration. Weak areas: low-and-slow insider activity, anything that requires environment-specific context (your "this user shouldn't be in this region" rule). Cost scales with API-call volume - large environments need budgeting.
Microsoft Defender for Cloud
Defender for Cloud bundles CSPM and CWPP plans, each producing findings. Defender for Servers, Containers, Storage, SQL, Key Vault, App Service, ARM, DNS, Cloud Database, and APIs each shed detections. Tightly integrated with Sentinel (the SIEM) and Defender XDR (the unified XDR plane). Strong on Microsoft-platform context - Entra, M365, Windows - and on integrating identity-side and infrastructure-side signal. Cost is per-plan and per-resource; the bill is non-trivial at scale.
Google Security Command Center
Security Command Center comes in Standard, Premium, and Enterprise tiers. Premium and Enterprise add Event Threat Detection (audit-log-based), Container Threat Detection, VM Threat Detection (memory scanning), Web Security Scanner, and Security Health Analytics. Enterprise (formerly Mandiant Hunt / Chronicle integration) layers Chronicle SecOps and Mandiant threat intelligence on top. Findings stream to Pub/Sub for SIEM ingestion. Strong areas: GCP-native API anomalies, container runtime threats. Weak areas: cross-cloud correlation, custom detection logic - for which you'd use Chronicle SecOps directly.
What native services don't replace
Every native service is built around the patterns its vendor has seen across all customers. Environment-specific detections - your access patterns, your service accounts, your geo footprint, your business-hours norms - only you can write. The native finding stream is one of many signal sources in your SIEM; treat it as a high-priority queue, not as the whole detection program.
SIEM vs data lake vs XDR
Three competing architectural patterns for "where the logs go and where the detections run." Most large programs run more than one.
Traditional SIEM
Splunk Enterprise Security, QRadar, ArcSight, Elastic Security. Mature ecosystem, deep correlation engines, packaged dashboards and incident workflows. Cost model historically scales with ingest volume - the structural reason data-lake patterns are eating the low-end of this market.
Cloud-native SIEM
Microsoft Sentinel, Google Chronicle / SecOps, Sumo Logic, Datadog Cloud SIEM, Panther. Cloud-hosted, usually cheaper-per-GB than traditional SIEM, tight integration with the vendor's broader platform. Chronicle's flat-fee-per-employee model is unusual and worth modeling for large environments.
Security data lake
Snowflake, Databricks, BigQuery, S3 + Iceberg + Athena. Cheap, schema-on-read, long retention, arbitrary analytics. Pair with a security-analytics layer - Anvilogic, Hunters, Query.ai, Panther, or Snowflake's own Horizon - to do the SIEM-shaped work on top.
XDR
Microsoft Defender XDR, Crowdstrike Falcon, SentinelOne Singularity, Palo Alto Cortex XDR. Endpoint-centric platforms extended to identity, cloud, and email. The vendor owns the detection content for their own telemetry; you write custom content in their query language. Often complementary to a SIEM, not a replacement.
The 2026 reality
Most mature programs run a hybrid:
- An EDR / XDR covers endpoint and (increasingly) identity / cloud telemetry with vendor-curated detections.
- A SIEM hosts the real-time, high-value correlations and the SOC's primary alerting workflow.
- A data lake holds the long-tail and high-volume logs (flow logs, DNS, full audit) cheaply for forensics, threat hunting, and slower detections. A security analytics layer queries it.
- The detection-as-code repo compiles rules to both the SIEM and the data lake; the SOC has one front-end UI.
The cost model conversation is rarely about which tool is cheapest in isolation. It's about where each log lands: high-cost SIEM for the 20% of logs that drive 80% of real-time detections, cheap data lake for the rest, with the detection engineer choosing per source.
Log retention & cost
Retention has three drivers: regulatory floors, detection needs, and incident-response needs. The three want different things.
- Regulatory floors. PCI DSS requires at least one year of audit-log retention (90 days immediately available). HIPAA's Security Rule, via "documentation retention," implies 6 years for audit-log evidence. SOX, FedRAMP Moderate, NYDFS Part 500, and similar each set their own. Map your applicable regulations to a floor.
- Detection needs. Most real-time detections look back hours or days. Some hunt queries look back months. Trend-based detections (anomalous baseline shifts) need 30-90 days of history.
- Incident-response needs. If you don't know when the intrusion started, you need enough history to bracket it. Dwell times for cloud incidents range from days to months; "we'd like a year" is a reasonable IR ask.
Tiering
- Hot. Indexed, queryable in seconds. 7-30 days is typical. This is where the SIEM rules run.
- Warm. Indexed but slower / cheaper. 30-180 days. Splunk SmartStore, Sentinel's Auxiliary Logs / Basic Logs tier, Chronicle's hot/cold tiers.
- Cold / archive. Object storage (S3 Glacier, Azure Archive, GCS Coldline). 1-7 years. Rehydrate-on-demand for IR.
Raw vs aggregated
Keep raw events in the lake; aggregate / summarize what you keep in the hot SIEM. The aggregation patterns: hourly counters of CloudTrail events per principal, daily summaries of VPC Flow Logs per VPC, sessionized identity logins. Aggregated indices answer the trend questions; raw archives answer "what exactly happened on day X?"
Signal sources beyond audit logs
Audit logs cover the API plane; the workload plane and the data plane have their own signal sources, and a mature detection program incorporates them all.
- EDR / XDR - Crowdstrike Falcon, SentinelOne, Cortex XDR, Defender XDR. Process, file, network, and registry telemetry from endpoints and (increasingly) VMs. Vendor-curated detections cover most known endpoint attacker behaviors; the custom-content surface is for environment-specific patterns.
- CNAPP runtime - Wiz, Sysdig, Aqua, Orca, Prisma Cloud, Upwind. Runtime visibility inside containers and VMs - syscalls, process trees, network flows, file modifications - correlated with the posture findings from the same platform.
- eBPF runtime tools - Falco, Tracee, Tetragon. Open-source kernel-level observability. Falco rules are YAML and live in your detection-as-code repo alongside Sigma.
- Application logs. The custom-written telemetry your services emit - authentication outcomes, sensitive-action audits, business-logic events. Often the most diagnostic signal for application-layer attacks and the least-instrumented in immature programs.
- SaaS audit APIs. GitHub Enterprise audit log, GitLab audit events, Slack audit, Okta system log, Salesforce event monitoring, Snowflake account usage. The kill chain often starts in SaaS before reaching the cloud.
- Network IDS / TLS-decrypting proxies. Suricata / Zeek for L7 detection inside transit VPCs, secure web gateways for egress, mTLS-inspecting service meshes. Less universal in cloud than in on-prem, but invaluable where deployed.
- DNS. Route 53 query logs, Cloud DNS query logs, Azure DNS analytics, plus secure-DNS services (Umbrella, Quad9). C2 and DGA detection still works there.
- WAF / edge logs. CloudFront / Cloudflare / Azure Front Door / Google Cloud Armor. First-touch visibility into web-side attacks.
Building a detection - walkthrough
A concrete example illustrates the lifecycle better than the abstract version. Threat: an attacker who has compromised an IAM principal in AWS creates a new access key for an existing IAM user as a persistence mechanism, then uses that access key from outside the org's normal geography.
Research
The ATT&CK technique is T1098.001 - Account Manipulation: Additional Cloud Credentials. The AWS API call is CreateAccessKey on IAM. The diagnostic fields in CloudTrail: eventName=CreateAccessKey, userIdentity (who did it), requestParameters.userName (the target user), responseElements.accessKey.accessKeyId (the new key ID). Benign cases: an admin onboarding a new service integration, a CI system rotating its own key.
Develop
Two rules, not one. Rule A: a high-signal-low-volume rule that fires on any CreateAccessKey against a user with the service-account tag (those keys should be rotated by the platform team's automation, not manually). Rule B: a correlation rule that fires when a key created in the last 24 hours is used from a country outside the operator's list. The second rule requires joining CloudTrail with IP-geolocation enrichment - easier in KQL / SPL than in pure Sigma.
Tune
Backtest Rule A against 30 days of CloudTrail. Discover that the platform team's emergency-rotation runbook also fires the rule. Add a suppression: userIdentity.arn matching the platform-team break-glass role. Backtest again - 2 fires/month, both genuine investigations. Acceptable. Backtest Rule B against the same window. Discover that traveling executives generate false positives. Add a per-user "approved geo" allowlist sourced from HRIS travel data.
Deploy
Merge the PR. CI compiles the Sigma source to KQL (for Sentinel) and to Panther Python (for the data-lake side). Both deploy. Severity: Medium for Rule A, High for Rule B. SLA: 1 hour for High, 4 hours for Medium.
Validate
Run the Stratus Red Team technique aws.persistence.iam-create-user-access-key in the test AWS account. Confirm Rule A fires in Sentinel within 5 minutes. For Rule B, follow up by exercising the new key from a non-allowlisted IP via a test runner. Confirm the alert. Log both validations against the rule IDs.
Retire
Add an explicit retirement criterion: if the org migrates fully to short-lived federated credentials (no more long-lived IAM access keys), the rule becomes meaningless and should be archived.
That entire workflow lives in a PR with a written hypothesis, two rule files, four test cases, a backtest report, a validation log entry, and a documented retirement criterion. Multiply across 200 rules and you have a detection program.
Tuning & noise reduction
Detection programs die of false positives. An analyst queue full of low-precision alerts trains the human to dismiss everything, and the real incident sits in the noise. The mechanics of keeping that from happening:
- Set a precision target per severity. Critical and High alerts should be ~90%+ true-positive when they fire - the analyst expects to act, not investigate. Medium can be lower. Low severity is for high-volume signal that batches into a daily review.
- Backtest before deploy. Run the rule against 7-30 days of historical production logs before any rule reaches the analyst queue. Most false-positive sources are obvious in the backtest.
- Baselines. "Anomalous" is a comparison to a baseline. Per-principal baselines (this user never assumed this role before), per-service baselines (this Lambda has never made an outbound call to this country), and per-environment baselines (production never sees PutBucketAcl) all narrow the rule.
- Suppression vs allowlist. A suppression is a rule-level exception ("don't fire when the platform-team break-glass role is the principal"). An allowlist is a data-level exception ("this IP is approved"). Both belong in the rule file under version control, not in the SIEM UI.
- Decay-and-revisit. Suppressions accumulate; review them on a cadence. A suppression added for a long-gone integration is a silent gap.
- Alert-fatigue metrics. Track per-rule fire rate, per-rule TP/FP, per-analyst dismiss rate. Rules that get dismissed without investigation are either broken or noise - both need fixing.
- Stacking signals. A single weak signal becomes a strong one in combination. The "create access key" rule is medium-confidence alone, high-confidence when stacked with "first use from new geography" and "principal has not used this region in 90 days." Risk-based alerting (Splunk RBA, Sentinel anomaly behavior analytics, custom risk scores) is the modern pattern.
Validation & purple teaming
A rule library you've never tested is a rule library you have to assume is broken. Validation falls into three flavors that complement each other.
Atomic-style automated emulation
- Atomic Red Team. Open-source library of small attack-technique scripts, organized by ATT&CK. General-purpose; strong on Windows / Linux endpoint. Run them on a schedule against test hosts and verify the corresponding detections fire.
- Stratus Red Team. The DataDog-built cloud-specific equivalent. Dozens of AWS, Azure, GCP, and Kubernetes techniques, each invokable with one command (e.g.
stratus detonate aws.persistence.iam-backdoor-user). Maps every technique to ATT&CK. The single most useful cloud-detection validation tool. - MITRE CALDERA. Automated adversary emulation framework. Heavier-weight than Atomic; chains techniques into full operations. Good for end-to-end kill-chain testing rather than per-rule validation.
- SkyArk, Pacu, MicroBurst, awspx. Cloud-attack toolkits with overlapping coverage; useful for adversary-shaped testing of specific techniques.
Purple teaming
An offensive team (internal red team or external engagement) runs realistic operations against the live environment with the detection team watching. Each technique → did the detection fire, on what severity, with what fidelity, in what time. Purple teaming is dense - a one-day exercise can surface a quarter's worth of detection-engineering backlog.
Breach & attack simulation (BAS)
Commercial platforms (AttackIQ, SafeBreach, Picus, XM Cyber, Cymulate) automate the purple-team cadence with broad technique libraries and built-in reporting. The justification compared to OSS (Stratus + Atomic) is the reporting layer, the technique breadth, and the integration with the SIEM and ticketing.
Continuous validation
Whichever stack you pick, run validation continuously - not just at the end of a quarter. A weekly Stratus run hitting every cloud-detection rule, with the result piped to a dashboard, is a sustainable cadence for a 1-2 person detection team. Detections decay silently (an API schema changes, a log field renames, a SIEM tuning regresses); continuous validation is the only way to catch the decay before an attacker does.
AWS, Azure, and GCP side-by-side
The detection-relevant native primitives each cloud ships, reduced to a one-screen reference:
| Capability | AWS | Azure | GCP |
|---|---|---|---|
| Control-plane audit log | CloudTrail (management events) | Activity Log (subscription / mgmt group) | Cloud Audit Logs - Admin Activity |
| Data-plane audit log | CloudTrail data events (paid, off by default) | Per-resource Diagnostic Settings (paid, off by default) | Cloud Audit Logs - Data Access (paid, off by default) |
| Identity sign-in log | IAM Identity Center sign-ins; CloudTrail for AssumeRole | Entra ID Sign-in Logs & Audit Logs | Cloud Audit Logs for IAM; Workspace Reports API |
| Network flow logs | VPC Flow Logs (ENI / subnet / VPC) | NSG Flow Logs / VNet Flow Logs (v2) | VPC Flow Logs (subnet-level) |
| DNS query logs | Route 53 Resolver query logs | Azure DNS analytics | Cloud DNS query logs |
| Managed threat detection | GuardDuty (10+ feature sets) | Defender for Cloud (per-resource plans) | Security Command Center Premium / Enterprise (ETD, CTD, VMTD) |
| Finding aggregator | Security Hub (ASFF) | Defender for Cloud / Sentinel | Security Command Center |
| Native SIEM | (none; CloudTrail Lake for limited) | Microsoft Sentinel | Google SecOps / Chronicle |
| SaaS audit (vendor's own) | (N/A - IAM Identity Center only) | Microsoft 365 Unified Audit Log | Google Workspace Reports API |
| Default audit retention | 90 days (console); indefinite if shipped to S3 | 90 days for Activity Log; configurable for Log Analytics | 400 days for Admin Activity; configurable for others |
The structural difference: Microsoft and Google ship their own SIEM (Sentinel, Chronicle / SecOps); AWS does not, and most large AWS shops run Splunk, Sentinel, Chronicle, or Panther on top of CloudTrail. AWS's CloudTrail Lake is closing the gap on the simplest cases but isn't a full SIEM replacement.
Maturity stages
A useful staging model for a cloud detection-engineering program:
Stage 1 - Wired
Control-plane audit logs (CloudTrail / Activity Log / Cloud Audit Logs) shipping to a SIEM. Native threat-detection services on (GuardDuty / Defender / SCC). Alerts route to one queue. Rules are mostly vendor-default. No detection-as-code yet; rule edits happen in the SIEM UI.
Stage 2 - Authored
Custom rules written for the environment's specific patterns. Sigma adopted for portable rules; vendor-language for the rest. ATT&CK tags on every rule. A coverage dashboard exists. Identity-provider logs ingested. Data-plane logging enabled on crown-jewel resources.
Stage 3 - Engineered
Detection-as-code repo with CI/CD to one or more SIEMs. Unit tests for every rule. Stratus Red Team running on a schedule against the cloud detection set. Per-rule precision targets tracked. Risk-based alerting stacking signals. Coverage report visible to the CISO.
Stage 4 - Adversarial
Purple-team cadence quarterly or better. Threat-intel-driven research backlog. New ATT&CK techniques (post-publication) have detections within an SLA. Detection-engineering team separate from SOC. Data-lake + SIEM hybrid with cost-aware log routing. Validation results feed engineering OKRs.
The skip-stage cost: trying to detection-as-code without an alert queue anyone trusts is automating against an unloved artifact. Each stage builds on the credibility of the prior one.
Common pitfalls
- Alerting on everything. The "if it's worth logging, it's worth alerting" instinct produces a queue no human can act on. Every alert should have a documented analyst action; if none exists, the rule belongs in the data lake for hunting, not in the SIEM for alerting.
- No version control on rules. Editing detections in the SIEM UI without a Git repo behind them is the equivalent of editing production config by SSH. Drift is silent, history is lost, and the post-incident "when did this rule change?" question has no answer.
- No validation. A rule library nobody has exercised is a list of confident assumptions. Stratus + Atomic + a quarterly purple team is table stakes; below that line the program is theoretical.
- Ignoring data-plane logs. Control-plane-only logging tells you who changed permissions, not who read the data. Most cloud-breach incidents involve data access. Enable data-plane on the crown jewels.
- Skipping the IdP. The first event of a typical cloud kill chain is in Entra, Okta, or Workspace - not in CloudTrail. Ingest the identity provider.
- Treating native threat detection as a complete program. GuardDuty / Defender / SCC each catch the well-known attacker behaviors well. They do not write your environment-specific detections. Use them as one signal source among many.
- Rule sprawl without ATT&CK tags. A library of 500 rules with no consistent technique mapping cannot produce a coverage report and cannot be reasoned about by anyone but the author. Tag every rule.
- Compliance-driven detection. "We need to satisfy this control" produces rules that pass an auditor but don't catch attackers. Write detections from threat-research findings; map them to compliance controls after the fact, not the other way around.
- Logging volume budget set by Finance without security context. Cutting log ingest to save money sometimes saves money; sometimes blinds the detection program. The conversation is "which logs, at which retention, at which tier?" - not "cut 20%."
- No retirement. Rules accumulate; dead rules persist. Every rule needs a retirement criterion at deploy time, not as an afterthought.
Further reading
Foundational
- Palantir - Alerting and Detection Strategy Framework
- SpecterOps - Detection-as-Code
- MITRE ATT&CK Cloud Matrix
- Center for Threat-Informed Defense - Stack Mappings
- awesome-threat-detection
Sigma & rule languages
Open-source detection content
- Elastic detection-rules
- Azure Sentinel content
- Splunk Security Content
- Panther Analysis
- Chronicle / Google SecOps detection rules
Validation
Provider documentation
- AWS CloudTrail User Guide
- AWS GuardDuty docs
- Azure Monitor docs
- Microsoft Sentinel docs
- GCP Cloud Audit Logs
- Security Command Center docs
Related CSOH pages
- Cloud SOC - the consume side of detection (analysts, queues, IR).
- Incident response - what happens after the alert.
- Threat research - where the detection-engineer's backlog comes from.
- Breach kill chains - real cloud incidents, organized by ATT&CK.
- GRC - preventative controls upstream of detection.
- CI/CD - where the detection-as-code pipeline runs.
- Glossary - every term on this page, defined.
FAQ
What's the difference between a SOC analyst and a detection engineer?
The analyst consumes alerts; the engineer builds the rules that produce them. The analyst's day is a queue and a clock - triage minutes per alert and time-to-acknowledge. The engineer's day is a Git repo, an ATT&CK coverage map, and a CI/CD pipeline pushing rule changes to one or more SIEMs. The roles cooperate constantly - the analyst's "this rule's noisy" or "I'm seeing this pattern again" is the engineer's backlog - but they think differently and the disciplines benefit from being staffed separately at any reasonable scale.
Which cloud logs do I actually need to enable?
The non-negotiable set: an organization-level audit trail (CloudTrail org trail / Activity Log Diagnostic Settings / Cloud Audit Logs at the org node); identity-provider sign-in and audit logs (Entra, Okta, IAM Identity Center, Workspace); VPC / network flow logs on production VPCs; and the platform-native threat-detection findings (GuardDuty, Defender for Cloud, Security Command Center Premium). The expensive one - data-plane / data-access logs - should be enabled deliberately on resources holding real customer data, scoped tightly. Skipping data-plane is the single most common cloud detection blind spot.
Is Sigma worth learning if my SIEM has its own query language?
Yes - for portability. Sigma is the closest the industry has to a vendor-neutral detection format. Writing the canonical rule in Sigma and compiling to your SIEM's native language with pySigma protects you from SIEM migrations and gives you a portable detection library. The vendor language is still where final performance tuning happens; the Sigma source is where the rule lives in your repo.
How is detection-as-code different from compliance-as-code?
Both put rules in Git and deploy through CI/CD. The difference is the input data: compliance-as-code (see the GRC page) evaluates configuration state - is this S3 bucket configured correctly right now? Detection-as-code evaluates streaming events - did this CloudTrail event indicate malicious activity? The workflows look almost identical and the team skills transfer; the test harnesses and the evaluation engines differ.
Why does GCP Data Access logging matter so much?
GCP's Cloud Audit Logs split into Admin Activity (free, always on), System Event (free, always on), Policy Denied (free, opt-in), and Data Access (paid, off by default). Data Access is the stream that records reads of customer data - a service account listing objects in a sensitive bucket, querying a sensitive BigQuery table, decrypting a KMS key. Most cloud breaches involve data access; turning the stream off saves money and blinds the detection program. Enable it deliberately on the projects that hold real data; budget for the volume.
Should I build on a SIEM or a data lake?
Most large 2026 programs run both: a SIEM (Sentinel, Splunk, Chronicle, Elastic) for the real-time, high-value correlations the SOC depends on, and a data lake (Snowflake, Databricks, BigQuery, S3 + Iceberg) with a security-analytics layer (Anvilogic, Hunters, Query.ai, Panther) for the cheaper long-tail and forensic querying. Small programs pick one and accept the trade-off - usually a cloud-native SIEM for speed-of-stand-up.
How do I validate that my detections actually work?
Three layers, complementary. Unit tests: replay a sample event against the rule and assert it fires (or doesn't). Stratus Red Team: execute real cloud attack techniques on a schedule and verify the corresponding detection lights up. Purple teaming: a red team operates in the live environment with the detection team watching, on a quarterly cadence. Without at least one of these running continuously, you have a rule library, not a detection program.
Where next
- Cloud SOC - the consume side: alert triage, SOC structure, IR playbooks.
- Incident response - what happens after the detection fires.
- Threat research - where the detection backlog comes from.
- Breach kill chains - real cloud incidents, mapped to ATT&CK techniques.
- GRC - the preventative-control discipline upstream of detection.
- Friday Zoom - detection engineering and Stratus Red Team come up regularly. Drop in.