Detection Engineering & Cloud Logging

The build side of cloud threat detection - the logs each cloud emits and where, the lifecycle that turns a threat-research finding into a tuned production rule, Sigma and detection-as-code, MITRE ATT&CK Cloud coverage, and how to validate that your detections actually fire when the technique runs. Vendor-neutral, opinionated, AWS / Azure / GCP.

Rows of illuminated servers in a data center, glowing with status lights
Photo by Manuel Geissinger on Pexels

· · Vendor-neutral · View source on GitHub

The 30-second version: Detection engineering is the discipline of building, testing, deploying, and retiring the rules that decide what an alert is. It's the build side; the Cloud SOC page covers the consume side. The job is mostly editing rule files in a Git repo, replaying historical logs against changes, and confirming the rule fires when the matching technique runs in a test environment.

In cloud the work hinges on three things: knowing what each platform emits (CloudTrail, Activity Log, Cloud Audit Logs, and the data-plane stream that's often off by default); writing rules portably (Sigma as the canonical format, compiled to the SIEM's native query language); and validating against tools like Stratus Red Team that emulate real cloud attacker techniques mapped to MITRE ATT&CK Cloud.

On this page

  1. What detection engineering is
  2. The detection engineering lifecycle
  3. Cloud logging fundamentals
  4. The data-access log gotcha
  5. Sigma - the lingua franca
  6. Vendor detection languages
  7. MITRE ATT&CK Cloud Matrix
  8. Detection-as-code workflow
  9. Native threat-detection services
  10. SIEM vs data lake vs XDR
  11. Log retention & cost
  12. Signal sources beyond audit logs
  13. Building a detection - walkthrough
  14. Tuning & noise reduction
  15. Validation & purple teaming
  16. AWS / Azure / GCP side-by-side
  17. Maturity stages
  18. Common pitfalls
  19. Further reading
  20. FAQ

What detection engineering is

Detection engineering is the discipline that produces, tests, and maintains the rules a SOC depends on. It is not the same as SOC analysis. The analyst sits in front of a queue of alerts and decides whether each one represents real malicious activity. The detection engineer sits in front of a code repo and a sample-log corpus and decides what the rule should match in the first place - and what it should ignore.

The two roles think differently. A SOC analyst's currency is triage minutes per alert and the false-positive rate of their queue; a detection engineer's currency is rules-deployed-per-quarter, mean time from threat-research finding to detection in production, and ATT&CK technique coverage. They cooperate constantly - the analyst's "I'm seeing this pattern again" feeds the engineer's backlog, and the engineer's new rule lands in the analyst's queue.

The reason the role exists separately is that modern detection looks much more like software engineering than like operations. Rules live in Git. They have unit tests. They go through code review. They deploy via CI/CD to one or more SIEM or data-lake backends. They have version history, owners, ATT&CK mappings, severity ratings, and an explicit retirement criterion. Programs that treat detection as a side-of-desk activity for the SOC team accumulate stale, untested rules and never close the coverage gaps that matter.

This page is the practitioner's view of the build side. For the consume side - alert triage, SOC structure, incident playbooks - see the Cloud SOC page.

The detection engineering lifecycle

The widely-cited model is Palantir's: research → develop → tune → deploy → validate → retire. Each phase has explicit inputs, outputs, and exit criteria. The lifecycle isn't ceremony - it's the only thing that keeps a rule library from becoming a graveyard.

1. Research

Start with a threat - an ATT&CK technique, a CTI report, a vendor advisory, an internal red-team finding, a real incident from yourself or a peer. Understand the technique end-to-end: what API calls it generates, what log fields are diagnostic, what benign activity looks similar. Output: a written hypothesis of "we should be able to detect X by looking for Y in Z."

2. Develop

Write the rule. Canonical form in Sigma where possible, or directly in the backend's query language. Include metadata - ATT&CK technique IDs, severity, owner, references. Run the rule against a corpus of historical logs and a corpus of test attack telemetry. Iterate. Output: a rule file in the Git repo with passing unit tests.

3. Tune

Run the rule against 7-30 days of historical production logs (a "backtest"). Measure the false-positive rate, identify benign sources, add suppressions or refine the logic. The acceptance bar is the analyst's tolerance - a rule that fires 50 times a day is dead on arrival unless every fire is an emergency. Output: a backtest report and a tuned rule.

4. Deploy

Merge to main; CI/CD pushes the rule to the SIEM / data-lake backend(s) via API. Stage in a "test" or "low-severity" mode for 1-2 weeks; promote to alerting only after the false-positive rate holds. Output: an active production rule with a documented owner and an SLA-tagged severity.

5. Validate

Execute the matching technique with Stratus Red Team (cloud) or Atomic Red Team (general). Confirm the rule fires, the alert reaches the right queue, and the metadata is intact. Re-run on a schedule (monthly or quarterly). Output: a validation log entry tied to the rule.

6. Retire

Every rule needs a retirement criterion. The underlying API is deprecated; the technique no longer applies; the false-positive rate has crept past tolerable; a better rule supersedes it. Without an explicit retirement step, dead rules accumulate and the analyst queue silently degrades. Output: an archived rule with a stated reason.

Two things distinguish mature programs from immature ones in this lifecycle. First, every step produces an artifact that lives in version control - the hypothesis, the backtest, the validation result. Second, the loop closes: a missed-detection incident in production triggers a research item, not just a postmortem action that quietly never ships.

Cloud logging fundamentals

You can't detect what you can't see. Every detection rule starts from a log source; every log source has costs, gotchas, and a default-on or default-off state you need to know. The catalog below is the practitioner's working set for each major cloud.

AWS

Azure

GCP

Identity providers (the often-skipped layer)

If you ingest only one extra source beyond cloud control planes, ingest the IdP. The majority of cloud incidents start at identity - an MFA-fatigue push, a stolen session token, a service-account key leaked to GitHub - and the IdP log sees the first event in the kill chain.

The data-access log gotcha

Each cloud splits its audit log into "what someone did to the resource" (management / admin activity) and "what someone read or wrote inside the resource" (data events / data access). The split has cost and privacy reasons; it has detection consequences that catch programs off guard.

Cloud Control-plane log Data-plane log Default state of data-plane
AWS CloudTrail management events (free for one trail) CloudTrail data events (paid, per-resource scoped) Off; enable per S3 bucket, Lambda function, DynamoDB table, etc.
Azure Activity Log (free) Per-resource Diagnostic Settings (paid storage) Off; configure Diagnostic Settings on each resource
GCP Cloud Audit Logs - Admin Activity (free) Cloud Audit Logs - Data Access (paid) Off for most services; explicit opt-in in IAM audit-config

The single most common cloud-detection blind spot is "we have CloudTrail / Activity Log / Cloud Audit Logs enabled" without realizing that the data-plane stream is a separate enablement. The result: you can see that an attacker assumed a role, but not that they then listed every object in a sensitive bucket. You can see who changed a Key Vault access policy, but not whose secret was retrieved. You can see who granted BigQuery dataset permissions, but not who exported the data.

The pragmatic posture: enable data-plane logging on your crown-jewel data resources, scoped tightly. The cost of logging every S3 GetObject across thousands of buckets is real; the cost of logging the dozen buckets that hold customer data is trivial. Inventory first, scope deliberately, accept the bill.

Sigma - the lingua franca

Sigma is a YAML-based, vendor-neutral format for describing log-based detections. It exists because every SIEM has its own query language (Splunk SPL, KQL for Sentinel and Defender, Elastic ESQL/EQL, Sumo Logic, Datadog, Panther's Python) and writing the same rule in five places is the worst possible use of detection-engineer time.

A Sigma rule names the log source, declares match conditions, and tags ATT&CK techniques, severity, references, and an owner. A converter (the original Sigmac, the modern pySigma, or a vendor-provided one) compiles it to the target SIEM's query language. The Sigma source is the canonical artifact in your detection-as-code repo; the compiled output is a build artifact.

What Sigma is good at

What Sigma is not good at

The practical workflow: write the rule in Sigma where the logic fits cleanly; write directly in the backend's query language where it doesn't; treat the choice as a per-rule decision, not a religious one.

Vendor detection languages

Each backend has its own query language. You will end up reading and writing the ones in your stack; familiarity with all of them is a strong differentiator for the role.

Pick the language your stack actually runs. Then read other languages' rule libraries - Splunk's public detections, Microsoft's Sentinel content hub, Elastic's detection-rules repo, Panther's analysis pack, and Chronicle's content packs are all valuable cross-reference even when you can't run their rules directly.

MITRE ATT&CK Cloud Matrix

MITRE ATT&CK is the threat-model taxonomy detection engineers and SOC analysts share. The Enterprise matrix has cloud-specific sub-matrices that matter directly to the role:

Every detection rule in a mature program tags one or more ATT&CK techniques. The mappings are what produce the coverage report - "we have detections for 142 of the 213 cloud techniques relevant to our environment; here's the prioritized backlog of the rest." That coverage report is the single most-requested artifact when a CISO or a customer asks about the detection program's maturity.

Cross-reference ATT&CK with the open-source Center for Threat-Informed Defense security stack mappings, which connect ATT&CK techniques to specific AWS, Azure, GCP service controls. Useful when planning preventative coverage alongside detective.

Detection-as-code workflow

Detection-as-code treats detection rules with the engineering rigor any other production code gets - branching, code review, automated testing, CI/CD deployment, observability of failures. The mechanics:

Reference open-source content repos for the shape: Elastic detection-rules, Azure Sentinel, Splunk Security Content, Panther Analysis. Each has its own conventions; the patterns rhyme.

Native threat-detection services

Each major cloud ships a managed threat-detection service. Detection engineers use them as signal sources - high-confidence findings that flow into the SIEM as one input among many - rather than as a complete detection program. The tradeoff is the same in every cloud: the managed service catches the well-known stuff cheaply; everything else still needs custom rules.

AWS GuardDuty

GuardDuty analyzes CloudTrail, VPC Flow Logs, DNS queries, S3 data events, EKS audit logs, Lambda invocation logs, EBS volume snapshots, and RDS login activity. Findings are categorized by attack stage (Reconnaissance, Discovery, CredentialAccess, etc.) and map to ATT&CK. Strong areas: instance compromise, S3 anomalies, mining/crypto activity, credential exfiltration. Weak areas: low-and-slow insider activity, anything that requires environment-specific context (your "this user shouldn't be in this region" rule). Cost scales with API-call volume - large environments need budgeting.

Microsoft Defender for Cloud

Defender for Cloud bundles CSPM and CWPP plans, each producing findings. Defender for Servers, Containers, Storage, SQL, Key Vault, App Service, ARM, DNS, Cloud Database, and APIs each shed detections. Tightly integrated with Sentinel (the SIEM) and Defender XDR (the unified XDR plane). Strong on Microsoft-platform context - Entra, M365, Windows - and on integrating identity-side and infrastructure-side signal. Cost is per-plan and per-resource; the bill is non-trivial at scale.

Google Security Command Center

Security Command Center comes in Standard, Premium, and Enterprise tiers. Premium and Enterprise add Event Threat Detection (audit-log-based), Container Threat Detection, VM Threat Detection (memory scanning), Web Security Scanner, and Security Health Analytics. Enterprise (formerly Mandiant Hunt / Chronicle integration) layers Chronicle SecOps and Mandiant threat intelligence on top. Findings stream to Pub/Sub for SIEM ingestion. Strong areas: GCP-native API anomalies, container runtime threats. Weak areas: cross-cloud correlation, custom detection logic - for which you'd use Chronicle SecOps directly.

What native services don't replace

Every native service is built around the patterns its vendor has seen across all customers. Environment-specific detections - your access patterns, your service accounts, your geo footprint, your business-hours norms - only you can write. The native finding stream is one of many signal sources in your SIEM; treat it as a high-priority queue, not as the whole detection program.

SIEM vs data lake vs XDR

Three competing architectural patterns for "where the logs go and where the detections run." Most large programs run more than one.

Traditional SIEM

Splunk Enterprise Security, QRadar, ArcSight, Elastic Security. Mature ecosystem, deep correlation engines, packaged dashboards and incident workflows. Cost model historically scales with ingest volume - the structural reason data-lake patterns are eating the low-end of this market.

Cloud-native SIEM

Microsoft Sentinel, Google Chronicle / SecOps, Sumo Logic, Datadog Cloud SIEM, Panther. Cloud-hosted, usually cheaper-per-GB than traditional SIEM, tight integration with the vendor's broader platform. Chronicle's flat-fee-per-employee model is unusual and worth modeling for large environments.

Security data lake

Snowflake, Databricks, BigQuery, S3 + Iceberg + Athena. Cheap, schema-on-read, long retention, arbitrary analytics. Pair with a security-analytics layer - Anvilogic, Hunters, Query.ai, Panther, or Snowflake's own Horizon - to do the SIEM-shaped work on top.

XDR

Microsoft Defender XDR, Crowdstrike Falcon, SentinelOne Singularity, Palo Alto Cortex XDR. Endpoint-centric platforms extended to identity, cloud, and email. The vendor owns the detection content for their own telemetry; you write custom content in their query language. Often complementary to a SIEM, not a replacement.

The 2026 reality

Most mature programs run a hybrid:

The cost model conversation is rarely about which tool is cheapest in isolation. It's about where each log lands: high-cost SIEM for the 20% of logs that drive 80% of real-time detections, cheap data lake for the rest, with the detection engineer choosing per source.

Log retention & cost

Retention has three drivers: regulatory floors, detection needs, and incident-response needs. The three want different things.

Tiering

Raw vs aggregated

Keep raw events in the lake; aggregate / summarize what you keep in the hot SIEM. The aggregation patterns: hourly counters of CloudTrail events per principal, daily summaries of VPC Flow Logs per VPC, sessionized identity logins. Aggregated indices answer the trend questions; raw archives answer "what exactly happened on day X?"

Signal sources beyond audit logs

Audit logs cover the API plane; the workload plane and the data plane have their own signal sources, and a mature detection program incorporates them all.

Building a detection - walkthrough

A concrete example illustrates the lifecycle better than the abstract version. Threat: an attacker who has compromised an IAM principal in AWS creates a new access key for an existing IAM user as a persistence mechanism, then uses that access key from outside the org's normal geography.

Research

The ATT&CK technique is T1098.001 - Account Manipulation: Additional Cloud Credentials. The AWS API call is CreateAccessKey on IAM. The diagnostic fields in CloudTrail: eventName=CreateAccessKey, userIdentity (who did it), requestParameters.userName (the target user), responseElements.accessKey.accessKeyId (the new key ID). Benign cases: an admin onboarding a new service integration, a CI system rotating its own key.

Develop

Two rules, not one. Rule A: a high-signal-low-volume rule that fires on any CreateAccessKey against a user with the service-account tag (those keys should be rotated by the platform team's automation, not manually). Rule B: a correlation rule that fires when a key created in the last 24 hours is used from a country outside the operator's list. The second rule requires joining CloudTrail with IP-geolocation enrichment - easier in KQL / SPL than in pure Sigma.

Tune

Backtest Rule A against 30 days of CloudTrail. Discover that the platform team's emergency-rotation runbook also fires the rule. Add a suppression: userIdentity.arn matching the platform-team break-glass role. Backtest again - 2 fires/month, both genuine investigations. Acceptable. Backtest Rule B against the same window. Discover that traveling executives generate false positives. Add a per-user "approved geo" allowlist sourced from HRIS travel data.

Deploy

Merge the PR. CI compiles the Sigma source to KQL (for Sentinel) and to Panther Python (for the data-lake side). Both deploy. Severity: Medium for Rule A, High for Rule B. SLA: 1 hour for High, 4 hours for Medium.

Validate

Run the Stratus Red Team technique aws.persistence.iam-create-user-access-key in the test AWS account. Confirm Rule A fires in Sentinel within 5 minutes. For Rule B, follow up by exercising the new key from a non-allowlisted IP via a test runner. Confirm the alert. Log both validations against the rule IDs.

Retire

Add an explicit retirement criterion: if the org migrates fully to short-lived federated credentials (no more long-lived IAM access keys), the rule becomes meaningless and should be archived.

That entire workflow lives in a PR with a written hypothesis, two rule files, four test cases, a backtest report, a validation log entry, and a documented retirement criterion. Multiply across 200 rules and you have a detection program.

Tuning & noise reduction

Detection programs die of false positives. An analyst queue full of low-precision alerts trains the human to dismiss everything, and the real incident sits in the noise. The mechanics of keeping that from happening:

Validation & purple teaming

A rule library you've never tested is a rule library you have to assume is broken. Validation falls into three flavors that complement each other.

Atomic-style automated emulation

Purple teaming

An offensive team (internal red team or external engagement) runs realistic operations against the live environment with the detection team watching. Each technique → did the detection fire, on what severity, with what fidelity, in what time. Purple teaming is dense - a one-day exercise can surface a quarter's worth of detection-engineering backlog.

Breach & attack simulation (BAS)

Commercial platforms (AttackIQ, SafeBreach, Picus, XM Cyber, Cymulate) automate the purple-team cadence with broad technique libraries and built-in reporting. The justification compared to OSS (Stratus + Atomic) is the reporting layer, the technique breadth, and the integration with the SIEM and ticketing.

Continuous validation

Whichever stack you pick, run validation continuously - not just at the end of a quarter. A weekly Stratus run hitting every cloud-detection rule, with the result piped to a dashboard, is a sustainable cadence for a 1-2 person detection team. Detections decay silently (an API schema changes, a log field renames, a SIEM tuning regresses); continuous validation is the only way to catch the decay before an attacker does.

AWS, Azure, and GCP side-by-side

The detection-relevant native primitives each cloud ships, reduced to a one-screen reference:

Capability AWS Azure GCP
Control-plane audit log CloudTrail (management events) Activity Log (subscription / mgmt group) Cloud Audit Logs - Admin Activity
Data-plane audit log CloudTrail data events (paid, off by default) Per-resource Diagnostic Settings (paid, off by default) Cloud Audit Logs - Data Access (paid, off by default)
Identity sign-in log IAM Identity Center sign-ins; CloudTrail for AssumeRole Entra ID Sign-in Logs & Audit Logs Cloud Audit Logs for IAM; Workspace Reports API
Network flow logs VPC Flow Logs (ENI / subnet / VPC) NSG Flow Logs / VNet Flow Logs (v2) VPC Flow Logs (subnet-level)
DNS query logs Route 53 Resolver query logs Azure DNS analytics Cloud DNS query logs
Managed threat detection GuardDuty (10+ feature sets) Defender for Cloud (per-resource plans) Security Command Center Premium / Enterprise (ETD, CTD, VMTD)
Finding aggregator Security Hub (ASFF) Defender for Cloud / Sentinel Security Command Center
Native SIEM (none; CloudTrail Lake for limited) Microsoft Sentinel Google SecOps / Chronicle
SaaS audit (vendor's own) (N/A - IAM Identity Center only) Microsoft 365 Unified Audit Log Google Workspace Reports API
Default audit retention 90 days (console); indefinite if shipped to S3 90 days for Activity Log; configurable for Log Analytics 400 days for Admin Activity; configurable for others

The structural difference: Microsoft and Google ship their own SIEM (Sentinel, Chronicle / SecOps); AWS does not, and most large AWS shops run Splunk, Sentinel, Chronicle, or Panther on top of CloudTrail. AWS's CloudTrail Lake is closing the gap on the simplest cases but isn't a full SIEM replacement.

Maturity stages

A useful staging model for a cloud detection-engineering program:

Stage 1 - Wired

Control-plane audit logs (CloudTrail / Activity Log / Cloud Audit Logs) shipping to a SIEM. Native threat-detection services on (GuardDuty / Defender / SCC). Alerts route to one queue. Rules are mostly vendor-default. No detection-as-code yet; rule edits happen in the SIEM UI.

Stage 2 - Authored

Custom rules written for the environment's specific patterns. Sigma adopted for portable rules; vendor-language for the rest. ATT&CK tags on every rule. A coverage dashboard exists. Identity-provider logs ingested. Data-plane logging enabled on crown-jewel resources.

Stage 3 - Engineered

Detection-as-code repo with CI/CD to one or more SIEMs. Unit tests for every rule. Stratus Red Team running on a schedule against the cloud detection set. Per-rule precision targets tracked. Risk-based alerting stacking signals. Coverage report visible to the CISO.

Stage 4 - Adversarial

Purple-team cadence quarterly or better. Threat-intel-driven research backlog. New ATT&CK techniques (post-publication) have detections within an SLA. Detection-engineering team separate from SOC. Data-lake + SIEM hybrid with cost-aware log routing. Validation results feed engineering OKRs.

The skip-stage cost: trying to detection-as-code without an alert queue anyone trusts is automating against an unloved artifact. Each stage builds on the credibility of the prior one.

Common pitfalls

Further reading

Foundational

Sigma & rule languages

Open-source detection content

Validation

Provider documentation

Related CSOH pages

FAQ

What's the difference between a SOC analyst and a detection engineer?

The analyst consumes alerts; the engineer builds the rules that produce them. The analyst's day is a queue and a clock - triage minutes per alert and time-to-acknowledge. The engineer's day is a Git repo, an ATT&CK coverage map, and a CI/CD pipeline pushing rule changes to one or more SIEMs. The roles cooperate constantly - the analyst's "this rule's noisy" or "I'm seeing this pattern again" is the engineer's backlog - but they think differently and the disciplines benefit from being staffed separately at any reasonable scale.

Which cloud logs do I actually need to enable?

The non-negotiable set: an organization-level audit trail (CloudTrail org trail / Activity Log Diagnostic Settings / Cloud Audit Logs at the org node); identity-provider sign-in and audit logs (Entra, Okta, IAM Identity Center, Workspace); VPC / network flow logs on production VPCs; and the platform-native threat-detection findings (GuardDuty, Defender for Cloud, Security Command Center Premium). The expensive one - data-plane / data-access logs - should be enabled deliberately on resources holding real customer data, scoped tightly. Skipping data-plane is the single most common cloud detection blind spot.

Is Sigma worth learning if my SIEM has its own query language?

Yes - for portability. Sigma is the closest the industry has to a vendor-neutral detection format. Writing the canonical rule in Sigma and compiling to your SIEM's native language with pySigma protects you from SIEM migrations and gives you a portable detection library. The vendor language is still where final performance tuning happens; the Sigma source is where the rule lives in your repo.

How is detection-as-code different from compliance-as-code?

Both put rules in Git and deploy through CI/CD. The difference is the input data: compliance-as-code (see the GRC page) evaluates configuration state - is this S3 bucket configured correctly right now? Detection-as-code evaluates streaming events - did this CloudTrail event indicate malicious activity? The workflows look almost identical and the team skills transfer; the test harnesses and the evaluation engines differ.

Why does GCP Data Access logging matter so much?

GCP's Cloud Audit Logs split into Admin Activity (free, always on), System Event (free, always on), Policy Denied (free, opt-in), and Data Access (paid, off by default). Data Access is the stream that records reads of customer data - a service account listing objects in a sensitive bucket, querying a sensitive BigQuery table, decrypting a KMS key. Most cloud breaches involve data access; turning the stream off saves money and blinds the detection program. Enable it deliberately on the projects that hold real data; budget for the volume.

Should I build on a SIEM or a data lake?

Most large 2026 programs run both: a SIEM (Sentinel, Splunk, Chronicle, Elastic) for the real-time, high-value correlations the SOC depends on, and a data lake (Snowflake, Databricks, BigQuery, S3 + Iceberg) with a security-analytics layer (Anvilogic, Hunters, Query.ai, Panther) for the cheaper long-tail and forensic querying. Small programs pick one and accept the trade-off - usually a cloud-native SIEM for speed-of-stand-up.

How do I validate that my detections actually work?

Three layers, complementary. Unit tests: replay a sample event against the rule and assert it fires (or doesn't). Stratus Red Team: execute real cloud attack techniques on a schedule and verify the corresponding detection lights up. Purple teaming: a red team operates in the live environment with the detection team watching, on a quarterly cadence. Without at least one of these running continuously, you have a rule library, not a detection program.

Where next