← Back to all cloud security roles
The honest version: Cloud detection engineering is one of the most technically demanding roles in cloud security, and it never stops being so. Your job is to write rules that catch real attackers - but the log sources keep changing, the attack techniques keep evolving, the SIEM never stops needing tuning, and every new managed service your engineering org adopts is a new class of telemetry you've never written a rule for. The engineers who thrive here genuinely enjoy the adversarial puzzle. The ones who burn out thought it was mostly a SIEM-admin job.
This page is the deep version of the summary card on the careers overview. Numbers are US-centric, 2026, and approximate.
On this page
- What a cloud detection engineer actually does
- Why the cloud version is a different job
- Cloud log sources, in depth
- Detecting identity-based attacks
- The learning treadmill, in detail
- A week in the life
- A day in the life: Wednesday at a fintech
- The skill stack
- Detection-as-code in practice
- Tools of the trade
- The multi-cloud dimension
- How the role changes by company stage
- Salary & compensation
- The interview loop for this role
- Portfolio projects that prove the role
- How to break in and pivot from adjacent roles
- Where this role leads
- Common mistakes
- Who this role is not for
- How AI is changing the role
- Quick answers
- Open-source content and community resources
- Where next
What a cloud detection engineer actually does
Strip away the job-description language and the work is this: you are responsible for making sure that when an attacker is operating inside your cloud environment, something fires. Not just a canned vendor alert that everyone ignores - a tuned, well-contextualized detection that routes to the right analyst, contains enough supporting data to investigate without a second log query, and was validated against a real attack simulation before it shipped. That sounds straightforward. It isn't.
On a normal week the work breaks into several threads running in parallel:
- Writing new detections. A threat intelligence report describes a new technique. A purple-team run exposed a coverage gap. A new managed service just reached general availability in the cloud your engineers use. Each of those is a new rule. You research the relevant log source, understand the event schema, write a Sigma rule or native SIEM query, test it against both benign and malicious samples, peer review it, and ship it through the pipeline.
- Tuning existing detections. The SIEM always needs tuning. A rule that fires 400 times a day in an environment where a developer legitimately calls
AssumeRole200 times an hour is a noise machine, not a detection. You analyze the false-positive population, scope the exclusion precisely, document why the exclusion is safe, and re-test that the real attack still fires. - Coverage mapping. You maintain a living map of which MITRE ATT&CK Cloud techniques have detection coverage, which have partial coverage, and which have none. The map drives prioritization. It also drives the conversation with leadership about what "covered" means - a rule that would fire on a noisy-but-never-tuned detection is not coverage.
- Purple-teaming. Roughly monthly (more often if you have a dedicated red team), you or a partner runs atomic attack simulations against a realistic environment - Stratus Red Team, Atomic Red Team, or custom scripts - and validates that your detections actually fire. The simulations that don't fire are the ones that matter most.
- Rule lifecycle management. Cloud provider event schemas change. A rule written against the 2024 CloudTrail structure for
sts:AssumeRolemay silently stop working after a provider schema update. You own the monitoring that detects when rules go stale, and you own fixing them. - Detection-as-code pipeline work. You maintain the CI/CD system that tests, deploys, and versions detection rules. When the pipeline breaks - or when you're rebuilding it on a new SIEM - that's engineering work, not analyst work.
- Threat-intel consumption and translation. Vendor reports, CISA advisories, and threat intelligence feeds describe attacker techniques in narrative form. You translate them into detection logic: what specific API calls, what field combinations, what sequences of events would this technique produce in your log sources? The gap between "understanding a technique" and "having a rule that fires on it" is where most of the craft lives.
- SOC enablement. You are not the person who responds to your alerts - but you need to make sure the people who do can act on them effectively. That means writing good alert descriptions, mapping to relevant playbooks, providing investigation context (what other events to look for), and being available when an analyst has a question about why a rule fired. A detection without context is just noise with documentation.
Notice that "responding to alerts" is not on that list. At most orgs, the detection engineer builds the detections and hands off to the SOC or IR team who respond to them. The overlap is real - you need to understand how analysts use your rules to make them useful - but the primary output is code: rules, pipelines, coverage documentation, and purple-team reports.
The ratio of new-rule-writing to tuning-existing-rules shifts heavily toward tuning as the detection library matures. A team with 50 rules spends most of its time writing new ones. A team with 500 rules spends most of its time keeping those 500 working. Neither phase is more important than the other, but candidates who've only ever worked in one phase often underestimate the other.
The scope of "what you own" also expands with seniority in ways that aren't always obvious from the outside. A junior detection engineer owns their individual rules. A senior detection engineer owns a coverage domain (identity techniques, data access techniques) end-to-end, including the purple-team validation schedule for that domain and the relationship with the IR team that uses those detections. A staff detection engineer owns the detection program's measurement framework - how coverage is quantified, how it's communicated to leadership, how prioritization decisions are made across the team. Each level adds a different kind of responsibility, not just more of the same thing.
Why the cloud version is a different job
Traditional detection engineering matured in environments where you had endpoint agents on every host, network sensors watching east-west traffic, and a relatively stable attack surface. Cloud removes most of that and replaces it with something fundamentally different. These are the twists that separate cloud detection from the version in your SANS blue-team courses.
1. You detect from the control plane, not from hosts
The most important telemetry in cloud is not what happened on a VM's filesystem - it's what happened to the cloud API. CloudTrail, Azure Activity Logs, and GCP Audit Logs record every API call made against the cloud fabric: who assumed which role, which bucket was listed, which Lambda was invoked with what parameters, which IAM policy was changed. On an ephemeral Fargate task or a serverless function, there may be no host to put an EDR agent on. The control plane is not backup telemetry - it is often the only telemetry. That means learning to read API-call logs fluently is as fundamental to this role as learning to read process-creation events is to a traditional endpoint detection engineer.
2. Every cloud provider structures logs differently - and so does every service
A Sigma rule that fires on a CloudTrail AssumeRole event is not automatically portable to an Azure Activity Log Microsoft.Authorization/roleAssignments/write event or a GCP Audit Log SetIamPolicy call. The fields have different names, different semantics, and different enrichment. At the service level the heterogeneity is even more extreme: S3 data events and CloudTrail management events use different schemas; RDS and DynamoDB emit logs through different channels; container workloads on ECS vs. EKS vs. Lambda each generate different telemetry. Every new managed service that reaches general availability is a new schema you haven't seen. You can't write a detection for it until you understand its log structure - and that understanding requires hands-on work, not just reading documentation.
3. Identity-based attacks dominate over malware execution
In cloud environments, the dominant attack patterns are not "malware executed on a host" - they are "credential was obtained and then used to call APIs." Role assumption chaining, IAM key exfiltration and abuse, OAuth consent phishing, token replay, and cross-account role assumption are the techniques showing up in real breaches at scale. These attacks often leave no trace in any endpoint telemetry. They are entirely visible in control-plane logs - but only if you're querying the right fields, at the right time, with rules tuned to distinguish the attack pattern from legitimate developer and automation behavior. The false-positive surface for identity-based detections in active engineering environments is enormous, which is why this is genuinely hard to do well.
The implication for your rule library: roughly 60-70% of your cloud detection coverage should be on identity techniques (credential access, privilege escalation through IAM, lateral movement via role assumption, and persistence via policy attachment), not on execution and exfiltration techniques the way an endpoint detection library would be weighted. This is a significant rebalancing from what most SANS blue-team training prepares you for, and it requires building a detailed mental model of what "normal" IAM activity looks like in your specific environment before you can reliably distinguish attacker behavior from legitimate use.
4. Rules silently rot when event schemas change
Cloud providers change API behavior, add new event fields, deprecate old ones, and sometimes restructure entire event schemas with minimal announcement. A rule that worked last quarter may silently stop firing this quarter - not because the attack technique disappeared, but because the field it was matching on no longer exists, or has moved to a nested JSON structure, or is now empty by default. This rot is insidious because it reduces coverage without generating any visible alerts. Detection engineers in cloud need a systematic monitoring approach to detect when their own rules have gone stale - and that monitoring is itself a detection problem.
5. Ephemeral workloads mean your evidence window is narrow
A container that ran for ninety seconds and exfiltrated a credential is gone before the analyst opens the ticket. You either captured the evidence while it ran - in the control-plane logs, in container runtime telemetry, in VPC flow logs - or you didn't. Post-incident forensics on ephemeral workloads often consists of "here is what the API logs recorded" and nothing more. This puts pressure on detection quality before incidents happen, not just during them. Every technique gap in your ATT&CK coverage map is a place where you'll have no evidence if an attacker uses it.
This is also why detection latency matters in a way that on-prem security often didn't. A detection that fires in thirty minutes on an ephemeral workload may fire after the workload and its evidence are gone. The log-to-detection latency of your pipeline - the time from an API call happening to a SIEM alert firing - is an operational measurement worth tracking. Sub-five-minute end-to-end latency is achievable with streaming ingestion; thirty-plus minutes is common with batch ingestion pipelines, and the difference is significant for ephemeral workload attacks.
6. Detection-as-code is the only approach that scales
A cloud environment spanning dozens of accounts, multiple regions, and several managed service categories cannot be covered by rules that someone clicked into a SIEM console. Detection-as-code - rules in Git, CI pipeline that runs unit tests against sample events, automated deployment to the SIEM - is how the coverage stays current and how changes get reviewed rather than going directly to production. This is not a "nice to have for mature teams" pattern; it's the baseline. Detection engineers who arrive from on-prem environments and treat the SIEM as a configure-in-the-UI tool run into the scale wall quickly.
7. The ATT&CK Cloud matrix keeps growing
MITRE adds new techniques and sub-techniques to the Cloud matrix as threat researchers document real-world attacker behavior. The 2026 matrix is materially larger than the 2022 version, and it will be larger still in 2028. Unlike endpoint ATT&CK, where the technique catalog is relatively stable, cloud ATT&CK reflects an attack surface that is genuinely expanding as providers ship new services that attackers learn to exploit. Your coverage map is never finished - it only has different-sized gaps over time.
Traditional detection asks "did malware execute?" Cloud detection asks "did an identity do something it shouldn't have, in a service it may not have existed two quarters ago, generating logs you've never seen before?"
8. The absence of network layer telemetry changes investigation
In data-center environments, network captures and proxy logs are often the richest forensic source after a compromise. In serverless and managed-service architectures, there may be no useful network telemetry at all - just API calls and the data they touched. This makes the control-plane-and-identity focus not just a detection choice but an investigation constraint: if your detections don't fire on the attack, there may be no other way to reconstruct what happened. The detective pressure that traditionally existed at both the network and the host layer now exists primarily in the API call log - which is why those logs need to be complete, consistently enriched, and queryable at low latency.
Cloud log sources, in depth
If there is one thing that separates a practitioner from someone who has read blog posts about cloud detection, it is depth on the actual log sources. You cannot write good detections for events you haven't spent time reading in volume. This section covers the log sources you'll use most and the things about each that matter for detection engineering specifically.
AWS: CloudTrail and beyond
CloudTrail is the spine of AWS detection. Management events record API calls that modify resources - CreateRole, PutRolePolicy, DescribeInstances, GetSecretValue. Data events require explicit opt-in per service (S3 object-level, Lambda invocations, DynamoDB row-level) and generate orders of magnitude more volume; ingesting all of them into the SIEM without filtering will break your budget. Key CloudTrail nuances for detection: userIdentity.type distinguishes a human IAM user from an assumed role from a federated identity from an AWS service; sourceIPAddress may be an AWS service principal rather than an IP in cross-account scenarios; errorCode values like AccessDenied on reconnaissance calls are often the earliest attacker signal; requestParameters is where the operationally important details live (which bucket, which role, which instance profile) and is the field most often absent from surface-level rule descriptions.
GuardDuty sits on top of CloudTrail and VPC Flow Logs and provides ML-enriched findings - but the finding rate is low by design, and GuardDuty findings are not a substitute for custom CloudTrail detections. The techniques GuardDuty doesn't cover are the ones that are most environment-specific or most recently documented; both categories are where custom rules add the most value. CloudWatch Logs, VPC Flow Logs, Route53 resolver logs, S3 server access logs, and WAF logs complete the picture for network-layer visibility. The Services that emit their own log formats - RDS audit logs, Lambda CloudWatch logs, EKS audit logs, MSK access logs - each require a separate ingestion pipeline and a separate understanding of their schema.
Azure: Activity Logs, Diagnostic Logs, and Entra ID sign-in logs
Azure Activity Logs are the control-plane equivalent of CloudTrail: resource writes, role assignment changes, policy modifications at the subscription and management group level. Entra ID (formerly Azure AD) sign-in logs and audit logs are the identity plane and are often the most detection-rich source for identity attacks - conditional access failures, MFA bypass attempts, service principal credential usage from unexpected locations, and token issuance events. These are separate streams that require separate ingestion into Sentinel; a common gap is teams that ingest Activity Logs but not Entra ID audit logs, which means they have resource control-plane coverage but no identity-plane coverage.
Diagnostic Logs vary per service and must be explicitly enabled per resource type; the schema differs materially between, say, Azure SQL and Azure Kubernetes Service. Microsoft Defender for Cloud and Microsoft Sentinel integrate tightly with these sources and provide UEBA, fusion rules, and native KQL analytics. KQL is the query language throughout the Microsoft ecosystem, and deep KQL fluency - including joins across multiple tables, time-series operators, and the make-series operator for behavioral analytics - is the differentiating skill in Azure-heavy detection environments. Microsoft 365 Defender / Defender XDR provides a separate hunting surface that overlaps with Azure detections when the attacker pivots between cloud resources and M365 identities.
GCP: Cloud Audit Logs and beyond
GCP Audit Logs have three types: Admin Activity (always on, control-plane write operations), Data Access (must be enabled per API, high volume), and System Event (automated GCP actions). The separation between Admin Activity and Data Access is the GCP equivalent of CloudTrail management vs. data events - and the same ingestion-cost dynamic applies. GCP's resource hierarchy (organization, folder, project) means you can aggregate audit logs at the org level through Log Router and BigQuery or Pub/Sub to a SIEM. This flexibility is genuinely useful; the engineering cost of building the pipeline correctly is real and often underestimated.
Chronicle (now Google SecOps) has native GCP integration and YARA-L as its rule language, which is distinct from both KQL and SPL. YARA-L is purpose-built for security analytics over time-series event data and handles complex temporal correlations (events that happen within N minutes of each other, sequences across multiple entities) more naturally than SPL or KQL. The Google-published detection content library for GCP is less mature than the community content available for AWS in Sigma or Splunk ES, which means GCP-primary detection engineers write more from scratch. GCP IAM events are in Audit Logs under SetIamPolicy; service-account key creation (CreateServiceAccountKey) is among the highest-fidelity attack signals in a GCP environment. GCP also emits VPC Flow Logs, Cloud DNS logs, and Cloud Armor WAF logs for network-layer coverage.
Kubernetes and container logs
Kubernetes audit logs are an often-overlooked and extremely high-value detection source for cloud environments running EKS, AKS, or GKE. API server audit logs record every kubectl command, RBAC change, pod creation, service account token request, and exec into a running pod. These events are the k8s equivalent of CloudTrail management events - they tell you who called what API against the cluster control plane. The challenge is twofold: k8s audit log schemas differ between managed providers (EKS adds AWS ARN context that GKE doesn't), and the volume from active CI/CD pipelines is high enough that naively alerting on anything unusual generates constant noise. The detection engineering work here is in building baseline models of what legitimate pipeline behavior looks like and alerting on deviation.
Container runtime telemetry from tools like Falco or AWS Security Hub container scanning adds process-level visibility where it doesn't exist natively in control-plane logs. Falco rules are written in their own YAML-based format and provide signals like unexpected process execution in a container, filesystem writes to sensitive paths, and outbound network connections to uncommon destinations. These complement control-plane detections rather than replacing them - control-plane logs tell you the resource was created; runtime telemetry tells you what happened inside it.
Detecting identity-based attacks: the cloud detection core skill
Because identity-based attacks dominate cloud intrusions, this section goes deeper on the specific detection patterns that matter most. These are the techniques you'll encounter in purple-team exercises, in incident investigations, and in threat intelligence reports covering cloud breaches. Understanding them at the detection level - not just the conceptual level - is what separates the practitioners from the blog-post readers.
Role assumption chaining
Attackers who start with a compromised low-privilege IAM identity often chain sts:AssumeRole calls to progressively higher-privilege roles: start with the developer's IAM user, assume a deployment role, assume an admin role in the same or a different account. The challenge: legitimate CI/CD pipelines do exactly the same thing at high volume. The detection logic must distinguish a human assuming roles at unusual hours, from unusual source IPs, in an unusual sequence - not just "AssumeRole happened." Key fields: userIdentity.type, userIdentity.sessionContext.sessionIssuer, source IP, and the requestParameters.roleArn to identify which role was assumed. Cross-account role assumptions (where requestParameters.roleArn belongs to a different account than the caller) are particularly high-signal.
IAM key and token exfiltration
Access keys used from novel geographic locations, unusual ASN ranges (hosting providers vs. residential ISPs), or with API call patterns inconsistent with the key's normal usage are the most common first-signal of key exfiltration. Detection: establish a baseline of normal source IP ASN, normal API call mix, and normal time-of-day usage for each access key or role; alert on significant deviation. The false-positive surface is legitimate developer travel and new CI runners - manage it with baseline-building windows and suppression for known-good patterns rather than hard IP allowlists that expire.
OAuth consent and service principal abuse
In Azure and GCP environments, attackers increasingly abuse OAuth consent flows to obtain persistent access to Microsoft Graph, Exchange Online, or GCP APIs without ever touching managed cloud resources. The detection surface is in Entra ID audit logs (consent grant events, especially for high-privilege Graph permissions like Mail.ReadWrite or Directory.ReadAll) and in GCP Audit Logs (SetIamPolicy on service accounts). Consent grants to applications outside the tenant's verified publisher list, or to applications that immediately begin reading high-value resources, are the highest-fidelity signals. This technique is frequently absent from detection libraries built primarily around CloudTrail because it is an Azure-first attack pattern.
Instance metadata service abuse
EC2 instance metadata service (IMDS) abuse - extracting temporary credentials from the metadata endpoint, then using them from outside the instance - was the core technique in the Capital One breach and remains common. IMDSv2 mitigates the most common SSRF-based exploitation path, but detection coverage for IMDS credential abuse is still valuable: AssumeRole calls using an EC2 instance profile (userIdentity.type = AssumedRole with sessionIssuer.type = Role and source IP outside the known EC2 address range for the instance's region) are the signal. The false-positive rate is low when the IP matching is precise; imprecise IP matching generates substantial noise.
S3 and data store reconnaissance and exfiltration
Bucket enumeration (ListBuckets, ListObjects on unfamiliar prefixes), followed by data access from new principals or source IPs, is the canonical cloud data exfiltration pattern. The data event tier (S3 access logs, CloudTrail data events) is required for the access signal; many teams skip it for cost reasons and then have no visibility into the exfiltration phase even if they detect the reconnaissance. Detection: alert on high-volume GetObject calls from identities that have never previously accessed the bucket, especially when the destination IP is a hosting provider ASN.
Lambda and serverless function abuse
Serverless functions running with overprivileged IAM roles are attack targets for privilege escalation via code injection or environment variable extraction. Detection relies on CloudTrail management events (UpdateFunctionCode, UpdateFunctionConfiguration, AddPermission) and Lambda invoke data events. The technique of updating a Lambda's execution role to a higher-privilege role is detectable via CloudTrail's UpdateFunctionConfiguration event when requestParameters.role changes. The technique of injecting code into a Lambda environment variable is detectable via UpdateFunctionConfiguration when requestParameters.environment is modified unexpectedly.
The learning treadmill, in detail
The treadmill is real in every cloud security role, but detection engineering has its own especially relentless version. The problem is structural: your job is to have detection coverage for the attack techniques that matter, but the attack surface and the telemetry that covers it are both moving at the same time, independently, in directions you don't control.
Here is what the treadmill looks like from the inside:
- Every new managed service is a new detection gap until you close it. When your engineers adopt Amazon Bedrock, Azure OpenAI Service, Google Cloud Run, or any other managed service, that service emits telemetry you may have never seen. Attackers learn to exploit new services months before defenders have written rules for them. The gap between GA and detection coverage is the window.
- Event schemas change without warning. CloudTrail fields get added, renamed, or restructured. Azure Activity Log formats evolve. GCP adds new audit event types. Rules that matched on a specific field value may start matching on nothing - or worse, start matching on something different than intended. You need a way to detect when your own rules have stopped working.
- MITRE ATT&CK Cloud adds new techniques. The cloud matrix grows faster than the enterprise matrix because the research community is still catching up to the breadth of cloud attack surface. Each new technique version is a prompt to check whether you have coverage - and if not, to either write it or consciously accept the gap.
- Attacker tradecraft evolves specifically to evade cloud detections. Threat actors who attack cloud environments read the same detection-engineering blog posts you do. Living-off-the-land techniques that blend into legitimate API usage are growing more common precisely because they evade behavior-based rules. The rule you wrote for
CreateRolewith inline policy creation doesn't catch the attacker who waits 48 hours between steps. - SIEM platforms change. Splunk adds new SPL functions. Sentinel adds new KQL operators. Chronicle changes YARA-L syntax. When your org migrates SIEMs - which happens more often than anyone plans - you're re-writing or re-translating the entire detection library.
How practitioners actually keep up - the detection engineers who stay current don't try to read everything. They build a system. Weekly: skim the ATT&CK navigator for new additions, check the SigmaHQ repository for new rules relevant to their environment, read the provider release notes for log format changes. Monthly: run a purple-team exercise against one ATT&CK technique category and close the gaps that surface. Quarterly: do a rule-staleness audit - replay historical benign and attack traffic through every detection and confirm the results match expectations. Community: fwd:cloudsec, SANS CloudSecNext, the CTID cloud analytics project, and the CSOH Friday sessions are where you find out what other practitioners are seeing before it hits the blog posts.
The treadmill is also why this role rewards practitioners who build systems over those who rely on personal heroics. A detection engineer who manually reviews every provider release note and every threat intel report will burn out. The ones who survive long-term have automated the triage layer: RSS feeds that filter for cloud-security-relevant content, automated staleness monitoring that flags rules for review before they silently fail, and a library structure that makes it easy to trace "what log source does this rule depend on" so that when a log format changes you can find all affected rules in seconds rather than hours.
One more element of the treadmill specific to detection engineering: the adversary learns too. As the community publishes more cloud detection content - through conference talks, Sigma rules, threat intelligence reports - sophisticated threat actors adapt their tradecraft to avoid the published detections. The timing delays between attack steps, the use of legitimate-looking source IPs, the blending of attack API calls with high-volume legitimate API traffic in the same session - these are deliberate adaptations to known detection approaches. Staying current means reading not just the defense-side literature but also the offensive research: the Permiso threat reports, the Wiz threat intelligence team findings, the cloud-attack CTF writeups, and the DFIR case studies from practitioners who handled real cloud breaches. Understanding what attackers know about your detections is the only way to reason about whether your rules would catch a prepared adversary, not just an opportunistic one.
A week in the life
This is a representative week for a senior detection engineer at a scale-up running primarily on AWS with Splunk as the SIEM. Your stack and org size will change the proportions, but the shape repeats.
Monday. Start with the weekly ATT&CK gap review. A new sub-technique was added to the cloud matrix last Thursday: T1578.005, Modify Cloud Compute Configurations. You read the technique description, pull up your CloudTrail logs and search for the relevant API calls (ModifyInstanceAttribute, ModifyNetworkInterfaceAttribute, and a handful more), and assess whether your existing instance-modification rules provide coverage or just partial coverage. They're partial - they don't catch the EBS snapshot case. Add it to the backlog.
Tuesday. Purple-team day. You're running a Stratus Red Team scenario for "Exfiltrate CloudTrail logs via S3." You detonate it in the lab account, wait to see whether the detection fires, and analyze the results. It fires - but only 40 minutes after the event, because the CloudTrail-to-S3-to-Splunk pipeline has indexing lag you didn't account for. You write up the finding: coverage exists but SLA for detection is 40 minutes, not the 10 you're targeting. Flag it to the SIEM team to discuss S3 notification triggers versus scheduled polling.
Wednesday. Heads-down rule work. You're writing a detection for OAuth consent grant abuse in Azure - a technique that generated two real incidents in the industry this quarter and that your Entra ID audit logs can cover. You spend the morning drafting the KQL, testing it against the last 30 days of audit logs in your Sentinel workspace, and counting the false-positive population. Twelve benign app registrations match the query; you add scoping criteria for known-good app display names, re-run, get to two, and document both as accepted baseline. The rule goes into the detection repo via PR; a colleague reviews by end of day.
Thursday. A platform team is adopting Amazon EKS for a new service. They've asked you to review the audit logging configuration before go-live. You review their Terraform, find that k8s audit logs are configured but not shipped to Splunk, and write a short requirements doc: ship audit logs, enable GuardDuty EKS Protection, and suppress these three known-noisy API paths in the Splunk transform. It's a one-hour engagement that prevents a three-month coverage gap.
Friday. Rule-maintenance pass. Automated monitoring flagged one detection that hasn't fired in 31 days - unusual for a rule that historically fires several times weekly. You investigate: an IAM field rename in a CloudTrail update from three weeks ago silently broke the match. You fix the field name, test in the lab, confirm the fix works, and ship the update. Afternoon: read the week's provider release notes (AWS: new ECS task metadata endpoint version; Azure: new Entra ID audit event for Privileged Identity Management activations; GCP: Cloud Spanner audit log schema update). One of them - the PIM activation event - is a new log type you don't have a detection for. Write the ticket.
What doesn't show up much: responding to live alerts (that's the SOC), writing compliance reports, or building out dashboards. What shows up every week without fail: reading code, writing code, reviewing code, and running simulations. The craft is in the details - a rule with a wrong field name provides zero coverage no matter how smart the logic is.
One thing that surprises people entering the role: the calendar looks much more like a software engineer's than like a SOC analyst's. You have blocks of heads-down time for rule development, code review for rules someone else wrote, engineering conversations with the platform team about log pipeline architecture, and structured purple-team sessions. The reactive alert-response cadence of SOC work is mostly absent. That's a feature for some people and a surprise for others; know which camp you're in before you interview.
A day in the life: Wednesday at a fintech running AWS and Sentinel
The weekly breakdown above is statistical. Here is the texture - an illustrative, composite Wednesday in the calendar of a senior detection engineer at a mid-size fintech running primarily on AWS with Sentinel as the SIEM. The specific customers, deals, and Slack messages are fictionalized. Treat it as a representative archetype.
7:45 - morning read. Coffee and the provider digest. AWS released two new API actions for SageMaker Unified Studio overnight. Neither is high-risk on its own, but one of them - a new role-chaining endpoint for model deployment - is the kind of thing that creates a privilege escalation path nobody has written a detection for yet. Add to the investigation queue.
8:30 - staleness alert. The automated monitoring system flagged a detection for CreateServiceLinkedRole that hasn't fired in 22 days. That's suspicious - this environment generates those events regularly. Pull up the rule, trace the field references against yesterday's CloudTrail sample. Found it: AWS changed the capitalization of the serviceLinkedRoleCreationContext field in a schema update two weeks ago. The rule is matching on the old casing. Fix, test against three known-malicious and five known-benign events, update the tuning history in the rule's metadata block, open the PR.
9:15 - rule review. A colleague opened a PR yesterday for a new detection covering Azure role assignment to privileged built-in roles from outside the tenant. You review the KQL: the logic is solid, but the exclusion for the CI/CD service principal is too broad - it excludes by display name, which is mutable, rather than by object ID, which is stable. Leave a comment, explain the risk, suggest the fix. The conversation takes three messages; the colleague updates, you approve.
10:00 - purple-team session. Monthly run with a contractor red teamer. Today's scope: T1548.005, Abuse Elevation Control Mechanism - Temporary Elevated Cloud Access. You've agreed to test whether your detection for abnormal PIM activations in Entra ID fires reliably. The red teamer activates the privileged role from a suspicious location; you watch Sentinel in real time. The rule fires in eight minutes - longer than the five-minute target but within SLA. Write up the result: rule fires, timing lag noted, recommend a priority upgrade to near-real-time evaluation. File the ticket with the log pipeline team.
11:00 - threat intel translation. A partner ISAC published a new advisory on a threat actor targeting financial services cloud environments. The TTPs section lists three techniques you don't have specific cloud-adapted coverage for. Two are straightforward translations of existing endpoint ATT&CK rules. The third - exfiltration via signed S3 pre-signed URLs generated from a compromised Lambda - requires a new detection logic that correlates CreateFunction, InvokeFunction, and S3 data event logs within a short time window. Write the research ticket; this one will take a day to build and validate properly.
1:00 - SOC office hours. Monthly check-in where the detection team and SOC leadership review which detections are generating the most work for analysts. Three rules in the top-10 most-investigated list are producing 65% of their alerts in a known-benign automated workflow. You agree to add a suppression for that workflow and schedule a tuning pass for the following week. One rule in the bottom-10 (rarely fired) is actually the most important to maintain - it covers a high-impact technique with a low base rate. Document the rationale explicitly so it doesn't get pruned in the next library review.
2:30 - SageMaker follow-up. Dig into the new SageMaker API actions from the morning. Read the AWS documentation, pull up an account with SageMaker enabled, call the new API, and watch what CloudTrail generates. The log structure is new enough that there's no Sigma rule for it. The privilege escalation path you suspected is real - the new role-chaining endpoint creates a sts:AssumeRole event with a service-specific ARN pattern that differs from standard developer-initiated assumptions. Draft an initial Sigma rule and drop a note in the Slack channel for the platform team that owns SageMaker to validate the normal usage patterns before you tune.
4:00 - documentation pass. Update the coverage navigator for the rules shipped this month. Three new techniques moved from "no coverage" to "partial coverage"; one moved from "partial" to "high confidence" after last week's successful purple-team validation. Write the coverage report for the monthly security metrics deck - leadership gets a trend line and a prioritized gap list.
5:30 - close. Log the open loops: the SageMaker rule needs production tuning after platform team input, the pre-signed URL exfiltration detection is on the backlog with a research ticket, the PIM timing lag goes to the pipeline team. Tomorrow's calendar has a detection design session for the new container workload they're spinning up. Read the CTID cloud analytics bulletin over coffee before logging off.
Total focused coding/rule-writing time: about 4 hours. Collaboration and review: about 2.5 hours. Research and reading: about 1.5 hours. Administration and documentation: about 1 hour. Every Wednesday is different in detail; the rhythm of building, reviewing, simulating, and monitoring repeats.
The skill stack
Detection engineering has a stable core that takes years to build and a moving edge that never stops. The ratio of core to edge shifts as you advance - junior engineers spend most of their energy on core fluency; senior engineers spend it on the edge and on building systems that help the team keep up.
The stable core
Build these deliberately. They don't expire, they compound, and they are what interviewers are actually testing even when the interview question sounds like it's about a specific tool.
- One cloud at operational depth. AWS first by default for market breadth, but Azure-primary or GCP-primary engineers are fully hireable. Depth means knowing the IAM model, the logging architecture, and the ten services that show up in 80% of production environments well enough to write detections without looking up the field names every time.
- Sigma and at least one native query language. Sigma is the portable rule format; KQL is dominant in Microsoft shops; SPL in Splunk-heavy shops; YARA-L in Chronicle/Google SecOps environments. You need Sigma plus the language of your primary SIEM, and enough reading fluency in the others to review and translate rules.
- The ATT&CK Cloud matrix. Not just knowing it exists - knowing each technique category well enough to reason about what telemetry would cover it, what a realistic false-positive rate looks like in a busy engineering environment, and what a rule that fires reliably without drowning the SOC actually requires.
- Log source internals. CloudTrail field semantics, Azure Activity Log structure, GCP Audit Log types. The engineers who treat these as black boxes write rules with subtle holes. The ones who've read thousands of real events can spot an attacker who's deliberately shaping their calls to blend in.
- Detection-as-code fundamentals. Git, code review, CI testing, deployment pipelines. If you haven't worked in a detection-as-code workflow, the CI/CD page and the home lab guide are the fastest way to build the intuition.
- Scripting. Python or Go for event manipulation, detection pipeline tooling, purple-team automation, and one-off investigation queries that the SIEM can't do efficiently. SQL for SIEM backends that expose it.
- Attacker technique depth. You can't write a reliable detection for something you haven't simulated. Stratus Red Team, CloudGoat, and Atomic Red Team are the tools. Running them yourself and reading the resulting logs is worth more than any certification.
The moving edge
Accept that this list has no fixed length. Every new managed service your org adopts, every new SIEM version that ships, and every new cloud-specific attack technique documented in public threat research extends this list. The skill is not "master the current list" - it's "have a reliable method for getting current on new items fast."
- New log sources as providers add managed services and your engineers adopt them.
- AI/ML workload telemetry - model API calls, data pipeline access, prompt injection indicators, credential use from agentic systems. See AI/ML security.
- Container and Kubernetes audit log analysis as workloads shift - see containers and Kubernetes.
- Supply-chain attack telemetry: CI/CD pipeline events, artifact registry access patterns, build system credential abuse.
- SIEM platform updates: new query operators, new ML-based analytics, new data connectors.
The detection lifecycle, step by step
Detection engineering has a lifecycle that most job descriptions underspecify. Understanding each step - and where the hard parts live - is more useful than a skills checklist:
- Technique selection. Not every ATT&CK technique deserves a rule. The ones that do are: high probability of appearing in your threat model, have reliable telemetry in your environment, and have a true-positive-to-false-positive ratio that the SOC can sustain. Technique selection is a risk prioritization exercise, not a completeness exercise.
- Telemetry research. Which log source captures this technique? What fields are populated? Are they populated consistently, or only under specific conditions? What does a benign event that triggers the same fields look like? This phase requires hands-on log analysis, not just reading documentation.
- Rule drafting. Write the initial logic. In Sigma first if the team uses a detection-as-code workflow; in native SIEM query language if you're prototyping quickly. Document the detection rationale, the ATT&CK technique mapping, and the expected false-positive classes.
- Attack simulation. Run a Stratus Red Team scenario or equivalent, confirm the rule fires, and examine the alert content. Does it contain enough context for an analyst to investigate without a second log query? Is the severity calibration correct?
- False-positive analysis. Run the rule against 30 days of production log data (or production-representative synthetic data). Count the false positive rate. Identify the benign use cases that match the rule and scope exclusions that are provably safe - that is, exclusions where the benign pattern cannot overlap with the attack pattern.
- Peer review. Another engineer reviews the rule, the simulation results, and the false-positive analysis. The review catches logic errors, missing edge cases, and exclusions that are too broad. This is the step most commonly skipped under time pressure and the step most commonly responsible for rules that fail in production.
- CI pipeline and deployment. The rule passes automated tests (syntax validation, schema validation, unit test against sample events) and deploys through the pipeline to the SIEM. The deployment is version-controlled; if the rule breaks, you can roll back.
- Production monitoring and feedback. Track alert volume, SOC feedback on quality, and analyst-applied exclusions. An alert that analysts are consistently dismissing as a false positive is a signal to tune the rule, not to accept the analyst behavior.
- Periodic re-validation. Quarterly or after major provider changes, re-run the simulation and confirm the rule still fires. Check that exclusions are still valid. Update the rule if the log schema has changed.
Detection-as-code in practice
Detection-as-code is not a philosophy - it is the specific set of engineering practices that makes a cloud detection library maintainable at scale. The term is used loosely enough that it's worth being precise about what a mature detection-as-code workflow actually contains.
The repository structure
A detection library in Git has: rules organized by ATT&CK tactic or by log source, a metadata schema for each rule (ATT&CK mapping, severity, log source, author, date, tuning history, false-positive classes), a test fixtures directory (sample events for unit tests, both positive and negative), and deployment configuration that maps rules to the target SIEM. The structure is opinionated and team-specific; the important thing is that it exists and is enforced, because ad-hoc organization accumulates technical debt at a rate that eventually makes the library unmanageable.
The CI pipeline
Every pull request runs: syntax validation (Sigma schema compliance, or native query parsing); unit tests that replay known-malicious and known-benign sample events and confirm correct classification; schema validation against the expected log source fields (a rule that references a field that doesn't exist in the target log source fails the check); and coverage diff (a report that shows which ATT&CK techniques gained or lost coverage). Optionally, a cost estimate for the new rule's expected query volume. The CI pipeline is what allows peer review to focus on logic and rationale rather than catching typos.
The deployment pipeline
Approved rules deploy automatically through a pipeline - to a staging SIEM environment first (where they can run for 24-48 hours against production-shaped traffic without alerting the SOC), then to production after a quality gate. The deployment pipeline also handles SIEM-specific compilation: Sigma rules compile to KQL, SPL, YARA-L, or EQL before deployment to the appropriate SIEM. Some teams maintain a secondary "archive" SIEM for historical queries separate from the primary alerting SIEM; the deployment pipeline handles routing.
Staleness monitoring
The most sophisticated element of a mature detection-as-code setup is automated monitoring for rule staleness. Implementation options: a daily query that checks alert volume for every active rule and flags any rule whose volume has dropped more than 80% from its trailing 30-day average; a weekly run that replays a sample of historical attack events through each rule and checks that the expected alerts fire; a provider-change monitor that watches AWS, Azure, and GCP release notes RSS feeds and tags rules whose referenced log sources or field names appear in change announcements. Not every team has all three. Every team should have at least the first one.
Rule metadata and documentation
Each rule should carry: the ATT&CK technique it covers; the log source(s) it depends on; the expected false-positive classes and why the scoping exclusions are safe; the simulation evidence (Stratus Red Team scenario name, date run, link to run log); the tuning history (what was changed, when, why); and the owner. The documentation overhead feels painful when you have 20 rules. It saves enormous time when you have 300 rules and a provider schema change requires you to find every rule that depends on the affected log source. The metadata is also what lets a new detection engineer onboard into the library and understand why rules are structured the way they are, rather than having to reverse-engineer the reasoning from the query logic.
If you are evaluating detection-as-code platforms or building one, the four questions worth asking about any candidate platform are: how does it handle schema validation against actual log source field definitions? what does the test framework look like? how does rule deployment get authorized and audited? and does it support multi-SIEM compilation from a single rule source? Platforms that answer all four well are genuinely enabling; platforms that answer none of them are YAML-in-Git with a deployment script, which is better than nothing but is not a detection engineering platform.
Tools of the trade
You will not use all of these everywhere. The specific products vary by org; the categories are stable. Know the category before you know the product - you'll change products more often than you change categories.
SIEM and analytics platforms
- Microsoft Sentinel - KQL, deep Azure/Entra ID integration, fusion alerts, UEBA. Dominant in Microsoft-shop and hybrid environments.
- Splunk Enterprise Security - SPL, largest ecosystem of detection content, the most common SIEM in large enterprises. ES is a premium layer on top of core Splunk.
- Google Chronicle / Google SecOps - YARA-L, petabyte-scale at flat pricing, native GCP integration, strong for orgs already in the Google ecosystem.
- Elastic Security - EQL (Event Query Language) and KQL, open ecosystem, strong Kubernetes and endpoint integration, self-hosted option.
- Panther - detection-as-code-first, Python-based rules, native S3/CloudTrail/Snowflake integration, popular with engineering-centric security teams.
- Matano, Anvilogic - detection-as-code oriented; the emerging tier.
Rule authoring and management
- SigmaHQ - the community Sigma rule repository; the first stop for "does a rule for this technique already exist?"
- pySigma / sigma-cli - the conversion toolchain that compiles Sigma to SPL, KQL, YARA-L, and others.
- Detection-as-code frameworks - custom or open-source (Sublime Security's approach, Panther's Python rules, Elastic's detection rules repo) - the plumbing for CI-based rule deployment.
Purple-team and simulation tooling
- Stratus Red Team - the de-facto cloud attack simulation tool; granular AWS, Azure, GCP, and Kubernetes techniques with clean cleanup. If you write cloud detections and haven't used Stratus, fix that this week.
- Atomic Red Team - broader ATT&CK coverage including cloud techniques; YAML-defined atomic tests.
- Pacu - AWS exploitation framework used for offensive simulation and for understanding what legitimate post-exploitation API sequences look like.
- ROADtools - Azure/Entra ID enumeration and attack simulation; essential for identity-based detection development in Microsoft environments.
Provider-native detection and telemetry
- AWS: CloudTrail, GuardDuty, Security Hub, CloudWatch Logs, VPC Flow Logs, Config.
- Azure: Activity Logs, Entra ID sign-in and audit logs, Microsoft Defender for Cloud, Defender XDR.
- GCP: Cloud Audit Logs, Security Command Center, Chronicle.
- Kubernetes: API server audit logs, Falco runtime detection, cloud-provider EKS/AKS/GKE audit logging.
Coverage mapping and documentation
- ATT&CK Navigator - the coverage map tool; use it to track which techniques have rules, which have partial coverage, which have none.
- CTID Cloud Analytics - the Center for Threat-Informed Defense's cloud-specific detection analytics; higher-quality than most community Sigma rules for cloud techniques.
- Detection documentation tooling - whatever your team uses to document detection rationale, false-positive population, tuning history, and owner. This sounds boring; its absence is how coverage drift goes undetected for six months.
Emerging and specialist tooling
- Sublime Security - email and cloud threat detection platform built on a detection-as-code model; growing adoption in environments where email-to-cloud attack paths (OAuth phishing chains) matter.
- Retool for Security / custom dashboards - coverage visualization beyond ATT&CK Navigator; some teams build internal tooling that cross-references detection coverage against the CSPM's known asset inventory to show "this technique is detectable in theory but your coverage doesn't extend to this particular service."
- OpenTelemetry / eBPF runtime agents (Falco, Tetragon, Aqua) - process-level telemetry for container workloads; fills the host-EDR gap for teams defending containerized environments where endpoint agents aren't viable.
- Cloud-native SOAR - Microsoft Sentinel Automation Rules, Splunk SOAR, Chronicle SOAR; automates the response side of high-confidence detections (automatic IP block, account suspension, bucket policy revert). Detection engineers who understand SOAR playbook design can close the loop between writing a detection and having an automated response, which changes the conversation about detection quality from "did it fire" to "did the response happen in time."
The multi-cloud dimension
Most detection engineers specialize in one cloud platform, but multi-cloud environments are common enough that you'll encounter cross-cloud detection requirements even if you're primarily an AWS person. The differences across providers matter operationally:
- AWS: CloudTrail is the primary control-plane source and is mature, well-documented, and has the largest community of Sigma rules. GuardDuty provides ML-enriched findings you can use to supplement custom rules without writing every detection from scratch. The biggest AWS detection challenge is the sheer breadth of services - over 300 - each with distinct event types. See AWS security.
- Azure: The identity plane is richer and more complex than AWS - Entra ID, conditional access, PIM, hybrid identities via AD Connect. Activity Logs cover the resource control plane, but Entra ID sign-in logs are a separate stream requiring separate ingestion. Microsoft's native fusion analytics in Sentinel correlate across both, which reduces rule-writing burden but requires you to understand what the fusion rules actually cover (and don't). See Azure security.
- GCP: GCP Audit Logs have the cleanest structure of the three providers and GCP IAM events are often the highest-fidelity signal (especially service account key creation). Chronicle's YARA-L is purpose-built for security analytics over log data and performs well at scale, but it has a smaller community of published detection content than Splunk or Sentinel. See GCP security.
In multi-cloud environments, the investment in Sigma pays off most: write once against the abstract schema, compile to each native language. The abstraction leaks at the edges - you'll still need to understand provider-specific field semantics to write accurate Sigma - but the compilation saves most of the translation work. The AWS vs Azure vs GCP comparison maps the conceptual equivalents across providers.
A practical multi-cloud prioritization: most detection engineers should be fluent in one cloud and have working reading knowledge of the other two. "Fluent" means you can look at 50 events from the primary provider and immediately identify which are suspicious without documentation. "Working reading knowledge" means you understand the conceptual equivalents and can research specifics quickly. The detection engineer who claims deep fluency in all three simultaneously is usually shallower in all three than one who went deep in one first.
How the role changes by company stage
- Startup / early-stage (you're also the SOC). You write the detections and you respond to them. The SIEM is whatever the previous person set up, the rule library is sparse, and you'll spend your first six months understanding what "normal" looks like in your environment before you can tune anything reliably. High leverage, low support, a lot of first-principles learning. If you're early enough, you get to choose the SIEM and build the detection-as-code pipeline from scratch - which is both an opportunity and a multi-month investment.
- Scale-up (detection team of 2-6). This is where detection engineering as a practice develops. You have peers to review rules, a detection-as-code pipeline worth maintaining, and enough attack surface to specialize (one person takes AWS IAM, another takes container workloads). The coverage gaps are large and visible; closing them is satisfying work. You interact regularly with the SOC and with IR, because the feedback loop between "rule fires" and "was this useful" is tight enough to act on.
- Enterprise / large tech (specialized teams, thousands of rules). The detection library is large and the primary challenge shifts from "write more rules" to "keep existing rules current and effective." Rule lifecycle management, coverage measurement, and false-positive reduction are as important as new rule development. Purple-teaming is more formalized - often a dedicated red team that runs quarterly campaigns against specific ATT&CK technique sets. Compensation is highest here; autonomy on individual rule decisions is lower because every rule change affects the SOC workflow of dozens of analysts.
Vendor vs. in-house detection teams
Beyond company size, the in-house vs. vendor distinction matters for this role more than it does for most cloud security specializations. In-house detection engineers write rules specifically for their own environment, which means they can build precise knowledge of what "normal" looks like in their particular cloud footprint. The false-positive calibration is always environment-specific and gets better over time. The trade-off is scope: you're defending one environment with one set of log sources.
Detection engineers at MSSP and MDR vendors write rules that must work across dozens or hundreds of different customer environments, which requires different design principles - rules that are robust across diverse configurations, well-documented enough for junior analysts to use in environments they've never seen, and tunable by customers without requiring deep engineering expertise. The breadth of exposure to different attack patterns and environments is genuinely educational, but the inability to deeply tune for any single environment is a real constraint. Some practitioners do a tour at an MSSP early in their career for the breadth, then move in-house for the depth.
Salary & compensation
US, 2026, base salary. Big-tech total comp runs 1.5-2x via equity and bonus. The detection engineering specialty commands a 10-15% premium over the generalist cloud security engineer at equivalent levels, driven by the narrower skill set and harder hiring market. MSSP and MDR roles typically pay 10-20% below in-house rates. Financial services and healthcare pay a meaningful premium for detection engineers who understand compliance-relevant cloud telemetry. Adjust down outside major tech hubs and well down outside the US - halve the number and add a question mark for a rough non-US estimate.
- Junior / associate (0-2 yrs): $100K-$140K. Often titled "SOC Analyst Tier 2/3," "Security Analyst," or "Threat Detection Analyst." Building the rule-writing and log-analysis fundamentals.
- Mid-level (2-5 yrs): $145K-$200K. Writing and owning detections independently. Starting to own coverage areas and run purple-team exercises.
- Senior (5-8 yrs): $195K-$260K. Owns one or more ATT&CK domains, leads purple-team campaigns, architects the detection-as-code pipeline.
- Staff / principal (8+ yrs): $250K-$350K base, $400K+ total comp at large tech. Sets detection strategy, drives cross-team purple-team programs, contributes to industry via Sigma community or conference talks.
For live data, cross-check levels.fyi (filter on "security engineer" at comparable companies), the BLS information security analysts data, and recent r/cybersecurity compensation threads. The careers salary section has the broader context across roles.
What "senior" actually means in detection engineering
The distinction between mid and senior in detection engineering is not primarily about years or the number of rules you've written. It's about systems thinking. A mid-level detection engineer writes good rules for known techniques and tunes them based on feedback. A senior detection engineer thinks about coverage as a program: they design the measurement system that tells you where the gaps are, build the purple-team cadence that validates coverage continuously, and make architectural decisions about the detection-as-code pipeline that affect the whole team's productivity. The seniors who get promoted to staff are the ones who made the team's detection capability better, not just their own rule library larger.
The interview loop for this role
Detection engineering loops are heavy on craft and simulation. Unlike the generalist loop that samples breadth, this one goes deep on a few specific skills. Expect some combination of these:
Log analysis and rule-writing exercise
The most common format: they give you a set of CloudTrail (or Activity Log, or GCP Audit Log) events and ask you to write a detection. The assessment is not just whether your query is syntactically correct - it's whether you understand the false-positive surface, whether your rule handles edge cases (what if the field is null? what if the same API call has a legitimate use at high volume?), and whether you can explain the detection rationale in terms of attacker behavior rather than just "this field equals this value."
ATT&CK coverage mapping
Walk me through your current coverage against MITRE ATT&CK Cloud. Which techniques do you have high-confidence detection for? Which are partially covered? Which are gaps, and why did you accept the gaps? This question is not testing whether you've memorized the matrix - it's testing whether you think systematically about coverage as a continuous measurement problem rather than a one-time project.
Purple-team design exercise
Design a purple-team exercise to validate detection coverage for credential-based lateral movement in AWS. Walk through: which techniques you'd simulate, which tools you'd use, what success looks like, and what you'd do with the results. Strong answers include specific Stratus Red Team scenarios, a discussion of the lab environment setup, and a plan for closing the gaps that surface.
Detection-as-code and pipeline questions
How do you manage your detection rules? How does a new rule go from idea to production in your environment? What tests run in CI? How do you detect when a deployed rule stops working? This surfaces whether you operate at engineering-team quality or SOC-analyst quality - and most detection engineering teams are looking for the former.
Behavioral and incident walk-through
Walk me through a detection you built that required significant tuning before it was useful. What was the false-positive population? How did you scope the exclusions? How did you validate you didn't break the true-positive case? This is looking for the candidate who understands that the first version of a detection is almost never the right version.
Take-home labs are common and often the highest-signal part of the loop: "Here are 48 hours of CloudTrail events from a compromised test account. Find the attack, write the detection, and explain how you'd tune it." Treat the take-home as the best single opportunity to show craft.
One underrated interview preparation: read ten Sigma rules for cloud ATT&CK techniques that you haven't written yourself, and work through the logic of each one. Ask yourself: what's the false-positive surface? What benign behavior would trigger this? What attacker behavior would not trigger this? The ability to critique an existing rule critically is at least as important as writing a new one, and it's something you can practice before any interview.
What interviewers are actually looking for
Three things, broadly. First, technical fluency: can you read a cloud log event and identify what happened, and can you write a detection query that finds the pattern you're looking for? Second, operational judgment: do you understand that detection is a tuning problem as much as a logic problem, and have you actually calibrated rules against real traffic rather than just writing theoretically correct queries? Third, the treadmill posture: do you have a practice for keeping current, and can you demonstrate that you learn new log sources and new attack techniques quickly when you encounter them? The candidates who perform best are the ones who can answer "walk me through a detection you wrote, from first reading about the technique to the rule being in production, including the tuning it required." If you don't have a real example, build one before you interview.
Portfolio projects that prove the role
Detection engineering portfolios are specific: they show detections you've written, attacks you've simulated, and coverage gaps you've measured and closed. "Built a security dashboard" is not a portfolio for this role. These are:
- Build a detection lab with a real SIEM and real attack simulations. Set up Splunk Free, Elastic, or a Sentinel trial. Ingest CloudTrail from a personal AWS account. Run Stratus Red Team against the account and write rules for each technique you simulate. Publish the rules, the ATT&CK coverage map, and the false-positive analysis for each. This is the single most effective portfolio artifact for this role.
- Walk CloudGoat scenarios and write detections for the attack path. CloudGoat is an intentionally vulnerable AWS environment. Walk the IAM privilege escalation scenarios, capture the CloudTrail events the attack generates, and write Sigma rules that would catch each step. Publish the write-up. This demonstrates both attacker understanding and detection craft.
- Contribute to SigmaHQ. Write a cloud detection rule for a MITRE ATT&CK Cloud technique that has no existing Sigma coverage. Open a pull request. A merged Sigma rule in the community repository is a public, permanent credential.
- Document detection coverage for an AWS Organization. Build the multi-account setup, turn on CloudTrail org-wide, and write the coverage documentation that maps organizational telemetry to ATT&CK techniques. Shows operational understanding of enterprise-scale cloud detection, not just lab-scale.
- Recreate a public breach kill chain. Take a public cloud breach (Capital One, Twitch, etc.) and build the detections that would have caught each step, using the technique categories the breach exposed. Publish the detection rules and the retrospective.
- Contribute to an open-source detection project. The SigmaHQ cloud rules, Elastic detection-rules, or Panther community rules all accept contributions. Contributing a cloud-specific detection rule - with proper ATT&CK mapping, accurate field references, test cases, and false-positive documentation - is a public, durable credential that signals not just technical skill but professional engagement with the community. Reviewers of your PR comment on your logic in public; use that feedback to improve and re-submit if needed. A merged contribution to a major open-source detection project is worth more in a detection engineering interview than most certifications.
- Map CNAPP findings to detection gaps. Take a CNAPP tool's finding categories and map each one to the corresponding ATT&CK technique and the detection that should cover it. This demonstrates both posture and detection thinking, shows you understand the relationship between preventive and detective controls, and produces a coverage document that looks like real work product.
The portfolio projects playbook has the full list with time estimates and how to talk about each artifact in interviews. Write up each project as a blog post, not just a GitHub repository - the write-up forces you to articulate your reasoning, surfaces gaps in your analysis, and becomes a permanent reference you can point interviewers toward.
How to talk about portfolio projects in interviews
The standard "tell me about a project you're proud of" question for detection engineers has a specific structure worth practicing. Interviewers want to hear: what attack technique you were covering (ATT&CK technique ID is good to know), what log source you used and why, what the false-positive surface was and how you scoped it, how you validated the rule with simulation, and what you'd do differently now. That's a five-part story, and rehearsing it for each portfolio artifact before the interview is worth more than any amount of additional studying. The candidate who can narrate an imperfect rule's tuning history demonstrates more craft than the one who describes a theoretically elegant rule that they never ran against real traffic.
How to break in and pivot from adjacent roles
Almost nobody enters cloud detection engineering cold. Almost everyone arrives from one of a few adjacent roles, each of which transfers a specific subset of the skills:
- SOC analyst (Tier 2/3) who has written or tuned SIEM queries. This is the most natural pivot and the one hiring managers look for first. You already understand the detection lifecycle from the response side; the gaps are cloud log source depth and detection-as-code practice. Close them with the detection lab and some Stratus Red Team runs. If you've written Splunk correlation searches or KQL analytics in Sentinel, you're already doing the technical work of detection engineering - you just need to frame it that way.
- Threat hunter. Hunters and detection engineers use overlapping skills - both read logs looking for attacker behavior, both reason from ATT&CK techniques. The gap is that hunting is ad-hoc investigation while detection engineering is systematic rule production. If you've converted a successful hunt into a repeatable alert, you've done detection engineering. Add cloud log source fluency and detection-as-code discipline.
- Cloud security engineer (generalist). If you've spent time in the detection-and-response part of the generalist role - building SIEM detections, triaging GuardDuty alerts, doing purple-team runs - you have the cloud context that takes SOC analysts months to build. The gaps are usually query language depth and rule-lifecycle discipline. Both are buildable in a few months with deliberate practice.
- Developer or data engineer comfortable in SQL and event-stream pipelines. The detection-as-code angle makes this pivot more viable than it sounds. If you're fluent in streaming data pipelines and comfortable reasoning about large-scale event data, the technical barrier to writing detection logic is lower than it seems. The gap is attacker knowledge and cloud log semantics. Fill it with CloudGoat, Stratus Red Team, and the ATT&CK matrix.
- Threat intelligence analyst. You understand attacker techniques and the threat landscape better than most defenders. The gap is turning that knowledge into operational detection rules. Learn one SIEM query language, build the detection lab, and start writing Sigma rules for the techniques you already know intimately.
The careers pivot guide covers the mechanics of the job search. The learning path and certifications guide have the credentials worth pursuing. GCIA, GCDA, and GCFE are the most relevant blue-team SANS certs; the CDIA (Certified Detection and Investigation Associate from SANS) maps most directly to this role. Cloud certifications (AWS Security Specialty, SC-200 for Sentinel, Google Professional Cloud Security Engineer) demonstrate the provider-specific context. The combination of a blue-team cert plus a cloud provider cert plus a public portfolio is the strongest resume package for this role - none of the three alone is sufficient.
One path worth naming explicitly: the detection lab-first approach. Before applying anywhere, build the lab described in the portfolio section. Spend three months running Stratus Red Team scenarios, writing Sigma rules, tuning them, publishing them, and writing up the results publicly on a blog or GitHub. That artifact is worth more in an interview than most certs, because it demonstrates you can actually do the job rather than that you've studied for a test about it. Hiring managers for detection engineering roles are practiced at distinguishing candidates who understand detection from candidates who can write about detection - and the lab is the most reliable separator.
The timeline to hireable
A realistic timeline for someone pivoting from an adjacent role (SOC analyst, cloud security engineer, or threat hunter) with dedicated part-time effort: three to four months to build the detection lab, run the key Stratus Red Team scenarios, publish two or three write-ups, and contribute one Sigma rule to a community repository. After that, you have a portfolio that can get you through the first resume screen at most organizations hiring at the mid-level. The full senior-level ramp - where you can own coverage strategy, lead purple-team programs, and design detection-as-code pipelines - typically takes two to three years of in-role experience after the initial hire. The good news is that this ramp is visible and measurable: you can track your own ATT&CK coverage improvements and purple-team validation rates as a proxy for seniority progression, which is unusual in security where skill progress is often opaque.
Where this role leads
Detection engineering is a deep specialist track with a clear IC progression, a natural branch into management for those who want it, and strong demand for the skills in adjacent roles.
- Senior / staff / principal detection engineer. The IC track extends to principal at large tech companies. Staff-level engineers set detection strategy for entire product lines or business units; principals drive industry-level work through conference talks, Sigma contributions, and threat-intel partnerships. The comp is good and the work stays technical.
- Cloud IR lead. The detection-to-response boundary blurs at senior levels. Many detection engineers who lean into the "what happens when the alert fires" question move into IR leadership roles, especially as cloud-native IR (no disk image, no memory dump, just API logs and timeline reconstruction) becomes its own discipline.
- Threat intelligence / threat research. Detection engineers who develop a strong attacker-research practice often move into threat intel roles, threat research teams at vendors, or adversary simulation teams. The output shifts from "rules that fire today" to "understanding of what attackers will do next."
- Detection engineering manager. Running a team of detection engineers is a natural management track for senior detection engineers. The job shifts from writing rules to setting coverage priorities, building the detection-as-code platform, and developing junior engineers. Compensation is comparable to staff IC at most companies.
- Cloud security generalist or architect. Detection engineers who want broader scope often transition into generalist or architect roles, bringing their deep telemetry and attacker-knowledge to the posture and design-review sides of the job.
- Security data engineering. An underappreciated exit path. Detection engineers who develop deep expertise in the data pipeline and SIEM architecture side - log collection, normalization, enrichment, and query performance - can transition into security data engineering or SecDataOps roles, which blur the line between data engineering and security and are in growing demand as SIEM data volumes scale.
One honest observation about the trajectory: detection engineering is a role where the IC track stays technically interesting well into the staff and principal levels in a way that not all security specializations do. At staff level you're setting coverage strategy for a large organization, building the measurement infrastructure that makes the strategy visible, and driving the industry-level conversation about cloud detection techniques. It's a career path where going deep pays off for a long time.
The other sibling roles worth noting for detection engineers who want adjacent exposure without leaving the specialty: CNAPP analyst, which is the preventive complement to the detection engineer's detective function, and GRC engineer, which is the compliance framing around the same telemetry. Detection engineers who develop fluency in all three - detection, posture, and compliance context - become the rare "full-spectrum cloud security practitioner" that senior IC roles at large companies are often looking for.
Common mistakes
- Writing rules without simulating attacks first. A rule that looks correct may not fire on a real attack. The only way to know is to run the attack and watch the logs. Detection engineers who skip simulation end up with a rule library that provides comfort, not coverage.
- Accepting false positives because they're technically correct detections. A rule that fires 50 times a day for a technique an attacker uses once a quarter will be disabled by the SOC within a week. False-positive discipline is not optional - it's what makes the difference between a rule that catches attackers and a rule that exists in Git.
- Treating coverage as a checklist. "We have a rule for T1078.004" is not the same as "we will detect T1078.004 in our environment at realistic attacker dwell times." Coverage is a continuous measurement, not a box-checking exercise.
- Ignoring the control-plane logs and only working from GuardDuty / Defender findings. Provider-native ML findings are useful enrichment; they are not a substitute for custom detections. The techniques that matter most are often the ones the provider's ML doesn't model, because they're too environment-specific or too new.
- Not monitoring your own rules for staleness. If you don't have automated monitoring that alerts when a rule has suspiciously low (or zero) event volume, you don't know when provider schema changes have silently broken your coverage. The rules that stop working invisibly are the most dangerous.
- Building a detection library in isolation from the SOC. The best detection engineers have a tight feedback loop with the analysts who respond to their rules. If your alerts are being disabled because they're too noisy, you're not doing detection engineering - you're generating work. Build the relationship; adjust based on what you hear.
- Avoiding the purple-team work because it's uncomfortable to find gaps. The point of purple-teaming is to find the gaps. A detection engineer who hasn't run Stratus Red Team against their own SIEM in the last quarter has a coverage map they cannot trust.
- Treating every cloud as identical in log structure. A rule that fires correctly on CloudTrail events will not automatically translate to Azure Activity Logs, because the field names, event types, and entity models are entirely different. The abstraction that Sigma provides is valuable precisely because the underlying schemas are not interchangeable.
- Skipping the data-access log tier for cost reasons. CloudTrail data events and GCP Data Access logs are expensive to ingest at volume, so many teams skip them. But S3 object reads, DynamoDB item fetches, and Lambda invocations are where a significant portion of cloud data exfiltration occurs. Skipping the data-access tier creates a coverage gap that attackers know about. The right answer is selective ingestion (high-value buckets, sensitive tables), not zero ingestion.
- Not owning the CI pipeline for detections. Detection rules that deploy without automated testing are infrastructure without tests. The CI pipeline that validates Sigma syntax, checks field name correctness against a sample schema, and runs a unit test against known-malicious and known-benign event sets is what separates a production-grade detection library from a folder of YAML files.
How AI is changing the role
Two things are happening simultaneously, and they point in different directions.
On the "AI as tool" side, the gains are real and accelerating. LLMs are competent at drafting initial Sigma rules from a technique description, translating between query languages (Sigma to KQL, KQL to SPL), explaining unfamiliar log event structures, and generating synthetic benign-event samples for rule testing. The detection engineer who uses AI tools to accelerate the mechanical parts of rule writing gets more done. But the judgment about whether a rule actually fires correctly, whether the false-positive analysis is complete, and whether the detection logic handles real attacker variation - that judgment is still yours. A confident but subtly wrong AI-generated rule is a coverage gap that looks like coverage. Review everything.
On the "AI as attack surface" side, agentic AI systems introduce new credential patterns, new data access paths, and new lateral movement techniques that don't fit neatly into existing ATT&CK categories. A model-as-a-service endpoint in your cloud environment is a new log source, a new IAM principal, and a new data exfiltration path - all at once. The detection engineer who understands how AI workloads authenticate and access data will be ahead of the curve as these workloads proliferate. The ones who wait will be writing rules for AI-specific attack techniques in response to incidents. See AI/ML security for the technical foundation.
What is not changing: the adversarial core of the job. AI can draft a rule; it cannot simulate an attack to validate the rule. It can translate a query; it cannot tell you whether the translated query handles the edge cases in your specific environment. It can suggest coverage gaps; it cannot own the decision about which gaps are acceptable. The detection engineer's judgment about what catches real attackers in your environment is not a task that automates away - it compounds over years of building the muscle.
The medium-term trajectory, honestly assessed: AI tools will make it feasible for smaller teams to maintain larger rule libraries. A three-person detection team in 2028 will likely be able to maintain coverage that a five-person team maintains today, because the mechanical translation and first-draft work will be automated. This is good news for the people in the role (they get leverage) and bad news for teams hoping to staff junior detection engineers primarily on translation and maintenance tasks (those tasks will shrink). The engineers who thrive will be those who use AI to go deeper on validation, simulation, and technique research rather than those who resist it.
One specific near-term shift worth calling out: AI-assisted detection is moving from generating rule drafts to generating behavioral analytics. LLM-based anomaly detection over cloud API call sequences - detecting that an IAM principal's API call behavior "looks different" this week - is in production at several large-cloud security vendors and in early trials at in-house teams. Understanding how these ML-based analytics complement (and don't replace) rule-based detection becomes part of the senior detection engineer's mental model. The rules catch known techniques; the behavioral analytics surface unknown deviations; the detection engineer's job is to understand which is which and tune accordingly. This is new, it is evolving fast, and it is the direction the discipline is heading.
Quick answers
What does a cloud detection engineer actually do?
Writes and maintains the rules that catch attackers in cloud environments: Sigma/KQL/SPL detections, ATT&CK Cloud coverage mapping, purple-team simulations with Stratus Red Team or Atomic Red Team, detection-as-code pipelines, and rule lifecycle management. The work is code-first, not console-first.
How is it different from traditional detection?
No EDR agents on most infrastructure. Detection is from control-plane API logs, not process-creation events. Attacks are identity-based (role assumption, key abuse, OAuth consent) rather than malware-execution. Log schemas differ per provider and per service. Rules silently rot when providers change event schemas. The MITRE ATT&CK Cloud matrix keeps expanding.
What query language should I learn first?
Sigma - it's vendor-neutral and compiles to everything else. Then learn the native language of your primary SIEM: KQL for Sentinel/Microsoft environments, SPL for Splunk, YARA-L for Chronicle. The investment in Sigma compounds across every SIEM migration your career will include.
Is purple-teaming required or optional?
Required, if you want to know whether your detections actually work. A coverage map without simulation evidence is a guess, not a measurement. Even a monthly thirty-minute Stratus Red Team run against a few ATT&CK techniques is better than operating on assumption.
Do I need to know how to code?
Yes, at the scripting level. Python for event manipulation, detection pipeline tooling, and lab automation. You also need Git fluency for detection-as-code workflows - rules that live only in the SIEM console don't get reviewed, versioned, or systematically maintained. You don't need to be a software engineer; you need to be comfortable shipping code.
How is this different from a SOC analyst role?
A SOC analyst responds to detections that fire; a detection engineer writes the detections that fire. SOC analysts triage alerts, investigate incidents, and escalate what they can't handle alone. Detection engineers design the rules that make that work possible - and design them so that the SOC workload is as high-signal and low-noise as possible. The feedback loop runs in both directions: detection engineers need SOC feedback on which rules are useful, and SOC analysts benefit from working directly with the people who can fix the rules that waste their time. At smaller orgs the two roles often overlap in the same person; at larger orgs they're distinct teams with a formal interface.
What's the hardest part of the job?
Tuning. Writing a rule that catches an attack is satisfying and takes maybe a few hours. Tuning that same rule so it fires at a sustainable rate in a production environment - where developers legitimately do things that look like attacks at scale - can take days, and the result is never perfect. The hardest tuning problems are identity-based techniques in active engineering organizations: an AssumeRole call from an unusual source IP is highly suspicious in some environments and completely normal in others. The judgment about where to draw the line, and the discipline to document why the line is where it is, is where most of the craft actually lives. People who expect detection engineering to be mostly "write clever rules" are often surprised by how much of the job is "understand your environment well enough to know what normal looks like."
Who this role is not for
Cloud detection engineering is a genuinely great role if you love the adversarial puzzle, enjoy code, and have the disposition to maintain a system that's never "done." It is a frustrating role if:
- You want closure. The SIEM is never fully tuned. The ATT&CK coverage map is never complete. Provider schemas change, new services appear, and attacker tradecraft evolves. If you need to finish things before moving on, this role will feel like Sisyphus work. If you treat the ongoing incompleteness as the interesting part, it's one of the more intellectually stimulating jobs in security.
- You want to respond to incidents, not just build for them. The detection engineer's direct product is a rule in a pipeline; the incident investigation that rule enables is someone else's job. If you want to be in the room when the breach is live, look at cloud IR. Many practitioners do a hybrid role that covers both; few purely detection engineering roles include deep IR work.
- You prefer breadth to depth. This is a specialist track. You will get very good at a specific class of problem - cloud log analysis, rule writing, purple-teaming - and that depth is what makes you valuable. If you want to own IAM, posture, detection, and design review all at once, the generalist cloud security engineer role is a better fit for the first several years, with detection as a future specialization.
- You don't enjoy code. The detection-as-code workflow, the CI pipeline, the scripting for lab automation and telemetry analysis - these are engineering tasks. The engineers who thrive write code naturally; the ones who avoid it end up with less maintainable rule libraries and slower triage workflows. You don't need to be a software engineer, but you need to be comfortable in Git and Python as daily tools, not occasional ones.
Open-source content and community resources
Cloud detection engineering has an active open-source and community ecosystem. These are the resources worth knowing specifically for the cloud-focused practitioner - beyond the vendor documentation and formal training that the certifications guide covers.
Rule repositories and detection content
- SigmaHQ - the canonical community Sigma rule library. The
cloud/directory contains AWS, Azure, and GCP rules contributed by practitioners across the industry. Browsing and critiquing existing rules is one of the fastest ways to build detection pattern intuition. - CTID Cloud Analytics - the Center for Threat-Informed Defense's cloud-specific analytic library; higher confidence and more rigorous ATT&CK mapping than most community Sigma rules. Start here for high-priority techniques before checking SigmaHQ.
- Elastic Detection Rules - Elastic's production rule library for EQL and KQL, including cloud-native rules. Well-documented with unit tests; useful even if you're not running Elastic.
- Microsoft Sentinel Community - the official Sentinel GitHub repository contains community analytics rules, hunting queries, and workbooks. The cloud-native rules for Azure and M365 are particularly useful for Azure-primary environments.
Attack simulation and research tooling
- Stratus Red Team - Datadog's cloud attack simulation tool; the standard for detection validation in AWS, Azure, GCP, and Kubernetes. Each technique is a standalone, reversible atomic that generates realistic logs. Run them all at least once against your lab SIEM.
- Atomic Red Team - Red Canary's atomics; broader ATT&CK coverage including cloud techniques. YAML-defined tests that can be executed with the Invoke-AtomicRedTeam framework.
- CloudGoat - intentionally vulnerable AWS environments for learning attack paths and building detection coverage. See the CloudGoat portfolio project for the structured approach to using it for detection work.
Community and continuing education
- fwd:cloudsec - the dedicated cloud security conference; detection engineering and threat research talks from the people writing the papers. The talk archive is a curriculum in itself.
- SANS CloudSecNext Summit - SANS's cloud-specific track with practitioner-led content on detection, IR, and offense.
- Blue team Slacks and Discords - the Sigma community Discord, the BlueTeamLabs Discord, and the Detection Engineering Discord are the places where practitioners share new rules, ask tuning questions, and announce schema changes faster than blog posts.
- CSOH Friday Zoom - cloud security practitioners from both customer and vendor sides; detection engineering comes up regularly. The archive of past sessions in the meeting recaps has several detection-focused discussions.
Where next
Cloud detection engineering connects deeply with several adjacent topics and roles. The links below are the highest-leverage next reads depending on which part of this page you found most interesting.
- Cloud security careers overview - the full role map this page sits inside.
- Detection engineering - the topic page with techniques, tooling, community resources, and the broader practice beyond cloud-specific telemetry.
- Cloud incident responder path - the role that uses your detections on the other side of the alert. Understanding how IR uses your detections makes you a better detection engineer.
- Cloud security engineer - the generalist role that often feeds into this specialty, and the page with the foundational cloud security context this role builds on.
- Cloud penetration tester - the offensive counterpart; understanding their techniques makes your detections sharper. The best detection engineers think like attackers.
- Portfolio projects - the detection lab and CloudGoat write-ups are the portfolio for this role. Start here if you're building toward your first detection engineering job.
- Home lab guide - how to build the free-tier SIEM + attack simulation environment this role requires for portfolio work and continuous learning.
- Certifications guide - GCIA, GCDA, AWS Security Specialty, SC-200 - which ones matter and when in a detection engineering career path.
- IAM and identity - identity is the perimeter in cloud; IAM fluency is a prerequisite for writing effective cloud detections.
- Breach kill chains - real cloud breach reconstructions with the specific API call sequences attackers used. The closest thing to a detection engineering training set available publicly.
- Friday Zoom sessions - practitioners doing this work weekly. Bring rule-writing questions, ATT&CK coverage questions, and log source questions.
- Mentorship - if you're considering the pivot into cloud detection engineering, a thirty-minute conversation with someone who's done it is the highest-leverage hour you'll spend.