The Cloud Detection Engineer Role

Builds the detections that catch attackers in cloud - Sigma/KQL/SPL rules, MITRE ATT&CK Cloud coverage, purple-teaming, and a SIEM that never stops needing tuning.

A security operations analyst monitoring detection dashboards
Photo by Pexels

· · Vendor-neutral · View source on GitHub

← Back to all cloud security roles

The honest version: Cloud detection engineering is one of the most technically demanding roles in cloud security, and it never stops being so. Your job is to write rules that catch real attackers - but the log sources keep changing, the attack techniques keep evolving, the SIEM never stops needing tuning, and every new managed service your engineering org adopts is a new class of telemetry you've never written a rule for. The engineers who thrive here genuinely enjoy the adversarial puzzle. The ones who burn out thought it was mostly a SIEM-admin job.

This page is the deep version of the summary card on the careers overview. Numbers are US-centric, 2026, and approximate.

$145-260K
Base, mid to senior (US)
3 providers
Separate log schemas to master
ATT&CK Cloud
The coverage map that never finishes
0 EDR agents
On most cloud infrastructure you defend

On this page

  1. What a cloud detection engineer actually does
  2. Why the cloud version is a different job
  3. Cloud log sources, in depth
  4. Detecting identity-based attacks
  5. The learning treadmill, in detail
  6. A week in the life
  7. A day in the life: Wednesday at a fintech
  8. The skill stack
  9. Detection-as-code in practice
  10. Tools of the trade
  11. The multi-cloud dimension
  12. How the role changes by company stage
  13. Salary & compensation
  14. The interview loop for this role
  15. Portfolio projects that prove the role
  16. How to break in and pivot from adjacent roles
  17. Where this role leads
  18. Common mistakes
  19. Who this role is not for
  20. How AI is changing the role
  21. Quick answers
  22. Open-source content and community resources
  23. Where next

What a cloud detection engineer actually does

Strip away the job-description language and the work is this: you are responsible for making sure that when an attacker is operating inside your cloud environment, something fires. Not just a canned vendor alert that everyone ignores - a tuned, well-contextualized detection that routes to the right analyst, contains enough supporting data to investigate without a second log query, and was validated against a real attack simulation before it shipped. That sounds straightforward. It isn't.

On a normal week the work breaks into several threads running in parallel:

Notice that "responding to alerts" is not on that list. At most orgs, the detection engineer builds the detections and hands off to the SOC or IR team who respond to them. The overlap is real - you need to understand how analysts use your rules to make them useful - but the primary output is code: rules, pipelines, coverage documentation, and purple-team reports.

The ratio of new-rule-writing to tuning-existing-rules shifts heavily toward tuning as the detection library matures. A team with 50 rules spends most of its time writing new ones. A team with 500 rules spends most of its time keeping those 500 working. Neither phase is more important than the other, but candidates who've only ever worked in one phase often underestimate the other.

The scope of "what you own" also expands with seniority in ways that aren't always obvious from the outside. A junior detection engineer owns their individual rules. A senior detection engineer owns a coverage domain (identity techniques, data access techniques) end-to-end, including the purple-team validation schedule for that domain and the relationship with the IR team that uses those detections. A staff detection engineer owns the detection program's measurement framework - how coverage is quantified, how it's communicated to leadership, how prioritization decisions are made across the team. Each level adds a different kind of responsibility, not just more of the same thing.

Why the cloud version is a different job

Traditional detection engineering matured in environments where you had endpoint agents on every host, network sensors watching east-west traffic, and a relatively stable attack surface. Cloud removes most of that and replaces it with something fundamentally different. These are the twists that separate cloud detection from the version in your SANS blue-team courses.

1. You detect from the control plane, not from hosts

The most important telemetry in cloud is not what happened on a VM's filesystem - it's what happened to the cloud API. CloudTrail, Azure Activity Logs, and GCP Audit Logs record every API call made against the cloud fabric: who assumed which role, which bucket was listed, which Lambda was invoked with what parameters, which IAM policy was changed. On an ephemeral Fargate task or a serverless function, there may be no host to put an EDR agent on. The control plane is not backup telemetry - it is often the only telemetry. That means learning to read API-call logs fluently is as fundamental to this role as learning to read process-creation events is to a traditional endpoint detection engineer.

2. Every cloud provider structures logs differently - and so does every service

A Sigma rule that fires on a CloudTrail AssumeRole event is not automatically portable to an Azure Activity Log Microsoft.Authorization/roleAssignments/write event or a GCP Audit Log SetIamPolicy call. The fields have different names, different semantics, and different enrichment. At the service level the heterogeneity is even more extreme: S3 data events and CloudTrail management events use different schemas; RDS and DynamoDB emit logs through different channels; container workloads on ECS vs. EKS vs. Lambda each generate different telemetry. Every new managed service that reaches general availability is a new schema you haven't seen. You can't write a detection for it until you understand its log structure - and that understanding requires hands-on work, not just reading documentation.

3. Identity-based attacks dominate over malware execution

In cloud environments, the dominant attack patterns are not "malware executed on a host" - they are "credential was obtained and then used to call APIs." Role assumption chaining, IAM key exfiltration and abuse, OAuth consent phishing, token replay, and cross-account role assumption are the techniques showing up in real breaches at scale. These attacks often leave no trace in any endpoint telemetry. They are entirely visible in control-plane logs - but only if you're querying the right fields, at the right time, with rules tuned to distinguish the attack pattern from legitimate developer and automation behavior. The false-positive surface for identity-based detections in active engineering environments is enormous, which is why this is genuinely hard to do well.

The implication for your rule library: roughly 60-70% of your cloud detection coverage should be on identity techniques (credential access, privilege escalation through IAM, lateral movement via role assumption, and persistence via policy attachment), not on execution and exfiltration techniques the way an endpoint detection library would be weighted. This is a significant rebalancing from what most SANS blue-team training prepares you for, and it requires building a detailed mental model of what "normal" IAM activity looks like in your specific environment before you can reliably distinguish attacker behavior from legitimate use.

4. Rules silently rot when event schemas change

Cloud providers change API behavior, add new event fields, deprecate old ones, and sometimes restructure entire event schemas with minimal announcement. A rule that worked last quarter may silently stop firing this quarter - not because the attack technique disappeared, but because the field it was matching on no longer exists, or has moved to a nested JSON structure, or is now empty by default. This rot is insidious because it reduces coverage without generating any visible alerts. Detection engineers in cloud need a systematic monitoring approach to detect when their own rules have gone stale - and that monitoring is itself a detection problem.

5. Ephemeral workloads mean your evidence window is narrow

A container that ran for ninety seconds and exfiltrated a credential is gone before the analyst opens the ticket. You either captured the evidence while it ran - in the control-plane logs, in container runtime telemetry, in VPC flow logs - or you didn't. Post-incident forensics on ephemeral workloads often consists of "here is what the API logs recorded" and nothing more. This puts pressure on detection quality before incidents happen, not just during them. Every technique gap in your ATT&CK coverage map is a place where you'll have no evidence if an attacker uses it.

This is also why detection latency matters in a way that on-prem security often didn't. A detection that fires in thirty minutes on an ephemeral workload may fire after the workload and its evidence are gone. The log-to-detection latency of your pipeline - the time from an API call happening to a SIEM alert firing - is an operational measurement worth tracking. Sub-five-minute end-to-end latency is achievable with streaming ingestion; thirty-plus minutes is common with batch ingestion pipelines, and the difference is significant for ephemeral workload attacks.

6. Detection-as-code is the only approach that scales

A cloud environment spanning dozens of accounts, multiple regions, and several managed service categories cannot be covered by rules that someone clicked into a SIEM console. Detection-as-code - rules in Git, CI pipeline that runs unit tests against sample events, automated deployment to the SIEM - is how the coverage stays current and how changes get reviewed rather than going directly to production. This is not a "nice to have for mature teams" pattern; it's the baseline. Detection engineers who arrive from on-prem environments and treat the SIEM as a configure-in-the-UI tool run into the scale wall quickly.

7. The ATT&CK Cloud matrix keeps growing

MITRE adds new techniques and sub-techniques to the Cloud matrix as threat researchers document real-world attacker behavior. The 2026 matrix is materially larger than the 2022 version, and it will be larger still in 2028. Unlike endpoint ATT&CK, where the technique catalog is relatively stable, cloud ATT&CK reflects an attack surface that is genuinely expanding as providers ship new services that attackers learn to exploit. Your coverage map is never finished - it only has different-sized gaps over time.

Traditional detection asks "did malware execute?" Cloud detection asks "did an identity do something it shouldn't have, in a service it may not have existed two quarters ago, generating logs you've never seen before?"

8. The absence of network layer telemetry changes investigation

In data-center environments, network captures and proxy logs are often the richest forensic source after a compromise. In serverless and managed-service architectures, there may be no useful network telemetry at all - just API calls and the data they touched. This makes the control-plane-and-identity focus not just a detection choice but an investigation constraint: if your detections don't fire on the attack, there may be no other way to reconstruct what happened. The detective pressure that traditionally existed at both the network and the host layer now exists primarily in the API call log - which is why those logs need to be complete, consistently enriched, and queryable at low latency.

Cloud log sources, in depth

If there is one thing that separates a practitioner from someone who has read blog posts about cloud detection, it is depth on the actual log sources. You cannot write good detections for events you haven't spent time reading in volume. This section covers the log sources you'll use most and the things about each that matter for detection engineering specifically.

AWS: CloudTrail and beyond

CloudTrail is the spine of AWS detection. Management events record API calls that modify resources - CreateRole, PutRolePolicy, DescribeInstances, GetSecretValue. Data events require explicit opt-in per service (S3 object-level, Lambda invocations, DynamoDB row-level) and generate orders of magnitude more volume; ingesting all of them into the SIEM without filtering will break your budget. Key CloudTrail nuances for detection: userIdentity.type distinguishes a human IAM user from an assumed role from a federated identity from an AWS service; sourceIPAddress may be an AWS service principal rather than an IP in cross-account scenarios; errorCode values like AccessDenied on reconnaissance calls are often the earliest attacker signal; requestParameters is where the operationally important details live (which bucket, which role, which instance profile) and is the field most often absent from surface-level rule descriptions.

GuardDuty sits on top of CloudTrail and VPC Flow Logs and provides ML-enriched findings - but the finding rate is low by design, and GuardDuty findings are not a substitute for custom CloudTrail detections. The techniques GuardDuty doesn't cover are the ones that are most environment-specific or most recently documented; both categories are where custom rules add the most value. CloudWatch Logs, VPC Flow Logs, Route53 resolver logs, S3 server access logs, and WAF logs complete the picture for network-layer visibility. The Services that emit their own log formats - RDS audit logs, Lambda CloudWatch logs, EKS audit logs, MSK access logs - each require a separate ingestion pipeline and a separate understanding of their schema.

Azure: Activity Logs, Diagnostic Logs, and Entra ID sign-in logs

Azure Activity Logs are the control-plane equivalent of CloudTrail: resource writes, role assignment changes, policy modifications at the subscription and management group level. Entra ID (formerly Azure AD) sign-in logs and audit logs are the identity plane and are often the most detection-rich source for identity attacks - conditional access failures, MFA bypass attempts, service principal credential usage from unexpected locations, and token issuance events. These are separate streams that require separate ingestion into Sentinel; a common gap is teams that ingest Activity Logs but not Entra ID audit logs, which means they have resource control-plane coverage but no identity-plane coverage.

Diagnostic Logs vary per service and must be explicitly enabled per resource type; the schema differs materially between, say, Azure SQL and Azure Kubernetes Service. Microsoft Defender for Cloud and Microsoft Sentinel integrate tightly with these sources and provide UEBA, fusion rules, and native KQL analytics. KQL is the query language throughout the Microsoft ecosystem, and deep KQL fluency - including joins across multiple tables, time-series operators, and the make-series operator for behavioral analytics - is the differentiating skill in Azure-heavy detection environments. Microsoft 365 Defender / Defender XDR provides a separate hunting surface that overlaps with Azure detections when the attacker pivots between cloud resources and M365 identities.

GCP: Cloud Audit Logs and beyond

GCP Audit Logs have three types: Admin Activity (always on, control-plane write operations), Data Access (must be enabled per API, high volume), and System Event (automated GCP actions). The separation between Admin Activity and Data Access is the GCP equivalent of CloudTrail management vs. data events - and the same ingestion-cost dynamic applies. GCP's resource hierarchy (organization, folder, project) means you can aggregate audit logs at the org level through Log Router and BigQuery or Pub/Sub to a SIEM. This flexibility is genuinely useful; the engineering cost of building the pipeline correctly is real and often underestimated.

Chronicle (now Google SecOps) has native GCP integration and YARA-L as its rule language, which is distinct from both KQL and SPL. YARA-L is purpose-built for security analytics over time-series event data and handles complex temporal correlations (events that happen within N minutes of each other, sequences across multiple entities) more naturally than SPL or KQL. The Google-published detection content library for GCP is less mature than the community content available for AWS in Sigma or Splunk ES, which means GCP-primary detection engineers write more from scratch. GCP IAM events are in Audit Logs under SetIamPolicy; service-account key creation (CreateServiceAccountKey) is among the highest-fidelity attack signals in a GCP environment. GCP also emits VPC Flow Logs, Cloud DNS logs, and Cloud Armor WAF logs for network-layer coverage.

Kubernetes and container logs

Kubernetes audit logs are an often-overlooked and extremely high-value detection source for cloud environments running EKS, AKS, or GKE. API server audit logs record every kubectl command, RBAC change, pod creation, service account token request, and exec into a running pod. These events are the k8s equivalent of CloudTrail management events - they tell you who called what API against the cluster control plane. The challenge is twofold: k8s audit log schemas differ between managed providers (EKS adds AWS ARN context that GKE doesn't), and the volume from active CI/CD pipelines is high enough that naively alerting on anything unusual generates constant noise. The detection engineering work here is in building baseline models of what legitimate pipeline behavior looks like and alerting on deviation.

Container runtime telemetry from tools like Falco or AWS Security Hub container scanning adds process-level visibility where it doesn't exist natively in control-plane logs. Falco rules are written in their own YAML-based format and provide signals like unexpected process execution in a container, filesystem writes to sensitive paths, and outbound network connections to uncommon destinations. These complement control-plane detections rather than replacing them - control-plane logs tell you the resource was created; runtime telemetry tells you what happened inside it.

Detecting identity-based attacks: the cloud detection core skill

Because identity-based attacks dominate cloud intrusions, this section goes deeper on the specific detection patterns that matter most. These are the techniques you'll encounter in purple-team exercises, in incident investigations, and in threat intelligence reports covering cloud breaches. Understanding them at the detection level - not just the conceptual level - is what separates the practitioners from the blog-post readers.

Role assumption chaining

Attackers who start with a compromised low-privilege IAM identity often chain sts:AssumeRole calls to progressively higher-privilege roles: start with the developer's IAM user, assume a deployment role, assume an admin role in the same or a different account. The challenge: legitimate CI/CD pipelines do exactly the same thing at high volume. The detection logic must distinguish a human assuming roles at unusual hours, from unusual source IPs, in an unusual sequence - not just "AssumeRole happened." Key fields: userIdentity.type, userIdentity.sessionContext.sessionIssuer, source IP, and the requestParameters.roleArn to identify which role was assumed. Cross-account role assumptions (where requestParameters.roleArn belongs to a different account than the caller) are particularly high-signal.

IAM key and token exfiltration

Access keys used from novel geographic locations, unusual ASN ranges (hosting providers vs. residential ISPs), or with API call patterns inconsistent with the key's normal usage are the most common first-signal of key exfiltration. Detection: establish a baseline of normal source IP ASN, normal API call mix, and normal time-of-day usage for each access key or role; alert on significant deviation. The false-positive surface is legitimate developer travel and new CI runners - manage it with baseline-building windows and suppression for known-good patterns rather than hard IP allowlists that expire.

OAuth consent and service principal abuse

In Azure and GCP environments, attackers increasingly abuse OAuth consent flows to obtain persistent access to Microsoft Graph, Exchange Online, or GCP APIs without ever touching managed cloud resources. The detection surface is in Entra ID audit logs (consent grant events, especially for high-privilege Graph permissions like Mail.ReadWrite or Directory.ReadAll) and in GCP Audit Logs (SetIamPolicy on service accounts). Consent grants to applications outside the tenant's verified publisher list, or to applications that immediately begin reading high-value resources, are the highest-fidelity signals. This technique is frequently absent from detection libraries built primarily around CloudTrail because it is an Azure-first attack pattern.

Instance metadata service abuse

EC2 instance metadata service (IMDS) abuse - extracting temporary credentials from the metadata endpoint, then using them from outside the instance - was the core technique in the Capital One breach and remains common. IMDSv2 mitigates the most common SSRF-based exploitation path, but detection coverage for IMDS credential abuse is still valuable: AssumeRole calls using an EC2 instance profile (userIdentity.type = AssumedRole with sessionIssuer.type = Role and source IP outside the known EC2 address range for the instance's region) are the signal. The false-positive rate is low when the IP matching is precise; imprecise IP matching generates substantial noise.

S3 and data store reconnaissance and exfiltration

Bucket enumeration (ListBuckets, ListObjects on unfamiliar prefixes), followed by data access from new principals or source IPs, is the canonical cloud data exfiltration pattern. The data event tier (S3 access logs, CloudTrail data events) is required for the access signal; many teams skip it for cost reasons and then have no visibility into the exfiltration phase even if they detect the reconnaissance. Detection: alert on high-volume GetObject calls from identities that have never previously accessed the bucket, especially when the destination IP is a hosting provider ASN.

Lambda and serverless function abuse

Serverless functions running with overprivileged IAM roles are attack targets for privilege escalation via code injection or environment variable extraction. Detection relies on CloudTrail management events (UpdateFunctionCode, UpdateFunctionConfiguration, AddPermission) and Lambda invoke data events. The technique of updating a Lambda's execution role to a higher-privilege role is detectable via CloudTrail's UpdateFunctionConfiguration event when requestParameters.role changes. The technique of injecting code into a Lambda environment variable is detectable via UpdateFunctionConfiguration when requestParameters.environment is modified unexpectedly.

Close-up of code on a screen during detection rule development
Photo by Pexels

The learning treadmill, in detail

The treadmill is real in every cloud security role, but detection engineering has its own especially relentless version. The problem is structural: your job is to have detection coverage for the attack techniques that matter, but the attack surface and the telemetry that covers it are both moving at the same time, independently, in directions you don't control.

Here is what the treadmill looks like from the inside:

How practitioners actually keep up - the detection engineers who stay current don't try to read everything. They build a system. Weekly: skim the ATT&CK navigator for new additions, check the SigmaHQ repository for new rules relevant to their environment, read the provider release notes for log format changes. Monthly: run a purple-team exercise against one ATT&CK technique category and close the gaps that surface. Quarterly: do a rule-staleness audit - replay historical benign and attack traffic through every detection and confirm the results match expectations. Community: fwd:cloudsec, SANS CloudSecNext, the CTID cloud analytics project, and the CSOH Friday sessions are where you find out what other practitioners are seeing before it hits the blog posts.

The treadmill is also why this role rewards practitioners who build systems over those who rely on personal heroics. A detection engineer who manually reviews every provider release note and every threat intel report will burn out. The ones who survive long-term have automated the triage layer: RSS feeds that filter for cloud-security-relevant content, automated staleness monitoring that flags rules for review before they silently fail, and a library structure that makes it easy to trace "what log source does this rule depend on" so that when a log format changes you can find all affected rules in seconds rather than hours.

One more element of the treadmill specific to detection engineering: the adversary learns too. As the community publishes more cloud detection content - through conference talks, Sigma rules, threat intelligence reports - sophisticated threat actors adapt their tradecraft to avoid the published detections. The timing delays between attack steps, the use of legitimate-looking source IPs, the blending of attack API calls with high-volume legitimate API traffic in the same session - these are deliberate adaptations to known detection approaches. Staying current means reading not just the defense-side literature but also the offensive research: the Permiso threat reports, the Wiz threat intelligence team findings, the cloud-attack CTF writeups, and the DFIR case studies from practitioners who handled real cloud breaches. Understanding what attackers know about your detections is the only way to reason about whether your rules would catch a prepared adversary, not just an opportunistic one.

A week in the life

This is a representative week for a senior detection engineer at a scale-up running primarily on AWS with Splunk as the SIEM. Your stack and org size will change the proportions, but the shape repeats.

Monday. Start with the weekly ATT&CK gap review. A new sub-technique was added to the cloud matrix last Thursday: T1578.005, Modify Cloud Compute Configurations. You read the technique description, pull up your CloudTrail logs and search for the relevant API calls (ModifyInstanceAttribute, ModifyNetworkInterfaceAttribute, and a handful more), and assess whether your existing instance-modification rules provide coverage or just partial coverage. They're partial - they don't catch the EBS snapshot case. Add it to the backlog.

Tuesday. Purple-team day. You're running a Stratus Red Team scenario for "Exfiltrate CloudTrail logs via S3." You detonate it in the lab account, wait to see whether the detection fires, and analyze the results. It fires - but only 40 minutes after the event, because the CloudTrail-to-S3-to-Splunk pipeline has indexing lag you didn't account for. You write up the finding: coverage exists but SLA for detection is 40 minutes, not the 10 you're targeting. Flag it to the SIEM team to discuss S3 notification triggers versus scheduled polling.

Wednesday. Heads-down rule work. You're writing a detection for OAuth consent grant abuse in Azure - a technique that generated two real incidents in the industry this quarter and that your Entra ID audit logs can cover. You spend the morning drafting the KQL, testing it against the last 30 days of audit logs in your Sentinel workspace, and counting the false-positive population. Twelve benign app registrations match the query; you add scoping criteria for known-good app display names, re-run, get to two, and document both as accepted baseline. The rule goes into the detection repo via PR; a colleague reviews by end of day.

Thursday. A platform team is adopting Amazon EKS for a new service. They've asked you to review the audit logging configuration before go-live. You review their Terraform, find that k8s audit logs are configured but not shipped to Splunk, and write a short requirements doc: ship audit logs, enable GuardDuty EKS Protection, and suppress these three known-noisy API paths in the Splunk transform. It's a one-hour engagement that prevents a three-month coverage gap.

Friday. Rule-maintenance pass. Automated monitoring flagged one detection that hasn't fired in 31 days - unusual for a rule that historically fires several times weekly. You investigate: an IAM field rename in a CloudTrail update from three weeks ago silently broke the match. You fix the field name, test in the lab, confirm the fix works, and ship the update. Afternoon: read the week's provider release notes (AWS: new ECS task metadata endpoint version; Azure: new Entra ID audit event for Privileged Identity Management activations; GCP: Cloud Spanner audit log schema update). One of them - the PIM activation event - is a new log type you don't have a detection for. Write the ticket.

What doesn't show up much: responding to live alerts (that's the SOC), writing compliance reports, or building out dashboards. What shows up every week without fail: reading code, writing code, reviewing code, and running simulations. The craft is in the details - a rule with a wrong field name provides zero coverage no matter how smart the logic is.

One thing that surprises people entering the role: the calendar looks much more like a software engineer's than like a SOC analyst's. You have blocks of heads-down time for rule development, code review for rules someone else wrote, engineering conversations with the platform team about log pipeline architecture, and structured purple-team sessions. The reactive alert-response cadence of SOC work is mostly absent. That's a feature for some people and a surprise for others; know which camp you're in before you interview.

A day in the life: Wednesday at a fintech running AWS and Sentinel

The weekly breakdown above is statistical. Here is the texture - an illustrative, composite Wednesday in the calendar of a senior detection engineer at a mid-size fintech running primarily on AWS with Sentinel as the SIEM. The specific customers, deals, and Slack messages are fictionalized. Treat it as a representative archetype.

7:45 - morning read. Coffee and the provider digest. AWS released two new API actions for SageMaker Unified Studio overnight. Neither is high-risk on its own, but one of them - a new role-chaining endpoint for model deployment - is the kind of thing that creates a privilege escalation path nobody has written a detection for yet. Add to the investigation queue.

8:30 - staleness alert. The automated monitoring system flagged a detection for CreateServiceLinkedRole that hasn't fired in 22 days. That's suspicious - this environment generates those events regularly. Pull up the rule, trace the field references against yesterday's CloudTrail sample. Found it: AWS changed the capitalization of the serviceLinkedRoleCreationContext field in a schema update two weeks ago. The rule is matching on the old casing. Fix, test against three known-malicious and five known-benign events, update the tuning history in the rule's metadata block, open the PR.

9:15 - rule review. A colleague opened a PR yesterday for a new detection covering Azure role assignment to privileged built-in roles from outside the tenant. You review the KQL: the logic is solid, but the exclusion for the CI/CD service principal is too broad - it excludes by display name, which is mutable, rather than by object ID, which is stable. Leave a comment, explain the risk, suggest the fix. The conversation takes three messages; the colleague updates, you approve.

10:00 - purple-team session. Monthly run with a contractor red teamer. Today's scope: T1548.005, Abuse Elevation Control Mechanism - Temporary Elevated Cloud Access. You've agreed to test whether your detection for abnormal PIM activations in Entra ID fires reliably. The red teamer activates the privileged role from a suspicious location; you watch Sentinel in real time. The rule fires in eight minutes - longer than the five-minute target but within SLA. Write up the result: rule fires, timing lag noted, recommend a priority upgrade to near-real-time evaluation. File the ticket with the log pipeline team.

11:00 - threat intel translation. A partner ISAC published a new advisory on a threat actor targeting financial services cloud environments. The TTPs section lists three techniques you don't have specific cloud-adapted coverage for. Two are straightforward translations of existing endpoint ATT&CK rules. The third - exfiltration via signed S3 pre-signed URLs generated from a compromised Lambda - requires a new detection logic that correlates CreateFunction, InvokeFunction, and S3 data event logs within a short time window. Write the research ticket; this one will take a day to build and validate properly.

1:00 - SOC office hours. Monthly check-in where the detection team and SOC leadership review which detections are generating the most work for analysts. Three rules in the top-10 most-investigated list are producing 65% of their alerts in a known-benign automated workflow. You agree to add a suppression for that workflow and schedule a tuning pass for the following week. One rule in the bottom-10 (rarely fired) is actually the most important to maintain - it covers a high-impact technique with a low base rate. Document the rationale explicitly so it doesn't get pruned in the next library review.

2:30 - SageMaker follow-up. Dig into the new SageMaker API actions from the morning. Read the AWS documentation, pull up an account with SageMaker enabled, call the new API, and watch what CloudTrail generates. The log structure is new enough that there's no Sigma rule for it. The privilege escalation path you suspected is real - the new role-chaining endpoint creates a sts:AssumeRole event with a service-specific ARN pattern that differs from standard developer-initiated assumptions. Draft an initial Sigma rule and drop a note in the Slack channel for the platform team that owns SageMaker to validate the normal usage patterns before you tune.

4:00 - documentation pass. Update the coverage navigator for the rules shipped this month. Three new techniques moved from "no coverage" to "partial coverage"; one moved from "partial" to "high confidence" after last week's successful purple-team validation. Write the coverage report for the monthly security metrics deck - leadership gets a trend line and a prioritized gap list.

5:30 - close. Log the open loops: the SageMaker rule needs production tuning after platform team input, the pre-signed URL exfiltration detection is on the backlog with a research ticket, the PIM timing lag goes to the pipeline team. Tomorrow's calendar has a detection design session for the new container workload they're spinning up. Read the CTID cloud analytics bulletin over coffee before logging off.

Total focused coding/rule-writing time: about 4 hours. Collaboration and review: about 2.5 hours. Research and reading: about 1.5 hours. Administration and documentation: about 1 hour. Every Wednesday is different in detail; the rhythm of building, reviewing, simulating, and monitoring repeats.

The skill stack

Detection engineering has a stable core that takes years to build and a moving edge that never stops. The ratio of core to edge shifts as you advance - junior engineers spend most of their energy on core fluency; senior engineers spend it on the edge and on building systems that help the team keep up.

The stable core

Build these deliberately. They don't expire, they compound, and they are what interviewers are actually testing even when the interview question sounds like it's about a specific tool.

The moving edge

Accept that this list has no fixed length. Every new managed service your org adopts, every new SIEM version that ships, and every new cloud-specific attack technique documented in public threat research extends this list. The skill is not "master the current list" - it's "have a reliable method for getting current on new items fast."

The detection lifecycle, step by step

Detection engineering has a lifecycle that most job descriptions underspecify. Understanding each step - and where the hard parts live - is more useful than a skills checklist:

  1. Technique selection. Not every ATT&CK technique deserves a rule. The ones that do are: high probability of appearing in your threat model, have reliable telemetry in your environment, and have a true-positive-to-false-positive ratio that the SOC can sustain. Technique selection is a risk prioritization exercise, not a completeness exercise.
  2. Telemetry research. Which log source captures this technique? What fields are populated? Are they populated consistently, or only under specific conditions? What does a benign event that triggers the same fields look like? This phase requires hands-on log analysis, not just reading documentation.
  3. Rule drafting. Write the initial logic. In Sigma first if the team uses a detection-as-code workflow; in native SIEM query language if you're prototyping quickly. Document the detection rationale, the ATT&CK technique mapping, and the expected false-positive classes.
  4. Attack simulation. Run a Stratus Red Team scenario or equivalent, confirm the rule fires, and examine the alert content. Does it contain enough context for an analyst to investigate without a second log query? Is the severity calibration correct?
  5. False-positive analysis. Run the rule against 30 days of production log data (or production-representative synthetic data). Count the false positive rate. Identify the benign use cases that match the rule and scope exclusions that are provably safe - that is, exclusions where the benign pattern cannot overlap with the attack pattern.
  6. Peer review. Another engineer reviews the rule, the simulation results, and the false-positive analysis. The review catches logic errors, missing edge cases, and exclusions that are too broad. This is the step most commonly skipped under time pressure and the step most commonly responsible for rules that fail in production.
  7. CI pipeline and deployment. The rule passes automated tests (syntax validation, schema validation, unit test against sample events) and deploys through the pipeline to the SIEM. The deployment is version-controlled; if the rule breaks, you can roll back.
  8. Production monitoring and feedback. Track alert volume, SOC feedback on quality, and analyst-applied exclusions. An alert that analysts are consistently dismissing as a false positive is a signal to tune the rule, not to accept the analyst behavior.
  9. Periodic re-validation. Quarterly or after major provider changes, re-run the simulation and confirm the rule still fires. Check that exclusions are still valid. Update the rule if the log schema has changed.
Detection engineering lifecycle The eight-step detection lifecycle from technique selection through production monitoring, forming a continuous loop. Detection lifecycle: from technique to production and back again Techniqueselection Telemetryresearch Ruledrafting Attacksimulation FPanalysis Peerreview CI/CDdeploy Productionmonitoring Quarterly re-validation loop: schema change? re-simulate, update, re-deploy
Every step after "attack simulation" is as important as writing the rule itself. Most coverage failures happen in the last two steps - rules that were never re-validated after a schema change.

Detection-as-code in practice

Detection-as-code is not a philosophy - it is the specific set of engineering practices that makes a cloud detection library maintainable at scale. The term is used loosely enough that it's worth being precise about what a mature detection-as-code workflow actually contains.

The repository structure

A detection library in Git has: rules organized by ATT&CK tactic or by log source, a metadata schema for each rule (ATT&CK mapping, severity, log source, author, date, tuning history, false-positive classes), a test fixtures directory (sample events for unit tests, both positive and negative), and deployment configuration that maps rules to the target SIEM. The structure is opinionated and team-specific; the important thing is that it exists and is enforced, because ad-hoc organization accumulates technical debt at a rate that eventually makes the library unmanageable.

The CI pipeline

Every pull request runs: syntax validation (Sigma schema compliance, or native query parsing); unit tests that replay known-malicious and known-benign sample events and confirm correct classification; schema validation against the expected log source fields (a rule that references a field that doesn't exist in the target log source fails the check); and coverage diff (a report that shows which ATT&CK techniques gained or lost coverage). Optionally, a cost estimate for the new rule's expected query volume. The CI pipeline is what allows peer review to focus on logic and rationale rather than catching typos.

The deployment pipeline

Approved rules deploy automatically through a pipeline - to a staging SIEM environment first (where they can run for 24-48 hours against production-shaped traffic without alerting the SOC), then to production after a quality gate. The deployment pipeline also handles SIEM-specific compilation: Sigma rules compile to KQL, SPL, YARA-L, or EQL before deployment to the appropriate SIEM. Some teams maintain a secondary "archive" SIEM for historical queries separate from the primary alerting SIEM; the deployment pipeline handles routing.

Staleness monitoring

The most sophisticated element of a mature detection-as-code setup is automated monitoring for rule staleness. Implementation options: a daily query that checks alert volume for every active rule and flags any rule whose volume has dropped more than 80% from its trailing 30-day average; a weekly run that replays a sample of historical attack events through each rule and checks that the expected alerts fire; a provider-change monitor that watches AWS, Azure, and GCP release notes RSS feeds and tags rules whose referenced log sources or field names appear in change announcements. Not every team has all three. Every team should have at least the first one.

Rule metadata and documentation

Each rule should carry: the ATT&CK technique it covers; the log source(s) it depends on; the expected false-positive classes and why the scoping exclusions are safe; the simulation evidence (Stratus Red Team scenario name, date run, link to run log); the tuning history (what was changed, when, why); and the owner. The documentation overhead feels painful when you have 20 rules. It saves enormous time when you have 300 rules and a provider schema change requires you to find every rule that depends on the affected log source. The metadata is also what lets a new detection engineer onboard into the library and understand why rules are structured the way they are, rather than having to reverse-engineer the reasoning from the query logic.

If you are evaluating detection-as-code platforms or building one, the four questions worth asking about any candidate platform are: how does it handle schema validation against actual log source field definitions? what does the test framework look like? how does rule deployment get authorized and audited? and does it support multi-SIEM compilation from a single rule source? Platforms that answer all four well are genuinely enabling; platforms that answer none of them are YAML-in-Git with a deployment script, which is better than nothing but is not a detection engineering platform.

Tools of the trade

You will not use all of these everywhere. The specific products vary by org; the categories are stable. Know the category before you know the product - you'll change products more often than you change categories.

SIEM and analytics platforms

Rule authoring and management

Purple-team and simulation tooling

Provider-native detection and telemetry

Coverage mapping and documentation

Emerging and specialist tooling

The multi-cloud dimension

Most detection engineers specialize in one cloud platform, but multi-cloud environments are common enough that you'll encounter cross-cloud detection requirements even if you're primarily an AWS person. The differences across providers matter operationally:

In multi-cloud environments, the investment in Sigma pays off most: write once against the abstract schema, compile to each native language. The abstraction leaks at the edges - you'll still need to understand provider-specific field semantics to write accurate Sigma - but the compilation saves most of the translation work. The AWS vs Azure vs GCP comparison maps the conceptual equivalents across providers.

A practical multi-cloud prioritization: most detection engineers should be fluent in one cloud and have working reading knowledge of the other two. "Fluent" means you can look at 50 events from the primary provider and immediately identify which are suspicious without documentation. "Working reading knowledge" means you understand the conceptual equivalents and can research specifics quickly. The detection engineer who claims deep fluency in all three simultaneously is usually shallower in all three than one who went deep in one first.

How the role changes by company stage

Vendor vs. in-house detection teams

Beyond company size, the in-house vs. vendor distinction matters for this role more than it does for most cloud security specializations. In-house detection engineers write rules specifically for their own environment, which means they can build precise knowledge of what "normal" looks like in their particular cloud footprint. The false-positive calibration is always environment-specific and gets better over time. The trade-off is scope: you're defending one environment with one set of log sources.

Detection engineers at MSSP and MDR vendors write rules that must work across dozens or hundreds of different customer environments, which requires different design principles - rules that are robust across diverse configurations, well-documented enough for junior analysts to use in environments they've never seen, and tunable by customers without requiring deep engineering expertise. The breadth of exposure to different attack patterns and environments is genuinely educational, but the inability to deeply tune for any single environment is a real constraint. Some practitioners do a tour at an MSSP early in their career for the breadth, then move in-house for the depth.

Multiple monitors showing detection queries and log analysis
Photo by Pexels

Salary & compensation

US, 2026, base salary. Big-tech total comp runs 1.5-2x via equity and bonus. The detection engineering specialty commands a 10-15% premium over the generalist cloud security engineer at equivalent levels, driven by the narrower skill set and harder hiring market. MSSP and MDR roles typically pay 10-20% below in-house rates. Financial services and healthcare pay a meaningful premium for detection engineers who understand compliance-relevant cloud telemetry. Adjust down outside major tech hubs and well down outside the US - halve the number and add a question mark for a rough non-US estimate.

For live data, cross-check levels.fyi (filter on "security engineer" at comparable companies), the BLS information security analysts data, and recent r/cybersecurity compensation threads. The careers salary section has the broader context across roles.

What "senior" actually means in detection engineering

The distinction between mid and senior in detection engineering is not primarily about years or the number of rules you've written. It's about systems thinking. A mid-level detection engineer writes good rules for known techniques and tunes them based on feedback. A senior detection engineer thinks about coverage as a program: they design the measurement system that tells you where the gaps are, build the purple-team cadence that validates coverage continuously, and make architectural decisions about the detection-as-code pipeline that affect the whole team's productivity. The seniors who get promoted to staff are the ones who made the team's detection capability better, not just their own rule library larger.

The interview loop for this role

Detection engineering loops are heavy on craft and simulation. Unlike the generalist loop that samples breadth, this one goes deep on a few specific skills. Expect some combination of these:

Log analysis and rule-writing exercise

The most common format: they give you a set of CloudTrail (or Activity Log, or GCP Audit Log) events and ask you to write a detection. The assessment is not just whether your query is syntactically correct - it's whether you understand the false-positive surface, whether your rule handles edge cases (what if the field is null? what if the same API call has a legitimate use at high volume?), and whether you can explain the detection rationale in terms of attacker behavior rather than just "this field equals this value."

ATT&CK coverage mapping

Walk me through your current coverage against MITRE ATT&CK Cloud. Which techniques do you have high-confidence detection for? Which are partially covered? Which are gaps, and why did you accept the gaps? This question is not testing whether you've memorized the matrix - it's testing whether you think systematically about coverage as a continuous measurement problem rather than a one-time project.

Purple-team design exercise

Design a purple-team exercise to validate detection coverage for credential-based lateral movement in AWS. Walk through: which techniques you'd simulate, which tools you'd use, what success looks like, and what you'd do with the results. Strong answers include specific Stratus Red Team scenarios, a discussion of the lab environment setup, and a plan for closing the gaps that surface.

Detection-as-code and pipeline questions

How do you manage your detection rules? How does a new rule go from idea to production in your environment? What tests run in CI? How do you detect when a deployed rule stops working? This surfaces whether you operate at engineering-team quality or SOC-analyst quality - and most detection engineering teams are looking for the former.

Behavioral and incident walk-through

Walk me through a detection you built that required significant tuning before it was useful. What was the false-positive population? How did you scope the exclusions? How did you validate you didn't break the true-positive case? This is looking for the candidate who understands that the first version of a detection is almost never the right version.

Take-home labs are common and often the highest-signal part of the loop: "Here are 48 hours of CloudTrail events from a compromised test account. Find the attack, write the detection, and explain how you'd tune it." Treat the take-home as the best single opportunity to show craft.

One underrated interview preparation: read ten Sigma rules for cloud ATT&CK techniques that you haven't written yourself, and work through the logic of each one. Ask yourself: what's the false-positive surface? What benign behavior would trigger this? What attacker behavior would not trigger this? The ability to critique an existing rule critically is at least as important as writing a new one, and it's something you can practice before any interview.

What interviewers are actually looking for

Three things, broadly. First, technical fluency: can you read a cloud log event and identify what happened, and can you write a detection query that finds the pattern you're looking for? Second, operational judgment: do you understand that detection is a tuning problem as much as a logic problem, and have you actually calibrated rules against real traffic rather than just writing theoretically correct queries? Third, the treadmill posture: do you have a practice for keeping current, and can you demonstrate that you learn new log sources and new attack techniques quickly when you encounter them? The candidates who perform best are the ones who can answer "walk me through a detection you wrote, from first reading about the technique to the rule being in production, including the tuning it required." If you don't have a real example, build one before you interview.

Portfolio projects that prove the role

Detection engineering portfolios are specific: they show detections you've written, attacks you've simulated, and coverage gaps you've measured and closed. "Built a security dashboard" is not a portfolio for this role. These are:

  1. Build a detection lab with a real SIEM and real attack simulations. Set up Splunk Free, Elastic, or a Sentinel trial. Ingest CloudTrail from a personal AWS account. Run Stratus Red Team against the account and write rules for each technique you simulate. Publish the rules, the ATT&CK coverage map, and the false-positive analysis for each. This is the single most effective portfolio artifact for this role.
  2. Walk CloudGoat scenarios and write detections for the attack path. CloudGoat is an intentionally vulnerable AWS environment. Walk the IAM privilege escalation scenarios, capture the CloudTrail events the attack generates, and write Sigma rules that would catch each step. Publish the write-up. This demonstrates both attacker understanding and detection craft.
  3. Contribute to SigmaHQ. Write a cloud detection rule for a MITRE ATT&CK Cloud technique that has no existing Sigma coverage. Open a pull request. A merged Sigma rule in the community repository is a public, permanent credential.
  4. Document detection coverage for an AWS Organization. Build the multi-account setup, turn on CloudTrail org-wide, and write the coverage documentation that maps organizational telemetry to ATT&CK techniques. Shows operational understanding of enterprise-scale cloud detection, not just lab-scale.
  5. Recreate a public breach kill chain. Take a public cloud breach (Capital One, Twitch, etc.) and build the detections that would have caught each step, using the technique categories the breach exposed. Publish the detection rules and the retrospective.
  1. Contribute to an open-source detection project. The SigmaHQ cloud rules, Elastic detection-rules, or Panther community rules all accept contributions. Contributing a cloud-specific detection rule - with proper ATT&CK mapping, accurate field references, test cases, and false-positive documentation - is a public, durable credential that signals not just technical skill but professional engagement with the community. Reviewers of your PR comment on your logic in public; use that feedback to improve and re-submit if needed. A merged contribution to a major open-source detection project is worth more in a detection engineering interview than most certifications.
  2. Map CNAPP findings to detection gaps. Take a CNAPP tool's finding categories and map each one to the corresponding ATT&CK technique and the detection that should cover it. This demonstrates both posture and detection thinking, shows you understand the relationship between preventive and detective controls, and produces a coverage document that looks like real work product.

The portfolio projects playbook has the full list with time estimates and how to talk about each artifact in interviews. Write up each project as a blog post, not just a GitHub repository - the write-up forces you to articulate your reasoning, surfaces gaps in your analysis, and becomes a permanent reference you can point interviewers toward.

How to talk about portfolio projects in interviews

The standard "tell me about a project you're proud of" question for detection engineers has a specific structure worth practicing. Interviewers want to hear: what attack technique you were covering (ATT&CK technique ID is good to know), what log source you used and why, what the false-positive surface was and how you scoped it, how you validated the rule with simulation, and what you'd do differently now. That's a five-part story, and rehearsing it for each portfolio artifact before the interview is worth more than any amount of additional studying. The candidate who can narrate an imperfect rule's tuning history demonstrates more craft than the one who describes a theoretically elegant rule that they never ran against real traffic.

How to break in and pivot from adjacent roles

Almost nobody enters cloud detection engineering cold. Almost everyone arrives from one of a few adjacent roles, each of which transfers a specific subset of the skills:

The careers pivot guide covers the mechanics of the job search. The learning path and certifications guide have the credentials worth pursuing. GCIA, GCDA, and GCFE are the most relevant blue-team SANS certs; the CDIA (Certified Detection and Investigation Associate from SANS) maps most directly to this role. Cloud certifications (AWS Security Specialty, SC-200 for Sentinel, Google Professional Cloud Security Engineer) demonstrate the provider-specific context. The combination of a blue-team cert plus a cloud provider cert plus a public portfolio is the strongest resume package for this role - none of the three alone is sufficient.

One path worth naming explicitly: the detection lab-first approach. Before applying anywhere, build the lab described in the portfolio section. Spend three months running Stratus Red Team scenarios, writing Sigma rules, tuning them, publishing them, and writing up the results publicly on a blog or GitHub. That artifact is worth more in an interview than most certs, because it demonstrates you can actually do the job rather than that you've studied for a test about it. Hiring managers for detection engineering roles are practiced at distinguishing candidates who understand detection from candidates who can write about detection - and the lab is the most reliable separator.

The timeline to hireable

A realistic timeline for someone pivoting from an adjacent role (SOC analyst, cloud security engineer, or threat hunter) with dedicated part-time effort: three to four months to build the detection lab, run the key Stratus Red Team scenarios, publish two or three write-ups, and contribute one Sigma rule to a community repository. After that, you have a portfolio that can get you through the first resume screen at most organizations hiring at the mid-level. The full senior-level ramp - where you can own coverage strategy, lead purple-team programs, and design detection-as-code pipelines - typically takes two to three years of in-role experience after the initial hire. The good news is that this ramp is visible and measurable: you can track your own ATT&CK coverage improvements and purple-team validation rates as a proxy for seniority progression, which is unusual in security where skill progress is often opaque.

Where this role leads

Detection engineering is a deep specialist track with a clear IC progression, a natural branch into management for those who want it, and strong demand for the skills in adjacent roles.

One honest observation about the trajectory: detection engineering is a role where the IC track stays technically interesting well into the staff and principal levels in a way that not all security specializations do. At staff level you're setting coverage strategy for a large organization, building the measurement infrastructure that makes the strategy visible, and driving the industry-level conversation about cloud detection techniques. It's a career path where going deep pays off for a long time.

The other sibling roles worth noting for detection engineers who want adjacent exposure without leaving the specialty: CNAPP analyst, which is the preventive complement to the detection engineer's detective function, and GRC engineer, which is the compliance framing around the same telemetry. Detection engineers who develop fluency in all three - detection, posture, and compliance context - become the rare "full-spectrum cloud security practitioner" that senior IC roles at large companies are often looking for.

Common mistakes

How AI is changing the role

Two things are happening simultaneously, and they point in different directions.

On the "AI as tool" side, the gains are real and accelerating. LLMs are competent at drafting initial Sigma rules from a technique description, translating between query languages (Sigma to KQL, KQL to SPL), explaining unfamiliar log event structures, and generating synthetic benign-event samples for rule testing. The detection engineer who uses AI tools to accelerate the mechanical parts of rule writing gets more done. But the judgment about whether a rule actually fires correctly, whether the false-positive analysis is complete, and whether the detection logic handles real attacker variation - that judgment is still yours. A confident but subtly wrong AI-generated rule is a coverage gap that looks like coverage. Review everything.

On the "AI as attack surface" side, agentic AI systems introduce new credential patterns, new data access paths, and new lateral movement techniques that don't fit neatly into existing ATT&CK categories. A model-as-a-service endpoint in your cloud environment is a new log source, a new IAM principal, and a new data exfiltration path - all at once. The detection engineer who understands how AI workloads authenticate and access data will be ahead of the curve as these workloads proliferate. The ones who wait will be writing rules for AI-specific attack techniques in response to incidents. See AI/ML security for the technical foundation.

What is not changing: the adversarial core of the job. AI can draft a rule; it cannot simulate an attack to validate the rule. It can translate a query; it cannot tell you whether the translated query handles the edge cases in your specific environment. It can suggest coverage gaps; it cannot own the decision about which gaps are acceptable. The detection engineer's judgment about what catches real attackers in your environment is not a task that automates away - it compounds over years of building the muscle.

The medium-term trajectory, honestly assessed: AI tools will make it feasible for smaller teams to maintain larger rule libraries. A three-person detection team in 2028 will likely be able to maintain coverage that a five-person team maintains today, because the mechanical translation and first-draft work will be automated. This is good news for the people in the role (they get leverage) and bad news for teams hoping to staff junior detection engineers primarily on translation and maintenance tasks (those tasks will shrink). The engineers who thrive will be those who use AI to go deeper on validation, simulation, and technique research rather than those who resist it.

One specific near-term shift worth calling out: AI-assisted detection is moving from generating rule drafts to generating behavioral analytics. LLM-based anomaly detection over cloud API call sequences - detecting that an IAM principal's API call behavior "looks different" this week - is in production at several large-cloud security vendors and in early trials at in-house teams. Understanding how these ML-based analytics complement (and don't replace) rule-based detection becomes part of the senior detection engineer's mental model. The rules catch known techniques; the behavioral analytics surface unknown deviations; the detection engineer's job is to understand which is which and tune accordingly. This is new, it is evolving fast, and it is the direction the discipline is heading.

Quick answers

What does a cloud detection engineer actually do?

Writes and maintains the rules that catch attackers in cloud environments: Sigma/KQL/SPL detections, ATT&CK Cloud coverage mapping, purple-team simulations with Stratus Red Team or Atomic Red Team, detection-as-code pipelines, and rule lifecycle management. The work is code-first, not console-first.

How is it different from traditional detection?

No EDR agents on most infrastructure. Detection is from control-plane API logs, not process-creation events. Attacks are identity-based (role assumption, key abuse, OAuth consent) rather than malware-execution. Log schemas differ per provider and per service. Rules silently rot when providers change event schemas. The MITRE ATT&CK Cloud matrix keeps expanding.

What query language should I learn first?

Sigma - it's vendor-neutral and compiles to everything else. Then learn the native language of your primary SIEM: KQL for Sentinel/Microsoft environments, SPL for Splunk, YARA-L for Chronicle. The investment in Sigma compounds across every SIEM migration your career will include.

Is purple-teaming required or optional?

Required, if you want to know whether your detections actually work. A coverage map without simulation evidence is a guess, not a measurement. Even a monthly thirty-minute Stratus Red Team run against a few ATT&CK techniques is better than operating on assumption.

Do I need to know how to code?

Yes, at the scripting level. Python for event manipulation, detection pipeline tooling, and lab automation. You also need Git fluency for detection-as-code workflows - rules that live only in the SIEM console don't get reviewed, versioned, or systematically maintained. You don't need to be a software engineer; you need to be comfortable shipping code.

How is this different from a SOC analyst role?

A SOC analyst responds to detections that fire; a detection engineer writes the detections that fire. SOC analysts triage alerts, investigate incidents, and escalate what they can't handle alone. Detection engineers design the rules that make that work possible - and design them so that the SOC workload is as high-signal and low-noise as possible. The feedback loop runs in both directions: detection engineers need SOC feedback on which rules are useful, and SOC analysts benefit from working directly with the people who can fix the rules that waste their time. At smaller orgs the two roles often overlap in the same person; at larger orgs they're distinct teams with a formal interface.

What's the hardest part of the job?

Tuning. Writing a rule that catches an attack is satisfying and takes maybe a few hours. Tuning that same rule so it fires at a sustainable rate in a production environment - where developers legitimately do things that look like attacks at scale - can take days, and the result is never perfect. The hardest tuning problems are identity-based techniques in active engineering organizations: an AssumeRole call from an unusual source IP is highly suspicious in some environments and completely normal in others. The judgment about where to draw the line, and the discipline to document why the line is where it is, is where most of the craft actually lives. People who expect detection engineering to be mostly "write clever rules" are often surprised by how much of the job is "understand your environment well enough to know what normal looks like."

Who this role is not for

Cloud detection engineering is a genuinely great role if you love the adversarial puzzle, enjoy code, and have the disposition to maintain a system that's never "done." It is a frustrating role if:

Open-source content and community resources

Cloud detection engineering has an active open-source and community ecosystem. These are the resources worth knowing specifically for the cloud-focused practitioner - beyond the vendor documentation and formal training that the certifications guide covers.

Rule repositories and detection content

Attack simulation and research tooling

Community and continuing education

Where next

Cloud detection engineering connects deeply with several adjacent topics and roles. The links below are the highest-leverage next reads depending on which part of this page you found most interesting.