Cloud Vulnerability Management

Q: Why is CVSS a bad priority score on its own?

CVSS v3.1's base score measures severity assuming a vulnerability is exploited - it does not measure the probability of exploitation, the reachability of the vulnerable code in your environment, or the criticality of the affected asset. A CVSS 9.8 in a library you import but never call is operationally less urgent than a CVSS 7.5 on an internet-exposed service in your auth path. The base score also ignores environmental and temporal context, and the temporal/environmental metrics that would add it are almost never populated. Use CVSS as a severity input, not as a queue-order.

Q: What is EPSS and how does it differ from CVSS?

EPSS (Exploit Prediction Scoring System), run by FIRST.org, produces a probability between 0 and 1 that a given CVE will be exploited in the wild in the next 30 days. It's retrained weekly against observed exploitation telemetry - honeypots, IDS signatures, exploit DBs, social media, code repos. CVSS asks 'how bad would this be if exploited?'; EPSS asks 'how likely is it to be exploited at all?'. The combination is far more useful than either alone - high CVSS + high EPSS is the patch-now bucket; high CVSS + EPSS near zero is what most of your backlog is.

Q: What is the CISA KEV catalog and why does it matter?

KEV (Known Exploited Vulnerabilities) is CISA's authoritative list of CVEs that have confirmed in-the-wild exploitation. Inclusion is evidence-based, not predictive. Under Binding Operational Directive 22-01, U.S. federal civilian executive-branch agencies must remediate KEV-listed CVEs within set timelines (typically 2 weeks for newer additions). For private-sector orgs it's not mandatory but is the single highest-signal prioritization input there is - if a CVE is on KEV, it has actually been used by attackers somewhere, full stop. Most mature programs fast-track KEV findings ahead of normal SLAs.

Q: What is reachability analysis and why does it change the queue?

Reachability analysis examines whether the vulnerable function in a dependency is actually called from your application's executable code paths - and ideally whether those paths are reachable from an externally exposed entry point. A typical Node.js or Python project imports hundreds of transitive dependencies; SCA flags vulnerabilities in any of them, but research consistently shows only 10-30% of those CVEs are reachable from the application's runtime. Reachable + internet-exposed is a smaller set still. Tools like Endor Labs, Backslash, Aikido, Snyk, Apiiro, and Wiz Code now compute reachability and use it to collapse SCA queues by 70-90%.

Q: What is an SBOM and what does VEX add to it?

An SBOM (Software Bill of Materials) is a machine-readable inventory of every component that makes up a piece of software - direct and transitive dependencies, versions, licenses, hashes. The two dominant formats are CycloneDX (OWASP) and SPDX (Linux Foundation). Generation tools include Syft, cdxgen, and most build systems' native exporters. VEX (Vulnerability Exploitability eXchange) is a companion artifact that asserts, per CVE, whether a given product is actually affected - 'not_affected because the vulnerable function is never called' or 'affected, fix available in v2.1.4'. SBOM tells you what's inside; VEX tells you what actually matters. Together they're the foundation for any modern vendor-disclosure or downstream-consumer workflow.

Q: Agentless or agent-based vulnerability scanning for cloud hosts?

Agentless (snapshot-based) scanning - used by Wiz, Orca, Lacework Polygraph, Sysdig agentless mode - takes block-storage snapshots of running VMs and scans them out-of-band. Pros: zero impact on workloads, no install friction, complete coverage of every VM the API can see, including shadow workloads. Cons: results are point-in-time (typical refresh 6-24h), no runtime context (process listing, network behavior). Agent-based - Crowdstrike, Tenable Cloud Security, Qualys VMDR, Defender for Cloud, Rapid7, Sysdig agent - runs continuously, gives runtime context, but has install/upgrade overhead and only covers VMs where the agent actually runs. Most mature programs use agentless for breadth and an agent on tier-0 or high-blast-radius systems for depth.

Q: What are realistic vulnerability remediation SLAs?

A common, defensible SLA structure for internet-exposed and production systems: P0 / Critical (CVSS 9.0+ AND on KEV OR with public weaponized exploit) - patch or mitigate within 24 hours; P1 / High (CVSS 7.0+ with high EPSS or reachable + exposed) - within 7 days; P2 / Medium - within 30 days; P3 / Low - within 90 days. Internal-only systems get longer windows. KEV additions trigger a fast-track regardless of CVSS. The SLA only works if you have an exception workflow with documented compensating controls - and the means to actually deploy patches at that cadence (which is where immutable-infrastructure golden-image pipelines pay for themselves).

Detailed view of a hand writing a signature on an official document with a ballpoint pen — Photo by Tima Miroshnichenko on Pexels

Last updated 2026-05-17 · By Shawn Nunley · Vendor-neutral · View source on GitHub

The 30-second version: Vulnerability management in cloud is not about finding CVEs - every modern scanner finds tens of thousands. It is about deciding which handful to fix this week, deploying the fix at cloud-native speed, and proving - to auditors and to your own SLAs - that you actually did. The modern prioritization stack is CVSS (severity) → EPSS (exploit probability) → KEV (confirmed in-the-wild) → reachability (is the vulnerable code path actually called) → asset criticality (does it touch tier-0). Run all five and you collapse a 50,000-finding queue by 99%+.

Coverage spans SAST (your code), SCA (your dependencies), DAST/IAST (your running app), container image scanning (your artifacts), IaC scanning (your Terraform / CloudFormation), cloud host scanning (agentless or agent), runtime detection (eBPF-based, in-memory), and SBOM + VEX for what's actually exploitable. An ASPM layer increasingly consolidates all of it with reachability, owner mapping, and SLA enforcement on top.

The cloud VM problem
The CVE lifecycle
CVSS is not a priority score
The prioritization stack
SCA - software composition analysis
SAST - static application security testing
DAST & IAST
Container image scanning
IaC scanning
Cloud host vulnerability scanning
SBOM and VEX
Runtime vulnerability detection
Patch management in cloud
ASPM - Application Security Posture Management
Vulnerability disclosure & bug bounty
SLAs by severity
AWS, Azure, and GCP side-by-side
Maturity stages
Common pitfalls
Further reading
FAQ

The cloud VM problem

Hook up a modern CNAPP or vulnerability scanner to a real cloud estate and the first scan returns somewhere between five and fifty thousand findings. Most are CVEs in operating-system packages, language runtimes, transitive npm/pip/Go dependencies, and base container images that ship with hundreds of pre-installed binaries you'll never invoke. A meaningful fraction are CVSS 7.0 or higher. A much smaller fraction have public exploits. A smaller fraction still are reachable from your application code. A vanishingly small subset are reachable and the asset is exposed to the internet and the asset matters.

The discipline of vulnerability management is everything between that 50,000-finding firehose and the prioritized list of fixes that will actually reduce risk in the next sprint. The legacy version of the discipline - "patch everything CVSS 7.0+ within 30 days" - does not survive contact with cloud-native workloads. It produces backlog so large that nothing meaningful gets done, while real exploitation continues against the small subset of findings that mattered.

What changed in cloud:

Ephemeral workloads. A container with a CVE may live for 10 minutes. Patching it in place is irrelevant; rebuilding the image is what actually fixes it.
Software supply chain depth. A typical Node.js app has 1,200+ transitive dependencies. Most CVEs are in code you never call.
Shared responsibility. The cloud provider patches the hypervisor and managed-service substrate. Your scope shrinks to images, runtimes, and dependencies - but doesn't disappear.
API-driven inventory. Every resource is enumerable. Every CVE in every container in every cluster, every host AMI, every Lambda runtime - all queryable. Coverage went from "what we found" to "everything that exists."
Continuous deployment. Some teams ship hundreds of times a day. A vulnerability-management program that runs on monthly patch windows is broken by definition.

The orgs that do this well are doing risk-based vulnerability management - finding the small set of findings that matter, automating their remediation, and ignoring the rest with intent.

The CVE lifecycle

Understanding where a CVE comes from - and how long after issuance the information you depend on actually shows up - is foundational. The stages, roughly in order:

Discovery. A security researcher (internal, external, or automated) finds a flaw. They may report to the vendor directly, to a coordinator like CERT/CC, or via a bug-bounty platform.
CVE assignment. A CVE Numbering Authority (CNA) - most major vendors, plus MITRE as root - assigns a CVE-YYYY-NNNN identifier. The CVE record lives in CVE.org. As of late 2023 this is the authoritative source; NVD enrichment lags increasingly.
Vendor advisory. The affected vendor publishes an advisory with affected versions, mitigation guidance, and (ideally) a patched version. GitHub Security Advisories, Red Hat Errata, Microsoft Security Update Guide, Cisco PSIRT, etc.
NVD enrichment. NIST's National Vulnerability Database historically added CVSS scoring, CPE (affected-product) matching, and references. Since early 2024, NVD enrichment has slowed dramatically - many CVEs now wait weeks or months for full enrichment, which has pushed scanners and feeds to rely on alternative sources (vendor advisories directly, GitHub Advisory Database, OSV.dev, VulnCheck NVD++).
Scanner ingestion. Trivy, Grype, Snyk, Wiz, Tenable, Qualys, and the rest pull from a blend of NVD, GHSA, OSV, ecosystem registries (npm advisory DB, RustSec, PyPA, etc.), and vendor advisory feeds. Coverage and timeliness vary by source.
Exploit observed. Proof-of-concept code appears on GitHub or exploit-DB; mass scanning starts; ransomware crews adopt it. CISA KEV adds it. EPSS scores spike.
Patch deployment. Your job. The clock that matters is between exploitation-observed and patch-deployed-in-your-environment, not between CVE-assigned and patch-available.

The practical takeaway: don't depend on NVD as the single source of CVE truth in 2026. Modern scanners that blend NVD with GHSA, OSV, and vendor advisories produce materially more current results. The OSV schema in particular is becoming a de facto interchange format for ecosystem-level vulnerability data.

CVSS is not a priority score

CVSS - the Common Vulnerability Scoring System from FIRST - is the most-cited and most-misused number in vulnerability management. The 0-10 base score measures the severity of a vulnerability if it is exploited, against a hypothetical default environment. It does not measure the probability of exploitation, the reachability of the vulnerable code in your environment, or the criticality of the affected asset.

What the base score actually encodes (CVSS v3.1):

Attack vector - network, adjacent, local, physical.
Attack complexity - low or high.
Privileges required, user interaction - how much the attacker needs to already have.
Scope - whether exploitation can pivot beyond the vulnerable component.
Impact - confidentiality, integrity, availability - none / low / high.

CVSS also defines Temporal metrics (exploit code maturity, remediation level, report confidence) and Environmental metrics (your own modifiers for the vector and impact in your environment). In practice almost no one populates either. Every CVSS score you see in a scanner is the base score, on the default assumption that there's no patch available and the attacker has no impediments - which is wrong in both directions for most real findings.

CVSS v4.0 (released late 2023) addresses several of v3.1's worst problems. It splits the score into Base, Threat (replaces Temporal), Environmental, and Supplemental groups; introduces a more granular attack-requirements metric; and explicitly recognizes that base-only scores are incomplete. Adoption has been gradual - most scanners still emit v3.1 - but v4 vectors will become the norm over 2025-2026. Recommended reading: CVSS v4.0 specification.

The right way to use CVSS is as the severity input to a multi-factor prioritization stack, never as the queue-order on its own. Treating it as priority produces the classic failure mode: teams patch CVSS 9.8 issues in unreachable code paths while CVSS 7.5 issues in their internet-exposed auth service sit untouched.

The prioritization stack

A modern, risk-based vulnerability management program runs every finding through a stack of filters. Each one is independently useful; the combination collapses noise by 99%+ on a real estate.

CVSS

Severity - if exploited, how bad. v3.1 base score or v4.0 base/threat/environmental. The starting filter, never the only one. A CVSS 9.8 in code you never run is not a CVSS 9.8 problem.

EPSS

Exploit Prediction Scoring System - a probability (0-1) that the CVE will be exploited in the next 30 days. Retrained weekly against real exploitation telemetry. High CVSS + low EPSS is most of your backlog; high CVSS + high EPSS is the patch-now bucket.

KEV

CISA Known Exploited Vulnerabilities - confirmed in-the-wild exploitation. Evidence-based, not predictive. The only mandatory list for U.S. federal civilian agencies under BOD 22-01. The single highest-signal source private-sector orgs have.

Vendor risk scores

Tenable VPR, Qualys VRR / TruRisk, Rapid7 Real Risk, ZeroFox / vendor-proprietary scores. Composite metrics that blend CVSS, EPSS, threat-intel, and asset context. Useful inside the vendor's ecosystem; not portable across tools.

Reachability

Is the vulnerable function in this dependency actually called by your application code? Is the calling path reachable from an internet-exposed entry point? Computed by SCA tools (Endor Labs, Backslash, Snyk, Aikido, Apiiro) and CNAPPs (Wiz Code) via call-graph analysis. Collapses SCA queues by 70-90%.

Asset criticality

Tier-0 (auth, payments, customer data) vs internal tooling vs ephemeral worker. Sourced from CMDB, tag taxonomy, or graph-based exposure analysis. A CVE on a tier-0 system at the right CVSS/EPSS gets fast-tracked; the same finding on a dev sandbox does not.

The pragmatic combination: CVSS ≥ 7 AND (on KEV OR EPSS ≥ 0.5 OR reachable-and-exposed) is a defensible patch-now rule for most cloud estates. Add asset criticality as a tie-breaker for the resulting list. The rest of the backlog gets normal-cadence remediation or risk-accepted with an exception.

For broader threat context on what attackers are doing in cloud, see Threat Research and Breach Kill Chains.

SCA - software composition analysis

SCA scans your application's dependencies - direct and transitive - for known vulnerabilities, license compliance issues, and (increasingly) typosquats and malicious packages. The dominant input is your lockfile (package-lock.json, yarn.lock, Pipfile.lock, go.sum, Cargo.lock, etc.) or your built artifact.

The tooling landscape:

Developer-first - Snyk, Dependabot, GitHub Advanced Security, Renovate. Auto-fix PRs, IDE plugins, fast feedback in the dev loop.
Enterprise SCA - Sonatype Lifecycle / Nexus IQ, Mend (WhiteSource), Black Duck, FOSSA, Veracode SCA. Deeper license and policy controls, artifact-repository integration, broad enterprise feature surface.
Reachability-aware SCA - Endor Labs, Backslash, Aikido, Apiiro, Snyk reachability, Wiz Code. Compute whether vulnerable code paths are actually called from your application.
Malicious-package detection - Socket, Phylum, GitGuardian, Sonatype Firewall. Behavioral analysis of new package versions for install scripts, network calls, credential access - catches the supply-chain attacks that don't have a CVE yet.
Open source - Trivy (also scans images), Grype, OWASP Dependency-Check, OSV-Scanner from Google.

SCA without reachability is mostly noise on a modern JavaScript or Python codebase. The ratio of "CVE flagged in lockfile" to "CVE actually reachable" is typically 5:1 to 10:1; for transitive-only dependencies it can be 20:1 or worse. Reachability is the single most leveraged feature in modern SCA.

SAST - static application security testing

SAST scans your own source code for security flaws - SQL injection, XSS, insecure deserialization, hardcoded secrets, weak crypto, IDOR patterns, and so on. The technique is taint analysis (track untrusted input to sensitive sinks) plus pattern matching plus, increasingly, LLM-assisted triage.

The tooling:

Rules-first, developer-friendly - Semgrep (open core; community + enterprise rules). Fast, customizable, low false-positive rate when rules are tuned. The pragmatic default.
Deep dataflow - CodeQL (GitHub). Treats code as a database; queryable with QL. Powers GitHub Advanced Security. Highest-quality dataflow analysis available.
Enterprise SAST - Checkmarx, Fortify (OpenText), Veracode, Coverity. Broad language support, compliance reporting, deep dataflow, slow scans, traditional procurement.
Code-quality crossover - SonarQube / SonarCloud, Snyk Code. Lighter-weight than the enterprise tier, deeper than linters.
Language-specific - Bandit (Python), gosec (Go), ESLint security plugins (JS/TS), cargo-audit / cargo-deny (Rust), Brakeman (Rails).
Secrets in code - Gitleaks, TruffleHog, GitHub secret scanning, GitGuardian. Run pre-commit, in CI, and on the historical git log.

SAST's failure mode is false positives. A scanner that flags 500 issues per PR teaches developers to ignore the tool. Two practical levers: tune the ruleset to your stack, and put the SAST gate in CI on changed files only - not on the whole codebase on every run. Modern tools (Semgrep especially) make incremental scanning the default.

DAST & IAST

DAST (Dynamic Application Security Testing) probes a running application from the outside - sends crafted requests, looks for injection, broken auth, XXE, misconfigurations, exposed admin endpoints. It finds things SAST can't see (the actual deployed config, the runtime behavior of the assembled system) and misses things SAST does see (unreachable code paths, business-logic flaws).

Open source - OWASP ZAP, Nikto, Wapiti. ZAP is the pragmatic default for in-pipeline DAST.
Manual + automated - Burp Suite. The pentester's tool of choice; Burp Enterprise scales to automation.
Enterprise DAST - Invicti (Netsparker), Acunetix, Veracode DAST, Synopsys WhiteHat.
Modern, CI-friendly - StackHawk, Bright Security, Detectify. Built to run in pipelines against staging or preview environments.
API-specialized - Salt Security, Noname (now Akamai), 42Crunch, Traceable. API discovery + DAST + runtime protection.

IAST (Interactive Application Security Testing) instruments the running app (typically with an agent or runtime hook) and observes how requests flow through the code. Hybrid of SAST and DAST. Vendors include Contrast Security, Veracode IAST, Synopsys Seeker. Niche but valuable for high-stakes apps where the SAST/DAST overlap leaves blind spots.

Container image scanning

Containers concentrate the cloud's vulnerability surface - every base image carries an OS package set, every layer adds binaries, every COPY pulls application dependencies. Image scanning is mostly OS-package CVE matching plus SCA on detected language manifests, with optional malware, secrets, and misconfiguration checks layered on.

Where to scan:

Build-time - in CI, after the image is built but before it's pushed. Fail the build on policy violations. The earliest, cheapest place to catch a problem.
Registry - scan every image stored in your registry, re-scan when vulnerability DBs update. The catch-net for images already in flight.
Admission controller - block deploy of non-conformant images at the Kubernetes API. Final gate before production.
Runtime - continuously assess what's actually running, including images that bypassed earlier gates.

The tooling:

Open source - Trivy (Aqua), Grype (Anchore), Clair (Red Hat / Quay). Trivy is the de facto default in CI pipelines.
Commercial CNAPP - Wiz, Orca, Lacework, Sysdig, Aqua, Prisma Cloud. Scan registries plus running clusters, correlate with cloud context.
SCA-led with container coverage - Snyk Container, Anchore Enterprise.
Cloud-native - AWS Inspector for ECR, Defender for Containers, GCP Artifact Analysis. Built into the registry; lowest-friction starting point.

The single highest-leverage move in container vulnerability management is base-image discipline. Switching from a fat base (Ubuntu, Debian full) to a minimal base (distroless, Alpine, Chainguard / Wolfi) typically drops the CVE count per image by 80-95%. The fix is structural, not patch-by-patch. See the Containers page for the broader practice.

IaC scanning

Infrastructure-as-code (Terraform, CloudFormation, ARM, Bicep, Kubernetes manifests, Helm charts, Pulumi, CDK) describes cloud configurations before they exist. Scanning IaC is preventive vulnerability management - catching the misconfiguration that would otherwise produce a CSPM finding 20 minutes after apply.

Open source - Checkov (Bridgecrew / Prisma), KICS (Checkmarx), tfsec (now part of Trivy), Terrascan (Tenable, formerly Accurics), Conftest + Rego (OPA).
Commercial - Snyk IaC, Prisma Cloud IaC (formerly Bridgecrew), Wiz IaC, Tenable Cloud Security.
Policy-as-code runtimes - OPA / Conftest (run Rego on Terraform plans, K8s manifests, anything JSON/YAML), Sentinel for Terraform Cloud / Enterprise.

The most useful integration point is the pull request - run Checkov / tfsec / Snyk against the Terraform plan and post inline comments on changed resources. Failures block merge for high-severity findings; lower-severity is informational. This shifts the cloud-misconfig fight from CSPM (post-deploy) to PR review (pre-deploy), which is where the cheapest fix lives. See CI/CD for where this sits in the broader pipeline.

Cloud host vulnerability scanning

VMs and EC2 / Azure VM / GCE instances still exist, still run Linux distributions full of packages, and still need CVE scanning. The architectural choice is agentless vs agent-based - and most mature programs run both.

Dimension	Agentless (snapshot-based)	Agent-based
Mechanism	Block-storage snapshot of running VM, scanned out-of-band by the platform	Agent installed in the VM image / via configuration management; continuously reports
Workload impact	Zero - runs in the security platform's account	Some - CPU, memory, network for the agent
Coverage	Every VM the cloud API can see, including shadow workloads	Only where the agent is installed and healthy
Freshness	Point-in-time, typical refresh 6-24h	Near-real-time
Runtime context	None - no process list, no live network behavior	Full - running processes, listening ports, in-memory state
Install friction	Zero (one cross-account role per cloud account)	Real - golden-image work, configuration management, deployment validation
Representative tools	Wiz, Orca, Lacework, Sysdig agentless, Defender for Cloud (agentless), Inspector v2 (agentless)	Crowdstrike Falcon, Tenable Cloud Security agent, Qualys VMDR Cloud Agent, Rapid7 InsightVM, Defender for Servers, Wazuh

The 2026 pragmatic split: agentless for breadth (every VM, every account, every region - coverage previously unattainable), agent on the systems where runtime context matters most (tier-0, high-blast-radius, regulatory scope). The agent-only era is over; the agentless-only programs are rare outside the most disciplined organizations and still benefit from agents on specific tiers.

SBOM and VEX

An SBOM (Software Bill of Materials) is a machine-readable inventory of every component of a piece of software - direct and transitive dependencies, versions, hashes, licenses, and (with the right tooling) the build provenance. EO 14028 made SBOMs a federal procurement requirement; the rest of the industry has followed.

The two dominant formats:

CycloneDX - OWASP-stewarded. Strong support for vulnerability and license data inline; the de facto choice in security tooling. Default output of Trivy, Grype, Snyk, and most modern scanners.
SPDX - Linux Foundation, ISO standard (ISO/IEC 5962:2021). Stronger licensing focus; often required by procurement. Most tools emit both.

Generation tools:

Syft (Anchore) - language-agnostic, scans images and filesystems. The pragmatic default for container SBOMs.
cdxgen - broad language support, used heavily for application SBOMs.
Native build tools - Maven plugins, Gradle plugins, npm sbom command (Node 17+), go version -m, etc.
CI integrations - GitHub's dependency graph, GitLab dependency scanning, Buildkite/Jenkins plugins.

Exchange and signing - in-toto attestations, sigstore (cosign + rekor + fulcio) for signing SBOMs and other build artifacts, SLSA framework for build-provenance levels.

VEX (Vulnerability Exploitability eXchange) is the companion artifact that solves SBOM's biggest weakness: an SBOM tells you the software contains a vulnerable library, but doesn't say whether that particular product is actually affected. VEX statements assert, per CVE, one of: not_affected, affected, fixed, under_investigation, with rationale (vulnerable_code_not_in_execute_path, inline_mitigations_already_exist, etc.). Formats include OpenVEX, CycloneDX VEX, and CSAF VEX. The industry is converging on shipping SBOM + VEX together so downstream consumers can filter the SBOM's CVE list by what's actually exploitable in the specific product.

Runtime vulnerability detection

The newest layer. Image scanning tells you the binary on disk has a CVE; runtime detection tells you whether the vulnerable code path is actually being executed, and whether someone is exploiting it right now. Most modern runtime detection uses eBPF for low-overhead kernel-level instrumentation.

Falco (CNCF graduated, originally Sysdig). The reference open-source runtime security project. Rule-based detection of suspicious syscalls, file access, network behavior.
Tracee (Aqua) - eBPF-based runtime tracing with built-in signatures for known attack patterns.
Commercial CNAPP runtime - Sysdig Secure, Aqua Runtime Protection, Wiz Runtime Sensor, Crowdstrike Falcon Cloud, Lacework, Prisma Cloud Defender.
Kubernetes-native - ARMO Kubescape (CNCF), Tetragon (Isovalent / Cilium project) for eBPF-based observability and enforcement.
In-memory exploit detection - Wiz, Sysdig, Aqua, and the EDR vendors increasingly hook the runtime to detect exploit payloads in memory even when the binary on disk is patched, or unpatchable. The convergence of CWPP (Cloud Workload Protection Platform) into CNAPP.

Runtime data feeds back into prioritization: a CVE in a library that is actually loaded and called at runtime is a different priority than the same CVE in a library never touched. Wiz, Sysdig, and others now show runtime-validated reachability - the strongest signal available for vulnerability deprioritization. See Detection Engineering for the broader runtime-detection practice.

Patch management in cloud

Finding the vulnerability matters less than deploying the fix. The cloud-native patch model is fundamentally different from on-prem.

Immutable infrastructure

The dominant pattern for VMs and containers: never patch in place; rebuild the image with patched components and redeploy. Three primitives:

Golden image pipelines. A regularly-built base AMI / managed image with current packages, hardened to CIS Benchmarks, signed, published to a private registry. Application teams consume the latest tag. Packer + EC2 Image Builder / Azure Image Builder / GCP Cloud Build are the typical pipeline.
Auto-rotation. Auto-scaling groups, managed instance groups, and Kubernetes node pools rotate to the new image automatically on a schedule or when a newer version is published. Workloads must tolerate node replacement (which they should anyway).
Cattle, not pets. No VM lives long enough to drift. The vulnerability remediation is the next rotation, not an in-place yum update.

Managed-service patching

For services where you don't own the OS - Lambda, App Service, Cloud Run, RDS, Aurora, Cloud SQL, MSK, EKS / GKE / AKS control planes - the provider patches. Your responsibility is staying on supported runtime/engine versions and consuming auto-upgrade options when they exist:

EKS managed node groups auto-rotate to patched AMIs on cluster upgrade; the EKS control plane upgrades automatically within a release channel.
GKE auto-upgrade (default on Autopilot, opt-in on Standard) handles node and control-plane upgrades.
AKS auto-upgrade channels for control plane and node images.
Azure VM Auto-Patching applies guest-OS patches on managed schedules.
AWS Systems Manager Patch Manager for explicit fleet patching where immutable rebuild isn't possible.

The fastest-patching org I have seen rebuilds and re-rolls its entire fleet every 7 days, regardless of whether any new CVE was published. The patch-deployment latency for any new CVE was bounded by the rebuild cadence, not by a triage process. This is the operating-model end-state, and it requires immutable infrastructure, ephemeral workloads, and runbook-quality deployment automation. Most orgs don't get there immediately; getting closer to it is the highest-leverage investment for VM speed.

ASPM - Application Security Posture Management

ASPM is the consolidation layer for everything in this page. SAST findings, SCA findings, IaC findings, secrets findings, container scan findings, DAST findings, runtime detections - each tool produces its own queue, with its own severity model, its own ticketing integration, its own remediation guidance. Mature engineering orgs find themselves running 8-15 separate AppSec tools, each generating findings that nobody can prioritize across.

ASPM platforms ingest from all of them, normalize, deduplicate, correlate to the code owner and the deployed asset, layer reachability on top, and produce a single prioritized queue per team. The category emerged around 2022; it is now the standard top-of-stack consolidation layer.

Notable vendors:

Apiiro - strong on code-to-cloud risk graph, including code-change risk scoring.
Cycode - broad scanner consolidation, native scanners plus third-party.
Backslash - reachability-first, strong on SCA prioritization.
ArmorCode - vendor-neutral aggregation, strong workflow features.
Phoenix Security - risk-based prioritization across app and cloud findings.
Wabbi - SSCM/ASPM focused on release-pipeline integration.
Legit Security - pipeline + scanner-output correlation.
Snyk ASPM - Snyk's consolidation layer atop its scanner suite plus third-party.
OX Security - pipeline-and-PR-centric AppSec governance.

The right time to adopt an ASPM is once you are running three or more first-class AppSec scanners with no consolidation, or once finding ownership has become the dominant remediation bottleneck. Below that bar, a well-instrumented CNAPP or a single broad vendor (Snyk, Wiz, Aikido) covers most of the value.

Vulnerability disclosure & bug bounty

The inbound side of VM. Even with perfect internal scanning, external researchers find things first. A coordinated vulnerability disclosure (CVD) program is table stakes; a bug-bounty program adds financial incentive and structured engagement.

ISO/IEC 29147 - guidelines for vulnerability disclosure (the external-facing side).
ISO/IEC 30111 - guidelines for vulnerability handling (the internal-process side).
security.txt (RFC 9116) - the well-known location researchers check first. Host one. Include a real contact, encryption key, policy URL, and acknowledgments link.
Bug-bounty platforms - HackerOne, Bugcrowd, Intigriti, YesWeHack, Synack. Public or private programs; triage, payout, and researcher-management services.
CVE Numbering Authorities (CNAs) - if you ship software others depend on, become a CNA so you can assign CVE IDs for your own products without going through MITRE.

The relationship between disclosure and bounty is sequencing: a CVD program with safe-harbor language is a precondition for a bounty. Researchers will not engage with a program that doesn't promise not to sue them.

SLAs by severity

What mature programs commit to, internally and contractually. These are defensible defaults; tune for your sector and exposure.

Severity	Criteria	Internet-exposed / production	Internal-only
P0 / Critical	On KEV, OR public weaponized exploit, OR CVSS 9.0+ with high EPSS and reachable	24 hours (mitigate or patch)	72 hours
P1 / High	CVSS 7.0+ with high EPSS, OR reachable + exposed, OR confirmed via DAST/pentest	7 days	30 days
P2 / Medium	CVSS 4.0-6.9, reachable but not exposed, OR exposed but low EPSS	30 days	90 days
P3 / Low	CVSS < 4, OR unreachable, OR informational	90 days	Next planned cycle
KEV fast-track	Any newly-added KEV CVE on any in-scope asset	Override normal SLA - 48-72 hours target regardless of prior triage

The SLAs only function with an exception workflow. Every miss requires a documented business justification, a compensating control, an approver, and an expiration date. Exceptions accumulate silently in programs that don't enforce expiration; the right cadence is a monthly review of all exceptions older than 30 days, escalating any that haven't moved.

AWS, Azure, and GCP side-by-side

The native vulnerability-management capabilities each cloud ships. Useful as a baseline; most mature programs supplement with a CNAPP or specialty tooling.

Capability	AWS	Azure	GCP
Host / VM scanning	Inspector v2 (agentless + SSM agent)	Defender for Servers (agentless + MDE)	Security Command Center - VM Threat Detection, OS Inventory
Container image scanning	Inspector for ECR, ECR enhanced scanning	Defender for Containers, ACR vulnerability assessment	Artifact Analysis (Container Analysis API)
Lambda / Function scanning	Inspector for Lambda	Defender for App Service / Functions	Limited native; Cloud Functions runtime version checks
Code scanning (SAST/SCA)	CodeGuru Security, Inspector code scanning (via CodeCommit / CI)	GitHub Advanced Security for Azure DevOps, Defender CSPM code integrations	Cloud Build vulnerability scanning, source-code link in SCC
IaC scanning	CloudFormation Guard, Inspector IaC scans (preview), third-party in CodePipeline	Defender for DevOps (IaC scanning of GitHub / ADO repos)	Limited native; Cloud Build with third-party scanners
Patch management	Systems Manager Patch Manager, EC2 Image Builder	Azure Update Manager, VM Auto-Patching, Image Builder	OS Patch Management, Cloud Build for image pipelines
CVE feeds & enrichment	Inspector uses NVD + vendor advisories + GHSA	Defender uses MSRC, NVD, Qualys engine	Artifact Analysis uses CVE feeds + OSV.dev
SBOM generation	Inspector SBOM export (CycloneDX, SPDX)	Defender SBOM (via Microsoft.SBOM, CycloneDX/SPDX)	Artifact Analysis SBOM (SPDX), GUAC integration
Findings aggregation	Security Hub (CSPM + vuln findings)	Defender for Cloud (unified)	Security Command Center Enterprise
Runtime threat detection	GuardDuty (Runtime Monitoring for EKS/ECS/EC2)	Defender for Cloud runtime sensor	Security Command Center - Container Threat Detection

The native tools have caught up dramatically. Inspector v2, Defender for Cloud, and SCC Enterprise are credible single-cloud VM stacks for most organizations. The case for a third-party CNAPP gets stronger as you go multi-cloud (one console across AWS + Azure + GCP), need deep reachability or graph-based exposure analysis, or want consolidated remediation workflow with code-owner mapping. See CSPM vs CNAPP for the broader trade-off.

Close-up of a checklist with green checkmarks — Photo by Towfiqu barbhuiya on Pexels

Maturity stages

A useful staging model for a cloud vulnerability-management program:

Stage 1 - Reactive

A scanner exists and runs on a schedule. Findings land in a spreadsheet or ticket queue. Prioritization is by CVSS. Backlog grows faster than remediation. Patch deployment is manual and infrequent. No reachability, no runtime, no KEV awareness. The default starting state.

Stage 2 - Risk-based

EPSS and KEV integrated into prioritization. SLAs by severity defined and tracked. SCA + SAST + container scanning + IaC scanning all running in CI. CSPM/CNAPP catches drift. Exception workflow exists. Patch-deployment cadence is bounded. The minimum bar for a credible 2026 program.

Stage 3 - Reachability-aware

SCA findings filtered by reachability - call-graph analysis collapses backlog by 70-90%. ASPM layer consolidates findings across tools, maps to code owners. Immutable-infrastructure pipelines bound patch latency to a known rebuild cadence. SBOMs generated and stored for every artifact; VEX statements published for shipped products.

Stage 4 - Automated

Auto-remediation for high-confidence findings (Dependabot / Renovate auto-merge on green CI, image rebuild auto-deploys on base-image update). Runtime-validated reachability further refines priority. Exposure-graph analysis drives queue order. The vulnerability backlog is bounded and shrinks faster than it grows. The platform team's product, not just the security team's job.

Most organizations are somewhere between Stage 1 and Stage 2 in 2026. The orgs at Stage 3+ are visibly different - they don't carry six-figure CVE backlogs, and their MTTR for KEV-listed findings is days, not quarters.

Common pitfalls

Scanning without prioritization. Turning on the scanner is the easy 5% of the work. Without EPSS, KEV, and reachability filters, the output is unactionable noise. The team learns to ignore the tool, and the next breach lands on a finding nobody triaged.
Fixing P3s while P1s sit. The classic anti-pattern of audit-driven VM. Easier fixes get done because they're easier; harder, higher-risk fixes accumulate. Risk-rank before assigning, never the reverse.
No reachability analysis on SCA. SCA on a modern JS/Python codebase is 70-90% noise without reachability. Adopting one of the reachability-aware tools is the single highest-leverage investment in SCA-heavy stacks.
No runtime feedback into prioritization. Image scanning tells you what's on disk; runtime tells you what's actually loaded. CVEs in libraries you import but never call run at the same priority as CVEs in your hot path - until runtime closes the loop.
No SLAs, or SLAs without exception workflow. SLAs unenforced are worse than no SLAs - they teach the org that security commitments are aspirational. Define them with executive backing and a real exception process; review weekly.
No patch automation. Immutable-infrastructure pipelines exist precisely so the patch-deployment latency is bounded by the rebuild cadence, not by ticket triage. Manual EC2 yum updates are a Stage-1 pattern.
Ignoring KEV. The single highest-signal list in vulnerability management. If a CVE is on KEV, it is being used by real attackers right now. Treat KEV additions as a fire-drill regardless of where they sit in the normal triage queue.
Treating CVSS as priority. The most common misuse of the score. CVSS is severity, not exploitability, not reachability, not asset criticality. Use it as one input in a stack, never the queue-order.
No SBOM, or SBOM without VEX. When the next Log4Shell / xz-utils-style supply-chain event lands, the orgs that can grep their SBOMs for "is this library in any of our products" answer in hours. The ones without SBOMs spend weeks. VEX is what stops every downstream consumer from having to redo the affected-ness analysis.
One scanner monoculture. Every scanner has blind spots. The pragmatic combination is at minimum: SCA + SAST + container scanning + IaC scanning + cloud-host scanning, ideally with an ASPM or CNAPP layer correlating. A single-vendor "we do everything" pitch always leaves gaps somewhere.

FAQ

Why is CVSS a bad priority score on its own?

CVSS v3.1's base score measures severity assuming a vulnerability is exploited - it does not measure the probability of exploitation, the reachability of the vulnerable code in your environment, or the criticality of the affected asset. A CVSS 9.8 in a library you import but never call is operationally less urgent than a CVSS 7.5 on an internet-exposed service in your auth path. The base score also ignores environmental and temporal context, and the temporal/environmental metrics that would add it are almost never populated. Use CVSS as a severity input, not as a queue-order.

What is EPSS and how does it differ from CVSS?

EPSS (Exploit Prediction Scoring System), run by FIRST.org, produces a probability between 0 and 1 that a given CVE will be exploited in the wild in the next 30 days. It's retrained weekly against observed exploitation telemetry - honeypots, IDS signatures, exploit DBs, social media, code repos. CVSS asks "how bad would this be if exploited?"; EPSS asks "how likely is it to be exploited at all?". The combination is far more useful than either alone - high CVSS + high EPSS is the patch-now bucket; high CVSS + EPSS near zero is what most of your backlog is.

What is the CISA KEV catalog and why does it matter?

KEV (Known Exploited Vulnerabilities) is CISA's authoritative list of CVEs that have confirmed in-the-wild exploitation. Inclusion is evidence-based, not predictive. Under Binding Operational Directive 22-01, U.S. federal civilian executive-branch agencies must remediate KEV-listed CVEs within set timelines (typically 2 weeks for newer additions). For private-sector orgs it's not mandatory but is the single highest-signal prioritization input there is - if a CVE is on KEV, it has actually been used by attackers somewhere, full stop. Most mature programs fast-track KEV findings ahead of normal SLAs.

What is reachability analysis and why does it change the queue?

Reachability analysis examines whether the vulnerable function in a dependency is actually called from your application's executable code paths - and ideally whether those paths are reachable from an externally exposed entry point. A typical Node.js or Python project imports hundreds of transitive dependencies; SCA flags vulnerabilities in any of them, but research consistently shows only 10-30% of those CVEs are reachable from the application's runtime. Reachable + internet-exposed is a smaller set still. Tools like Endor Labs, Backslash, Aikido, Snyk, Apiiro, and Wiz Code now compute reachability and use it to collapse SCA queues by 70-90%.

What is an SBOM and what does VEX add to it?

An SBOM (Software Bill of Materials) is a machine-readable inventory of every component that makes up a piece of software - direct and transitive dependencies, versions, licenses, hashes. The two dominant formats are CycloneDX (OWASP) and SPDX (Linux Foundation). Generation tools include Syft, cdxgen, and most build systems' native exporters. VEX (Vulnerability Exploitability eXchange) is a companion artifact that asserts, per CVE, whether a given product is actually affected - "not_affected because the vulnerable function is never called" or "affected, fix available in v2.1.4". SBOM tells you what's inside; VEX tells you what actually matters. Together they're the foundation for any modern vendor-disclosure or downstream-consumer workflow.

Agentless or agent-based vulnerability scanning for cloud hosts?

Agentless (snapshot-based) scanning - used by Wiz, Orca, Lacework Polygraph, Sysdig agentless mode - takes block-storage snapshots of running VMs and scans them out-of-band. Pros: zero impact on workloads, no install friction, complete coverage of every VM the API can see, including shadow workloads. Cons: results are point-in-time (typical refresh 6-24h), no runtime context (process listing, network behavior). Agent-based - Crowdstrike, Tenable Cloud Security, Qualys VMDR, Defender for Cloud, Rapid7, Sysdig agent - runs continuously, gives runtime context, but has install/upgrade overhead and only covers VMs where the agent actually runs. Most mature programs use agentless for breadth and an agent on tier-0 or high-blast-radius systems for depth.

What are realistic vulnerability remediation SLAs?

A common, defensible SLA structure for internet-exposed and production systems: P0 / Critical (CVSS 9.0+ AND on KEV OR with public weaponized exploit) - patch or mitigate within 24 hours; P1 / High (CVSS 7.0+ with high EPSS or reachable + exposed) - within 7 days; P2 / Medium - within 30 days; P3 / Low - within 90 days. Internal-only systems get longer windows. KEV additions trigger a fast-track regardless of CVSS. The SLA only works if you have an exception workflow with documented compensating controls - and the means to actually deploy patches at that cadence (which is where immutable-infrastructure golden-image pipelines pay for themselves).

Where next

Containers - image hardening, base-image strategy, and the structural fix for most CVE noise.
CI/CD - where pre-deploy SAST, SCA, IaC scanning, and image gates actually live.
CSPM vs CNAPP - the broader posture and exposure-management layer that VM sits inside.
Detection Engineering - runtime detection of exploitation attempts against vulnerabilities you haven't patched yet.
Friday Zoom - reachability, ASPM, and "we have 50,000 findings, now what" come up almost every week. Drop in.