Why this page exists. csoh.org is a community for cloud security practitioners. We host it ourselves - one static site served active/active from three clouds at once (AWS, GCP, and Azure) behind Cloudflare - and we treat the deployment as a teaching artifact: every choice we made is on this page, with a plain-English explanation of why it's there and what attack it stops. The Terraform and GitHub Actions YAML that actually runs in production is linked throughout so you can read the real thing.
Who this is for. If you've ever deployed a website to a shared web host (cPanel, Netlify, GitHub Pages) and you want to understand what a "real" cloud deployment looks like - this is your page. We assume you know HTML/HTTP basics. We don't assume you know GCP, IAM, CI/CD, or container security. Every term we use links to the glossary on first use.
How to read this page. Top-to-bottom for the full tour, or jump straight to the section you care about via the table of contents below. Each layer of defense gets its own section that opens with "the attack we're stopping" in plain language before we touch the technical detail.
On this page
- The big picture: defense in depth
- What attacks are we actually defending against?
- Architecture diagram
- Layer 1 - Cloudflare: the one edge in front of everything
- Layer 2 - Three cloud origins, active/active
- Layer 3 - TLS (the lock icon in the browser)
- Layer 4 - The origins themselves (S3, Cloud Run, Blob)
- Layer 5 - The bytes we ship to each origin
- Layer 6 - How we deploy to three clouds without a saved password
- Layer 7 - Protecting the deploy pipeline itself
- Layer 8 - Logging and what we'd see during an attack
- What we didn't do (and why)
- What this costs to run
- If you want to copy this for your own site
- Further reading
The big picture: defense in depth
The single most important idea on this page is called defense in depth: instead of relying on one strong wall, we stack many imperfect ones. If an attacker bypasses one layer, the next one is still in their way. None of the individual controls below are unbreakable - but together, an attacker has to be lucky on every layer at once, while we only have to be lucky on one.
Our stack has eight layers. Each one is a section on this page:
- Cloudflare in front of everything - terminates TLS, caches at the edge, runs the WAF, sets security headers, applies legacy redirects, and load-balances across our three origins with health checks. This one edge does all of it, in front of three interchangeable origins.
- Three cloud origins, active/active - the same site lives on AWS (S3 + CloudFront), GCP (Cloud Run), and Azure (Blob static website). Cloudflare spreads live traffic across all three and pulls any unhealthy one out of rotation automatically.
- TLS end-to-end (browser → Cloudflare, Cloudflare → each origin at Full strict) - encrypted, modern ciphers, certificates that auto-renew, no unauthenticated hop anywhere.
- The origins themselves - each locked down: S3 is private (reachable only via CloudFront), Cloud Run runs as a zero-permission identity, Azure serves only its public $web container.
- The bytes we ship - the GCP container is pinned to known-good bytes and vulnerability-scanned; the object-storage origins get only an allowlisted public file set (sensitive files are never uploaded).
- The deploy identity - GitHub Actions deploys to all three clouds without a single stored password, using keyless OIDC federation that grants ~1 hour of narrowly-scoped access per workflow run, per cloud.
- The pipeline itself - Code Owners review on the deploy workflow, branch protection on
main, secret scanning, push protection. - Logging - Cloudflare zone analytics at the edge, plus per-cloud origin + IAM/audit logs (GCP's kept for 400 days).
Read on for the plain-English version of each layer, what it's defending against, and what the actual config looks like.
What attacks are we actually defending against?
Before designing controls, you need a threat model - a list of "what could go wrong, and roughly how likely is it?" For a public, static site, the surface is smaller than you might think:
- There's no database to inject into.
- There's no login form to brute-force.
- There are no user accounts, sessions, or cookies to hijack.
- There's no per-user data to steal.
That eliminates most of the OWASP Top 10 right off the bat. What's actually left, ranked from most-to-least likely:
- Someone tampers with the build. An attacker compromises the base image we use, a GitHub Action we depend on, or a CI token, and slips malicious bytes into our deploy. The site visibly looks normal but ships malware to readers. Mitigated by: pinning the base image to its content hash, pinning every GitHub Action to a specific commit, scanning the built image for vulnerabilities, and refusing to overwrite image tags after they're published.
- Someone steals the deploy credentials. Historically the worst single way to compromise a website: leak the CI's deploy password and now anyone with that password can publish whatever they want. Mitigated by: not having a deploy password at all (see Layer 6 - keyless deploys).
- Someone messes with our DNS or TLS. Misissued certificate, an on-path attacker downgrading HTTPS to HTTP, DNS hijack. Mitigated by: two-factor on Cloudflare, registrar lock on the domain, HSTS preload (browsers refuse plain HTTP for our domain, period), modern TLS only.
- Someone defaces the site. Got into the build pipeline somehow and pushed an embarrassing change. Mitigated by: everything in (1) and (2), plus a one-command rollback (every Cloud Run revision is pinned to a specific image hash; "go back to yesterday's deploy" is a single CLI call).
- Volumetric attack (DDoS) or an origin/region outage. Someone tries to take the site offline by sending an enormous amount of traffic - or one cloud simply has a bad day. Mitigated by: Cloudflare absorbing the bulk of traffic at its edge, edge rate limiting, the CDN serving cached content even if an origin goes down, and - new in the multi-cloud design - three independent origins behind a health-checked load balancer, so an entire cloud can fail and the site keeps serving from the other two.
- Bot scraping and probing. Bots constantly throw classic attack patterns (
?id=1' OR 1=1--) at every endpoint on the public internet. Mitigated by: Cloudflare's WAF (the free Managed Ruleset) plus a rate-limit rule, silently dropping those requests at the edge so they never reach any of our three origins.
What we explicitly do not defend against, because none of it applies to a static site: authenticated session theft, broken access control, business-logic abuse, privilege escalation from an application server. If you're reading this page to copy the design for a site that does have logged-in users - you need more than what's here.
Architecture diagram
Here's the whole system in one picture. Don't worry if some of the labels are unfamiliar - every box is explained in its own section below.
Everything in that diagram - every origin setting, every Cloudflare rule, every IAM permission across all three clouds - is defined as code in infra/terraform/, which has one directory per cloud (aws/, gcp/, azure/, cloudflare/). We never click around in any cloud console to make changes; the consoles are read-only for normal operation. This is called infrastructure as code, and it's how you keep a real multi-cloud setup from drifting into three different snowflakes nobody can rebuild.
Layer 1 - Cloudflare: the one edge in front of everything
What it stops: volumetric attacks (DDoS), known-bad bots, exposing our origins to the public internet - and a whole cloud going down.
The big idea. Cloudflare isn't just a CDN sitting in front of a cloud load balancer - it is the load balancer, the WAF, the TLS terminator, the redirect engine, and the security-header layer, all at once. Running those same controls again on a per-cloud load balancer would be paying twice for them, so we don't: one edge (Cloudflare's free plan plus the ~$5/mo Load Balancing add-on) does all of it, in front of three interchangeable origins.
MITRE ATT&CK mitigated: T1498 (Network Denial of Service), T1499 (Endpoint Denial of Service), T1595 (Active Scanning).
How it works in plain English. When you type csoh.org in your browser, the DNS lookup returns a Cloudflare IP - not ours. Your browser opens a TLS connection to Cloudflare. Cloudflare looks at the request and one of three things happens:
- The page is cached at Cloudflare's edge. Cloudflare returns the cached response directly. We never see this request. (For a static site like ours, this is most traffic.)
- The page isn't cached. Cloudflare opens its own connection to our load balancer and fetches it on your behalf, then caches the result so the next reader gets the cached version.
- The request is bad. Cloudflare's bot mitigation, rate limiting, or threat intelligence flags it; the request is blocked before it ever reaches us.
This pattern is called a reverse proxy or CDN. The security wins are big:
- Our origin IP is hidden. Public DNS only ever points to Cloudflare. An attacker who wants to attack us has to attack Cloudflare first.
- Floods get absorbed at the edge. Cloudflare's network is much bigger than ours. A DDoS that would take us offline is unnoticed at their scale.
- The browser sees Cloudflare's certificate. Cloudflare manages its own TLS cert with its own auto-renewal. We don't have to hand-feed it our domain.
- One origin failing doesn't take the site down. Cloudflare's Load Balancer (Layer 2) health-checks all three origins and routes only to healthy ones. AWS, GCP, and Azure would all have to be down at once for the site to go dark.
The trade-off, worth being honest about: Cloudflare's free-plan WAF is a lighter rule set than a tunable OWASP Core Rule Set. For a static site with no database or login that's an easy trade (see the "What we didn't do" section for how we'd restore parity). Our origins only ever see requests from Cloudflare IPs, not real readers - which is exactly why per-IP rate limiting belongs at Cloudflare's edge, where the real client IP is visible, rather than at the origin.
Layer 2 - Three cloud origins, active/active
What it stops: a single cloud (or region) outage taking the site down; vendor lock-in; the cost of running a dedicated cloud load balancer just to get an HTTPS front door.
MITRE ATT&CK mitigated: T1499 (Endpoint Denial of Service, via failover), T1498 (Network Denial of Service), T1195 (Supply Chain Compromise, via not depending on one vendor's pipeline).
The shape. The exact same static site lives on three clouds at once. Cloudflare's Load Balancer holds all three in one pool and uses random origin steering to spread live requests across every healthy origin - this is what "active/active" means: they all serve real traffic simultaneously, not "one live, two on standby." A health monitor probes each origin every minute; any that fails is pulled out of rotation automatically and slipped back in when it recovers.
Why each origin is shaped the way it is
The one hard requirement: every origin must answer over HTTPS with a valid certificate, so the Cloudflare→origin leg can run at Full (strict) - no unencrypted or unauthenticated hop anywhere. That requirement quietly drives each choice:
- AWS - private S3 bucket behind CloudFront (with Origin Access Control). The cheap, obvious option - the S3 "static website" endpoint - is HTTP-only, which would force an unencrypted origin hop. So instead the bucket stays fully private and CloudFront serves it over HTTPS with a valid
*.cloudfront.netcert. CloudFront's free tier covers our egress; the bucket has no public access at all. - GCP - Cloud Run (scale-to-zero). Its
*.run.appURL is already HTTPS with a Google-managed cert and costs ~nothing when idle. This is why we kept Cloud Run but deleted the load balancer in front of it - the run.app URL is a perfectly good HTTPS origin on its own. (Google Cloud Storage's website endpoint, like S3's, is HTTP-only, so Cloud Run is actually the cheaper path to an HTTPS origin on GCP.) - Azure - Storage Account "static website" ($web). Azure serves the special
$webcontainer over a built-in*.web.core.windows.netHTTPS endpoint with a managed cert - no load balancer, no CDN, just static hosting. The simplest of the three.
Notice the pattern: none of the three needs a cloud load balancer, a managed-cert dance, or a WAF product, because Cloudflare does all of that once at the edge. Each origin is reduced to "the cheapest way this vendor will hand me an HTTPS URL for a folder of files."
The three stacks, side by side
Each cloud ends up with a deliberately different shape, because each vendor's cheapest path to a valid-HTTPS origin is different. Here is the full stack on each, end to end - what serves the bytes, what's exposed, how it gets a cert, how CI publishes to it, and what (if anything) runs code:
| Aspect | AWS | GCP | Azure |
|---|---|---|---|
| Serves the bytes | Private S3 bucket behind a CloudFront distribution | Cloud Run running our nginx container (scale-to-zero) | Storage Account static website ($web container) |
| Public surface | Only the CloudFront URL; bucket blocks all public access (OAC-keyed to the distribution) | The *.run.app URL (ingress = all) |
Only the $web endpoint; every other blob stays private |
| Origin TLS cert | *.cloudfront.net (AWS-managed) |
*.run.app (Google-managed) |
*.web.core.windows.net (Azure-managed) |
| Keyless deploy auth | OIDC → sts:AssumeRoleWithWebIdentity → IAM role csoh-site-publisher |
OIDC → WIF → impersonate csoh-deployer service account |
OIDC → Entra federated credential on an app registration (no client secret) |
| Deploy permission scope | Write the one bucket + invalidate the one distribution | Push to Artifact Registry + deploy Cloud Run revisions | Storage Blob Data Contributor on the one account |
| How CI publishes | aws s3 sync --delete + CloudFront invalidate |
docker build → Trivy scan → push immutable tag → gcloud run deploy |
az storage blob sync into $web |
| Runs code? | No - static objects, no runtime identity to abuse | Yes (nginx) - runs as a zero-IAM service account | No - static objects, no runtime identity to abuse |
| Why this shape | S3's own website endpoint is HTTP-only, so CloudFront is the cheapest way to get a valid-HTTPS, private origin | Cloud Run gives HTTPS + a cert for free and idles to ~$0, so it needs no load balancer in front | The $web endpoint is HTTPS out of the box - the simplest valid origin of the three, no compute at all |
The throughline: we let each cloud do the one thing it does cheapest, and pushed everything else (TLS to the browser, caching, WAF, redirects, headers, failover) up to the single Cloudflare edge. That's why two origins run zero code and the third runs a zero-permission container - the less each origin is trusted to do, the smaller the blast radius if any one of them is ever compromised.
One subtlety: the Host header
Each origin answers on its own hostname (…cloudfront.net, …run.app, …web.core.windows.net). If Cloudflare forwarded the public Host: csoh.org to them, each would reject the request - it doesn't recognize that name. So every origin in the Cloudflare pool sets a Host-header override to its own hostname. Small detail, but it's the thing that most often trips people up the first time they put object storage behind a proxy.
WAF, rate limiting, redirects, and caching - set once at the edge
All of it lives at Cloudflare, set once and applied no matter which origin serves the response:
- WAF - Cloudflare's free Managed Ruleset plus a rate-limit rule. It's a light rule set, but a static site has no SQL to inject or login to brute-force, so the rules mostly just eat bot-probe noise (see "What we didn't do" for restoring full-CRS parity).
- Legacy redirects - the
/conc8/*and/csoh/*301 maps are Cloudflare Redirect Rules, so they fire at the edge for every origin identically. - Caching - Cloudflare Cache Rules set the edge and browser TTLs (HTML 1h, assets 1y immutable, search.html 60s). The object-storage origins don't emit consistent
Cache-Controlheaders, so setting it at the edge gives uniform caching regardless of which cloud answered. - HTTP → HTTPS - Cloudflare's "Always Use HTTPS" handles the port-80 redirect; there's no plain-HTTP path to any origin.
Layer 3 - TLS (the lock icon in the browser)
What it stops: someone reading or modifying the page in transit between the user's browser and us.
MITRE ATT&CK mitigated: T1557 (Adversary-in-the-Middle), T1040 (Network Sniffing), T1565.002 (Transmitted Data Manipulation).
What's TLS? TLS (Transport Layer Security) is the modern name for what people called SSL - the encryption layer that makes URLs https:// instead of http://. Two computers establish a TLS connection, prove identity to each other with certificates, agree on a shared secret, and from there everything is encrypted. The lock icon in the browser is the user-facing signal.
Our setup has TLS at two separate layers, which often confuses people the first time they see it:
- Browser ↔ Cloudflare. Cloudflare's "Universal SSL" certificate, valid for
csoh.organdwww.csoh.org. This is the cert your browser actually validates and shows the lock icon for. It auto-renews on Cloudflare's normal cadence; we don't manage it. - Cloudflare ↔ each origin, at Full (strict). A separate TLS connection on the back side of Cloudflare to whichever origin it picked. Each origin presents its own provider-managed cert (CloudFront's
*.cloudfront.net, Cloud Run's*.run.app, Azure's*.web.core.windows.net), and Cloudflare's SSL/TLS mode is set to Full (strict), meaning it validates that origin cert rather than blindly trusting it. There is no unencrypted or unauthenticated hop anywhere in the path.
Why two layers and not just one? Because Cloudflare doesn't have your domain's private key. They generated their own cert that the browser trusts (Cloudflare is a public Certificate Authority); each origin presents a cert its own cloud provider issued and renews. Each cert covers what its owner can prove they control, and neither side has to share secrets.
The hardening details
- Modern TLS floor. Cloudflare is configured to refuse TLS 1.0 and 1.1 - only 1.2 and 1.3. Old TLS versions have known weaknesses; refusing them is the easiest "free" hardening you can do.
- HSTS with
preload. Every response from our site includes an HTTP header telling the browser "always use HTTPS for csoh.org for the next year." Withpreloadwe get added to a list browsers ship with by default, so the protection is active on the very first visit too - even before any of our HTTPS responses have been seen. - Auto-renewing certificates, everywhere. The edge cert (Cloudflare) and all three origin certs (CloudFront, Cloud Run, Azure) renew themselves on their own provider's schedule. We don't have a calendar reminder for any of them. Expired certs cause more outages than they prevent attacks; eliminating manual renewal eliminates that risk - and with three origins, that's three fewer certs to forget about.
Other security headers - set once at the edge
HSTS is the highest-impact header but not the only one. With three different origins, having each one set headers identically would be three places to drift out of sync - so we set them once at Cloudflare (a response-header Transform Rule). Whichever cloud serves the bytes, the response carries the same headers:
- Content Security Policy (CSP) - a strict policy: only first-party scripts, no inline JS, no
eval(), only specific image and frame sources allowed. The single highest-impact defense against XSS, even if an attacker were able to inject a<script>tag into our HTML. - X-Frame-Options: DENY + frame-ancestors 'none' in CSP - prevents anyone from embedding our pages in an iframe on their site. Stops clickjacking, where an attacker invisibly overlays our page under their own UI.
- X-Content-Type-Options: nosniff - tells the browser to trust the content-type we declared, instead of guessing from the first bytes. Closes some old MIME-confusion attacks.
- Referrer-Policy: strict-origin-when-cross-origin - when a reader clicks an external link from our site, the destination only sees that they came "from csoh.org," not the specific page or query parameters.
- Permissions-Policy - explicitly disables camera, microphone, geolocation, payment, USB, and motion-sensor APIs. We don't use any of them, so we deny them.
- Cross-Origin-Opener-Policy + Cross-Origin-Resource-Policy - limit how other origins can interact with windows or load resources from ours. Defends against newer-class side-channel attacks.
You can see all of these by running curl -I https://csoh.org/ from any terminal. They're public; that's the point.
Layer 4 - The origins themselves (S3, Cloud Run, Blob)
What it stops: reaching data an origin shouldn't expose; over-permissive cloud access if an origin were ever compromised.
MITRE ATT&CK mitigated: T1530 (Data from Cloud Storage), T1078.004 (Valid Accounts: Cloud Accounts), T1098.003 (Account Manipulation: Additional Cloud Roles).
Each origin is just "the cheapest HTTPS front door this cloud offers for a folder of static files" - but each is locked down so the only thing reachable is the site itself.
Each origin exposes only the site, nothing else
- AWS - the bucket is private. The S3 bucket blocks all public access; only this CloudFront distribution can read it, enforced by an Origin Access Control policy keyed to the distribution's ARN. There's no public S3 URL to find and poke at - the bucket simply isn't reachable except through the front door we built.
- GCP - public URL, zero-permission identity (below). Cloud Run's ingress is
all(Cloudflare reaches therun.appURL directly). Public reachability is fine because the container only serves static files and its identity can touch nothing else in the project. - Azure - only
$webis public. The static-website feature exposes exactly one container ($web); every other blob in the storage account stays private. Deploys write to$webthrough a data-plane role (below), not by making the account public.
The compute origin's identity has zero permissions
The GCP origin is the one that runs code (nginx in a container), so it's the one that needs an identity. Every workload in GCP runs as a service account - an identity that holds the cloud permissions for whatever code is using it. A misconfigured workload SA is one of the most common cloud security mistakes: people grant "Editor" to the application's identity "to make it work," and now every CVE in the application is potentially also a path to "rewrite all the GCP resources in this project."
Our application is static nginx that makes no GCP API calls. So we created a dedicated service account (csoh-run-runtime) with zero IAM roles and run the container as that identity. If the container were ever compromised - RCE in nginx, malicious bytes in the image, anything - the attacker gets a foothold in a process that can't talk to anything else in the cloud project. Its blast radius is the container itself; that's the whole point. (The object-storage origins, AWS and Azure, run no code at all, so there's no runtime identity to abuse there - the attack surface is just "static files served read-only.")
This is a practical example of zero trust applied to your own application: don't grant your code anything you can't justify, and "I might need it later" is not a justification.
Layer 5 - The bytes we ship to each origin
What it stops: shipping malicious or vulnerable bytes to production by accident - or accidentally publishing a file that should never be public.
MITRE ATT&CK mitigated: T1195.002 (Compromise Software Supply Chain), T1525 (Implant Internal Image), T1552.001 (Unsecured Credentials in Files).
Two origin types, two flavors of "what we ship." The GCP origin ships a container image (nginx + the site), so it gets the full container supply-chain treatment below. The object-storage origins (AWS, Azure) ship a folder of files - so for them, "supply chain" means making sure that folder contains only what's meant to be public.
0. The object-storage origins: an allowlist, not request-time blocking
An nginx origin can keep sensitive files (dotfiles, .py scripts, internal .json, anything with a key in it) present in the container but blocked at request time by nginx rules. Object storage has no request-time rules - whatever you upload is world-readable. So for the object-storage origins we flip the model: a single build step (stage_site.sh) stages a dist/ directory containing only the public file set, and that's what gets synced to S3 and Azure. The allowlist (site-publish.filter) mirrors the nginx block rules exactly, and the build fails loudly if a secret-shaped file ever slips into dist/. Not uploading a file is a stronger guarantee than serving it and hoping a deny rule catches every request for it.
What's a container image?
A container image is a packaged-up filesystem snapshot - your application code, plus the operating system files it needs to run, frozen as one shippable unit. For the GCP origin we push the image to a registry (Google's Artifact Registry); Cloud Run pulls it from there and runs it. The three controls below protect that image.
What's pinning? Pinning is naming a dependency by something the publisher cannot quietly redefine. A version like nginx:1.27-alpine or actions/checkout@v4 looks specific, but it's just a label - whoever owns it can repoint that label at different bytes tomorrow, and your build will pull the new bytes the next time it runs. Pinning means replacing that label with an immutable identifier - for container images, the SHA-256 digest of the exact bytes (@sha256:65645c…); for GitHub Actions, the full commit SHA (@a1b2c3d…). The label can move; the hash can't. If anyone tampers with the artifact upstream, the hash no longer matches, and the build fails closed instead of silently shipping the new bytes. We pin every external thing our deploy depends on (base image, every GitHub Action, the Cloud Run revision we route traffic to) for exactly this reason: it makes our deploy tamper-evident and reproducible. The trade-off is friction - somebody has to manually update the pin when we want a newer version - but that friction is the feature: an automated supply-chain attack can't propagate to us silently.
The supply chain for our container has three places where an attacker could substitute "what we meant to ship" with "what they preferred to ship." Each one gets a control:
1. The base image we start from
A typical Dockerfile starts with a line like FROM nginx:1.27-alpine, meaning "use whatever the nginx 1.27-alpine image is right now." But "right now" is whatever the registry returns. If someone compromised the registry, or the image owner's account, or any link in their build chain - that FROM line ships a malicious base layer into your image, and you'd never know unless you bought tooling specifically to detect it.
We pin to the content hash instead:
FROM nginx:1.27-alpine@sha256:65645c7bb6a0661892a8b03b89d0743208a18dd2f3f17a54ef4b76fb8e2f2a10
That long string after the @ is the cryptographic hash of the exact bytes we expect. Docker downloads the image, computes the hash, and refuses to use it if the hash doesn't match. The registry can't substitute different bytes without changing the hash, and the changed hash would make our build fail. Tamper-evident.
The trade-off: we have to manually update the hash when we want a newer base image. That's friction by design - it means an automated supply-chain attack doesn't propagate to us silently.
2. Stale packages on top of the pinned base
Pinning the hash freezes the base image's bytes. But the OS packages inside that base image (libssl, libxml2, libpng, etc.) keep getting new security fixes upstream. A digest pinned 6 months ago has 6 months of accumulated CVEs baked in.
We solve this by running apk upgrade immediately after the pinned base, in our Dockerfile:
RUN apk upgrade --no-cache && \
rm -rf /var/cache/apk/*
That tells the package manager: "fetch the current versions of every installed package, in this build." We start from a known good snapshot (the digest pin) and end with current security patches (the upgrade). The next layer (Trivy scanning) verifies we haven't missed anything.
3. The CI build artifact
Even with both controls above, a clever attacker might find a CVE in a newly-disclosed package that we just included. So every container we build gets scanned, in CI, by Trivy - an open-source vulnerability scanner. The scan walks every package in the image, cross-references it against public CVE databases, and the build fails if anything HIGH or CRITICAL shows up:
trivy image \
--exit-code 1 \
--ignore-unfixed \
--severity HIGH,CRITICAL \
"${IMAGE}"
The --ignore-unfixed flag filters CVEs that don't have a fix available yet - those are noise we can't act on, and including them would just train people to ignore the scan output.
If the scan passes, the image is pushed to Artifact Registry (Google's container registry) with two important properties:
- Immutable tags. Once we push
csoh-site:abc123, the tagabc123can't be overwritten or moved. Nobody - not an attacker who got into our deploy account, not a careless engineer - can silently change what bytes that tag refers to. - Hash-based, not
:latest. Each Cloud Run deploy points to a specific image hash. Rollback is one CLI command:gcloud run services update-traffic --to-revisions <old-revision>=100, and there's zero ambiguity about what bytes the rolled-back revision is running.
The Artifact Registry repo also has a retention policy: keep the 30 most recent images, delete untagged ones older than 7 days. We keep enough history for any sane rollback without paying for unbounded storage.
Layer 6 - How we deploy to three clouds without a saved password
What it stops: credential theft from the deploy pipeline. The most common single vector of website compromise - and with three clouds, three times the credentials that don't exist to steal.
MITRE ATT&CK mitigated: T1552.001 (Credentials In Files), T1552.004 (Private Keys), T1528 (Steal Application Access Token).
This is the most consequential design choice on this page. Read it carefully. The naïve way to deploy to three clouds would be to store three sets of long-lived credentials (an AWS access key, a GCP service-account JSON, an Azure client secret) in GitHub Secrets - tripling the blast radius of a leaked secrets store. We store none of them. Every cloud is reached with keyless OIDC federation: the same idea, implemented three times. We'll walk through the GCP version in detail because it's representative, then show how AWS and Azure do the identical dance.
The traditional approach (don't do this)
Most CI/CD pipelines deploy by storing a long-lived credential in a secret manager. For GCP, that means a service account key - a JSON file with cryptographic material that proves "I am this service account." Workflow runs read the JSON from secrets, presents it to GCP, and uses the resulting access. This works. It's also the source of countless real breaches, because:
- The JSON key never expires unless someone manually rotates it.
- Anyone who reads the secrets store gets it.
- It survives the engineer who created it leaving the company, until someone notices.
- If a leaked secret ends up on GitHub or a Pastebin, it's still useful to attackers months later.
What we do instead: Workload Identity Federation
Workload Identity Federation (WIF) replaces "stored credential" with "prove who you are at the moment you ask for access."
Walking the diagram step-by-step:
- Our GitHub Actions workflow runs. As part of starting up, GitHub mints a short-lived OIDC token for that specific workflow run. The token is signed by GitHub's identity service and includes verifiable claims: this is repo CloudSecurityOfficeHours/csoh.org, on branch main, in workflow deploy.yml, workflow run #12345.
- The workflow hands that token to Google Cloud's STS (Security Token Service), saying "exchange this for an access token, please."
- Google Cloud's STS checks: do I trust GitHub's identity service as a token issuer? Yes (we configured it to). And does this token's
repositoryclaim match the policy I've set? Our policy says it must equal exactlyCloudSecurityOfficeHours/csoh.org. Yes. - STS returns a 1-hour Google Cloud access token, scoped to impersonating one specific service account -
csoh-deployer, our deploy-only identity. - The workflow uses that token to push containers and deploy revisions. After 1 hour, the token expires.
Crucially: there is no JSON key anywhere in this flow. There's nothing for a leaked GitHub secret to reveal - the deploy auth is created on-demand, scoped to one workflow run, and discarded. If a workflow log somehow leaked, an attacker would get an access token that's valid for at most one hour, scoped to "deploy to this one project," and they'd have to use it before it expired. There's nothing to rotate, because there's nothing stored.
The Terraform that wires this up is in gcp/wif.tf - about 30 lines.
The same pattern, on AWS and Azure
The exchange above isn't a GCP feature - it's the OIDC federation standard, and every major cloud speaks it. So the AWS and Azure publish jobs do the identical dance, just with each cloud's nouns:
- AWS - GitHub's OIDC token is presented to AWS STS via
sts:AssumeRoleWithWebIdentity. An IAM role (csoh-site-publisher) trusts GitHub's issuer, with a condition that the token'ssubclaim must equalrepo:CloudSecurityOfficeHours/csoh.org:ref:refs/heads/main- the exact analogue of the GCP repo condition. The role can write the S3 bucket and invalidate the one CloudFront distribution, nothing more. (aws/oidc.tf) - Azure - an Entra ID app registration carries a federated credential whose
subjectis that samerepo:…:ref:refs/heads/mainstring and whose issuer is GitHub. The app's service principal holds one data-plane role, "Storage Blob Data Contributor," scoped to the one storage account. No client secret is ever created. (azure/identity.tf)
Three clouds, three short-lived tokens minted on demand, zero stored credentials. A leaked GitHub secrets store would reveal nothing useful, because the deploy auth for every cloud is created per-run and discarded.
The deploy identities have narrow permissions
Each cloud's deploy identity can do exactly what it needs to publish, and nothing else. The GCP csoh-deployer service account can do exactly three things:
- Push container images to our Artifact Registry repo.
- Create Cloud Run revisions and shift traffic between them.
- Set the runtime identity on a Cloud Run revision (so deploys can specify which service account the running container will use).
It can't read other GCP projects, disable logging, or escalate to admin. The AWS role and Azure principal are scoped just as tightly: write-one-bucket-and-invalidate-one-distribution, and write-one-storage-account, respectively.
And remember from Layer 4: the GCP runtime identity (csoh-run-runtime) the deployer sets on the running container has zero permissions. So even an attacker who compromises the deploy identity AND uses it to ship a malicious container to production… ends up with a malicious container that has no cloud access. The blast radius is bounded at every layer.
Layer 7 - Protecting the deploy pipeline itself
What it stops: a malicious or accidental change to the workflow that does the deploying.
MITRE ATT&CK mitigated: T1195.001 (Compromise Software Dependencies and Development Tools), T1199 (Trusted Relationship), T1078 (Valid Accounts).
The deploy workflow file can ship code to three production clouds. That makes the file itself as sensitive as a production secret - anyone who can change it can change what gets shipped, everywhere. We protect it with multiple layers on the GitHub side:
- CODEOWNERS - a special file that says "any change to
.github/workflows/,infra/, theDockerfile, or security docs requires review from@Nunley." A pull request touching those paths can't merge without that explicit approval. - Branch protection on
main- pull requests are required (no direct push to main), each PR needs at least 1 approving review from a code owner, and three required status checks (ruff,actionlint,yamllint) must pass before merging is allowed. - "Production" environment gate - the deploy job declares
environment: production. GitHub's environment configuration says only themainbranch can deploy to it. A pull request from a fork can't run this workflow even if the fork's author is sneaky about it. - Secret scanning + push protection are on. If anyone ever commits a credential-shaped string (an AWS key, a GitHub token, anything Git knows the pattern of), the push is blocked at the moment of
git push. - Dependabot security updates are on. Anything we depend on that ships a CVE generates an automatic PR for us to review.
- Every third-party GitHub Action is pinned to a specific commit SHA, not a version tag. Tags can be moved silently. SHAs can't.
Walking the workflow (deploy.yml) - it builds once, then fans out to three publish jobs in parallel:
- Triggers on pushes to
mainmatching specific paths (HTML, CSS, JS, Dockerfile, nginx.conf, the staging script, the workflow file itself), plusworkflow_dispatchfor manual runs. - Concurrency group
deploywithcancel-in-progress: true- if a newer commit lands while an older deploy is mid-flight, the older one is cancelled. Prevents the stale-content race where an older deploy publishes after a newer one finishes. - Permissions block scopes the auto-injected
GITHUB_TOKENtocontents: read+id-token: write(the latter is what lets each cloud's OIDC exchange happen). Nothing else. - build job - regenerates the search index, runs
stage_site.shto produce the publicdist/folder, and uploads it as an artifact. One build, so all three origins serve byte-identical content. - publish-aws - assumes the IAM role via OIDC,
aws s3 sync --deleteofdist/to the bucket, then a CloudFront invalidation. - publish-azure - logs in via the Entra federated credential,
az storage blob syncofdist/into$web(sync handles deletions too). - publish-gcp - the container path: WIF auth,
docker buildwithorg.opencontainers.image.*labels, Trivy scan (fails on HIGH/CRITICAL), push to Artifact Registry under a single immutable hash-based tag, thengcloud run deploy. The push is idempotent (it skips if the tag already exists) so a rerun doesn't trip overimmutable_tags=true. There's no cache-invalidation step in the deploy - Cloudflare caches at the edge and is purged separately. - Every job declares
environment: production, so a fork PR can't run any of them even if it could mint an OIDC token - protected-environment rules only apply onmain.
Layer 8 - Logging and what we'd see during an attack
What it stops: nothing directly - but it's how we'd notice if any of the layers above failed.
MITRE ATT&CK detection coverage: T1190 (Exploit Public-Facing Application, via WAF block logs), T1098 (Account Manipulation, via IAM change logs), T1078.004 (Valid Accounts: Cloud Accounts, via the audit log stream).
Even with everything above, you should assume something will eventually go wrong. A vulnerability in nginx, a leaked credential we didn't anticipate, a configuration drift no one caught - there's always a possibility. Logging is what turns "an attack happened" into "an attack happened, here's exactly when, here's exactly what they did, and here's what we need to fix." Without logs, you have no idea.
With three clouds, logging lives in two places. The edge - where total traffic is visible - is Cloudflare's zone analytics and Load Balancer health dashboards: requests, cache hit ratio, WAF blocks, and which origins are healthy. The origins log only what got past Cloudflare's cache and actually reached them. The GCP origin keeps the deepest forensic trail, because that's where the IAM and audit story lives.
By default, Google Cloud Logging keeps logs for 30 days. That's not enough for security work - supply-chain attacks specifically are often discovered months after the fact, and the logs you'd need for forensics are gone. We define a custom 400-day retention bucket and a log sink that routes the security-relevant events into it (see gcp/logging.tf). The filter captures three categories:
(resource.type="cloud_run_revision" AND httpRequest.status>=400) OR protoPayload.serviceName="iam.googleapis.com" OR protoPayload.@type="type.googleapis.com/google.cloud.audit.AuditLog"
- Every 4xx and 5xx from the Cloud Run origin. Tells us about errors and probes that reached this origin (rather than being served from cache or handled at the edge). Helpful for both performance triage and abuse detection.
- Every IAM change. If anyone modifies a permission, grants a role, creates a service account - we have it. The single highest-leverage admin event in any cloud project; you almost always want to know about it before the audit happens.
- The full audit log stream. Every API call against this project, who made it, when, with what outcome.
WAF blocks and per-request edge logs live in Cloudflare, at the layer that does the blocking. (Cloudflare's free tier keeps less log history than a paid plan or our GCP sink would, which is part of the trade noted in "What we didn't do.")
What we don't have (yet): a SIEM, real-time alerting, or anomaly detection. For a community site the cost/benefit doesn't justify it; for a production SaaS workload, you'd want this same sink plus an export to BigQuery for long-term analytics or Pub/Sub for streaming detection. We also keep a Cloud Monitoring dashboard for the GCP origin's day-to-day metrics - request rate, latency percentiles, instance count (defined in gcp/monitoring.tf, "csoh.org Origin" in the GCP console) - and watch Cloudflare's analytics for the whole-site view.
What we didn't do (and why)
Listing controls we considered and rejected is more honest than pretending the design is finished. Each of these is a defensible choice for a small static site and a less-defensible choice as the threat surface grows. If you're copying this design for something bigger, this list is your homework.
- Binary Authorization (signed-image enforcement). Google Cloud has a feature where you can configure Cloud Run to refuse to start a container unless its image has been cryptographically signed by an approved party. We've enabled the API but don't enforce a policy yet - Trivy + immutable image tags + WIF-restricted push already cover most of what this would add for a single-image, single-deployer setup. The day we have multiple environments (staging vs. production) or multiple services, we'll turn it on.
- Real client IP visibility through Cloudflare. Cloudflare proxy hides the real reader's IP from us. There's a standard way to surface it (Cloudflare adds an
X-Forwarded-Forheader; you configure your origin to trust that header from Cloudflare's IP ranges). We haven't wired this up yet because it requires keeping a current allowlist of Cloudflare egress IPs in Terraform, which is a non-zero maintenance ask. On the to-do list. - SLSA provenance attestation. A standard for cryptographically attesting "this image was built by this pipeline from this source commit." The slsa-github-generator action makes this fairly easy to add. On the to-do list; would deepen the supply-chain story above.
- Image signing with cosign. Same shape as SLSA - would let us reject unsigned images at deploy time. Natural follow-up to SLSA provenance.
- Distroless or scratch base image. A truly minimal container has only the application binary, with no shell, no package manager, no system utilities. nginx-on-alpine is bigger than that - but it gives us the URL-rewriting, header-injection, and config flexibility we need today. The Trivy scan + apk upgrade keeps the alpine package surface honest.
- Full OWASP CRS at the edge. Cloudflare's free-plan WAF is the lighter free Managed Ruleset rather than a tunable OWASP Core Rule Set. For a static site with no database or login, a full CRS would be mostly demonstrative anyway. To restore parity you'd move to Cloudflare's paid WAF, or attach AWS WAF to the CloudFront origin - both real options if the threat surface grows.
- Edge-level WebP conversion (Cloudflare Polish). We do serve WebP, the origin-agnostic way: every
<img>with a generated.webpsibling is wrapped in a<picture>with a WebP<source>(see wrap_img_webp.py), so capable browsers fetch the smaller file and everything else falls back to the original. What we don't do is Cloudflare Polish - transparent edge conversion with no markup - because that's a Pro-plan feature and we run the free plan plus the Load Balancing add-on. - Per-region failover within a cloud. Each origin is single-region (one S3 region, one Cloud Run region, one Azure region). We don't need per-cloud multi-region because the cross-cloud failover already covers the realistic outage: Cloudflare health-checks all three and routes around any that's down. A region outage in one cloud just shifts traffic to the other two. Adding multi-region inside each cloud on top of that is cost + complexity that doesn't pay off at our scale.
- SIEM integration / real-time alerting. Logs are retained; we'd notice an attack on the next dashboard check. We don't get woken up at 3am. For a community site, that's the right call. For something with real users and revenue, you'd want at least PagerDuty integration on the most-critical filters.
- Custom VPC-SC service perimeter. GCP's heaviest network-isolation feature - useful when you have services holding sensitive data and want to forbid all data egress outside a defined boundary. Our runtime SA has access to nothing, so there's nothing to perimeterize.
- An enterprise-style landing zone / cloud foundation. Everything on this page lives in one GCP project, one AWS account, and one Azure subscription - flat, single-tenant. An enterprise foundation is a different shape: a folder hierarchy of dozens of projects separated by environment and business unit, a Shared VPC owned by a platform team, Cloud Identity sync from your IdP, aggregated org-wide log sinks into a dedicated security project, Cloud KMS and Secret Manager owned by separate teams, org-policy constraints that pre-block risky configs everywhere, and a Cloud Build / Terraform pipeline that's the only thing allowed to write to prod. We're a single static site with one deployer (Shawn) - we don't need any of that. If you're standing up GCP at company scale, the patterns we did not use are documented in the landing zones & cloud foundations guide.
What this costs to run
Approximate monthly bill at our traffic level (low - we're a community site):
| Component | Approx / month |
|---|---|
| Cloudflare Load Balancing add-on (Free plan + LB) | ~$5-7 |
| AWS S3 + CloudFront (free-tier egress) | ~$0-1 |
| GCP Cloud Run (scale-to-zero) + Artifact Registry | ~$0-1 |
| Azure Blob static website | ~$0-1 |
| Terraform state (GCS) + GCP logging | < $2 |
| Total | ~$8-12/mo |
A few honest notes on cost:
- The Cloudflare Load Balancing add-on is the dominant line item (~$5-7); it buys the active/active failover across all three clouds. Everything below it is rounding error at our traffic.
- All three origins essentially scale to zero. S3 and Azure Blob bill for storage (pennies for a few hundred files) plus egress that mostly never happens because Cloudflare caches it; Cloud Run bills per request and is free when idle.
- For a security community, the deployment is the value proposition - and it's a multi-cloud one. We get the full security story and three-cloud resilience for about the price of a couple coffees.
If you want to copy this for your own site
This isn't a step-by-step tutorial - it's a checklist. Each item links to the actual file we use, so you can read the working version. If you want to copy this for your own static site, the path is roughly:
- Pick how many clouds you actually want. The design degrades gracefully: one origin is a normal "static site behind Cloudflare," two gives you failover, three is what we run. Start with one and add origins later - the Cloudflare pool just grows.
- Stand up each origin from its Terraform directory.
infra/terraform/has one dir per cloud (aws/,gcp/,azure/,cloudflare/). Each is self-contained:terraform applyinaws/builds the private bucket + CloudFront + the OIDC role, and so on. See infra/README.md for the per-cloud bootstrap. - Wire the origins into the Cloudflare dir. Feed each origin's hostname (from its
terraform output) intocloudflare/and apply - that creates the Load Balancer, pool, health monitor, security-header + redirect + cache rules. - Copy
.github/workflows/deploy.ymlandtools/stage_site.sh, and set the per-cloud resource IDs as repo Variables (the README lists exactly whichterraform outputfeeds each one). Every cloud authenticates keyless via OIDC - no secrets to paste. - On GitHub: create a
productionenvironment scoped tomain; create a CODEOWNERS file requiring your review on workflow + infra paths; turn on branch protection, secret scanning, and Dependabot. - Cut over safely. Verify each origin directly, add them to the Cloudflare LB with the old origin kept as a fallback, then flip DNS - the README has the staged cutover + rollback runbook.
Realistic time investment if you've never used Terraform before: a weekend to read everything carefully and a day to stand up all three clouds. If you only want one origin to start: an evening. If you hit a wall, bring it to Friday Zoom.
Further reading
- The Terraform itself - infra/terraform/, one directory per cloud. Start with cloudflare/load_balancer.tf for the active/active pool; gcp/wif.tf, aws/oidc.tf, and azure/identity.tf for the three keyless-deploy stories side by side.
- The deploy workflow - deploy.yml. Build once, fan out to three clouds; heavily commented.
- Our broader security writeup - SECURITY.md. Covers the security headers, the CI auth model, every secret in the repo, and rotation cadence.
- How CSOH uses GitHub Actions - the companion learning page walks through every workflow file in the repo.
- Glossary - every acronym we used here is in the CSOH glossary with cross-links.
- External documentation:
- GCP - Workload Identity Federation overview
- google-github-actions/auth - the action that does the OIDC exchange
- Cloudflare - Load Balancing - the active/active pool, health monitors, and origin steering this site's edge runs on
- GCP - Cloud Run ingress controls
- GCP - Enterprise foundations blueprint - what this same defense-in-depth model looks like at organization scale: folder hierarchy, Shared VPC, aggregated log sinks, Cloud KMS, Security Command Center, and the org-policy constraints we don't need for a single static site
- OWASP Top 10 - the canonical list of common web app vulnerabilities
- SLSA - the supply-chain provenance framework we'll add next
Questions?
Bring them to Friday Zoom. Several of our regulars run nontrivial GCP setups (multi-project orgs, Binary Authorization in production, signed build artifacts) and are happy to walk through specifics for your environment. The meeting recaps often include cloud-deployment war stories.