The 30-second version: A service mesh is a dedicated infrastructure layer for service-to-service communication inside a Kubernetes cluster. It gives every east-west call three things - encryption (mTLS), identity (a cryptographic workload identity, usually SPIFFE-based), and policy (authorization decisions on every hop) - without changing application code.
The four meshes that matter in 2026: Istio (most powerful, most complex; CNCF graduated; sidecar and ambient modes), Linkerd (smallest operational surface; CNCF graduated; Rust micro-proxy), Cilium Service Mesh (no sidecar; eBPF; bundled with the CNI), Consul Connect (HashiCorp; strong multi-runtime story including VMs). The major architecture shift since 2024 is the move away from per-pod sidecars toward ambient (Istio) and sidecarless (Cilium) data planes - same outcomes, dramatically less operational tax.
On this page
- What a service mesh is
- Sidecar vs sidecar-less
- The service mesh landscape
- mTLS - encryption everywhere
- Authentication & workload identity
- Authorization policy
- Encryption, identity, policy - the three pillars
- Observability for security
- Zero-trust east-west
- Service mesh vs API gateway
- Service mesh vs NetworkPolicy
- Multi-cluster mesh
- Egress control via the mesh
- Sidecar attack surface
- Ambient mode & sidecarless
- SPIFFE/SPIRE deep-dive
- Istio vs Linkerd vs Cilium vs Consul
- AWS / Azure / GCP managed mesh
- Maturity stages
- Common pitfalls
- Further reading
- FAQ
What a service mesh is
A service mesh is a dedicated infrastructure layer that intercepts every service-to-service network call in a Kubernetes cluster and adds encryption, identity, policy, observability, and traffic management - all without requiring the application code to know it's there. The interception happens either via a sidecar proxy injected into each pod, or via a node-level data plane (ambient or eBPF) that performs the same job without modifying the pod.
The problems the mesh solves are the problems every microservices architecture eventually develops:
- East-west encryption. Service-to-service traffic inside the cluster is usually plaintext by default. Once any workload is breached, lateral movement is trivial. The mesh gives every hop mTLS without an SDK change.
- Workload identity. "Service A is calling service B" is normally inferred from network position - an IP, a port, a NetworkPolicy. The mesh gives every workload a cryptographic identity that the receiving service can verify, regardless of network position.
- Policy that follows the workload. Authorization decisions on every call ("can the orders service call the payments service?") in a uniform language, applied at the proxy, not buried in each application's middleware.
- Observability of east-west traffic. Distributed traces, golden-signal metrics, and connection-level logs for every call, automatically - without instrumenting each service.
- Traffic management. Retries, timeouts, circuit breakers, canary releases, fault injection - all decoupled from application code.
The cost of the mesh is real: a new control plane to operate, a new failure mode to understand, latency added at every hop, and CPU/memory consumed by every sidecar (mitigated heavily in ambient and sidecarless modes). Whether it's worth that cost depends on how many services you have, how regulated the traffic is, and how much your application code already does the mesh's jobs poorly.
For Kubernetes fundamentals that underpin this discussion, see Kubernetes security; for the layer beneath the mesh, see network security; for the philosophy the mesh implements, see zero trust.
The sidecar pattern vs sidecar-less
The original service mesh architecture put a proxy container (almost always Envoy) into every application pod as a sidecar. Every packet leaving the application container was redirected (via iptables rules installed by an init container) into the local sidecar, which then handled mTLS, policy, observability, and traffic management before sending the packet out to its destination - where another sidecar received it and reversed the process.
The sidecar pattern works, and it shipped to production for the better part of a decade. It has well-known costs:
- Resource overhead. A sidecar Envoy adds 50-200 MB of memory and a fraction of a CPU per pod. Across a 5,000-pod cluster, that's hundreds of GB of RAM and dozens of cores spent on proxies.
- Latency. Two extra proxy hops on every call. Usually 1-3 ms per direction; sometimes worse under load.
- Lifecycle coupling. The sidecar starts before the app and shuts down after it; mismatches produced the famous "job pod never terminates because the sidecar is still running" class of bugs. Kubernetes 1.29's native sidecar support (KEP-753) finally fixed this.
- Upgrade churn. Every Envoy CVE means rolling every workload to pick up a new sidecar image. The cadence is unrelenting.
- Blast radius. A misconfigured sidecar can take down the application pod it's attached to.
The 2024-2026 architectural shift is to move the data plane out of the pod:
- Istio Ambient. Replaces sidecars with a per-node L4 proxy (ztunnel) for mTLS and identity, plus an optional per-namespace L7 proxy (waypoint) for advanced features. GA at Istio 1.22. Same control plane, different data plane.
- Cilium Service Mesh. Uses eBPF programs in the kernel to do the work a sidecar would. No proxy in the pod; the L4 path runs in eBPF and L7 features run in an Envoy that lives at the node (one per node, not one per pod).
- Linkerd still uses sidecars, but its Rust micro-proxy (linkerd2-proxy) is so much smaller than Envoy that the overhead argument has different weight. Linkerd has been clear it's evaluating ambient-style architectures and has shipped early experiments.
The sidecar pattern isn't dead, but it's no longer the default new deployments reach for unless they have a specific reason. The new question is "sidecar-based or sidecar-less" rather than "which sidecar".
The service mesh landscape
The meshes you will encounter, in rough order of mindshare and production adoption:
Open-source, CNCF projects
- Istio - CNCF graduated (2023). Originated at Google and IBM; control plane is istiod; data plane is Envoy in either sidecar or ambient mode. The most feature-complete mesh - JWT auth, request authentication, ingress / egress gateway, multi-cluster, Gateway API. Operational complexity is the trade-off.
- Linkerd - CNCF graduated (2021). Originated at Buoyant. Lightweight Rust micro-proxy as the sidecar, small control plane, opinionated defaults. Easiest mesh to operate. Strong on the "boring is good" axis. Note: Linkerd stable releases moved to a commercial offering in 2024 (Buoyant Enterprise for Linkerd); edge releases remain freely available.
- Cilium - CNCF graduated (2023). Originated at Isovalent (acquired by Cisco in 2024). Not just a service mesh - a CNI, NetworkPolicy engine, service mesh, and observability tool, all backed by eBPF. No sidecar. The most "rethink the stack" of the four; rapidly becoming the default for new clusters that want one project to own the network and security path.
- Kuma - CNCF sandbox. Originated at Kong. Envoy-based; multi-zone / multi-cluster as first-class concepts. Less production scale than the above three, but a real contender in multi-cloud, multi-cluster environments.
Vendor-led, open-source
- Consul Connect - HashiCorp. Service mesh capability built into Consul; strong story for mixed Kubernetes + VM + multi-cloud workloads where Consul is already the service catalog and KV store. Envoy-based.
- NGINX Service Mesh - F5/NGINX. NGINX-based data plane; integrates with NGINX Plus Ingress Controller.
- Traefik Mesh (formerly Maesh) - Traefik Labs. Different model: per-node mesh proxy, not per-pod sidecar. Simple to operate.
Managed and provider-integrated
- AWS App Mesh - Envoy-based managed mesh; works across EKS, ECS, EC2, and Fargate. Less feature-rich than vanilla Istio; tightly integrated with AWS networking and IAM.
- Cloud Service Mesh (the consolidated successor to Anthos Service Mesh) - Google's managed Istio. Both sidecar and ambient modes; cross-cluster federation across GKE, on-prem, and other clouds.
- Azure Service Mesh add-on for AKS - Microsoft retired Open Service Mesh (OSM) in 2024 and now ships managed Istio as the AKS add-on (Istio-based service mesh). Linkerd is also commonly run on AKS unmanaged.
- Red Hat OpenShift Service Mesh - productized Istio for OpenShift, with Maistra components on top.
- Solo.io Gloo Mesh - commercial Istio distribution and federation control plane; widely used in regulated enterprises.
- Buoyant Enterprise for Linkerd - commercial Linkerd from the maintainers; FIPS builds, longer-support stable channel.
Notable transitions
- Open Service Mesh (OSM) - Microsoft-led CNCF project; archived in 2024. The CNCF-graduated alternatives absorbed the use cases.
- AWS App Mesh still receives updates but AWS has shifted its strategic emphasis toward managed open-source meshes via the EKS add-on framework.
mTLS - encryption everywhere
Mutual TLS is the most-cited single feature of a service mesh, and the easiest one to grasp. Every connection between two mesh-enrolled workloads is encrypted with TLS, and both sides present and verify certificates. The mesh control plane runs the certificate authority that issues these certs and rotates them on a tight schedule (every 24 hours is typical) - short-lived enough that revocation isn't even a meaningful operation.
What changes about your network when every hop is mTLS
- Eavesdropping in the cluster fails. A compromised node, a misconfigured CNI, or a packet capture from a sidecar process all return ciphertext, not credentials and PII.
- Spoofing by IP/DNS fails. An attacker who has redirected DNS or hijacked an IP can't impersonate a service because they don't hold the right certificate.
- "Encryption in transit" boxes light up. SOC 2, PCI DSS, HIPAA, FedRAMP, and most modern privacy regulations all want encrypted-in-transit assertions for sensitive data, including between internal services. The mesh produces this evidence trivially.
- Network-based debugging gets harder.
tcpdumpstops being useful at the application layer; you need to debug at the proxy admin endpoint or use Hubble/Kiali to see L7 context. - Some sidecars to legacy systems break. Anything that expects plaintext on the wire - a legacy database client, a TCP-based health check - needs to be configured to bypass the mesh or upgraded to participate in it.
Permissive transition mode is mandatory
Every production mesh supports a mode where mTLS is offered but not required - the receiving side accepts both encrypted and plaintext connections. This exists precisely because flipping a real cluster to strict mTLS in one move is how you cause an outage. The pattern is: enable mesh-wide in permissive mode, observe telemetry until you can prove every connection is mTLS, then move namespace-by-namespace to strict. Anything else is courage masquerading as engineering.
Certificate-authority architecture
Where the mesh's CA roots are anchored matters for security and for compliance. Common patterns:
- Self-signed cluster-local CA. The mesh control plane generates its own root. Easiest; the audit conversation about "where does the trust come from" is harder.
- Cluster CA chained to a corporate intermediate. The mesh's intermediate is issued by your existing PKI. Aligns with corporate certificate governance and federation across clusters.
- External CA via cert-manager. Istio (and Linkerd) can be configured to issue from cert-manager, which can in turn be backed by Vault, AWS Private CA, or an internal PKI.
- SPIRE as the CA. SPIRE issues SPIFFE SVIDs that the mesh data plane uses directly. The cleanest architecture for cross-mesh federation, and the most demanding to operate.
Authentication & workload identity
"Authentication" in a mesh has two distinct meanings. Peer authentication is the workload-to-workload identity verified by mTLS - service A is provably service A because it presents the cert for service A's identity. Request authentication is the end-user (or upstream service) credential carried in the request, almost always a JWT, that the mesh validates at the proxy.
Workload identity (peer authentication)
Every mesh-enrolled workload gets a cryptographic identity. In SPIFFE-conformant meshes (Istio, Cilium, Consul), that identity is an SPIFFE URI like spiffe://cluster.local/ns/payments/sa/checkout - encoding the trust domain, namespace, and Kubernetes ServiceAccount. The certificate the workload presents includes this URI in its SAN extension; the receiving sidecar verifies the signature and reads off the identity, no IP address required.
Linkerd uses an analogous identity model bound to the ServiceAccount; the URL format differs but the semantic is the same.
Request authentication (end-user / upstream JWT validation)
For HTTP requests, the mesh can additionally validate a JWT presented by the caller - checking signature, issuer, audience, and expiration, and exposing claims to authorization policy. This pushes authentication out of every application and into the data plane.
- Istio -
RequestAuthenticationresource defines JWT issuers and JWKS endpoints; combined withAuthorizationPolicyreferencing JWT claims. - Cilium - L7 NetworkPolicy can match JWT-aware rules via Envoy filters; policy syntax is Cilium's native YAML.
- Linkerd - historically lighter on request-level features; JWT validation typically happens at the ingress (e.g., Linkerd's authorization policy with an ingress-level OIDC proxy).
- Consul - JWT auth via the API gateway or Consul-aware proxy filters.
Pushing JWT validation to the mesh has a non-obvious benefit: the application no longer holds the JWKS, doesn't rotate its own validation keys, and doesn't ship a JWT library in every language. The mesh does it once.
The three pillars - encryption, identity, policy
The whole mesh security model rests on three things working together. Each is necessary; none is sufficient.
Encryption (mTLS)
Every byte on the wire is encrypted, every connection is mutually authenticated. Without it, identity assertions can be eavesdropped or replayed; with it, a node-level compromise doesn't immediately yield credentials and PII. Cheap and necessary - there is no good reason in 2026 not to have it.
Identity (SPIFFE / workload identity)
Every workload presents a cryptographic identity, not a network position. Without it, the policy engine has only IPs and ports to reason about - useless in a Kubernetes cluster where both change constantly. With it, policy follows the workload.
Policy (authorization)
Explicit allow / deny decisions on every call, based on identity and request content. Without it, encryption protects the channel but does not prevent a compromised workload from calling anything it can reach. With all three, lateral movement is bounded by the policy.
What breaks if you skip one:
- Encryption without identity - TLS that doesn't verify the peer is just an obfuscated channel. The breach reads exactly the same; the auditor finding is slightly different. Don't ship this.
- Identity without policy - every service knows who is calling, but every service is also still allowed to be called by anyone. Common starting state, common stopping state. The identity is doing nothing useful without policy enforcing it.
- Policy without identity - IP-based authorization in a cluster where pods churn. Either too permissive (broad CIDRs) or constantly breaking. The reason NetworkPolicy alone is hard.
Observability for security
The mesh is also one of the best places in the stack to see east-west traffic. Every connection generates structured signal - golden-signal metrics (request rate, error rate, latency), distributed-trace spans, and connection-level logs - without instrumenting application code.
Tools that consume mesh telemetry
- Jaeger and Tempo - distributed-trace backends. The mesh emits spans for every hop; Jaeger / Tempo reconstruct the call graph.
- Prometheus - every mesh emits Prometheus-format metrics out of the box. The standard dashboards for request rate, error rate, latency, and TLS-cert expiry.
- Kiali - Istio-specific observability UI; visualizes the service graph, traffic policies in effect, and certificate health.
- Hubble - Cilium's observability companion. Real-time flow logs (including encrypted-flow metadata), policy decisions, and a service-map UI.
- Linkerd Viz - Linkerd's bundled dashboards; tap-style live request inspection.
What security teams use the telemetry for
- Anomalous east-west traffic. A service that suddenly starts calling a service it has never called before; an unexpected geo or namespace appearing in the source identity list; a request rate spike from a single identity. SIEM rules over Prometheus exemplars or Hubble flows light these up.
- Policy violation alerts. Every
DENYdecision logged with identity, source pod, target service, and HTTP method. A burst of denies is either a misconfiguration or an active attempt. - Certificate expiry monitoring. A mesh whose certs are rotating correctly is invisible; one that isn't fails dramatically. Surface the "% of workloads with cert age greater than half the rotation interval" metric.
- Lateral movement detection. Compare the actual service graph (what's calling what right now) against the expected service graph (what should be calling what per the architecture). Drift is signal.
For the broader detection question, see detection engineering; for SOC workflows, cloud SOC.
Zero-trust east-west
The service mesh is the most direct implementation of zero-trust principles inside a Kubernetes cluster. The five common phrasings of zero trust map cleanly to mesh primitives:
| Zero-trust principle | How the mesh implements it |
|---|---|
| Never trust, always verify | Every call presents and verifies an mTLS certificate. Network position confers no trust. |
| Verify explicitly | Workload identity (SPIFFE SVID) plus, where applicable, request-level JWT. Two layers of explicit verification. |
| Least privilege | AuthorizationPolicy / ServerAuthorization / L7 NetworkPolicy restrict each service to the calls it actually needs. |
| Assume breach | Continuous mTLS, short-lived certs, deny-by-default policy. A breached pod's blast radius is bounded by policy, not by network reachability. |
| Continuous validation | Every call is re-evaluated against policy. Identity rotates on a 24-hour-or-less cadence. |
What the mesh doesn't do for zero trust: it doesn't reach the user-to-service edge (that's the API gateway and identity-provider layer), it doesn't reach the host kernel (that's eBPF, runtime security tools, and CNI), and it doesn't reach the data store (that's database-level access controls and KMS). Zero trust is end-to-end; the mesh is the east-west chapter.
Service mesh vs API gateway
The two get confused because they're both Envoy-based proxies that handle authentication, authorization, and observability - they just live at different layers.
| Dimension | API gateway (north-south) | Service mesh (east-west) |
|---|---|---|
| Traffic direction | External clients into the cluster | Service to service inside the cluster |
| Caller identity | End users, partner APIs, mobile apps - usually JWTs / API keys | Workloads - SPIFFE / mesh-issued identity |
| Primary auth | OAuth 2 / OIDC, API keys, JWTs from the IdP | mTLS workload identity; secondary JWT validation |
| Typical concerns | Rate limiting, WAF, API key management, transformation, schema validation | Retry / circuit-break, traffic split, observability, deny lateral movement |
| Topology | Centralized at the edge - a few gateway pods | Distributed - a proxy at every pod (sidecar) or every node (sidecarless) |
| Examples | Kong, Apigee, Tyk, AWS API Gateway, Azure APIM, Envoy Gateway, Istio Gateway | Istio, Linkerd, Cilium, Consul Connect |
The two often overlap at the cluster boundary. Istio's Gateway API integration uses the same control plane to operate the ingress gateway and the mesh; Kong's Mesh product is a Kuma-derived mesh; Envoy Gateway (the CNCF project) and the various meshes share their data-plane lineage. Pick the one(s) whose strengths match what you actually need at each layer.
Service mesh vs Kubernetes NetworkPolicy
Both restrict service-to-service traffic; they operate at different layers and complement each other rather than substitute.
- NetworkPolicy is L3/L4 (IP, port, namespace, pod label). Enforced by the CNI (Calico, Cilium, others). Allows or denies the TCP connection; doesn't inspect HTTP method or path; doesn't authenticate the caller.
- Service mesh policy is L7 (identity, HTTP method, path, header, JWT claim). Enforced by the data plane. Operates after the TCP connection has already been allowed.
The right pattern is layered:
- NetworkPolicy as the floor. Default-deny network policy at every namespace; only allow the egress and ingress the namespace actually needs. Limits the blast radius of any sidecar or mesh-control-plane compromise.
- Mesh policy as the L7 ceiling. Within the allowed L3/L4 paths, the mesh enforces method-level and identity-level rules.
Cilium L7 NetworkPolicy is the most interesting cross-over: it extends the NetworkPolicy API itself to L7 semantics, so a single policy resource specifies both "this service may connect" and "only via these HTTP routes". The bridge between the two paradigms - useful in clusters where one team owns both layers.
For the wider topic see Kubernetes security and network security.
Multi-cluster mesh
One cluster is the easy case. Most real production environments end up with multiple clusters - per environment, per region, per blast-radius boundary, per acquisition. A multi-cluster mesh extends identity, encryption, and policy across those clusters so a service in cluster A can call a service in cluster B with the same primitives it would use locally.
The common architectures
- Istio multi-cluster - three deployment models: primary-remote (one control plane, multiple data planes), multi-primary (independent control planes federated via shared root CA), and multi-network (mesh spans clusters on different networks via east-west gateways). All three rely on a shared trust domain or federated trust domains via SPIFFE.
- Linkerd multi-cluster - explicit "mirroring" model. Services in cluster A are mirrored as virtual services in cluster B; traffic to the mirror goes through a gateway. Simpler model, fewer modes.
- Cilium ClusterMesh - shared cluster identity and global service abstraction; pods in different clusters appear in each other's service discovery and can connect with mTLS and policy enforced uniformly. Works particularly well with Cilium's eBPF data plane.
- Consul WAN federation - long-standing multi-datacenter pattern; mesh gateways forward traffic between Consul-managed clusters / datacenters.
What's hard about it
- Shared trust. All clusters must trust a common root CA (or federate via SPIFFE). The PKI conversation gets real.
- Service-discovery semantics. A name that resolves to one service in cluster A and a different one in cluster B is the surface area where bugs hide.
- Failure-mode reasoning. What happens when the link between clusters partitions? Each mesh has answers; they're not the same answers.
- Cost. East-west traffic is now in some cases cross-region or cross-cloud, with egress charges to match. Multi-cluster mesh without traffic-locality awareness can quietly become expensive.
Egress control via the mesh
Most cluster security stories focus on what comes in. Egress - what leaves the cluster - is just as important, and the mesh is a useful enforcement point for it.
Why mesh-managed egress
Without an egress policy, any compromised workload can reach the internet, exfiltrate data, and pull down second-stage payloads. Network egress restrictions at the firewall help, but they're coarse-grained (CIDR-based) and not workload-aware. The mesh adds identity-aware egress: "the payments service may call api.stripe.com:443 over HTTPS; nothing else may."
The mechanisms
- Istio egress gateway. A dedicated set of pods that all egress traffic routes through, with policies governing which workloads may reach which external destinations and under what TLS configuration. Provides logging, mTLS termination toward external SaaS, and a single audit point.
- Cilium egress NAT gateway. Pin egress traffic from selected workloads to a specific source IP so external firewalls / SaaS allowlists can be tightened. Combined with FQDN-aware policies that allow specific external hostnames.
- Linkerd - handles external HTTPS naturally via its policy model; egress filtering typically layered with NetworkPolicy and a separate egress proxy.
- Consul terminating gateway. The Consul-side equivalent of the egress gateway; bridges to external services not on the mesh.
Common egress controls
- Allow only the FQDNs the service has declared in its config.
- Force TLS termination at the egress gateway so the security team can attest to TLS configuration.
- Log every egress connection with workload identity, destination FQDN, bytes transferred - feed to SIEM.
- Block egress to known-bad indicators from threat intel feeds.
See also CSPM / CNAPP for the wider data-exfiltration detection story.
Sidecar attack surface
The mesh is infrastructure; like all infrastructure, it is itself a target. Treating the data plane as automatically secure because it is "the security layer" misses several real risks.
Envoy CVEs
Envoy is one of the most widely deployed proxies in the world, and like any complex C++ codebase it ships CVEs. The Envoy security advisories page publishes them; the cadence is monthly-to-quarterly with occasional critical fixes. Every Istio, Cilium, Consul, and App Mesh release ships a specific Envoy version; staying current is part of the operational tax of running the mesh. Linkerd's Rust micro-proxy has a meaningfully smaller CVE surface but is not immune - Rust eliminates whole bug classes, not all bugs.
Sidecar config injection
If an attacker can edit the EnvoyFilter (Istio) or equivalent resource in your cluster, they can rewrite parts of the data plane - redirect traffic, downgrade TLS, exfiltrate logs. Treat EnvoyFilter and similar resources as production-critical RBAC subjects; review every change.
Admin-port exposure
Every Envoy ships an admin interface on a local port (usually 15000) that lets the operator inspect and modify config. By default it's bound only to localhost in the pod, but historical misconfigurations have exposed it cluster-wide. The admin port should never be reachable from outside the pod's network namespace.
Resource exhaustion
A sidecar without resource limits can be DoSed by an attacker who can generate enough traffic to or from the pod. Set memory and CPU limits on every sidecar; alert on sustained near-limit utilization.
Control-plane compromise
If istiod (or the Linkerd / Cilium / Consul control plane) is compromised, every certificate in the mesh becomes untrustworthy. RBAC the control plane like the secret it administers; isolate it to a hardened namespace with deny-by-default NetworkPolicy; rotate its root CA on a defined cadence; alert on every API change to it.
Sidecar lifecycle bugs
Pre-Kubernetes-1.29, sidecars had no formal lifecycle relationship with the application container - they started in arbitrary order and could outlive the app, causing the "job pod that never terminates" class of bugs. Kubernetes 1.29's native sidecars (KEP-753, GA in 1.29) finally fixed this, but old clusters may still hit it.
Ambient mode & sidecarless
The biggest architectural shift in service mesh in 2024-2026 is the move away from per-pod sidecars. The reasons are operational: thousands of sidecars consume tens or hundreds of GB of memory, every Envoy CVE means rolling everything, and the per-pod attribution of cost / latency / failures is a real burden.
Istio Ambient
Two-tier data plane:
- ztunnel - a per-node L4 proxy (one DaemonSet pod per node) that handles mTLS, identity, and L4 authorization for every pod on the node. Written in Rust; HBONE (HTTP-Based Overlay Network Environment) protocol for inter-node transport.
- Waypoint proxy - optional, per-namespace or per-service-account Envoy that handles L7 features (HTTP-level authorization, JWT validation, request transformation). Deployed only where the L7 features are actually needed.
GA at Istio 1.22 (mid-2024) and steadily maturing through 2025-2026. The migration story from sidecar to ambient is supported by the Istio operator; ambient and sidecar pods can coexist during transition.
Cilium Service Mesh (sidecarless from the start)
Cilium's mesh has never used sidecars. The L4 path (mTLS, identity, encryption, NetworkPolicy enforcement) runs in eBPF programs attached to network hooks in the Linux kernel; there is no proxy in the pod. L7 features (HTTP-level routing, authorization with HTTP-aware predicates) run in a per-node Envoy that Cilium manages. The result is one of the leanest data planes in the ecosystem - most of the work happens in the kernel, on the packet, without a userspace proxy round-trip.
What sidecarless costs you
- Different blast-radius model. A sidecar fails per pod; a node-level proxy fails per node. Different failure modes need different SLOs.
- Smaller per-pod attribution. "Which sidecar is using all the memory?" no longer makes sense; "which node-level proxy is hot?" replaces it.
- Different troubleshooting.
kubectl execinto the sidecar to dump Envoy stats no longer applies; you query the node-level proxy or the eBPF flow store. - Some advanced features still trail. A handful of Istio sidecar features take a release or two longer to ship in ambient mode; same is true for any specific Cilium feature that requires L7 Envoy involvement.
The direction is clear: by the end of 2026, new mesh deployments default to ambient or sidecarless unless they have a specific need that the sidecar pattern meets better. Existing sidecar deployments are migrating at the pace of their willingness to validate the new failure modes.
SPIFFE / SPIRE deep-dive
SPIFFE (Secure Production Identity Framework For Everyone) is the open standard for workload identity, hosted by CNCF (graduated). SPIRE is the reference implementation. If you only learn one cross-mesh concept, this is the one - every modern mesh's identity model is SPIFFE-aligned in spirit if not always in name.
The core concepts
- Trust domain. A namespace for identities; typically one per organization or one per high-level boundary (per cloud, per cluster of clusters). Written as
spiffe://example.org. - SPIFFE ID. A URI that identifies a workload:
spiffe://example.org/ns/payments/sa/checkout. Encodes the trust domain and a hierarchical path. - SVID (SPIFFE Verifiable Identity Document). The cryptographic proof of identity. Two formats: X.509-SVID (a cert with the SPIFFE ID in the SAN) and JWT-SVID (a JWT with the SPIFFE ID as the
subclaim). Meshes use X.509-SVIDs for mTLS; JWT-SVIDs are used for token-style authentication. - Trust bundle. The set of public keys / certificates that verify SVIDs from a given trust domain. Exchanged between trust domains to enable federation.
- Workload API. The interface a workload uses to fetch its SVID. Implemented as a Unix-domain socket; the workload reads its identity, the SPIRE Agent (or mesh data plane acting as one) attests the workload and issues the SVID.
- Attestation. The process of verifying "is this really the workload it claims to be?" before issuing an SVID. SPIRE supports many attestors: Kubernetes (ServiceAccount + pod label + node identity), AWS (IAM role, instance metadata), GCP (instance identity), Docker (container properties), and more.
Federation
Two organizations (or two clusters with different roots) can federate their trust domains - each trusts the other's CA, identities from one are accepted by services in the other. This is the unlock for cross-cloud and cross-organization workload identity: a service in your trust domain can authenticate to a service in your partner's trust domain without either side having to share secrets or terminate at an API gateway.
SPIFFE in each mesh
- Istio - native SPIFFE identities; certs include SPIFFE URIs in the SAN. Trust domain configurable. SPIRE integration documented for orgs that want SPIRE as the CA instead of istiod's built-in CA.
- Consul - SPIFFE-based identities since Connect launched; explicitly designed around the standard.
- Cilium - added SPIFFE identity support in 1.14 (2023); workloads can have SPIFFE SVIDs for mTLS with SPIRE as the issuer.
- Linkerd - uses an analogous identity model bound to the Kubernetes ServiceAccount; not strictly SPIFFE-formatted by default, though the conceptual model is the same.
For a cluster running just one mesh, SPIFFE may feel like a detail. For an environment with multiple meshes, multiple clusters, or partners - SPIFFE is the standard that lets workload identity travel.
Istio vs Linkerd vs Cilium vs Consul Connect
The four meshes that most production teams actually choose between, on the dimensions that matter:
| Dimension | Istio | Linkerd | Cilium | Consul Connect |
|---|---|---|---|---|
| CNCF status | Graduated (2023) | Graduated (2021) | Graduated (2023) | Not CNCF; HashiCorp |
| Data plane | Envoy (sidecar or ambient) | linkerd2-proxy (Rust, sidecar) | eBPF + Envoy (no sidecar) | Envoy (sidecar) |
| mTLS | Built-in; PERMISSIVE / STRICT modes | On by default; trivial to configure | Built-in; per-policy | Built-in via Connect |
| Identity | SPIFFE (native) | ServiceAccount-bound (SPIFFE-conceptual) | Label-derived; SPIFFE-capable | SPIFFE (native) |
| Policy language | AuthorizationPolicy (rich, L7, JWT) | Server / ServerAuthorization (simple) | L7 NetworkPolicy (Kubernetes-native) | Intentions + API gateway policies |
| Sidecar vs sidecarless | Both (Ambient GA at 1.22) | Sidecar (Rust micro-proxy) | Sidecarless from day one | Sidecar |
| Multi-cluster | Primary-remote, multi-primary, multi-network | Mirroring model | ClusterMesh | WAN federation across datacenters |
| Beyond Kubernetes | VMs supported, GKE / on-prem federation | Kubernetes-focused | Kubernetes-focused | First-class VM and bare-metal support |
| Operational complexity | High (rich features, many knobs) | Low (deliberately small surface) | Medium (one big project replaces several) | Medium-high (Consul itself is the dependency) |
| Best fit | Large platforms with dedicated mesh team | Teams that want mTLS / policy fast, minimal ops tax | Teams replacing CNI + NetPol + mesh together | Mixed Kubernetes + VM, Consul-shop environments |
The honest choice criteria:
- If you don't yet know what you need - pick Linkerd. Smallest surface, hardest to misconfigure.
- If you're replacing more than one networking layer - pick Cilium. The whole-stack consolidation is the value.
- If you need rich L7 features, multi-cluster federation, or already have a platform team - pick Istio. In ambient mode, the historical operational tax is much lower.
- If you have significant VM and bare-metal alongside Kubernetes - pick Consul. It's the only one that treats VMs as first-class.
AWS / Azure / GCP managed mesh offerings
The hyperscaler managed-mesh story has consolidated significantly since 2024. The current state:
| Capability | AWS | Azure | GCP |
|---|---|---|---|
| Primary offering | AWS App Mesh; Istio via EKS add-ons | Managed Istio add-on for AKS | Cloud Service Mesh (managed Istio, both sidecar and ambient) |
| Underlying data plane | Envoy | Envoy | Envoy |
| Identity binding | App Mesh integrates with AWS IAM via SDS | Entra workload identity via federated tokens | Workload Identity Federation; SPIFFE-native |
| Multi-cluster / multi-cloud | Mesh per cluster; multi-cluster requires unmanaged Istio | Managed Istio multi-cluster on roadmap; commonly run unmanaged for federation | Cross-cluster federation across GKE, on-prem, other clouds (Anthos lineage) |
| Retired predecessors | - | Open Service Mesh (OSM) retired 2024 | Anthos Service Mesh consolidated into Cloud Service Mesh |
| Pricing model | No additional charge for App Mesh; pay for Envoy compute | Add-on free; pay for AKS compute | Per-Envoy / per-call pricing on managed plane |
| Documentation | App Mesh docs | AKS Istio add-on | Cloud Service Mesh docs |
The trend is unambiguous: the hyperscalers have largely converged on managed Istio (with Envoy underneath) as the right answer, and the proprietary offerings (App Mesh, OSM) are either de-emphasized or retired. Cilium is widely run unmanaged on all three clouds and is the basis for some managed CNI-plus-mesh offerings (GKE Dataplane V2, EKS Cilium add-on, AKS Advanced Container Networking Services).
Maturity stages
A staging model for mesh adoption in a real organization:
Stage 1 - Pilot
One mesh chosen, deployed in one non-production cluster. A handful of services enrolled. mTLS in permissive mode. The team learns the failure modes, the upgrade story, the observability story. Three to six months at this stage is normal.
Stage 2 - Production
Mesh running in the production cluster(s). Permissive mTLS cluster-wide; strict mTLS in selected namespaces. Basic AuthorizationPolicy (deny-by-default at the namespace level for new services). Dashboards live; on-call runbooks for mesh failures exist. The team has hit, and survived, at least one mesh-related incident.
Stage 3 - Default
Strict mTLS cluster-wide. Every service has an AuthorizationPolicy. JWT validation at the mesh for end-user-bearing services. Egress gateway operational. Multi-cluster mesh either deployed or actively planned. Sidecar / sidecarless decision made on basis of operational fit.
Stage 4 - Platform
Mesh is part of the platform team's product; new services get mesh enrollment automatically via templates / golden-path tooling. SPIFFE-based federation either live or evaluated for partner integration. The mesh is invisible to most developers because it just works. Observability and policy are the abstractions developers see; the mesh is a load-bearing implementation detail.
The skip-stage cost is real: an org that goes straight from Stage 1 pilot to Stage 3 default usually finds the operational failures it didn't learn at Stage 2 - and they happen in production.
Common pitfalls
- Enabling strict mTLS without a permissive transition. The most common production outage. Strict mTLS rejects every non-mesh client; CI runners, legacy probes, third-party integrations all break simultaneously. Always: PERMISSIVE first, validate via telemetry, then STRICT namespace by namespace.
- Ignoring proxy CVEs. Envoy and the mesh control planes ship CVEs on an ongoing basis. Treat the mesh upgrade cadence as a security obligation, not a feature backlog item. Subscribe to the project security advisories.
- No policy at all. Encryption without authorization is a partial answer. Every breach demo from 2020-2024 includes "the mTLS was on; the attacker just took the path the policy hadn't restricted". Configure deny-by-default policy at the namespace level once you have basic mTLS coverage.
- Forgetting egress. A mesh that perfectly governs east-west traffic but allows any pod to
curlthe internet has not closed the lateral-movement loop. Egress gateways or FQDN-aware egress policy belong in every production mesh. - Mesh-only ACLs instead of layered defense. If the mesh control plane is the only thing standing between a compromised workload and the rest of the cluster, control-plane compromise is total. Layer with NetworkPolicy, namespace isolation, and host-level controls.
- Choosing the wrong mesh for the team. Istio on a team without dedicated platform engineering tends to fail; Linkerd on a team that needs rich L7 features will outgrow the mesh quickly. Match the mesh to the team, not to the slide.
- Per-pod sidecar resource cost not budgeted. 100 MB × 5,000 pods is 500 GB of RAM dedicated to proxies. Either budget for it, or move to ambient / sidecarless.
- Sidecar lifecycle bugs on older Kubernetes. Pre-1.29, job pods can hang forever because the sidecar refuses to exit. Upgrade Kubernetes and adopt native sidecars (KEP-753).
- Confusing identity with authorization. The mesh telling you "this is workload X" is identity; the mesh deciding whether X is allowed to call Y is authorization. Both must be configured. Most "we have a service mesh" claims stop at identity.
- Skipping observability. A mesh whose telemetry isn't piped to Prometheus, Jaeger / Tempo, Hubble, Kiali, or some equivalent is a mesh that will fail silently. The observability layer is the operational SLI / SLO surface for the mesh; if it's missing, you'll only notice problems when customers do.
Further reading
Projects & specs
- Istio documentation
- Linkerd documentation
- Cilium documentation
- Consul Connect documentation
- Kuma documentation
- SPIFFE / SPIRE documentation
- Envoy proxy documentation
- Kubernetes Gateway API
Provider docs
Practitioner background
- CNCF landscape - service mesh
- Service Mesh Comparison (community)
- Envoy security advisories
- Buoyant - service mesh research
Related CSOH pages
- Kubernetes security - the platform the mesh runs on.
- Network security - the L3/L4 layer beneath the mesh.
- Zero trust - the philosophy the mesh implements east-west.
- API security - the north-south complement at the gateway.
- IAM & identity - where workload identity ties to cloud IAM.
- Containers - the workload layer the mesh secures.
- Detection engineering - turning mesh telemetry into alerts.
- Glossary - every term on this page, defined.
FAQ
Do I need a service mesh?
Probably not on day one, and possibly never. A mesh earns its operational cost when you have enough services that pinning identity, encrypting east-west traffic, applying L7 authorization, and observing service-to-service behavior all become important at the same time. Below ten or so services that talk to each other, the same outcomes are cheaper to reach with NetworkPolicy, application-level mTLS via SDK or cert-manager, and per-service tracing. Once you cross fifty services, multi-team ownership, and any meaningful regulatory requirement around encryption-in-transit between services, the math flips. Pilot one before you commit, pick the smallest mesh that fits, and budget for the operational tax it adds.
Istio vs Linkerd - which should I pick?
Linkerd if you want the smallest, fastest path to mTLS and basic policy with the least operational surface. Istio if you need everything - rich L7 traffic management, JWT validation, sophisticated authorization, multi-cluster federation, the ambient data plane option. Both are CNCF graduated; pick on operational appetite, not on feature checklist marketing.
Is sidecar-less ready for production?
Yes, with caveats. Cilium Service Mesh has been in production at significant scale for several years and is GA. Istio Ambient was declared GA at 1.22 and continues to mature; many teams have moved production workloads to it specifically to escape sidecar operational tax. The caveats: ambient's L7 features require a separate per-namespace waypoint proxy; some advanced Istio features still lag in ambient; the failure modes are different from sidecar (node-level rather than pod-level blast radius).
Does Cilium replace my mesh AND my CNI?
Yes, and that's the pitch. Cilium is a CNI, a NetworkPolicy engine with L7 enforcement, a service mesh (mTLS, identity, L7 routing, no sidecar), and a deep observability tool (Hubble) - one eBPF data plane underneath all of it. The trade-off is concentration: one project owns more of the stack than was traditionally true. The upside is fewer moving parts. For most new clusters in 2026, the simplification is worth it.
Service mesh vs API gateway - when do I need which?
API gateway lives at the edge of the cluster (north-south traffic). Service mesh lives inside the cluster (east-west traffic). The right question is rarely "which one" - it's "what concerns live at what layer", and the answer usually involves both. See the comparison above.
What is SPIFFE/SPIRE and why does it matter?
SPIFFE is the open standard for workload identity - every workload gets a cryptographically verifiable identity instead of relying on a shared secret or a network position. SPIRE is the reference implementation. Every modern mesh's mTLS is grounded in a workload-identity model, and SPIFFE makes those identities portable across mesh implementations, cloud providers, and even organizations (federation). See the deep-dive above.
Doesn't strict mTLS break legacy services?
Yes, which is why every production-grade mesh ships a permissive transition mode. Enable mesh-wide PERMISSIVE; verify via telemetry that traffic is using mTLS; flip to STRICT namespace-by-namespace once each is clean. Skipping the permissive step is one of the top-three mesh outages.
Where next
- Kubernetes security - the platform under the mesh; admission control, RBAC, pod security.
- Network security - the L3/L4 layer; NetworkPolicy, segmentation, the layered defense.
- Zero trust - the philosophy the mesh implements east-west.
- API security - the north-south complement at the gateway.
- Friday Zoom - service mesh, eBPF, and ambient-mode migrations come up regularly. Drop in.