An introduction to service meshes.

The evolution from traditional networking to service mesh architecture

In the early days of microservices, traditional networking approaches were sufficient. However as applications grew in complexity and were rebuilt as microservices, the need for a more sophisticated traffic management, security and observability for these dynamic and distributed environments became apparent. This led to the development of service meshes, which provide a dedicated infrastructure layer for handling service-to-service communication for handling these concerns.

Before the advent of service meshes, application teams were responsible for implementing traffic policies, error handling and observability within their applications. These use cases were implemented in application code. This means that different teams would be re-implementing the same functionality over-and-over again, creating fragmentation and security risks for the organisation.

What is a service mesh?

A service mesh manages network communication between bare metal servers, containers or virtual machines, taking care of things like routing traffic, securing communication, and providing observability.

When a service mesh is applied, all communication between services are routed through side car proxies. These proxies have built-in networking capabilities like load-balancing and encryption. It decouples the network logic from the application or business logic of your services.

In short: it addresses one of the main challenges of distributed architecture, namely the inherent unreliability, insecurity, and lack of logging in default network communications in distributed architectures.

As your microservices grow in number, managing these interactions can get complicated. A service mesh automates many of these tasks, so developers can focus on building the application instead of managing service-to-service communication. It is, in a sense, also a scalability problem: managing 10 microservices could be managed manually. For managing 100’s of microservice deployments, you need tools like a service mesh for handling operational network management at scale.

What challenges does a service mesh address?

A service mesh gives you control over how traffic flows between services. You can split traffic between different versions of a service, reroute requests during deployments, or set up retry and timeout policies. This is useful for, for example, canary deployments.

Service meshes give you also advanced load-balancing capabilities like dynamic traffic distribution (for scaling up or down) and automated service discovery, reducing the operational load of managing endpoints.

A service mesh makes it easy to enable mutual TLS (mTLS). This ensures that all communication between services (data in transit) is encrypted and authenticated (the ‘m’ in mTLS), keeping unauthorized services out. This fits nicely in the principles of zero-trust security.

Policy enforcement is another key capability for service meshes. A service mesh provide a centralized way to define and enforce policies related to security, traffic management and observability, ensuring consistent behavior across al the services.

Lastly, but equally important is the capability of gaining visibility into service interactions a.k.a observability. A service mesh automatically collects metrics, logs, and traces, giving you real-time visibility into your services. This helps with monitoring, troubleshooting, and performance tuning.

Core components of a service mesh

A service mesh basically consists of a central control plane and side car proxies in the data plane.

The control plane manages the configuration and policies for the service mesh. It distributes rules and policies to the side car proxies, collects telemetry data and provides a centralised point for managing the mesh.

In the data plane, each (micro-) service instance is paired with a side car proxy, typically deployed as a separate container on Kubernetes. These proxies handle all incoming and outgoing network traffic for the service, providing features like load balancing, traffic routing and security.

Service mesh type segmentation

Service meshes can be broadly categorized into Kubernetes-based and non-Kubernetes-based implementations. Kubernetes-based service meshes, such as Istio and Linkerd integrate seamlessly with Kubernetes, leveraging its native capabilities. Non-Kubernetes based implementations like Consul, offer more flexibility for environments that do not use Kubernetes.

In Kubernetes based implementations the side-car proxy pattern is used within Kubernetes clusters. Envoy is currently the most used service mesh proxy. All service meshes based on Istio are using Envoy as their default proxy.

In non-Kubernetes-based implementations, the proxy capabilities are deployed within service instances. Service discovery is handled by external applications like Consul or Eureka. A disadvantage of this type of solution is that you need specific application code to register with these centralized servers. Another approach is an agent setup where a proxy is deployed as another service instance on a service but this kind of setup requires usually still some specific network configuration.

Service mesh topology types.

These capabilities are usually only found in the commercial offerings, as opensource product tend to focus on single cluster/environment deployments.

Multi-tenant: supports multiple tenants within a single cluster, ensuring isolation and security. Typically if you have multiple teams building and deploying microservices on a container platform, each team is a tenant and can deploy, handle and operate its services in isolation from other teams.

Multi-zone: Operates across multiple availability zones within a single region. This is a crucial capability for cloud based deployments. The built in load balancing capabilities of this type of mesh can optimize inter-zone traffic, potentially reducing traffic cost in hyperscaler deployments.

Multi-location: Extends across different geographical locations, providing global reach. This allows also a kind of hybrid setup where a service mesh is deployed both in cloud and on premises locations.

Multi-cluster: Specific for Kubernetes-based service meshes, where one service mesh is deployed on multiple Kubernetes clusters. This is basically the same setup as a multi-zone service mesh.

Performance implications

Service meshes can introduce additional latency due to the extra network hops and processing required by the sidecar proxies. This can be particularly noticeable in high-throughput environments. The sidecar proxies consume additional cpu and memory resources. This overhead can be substantial in large-scale deployments where each service instance has his own proxy. The added processing by the service mesh can impact overall throughput of the system if the proxies are not managed properly.

Overhead considerations

Managing a service mesh involves configuring and maintaining multiple components, which increases operational complexity. This includes setting up the proxies as well as managing the service mesh control plane. The additional layer of the service mesh make monitoring and debugging more complex. It requires specialised tools and expertise to effectively manage and troubleshoot issues.

While service meshes enhance security by providing features like mutual TLS and fine-grained access control, they also add to the computational overhead.

Some popular opensource meshes

Istio
The original service mesh and today the most widely used service mesh. It provides a very extensive set of features, which makes it the most complex mesh to manage. Most commercial service mesh offerings are based on istio.

Linkerd
Linkerd is known for its simplicity and performance. It was designed to be lightweight and easy to use, making it a great choice for those starting with a service mesh. Although it has less features than Istio, it still provides all the core service mesh capabilities.

Kuma
Kuma is known as a universal service mesh, meaning it can support both Kubernetes and traditional bare metal/ virtual machine environments. Making it a good choice for different deployment scenarios.

Consul
Like Kuma, Consul is also a universal mesh. Consul is one of HashiCorp’s line of products and if you’re already using some of HashiCorp’s products, this service mesh would be a good choice.

A quick comparison of open source service meshes

Feature	Istio	Linkerd	Kuma	Consul
Ease of use	Complex	Simple	Moderate	Moderate
Maturity	Mature	Mature	Emerging	Mature
Complexity	High	Low	Low	Medium
Security	Advanced	Minimal	Advanced	Advanced
Performance	Medium	High	High	Medium
Feature Set	Comprehensive	Focused	Flexible	Limited
Proxy	Envoy	linkerd2-proxy	Envoy	Envoy
Community	Large, active	Smaller, focused	Growing	Moderate

All commercial vendors are based on one of these four opensource products. Istio is by far the most popular, but all the other opensource variants have at least one commercial offering.

The odd one here is AWS Service mesh which only uses Envoy as a proxy but is not based on any of the four mentioned above.

Secure production identity framework for everyone (SPIFFE)

SPIFFE is a set of open-source standards for securely identifying software systems in dynamic and heterogeneous environments. Systems that adopt SPIFFE can easily and reliably mutually authenticate wherever they are running.

The core of these specifications is the one that defines short lived cryptographic identity documents – called SVIDs. Workloads can then use these identity documents when authenticating to other workloads, for example by establishing a TLS connection or by signing and verifying a JWT token.

All major service meshes support the SPIFFE standard, usually implemented via a spire-server. This helps automating identity and certificate management needed for mTLS.

Before SPIFFE, all certificate management had to be done manually.

So both SPIFFE/Spire (identity management) and mTLS configuration (authentication and encryption) are needed to achieve zero-trust principles in a service mesh.

Resilience patterns in service meshes

Resilience patterns are critical strategies for maintaining system reliability and performance under various failure conditions. In a service mesh and application code, these patterns help manage distributed system challenges like network failures, latency, and service unavailability.

Key resilience patterns in service meshes

Circuit breaker

In a service mesh like Istio or Linkerd, circuit breakers are implemented at the infrastructure level. They monitor service health by tracking request success rates, latency, and error responses. If a service consistently fails or becomes unresponsive, the circuit breaker temporarily stops routing requests to that service, preventing cascading failures.

Retry mechanism

Service meshes provide automatic retry capabilities with configurable parameters like:

Maximum retry attempts
Retry budgets (total retries across the system)
Backoff strategies (exponential, jittered) These are typically configured through mesh-wide policies or per-service configurations.

Timeout management
Service meshes can enforce global or service-specific timeout policies. They can automatically terminate requests that exceed predefined time limits, preventing resource exhaustion and improving overall system responsiveness.

Bulkhead pattern
Service meshes isolate service communication by applying traffic management rules. They can limit concurrent requests, allocate specific resources to services, and prevent a single service’s failure from impacting the entire system.

Recommendation
For modern microservices architectures, service mesh resilience patterns are providing the most robust solution. Service meshes handle all network-level concerns, while application code should only implement specific business logic around failures.

When implementing a service mesh in an existing architecture, a thorough application code assesment is necessary to discover some possible overlapping network handling capabilities between application code and service mesh.

Conclusion

Service meshes provide robust security features like mutual TLS, fine-grained access control and traffic encryption which can significantly improve the security of your services. They also offer advanced observability tools, including metrics, tracing and logging, which makes the inter-service communication much more visible.

Lastly, they enable sophisticated traffic management capabilities such as load balancing, traffic splitting which can enhance the resilience.

But there are also a few challenges like the additional latency and resource consumption introduced by the side car proxies. It’s crucial to assess whether your infrastructure can handle this overhead.

Managing a service mesh adds some complexity to your operations. It requires expertise in configuring and maintaining the mesh components, as well as monitoring and troubleshooting.

It is true that operating a service mesh means more work but it has to be weighed against the gain in operational workload benefits, e.g. it provides, for example, a platform that can provide a uniform transmission security for which no certificates have to be managed. The more complex the network topology and the greater the number of services to be managed, the more benefits a service mesh will provide.

Another point to consider is choosing the type of service mesh: Kubernetes based or not. If you have a mix of deployment environments, service meshes like Kuma or Consul are a better fit if you need to roll out a service mesh on bare metal or virtual machines. If you already have one or more Kubernetes clusters, any service mesh is a good choice, depending on your particular use case or needs.

In summary, while a service mesh can provide significant benefits in terms of security, observability and traffic management, it also introduces performance and operational challenges that need to be carefully managed.