Using Microsoft’s Retina to monitor Kubernetes networks

By Simon Bisson

Kubernetes plays an important role at Microsoft. The container management system is a foundational piece of the company’s many clouds, from Microsoft 365 and Xbox, to Azure, to partners like OpenAI that use Microsoft’s Kubernetes to host their own services.

As a result, Microsoft has invented many of its own Kubernetes management tools. These include Kaito for deploying AI inferencing workloads and Fleet for large-scale management of Kubernetes clusters. All of Microsoft’s various tools sit underneath its two managed Kubernetes services, Azure Kubernetes Service and Azure Container Service, allowing you to deploy and orchestrate your container-based applications without needing to build the necessary management framework. It all comes for free, with APIs, portals, and command line interfaces.

In the old days, that would have been it. Microsoft would have used these features to differentiate itself from its competitors and their Kubernetes clouds. But Microsoft has taken the open-source model to heart, with many of the leaders of its Kubernetes initiatives coming from an open-source background. Instead of keeping its Kubernetes tools to itself, Microsoft releases them as open-source projects, where anyone can use them, and where anyone can contribute new code.

Introducing the Retina observability platform

One of the latest Azure tools to become an open-source project is Retina, a network observability tool designed to help you understand network traffic in all of your clusters, no matter how they’re configured or what OS they use. There’s no tie to Azure functionality, either. You can run Retina in any Kubernetes instance, on-premises or in AWS, Azure, or GCP.

At the heart of Retina, much like the Falco security tool, are extended Berkeley Packet Filters (eBPF). These let you run code in the kernel of the host OS, outside your application containers, so you can use eBPF probes without significantly affecting the code you’re running. There’s no need to add agents to your containers or add monitoring libraries to your code, and one eBPF probe can monitor all the nodes running on a host, whether it’s a cloud VM or on-premises physical hardware.

Running Retina probes in-kernel simplifies network monitoring. You don’t need to know what network cards are installed on the host server, or how your Kubernetes install uses a service mesh. Instead, you get a look at how the host OS’s networking stack is handling packets. You can track packet types, latency, and packet loss, taking advantage of low-level TCP/IP features that may not be accessible at a higher level.

By focusing on making cloud-native networking observable, Retina is designed to fit into any monitoring tool set and any Kubernetes install. There’s support for both Linux and Windows, which should help you monitor and debug hybrid applications that mix Linux and Windows services. As eBPF probes are code, you can think of them as customizable plugins, allowing Retina to evolve with new Kubernetes features and to support the metrics you need for your monitoring requirements.

Data is delivered to the familiar Prometheus logging service at a node level. Data gathered include DNS, layer 4 operations, and packet captures. Because the data is labelled, you can build a map of operations in your Kubernetes environment, helping track down issues like a blocking microservice as Retina logs the pattern of flows in and around your Kubernetes instances.

Getting started with Retina

Start by cloning the Retina GitHub repo, then use the bundled Helm charts to install. You may need to configure Prometheus as well, to ensure that Retina is logging data. If you want to use the Retina CLI, you need to be running on a Linux-hosted Kubernetes. The CLI runs in kubectl, so will be easy to use alongside your other Kubernetes CLI tools. Alternatively, you can use YAML custom resource definitions to configure and run a network capture.

On Linux the eBPF network capture plugin is a version of the open source Inspektor Gadget tool. This was originally developed by the Kinvolk team, now part of Azure and still focused on container engineering. Inspektor Gadget is a library of Kubernetes eBPF tools that works with Kubernetes applications of any size, from single nodes to large clusters. Retina uses Inspektor Gadget trace gadgets to observe network system events.

Observing container networks

The Retina website provides detailed instructions for working with the tool. Retina offers three different operating modes: basic metrics at a per-node level, more detailed “remote context” metrics with support for aggregating by source and destination pod, and a “local context” option that allows you to choose which pods to monitor.

It’s important to note that you don’t see everything by default, as that could be overwhelming. Instead, different metrics are enabled by different plugins. For example, if you want to track DNS calls, start by enabling the DNS plugin. All the metrics include cluster and instance metadata, so you can filter and report using labels to identify specific target nodes and pods. Local and remote context options add labels that track source and destination.

Configuring Retina also requires setting up a Prometheus target for the data, along with an appropriate Grafana dashboard. Microsoft provides sample configurations for both on GitHub in the Retina repository. The defaults display networking and DNS data for your cluster. Having the data in Prometheus allows you to use other tools to work with Retina data, for example feeding data into a policy engine to trigger alerts or automate specific operations.

With Retina installed and Prometheus and Grafana configured, you can now go beyond the defaults, configuring the Retina agent and plugins via YAML. Additional metrics configuration is via Kubernetes custom resource definitions.

Measuring Kubernetes network operations

Retina isn’t really a tool for continuous monitoring at a packet level, as it will generate a lot of data in a busy cluster, unless of course you use it with a policy-based tool to identify exceptions from normal operation. In practice, it’s perhaps best to use Retina to identify the root causes of issues with a running cluster. Perhaps nodes are failing to communicate with each other, or you suspect that errors may be due to latency in a specific service interaction. Here you can trigger the required packet capture with a single command that collects all of the data you need to run a diagnosis.

Continuous operation is reported via metrics that give you statistical information about key network issues. These can be managed using Prometheus to generate alerts, with Grafana dashboards to give you an overview of the overall performance of your cluster, along with data from other observability tools.

One useful metric offered by Retina is one that’s often ignored: API latency. However, in cloud-native development, you’re often working with third-party APIs. Some might be platform services from a cloud provider, while others could be essential line-of-business data sources, like Salesforce or SAP Hana. Here you can use Retina’s API server latency to get metrics that help track server response times.

Having this data lets you start a diagnostic process with your API provider, helping track down the source of any latencies. Delays in API access can be a significant blocker in your applications, so having this data can help you deliver a more reliable and responsive application.

A maturing Kubernetes ecosystem

Microsoft has made a preview version of a Retina-based observability tool available for Azure Kubernetes Service as the Network Observability add-in. This works with Azure’s managed Prometheus and Grafana. You can find a list of the pre-configured metrics in its documentation, but it currently offers only a subset of Retina’s capabilities, delivering only node-level metrics.

One key point to consider with Retina is that it builds on Azure’s experience with Kubernetes. The metrics captured out-the-box are what the Azure team considers important, and you’re building on the knowledge that supports one of the largest and most active Kubernetes environments anywhere. If you need alternative metrics, you can build your own eBPF probes for Retina, which then can be shared with the wider Kubernetes community.

Open source requires shared expertise to be successful. By opening up the code base, Microsoft is encouraging Retina developers to bring their knowledge to the platform, with the hope that AWS, GCP, and other at-scale Kubernetes operators will share the networking lessons they have learned with the world. As Kubernetes matures, eBPF-based tools like Retina and Falco will become increasingly important, providing the data we need to deliver secure and reliable cloud-native applications at scale.