Grafana: Shining a light into Kubernetes clusters

By Travis Van

Back in 2014, when the wave of containers, Kubernetes, and distributed computing was breaking over the technology industry, Torkel Ödegaard was working as a platform engineer at eBay Sweden. Like other devops pioneers, Ödegaard was grappling with the new form factor of microservices and containers and struggling to climb the steep Kubernetes operations and troubleshooting learning curve.

As an engineer striving to make continuous delivery both safe and easy for developers, Ödegaard needed a way to visualize the production state of the Kubernetes system and the behavior of users. Unfortunately, there was no specific playbook for how to extract, aggregate, and visualize the telemetry data from these systems. Ödegaard’s search eventually led him to a nascent monitoring tool called Graphite, and to another tool called Kibana that simplified the experience of creating visualizations.

“With Graphite you could with very little effort send metrics from your application detailing its internal behaviors, and for me, that was so empowering as a developer to actually see real-time insight into what the applications and services were doing and behaving, and what the impact of a code change or new deployment was,” Ödegaard told InfoWorld. “That was so visually exciting and rewarding and made us feel so much more confident about how things were behaving.”

What prompted Ödegaard to start his own side project was that, despite the power of Graphite, it was very difficult to use. It required learning a complicated query language, and clunky processes for building out frameworks. But Ödegaard realized that, if you could combine the monitoring power of Graphite with the ease of Kibana, you could make visualizations for distributed systems much more accessible and useful for developers.

And that’s how the vision for Grafana was born. Today Grafana and other observability tools fill not a niche in the monitoring landscape but a gaping chasm that traditional network and systems monitoring tools never anticipated.

A cloud operating system

Recent decades have seen two major jumps in infrastructure evolution. First, we went from beefy “scale-up” servers to “scale-out” fleets of commodity Linux servers running in data centers. Then we made another leap to even higher levels of abstraction, approaching our infrastructure as an aggregation of cloud resources that are accessed through APIs.

Throughout this distributed systems evolution driven by aggregations, abstractions, and automation, the “operating system” analogy has been repeatedly invoked. Sun Microsystems had the slogan, “The network is the computer.” UC Berkeley AMPLab’s Matei Zaharia, creator of Apache Spark, co-creator of Apache Mesos, and now CTO and co-founder at Databricks, said “the data center needs an operating system.” And today, Kubernetes is increasingly referred to as a “cloud operating system.”

Calling Kubernetes an operating system draws quibbles from some, who are quick to point out the differences between Kubernetes and actual operating systems.

But the analogy is reasonable. You do not need to tell your laptop which core to fire up when you launch an application. You do not need to tell your server which resources to use every time an API request is made. Those processes are automated through operating system primitives. Similarly, Kubernetes (and the ecosystem of cloud-native infrastructure software in its orbit) provides OS-like abstractions that make distributed systems possible by masking low-level operations from the user.

The flip side to all this wonderful abstraction and automation is that understanding what’s going on under the hood of Kubernetes and distributed systems requires a ton of coordination that falls back to the user. Kubernetes never shipped with a pretty GUI that automagically rolls up system performance metrics, and traditional monitoring tools were never designed to aggregate all of the telemetry data being emitted by these vastly complicated systems.

From zero to 20 million users in 10 years

Dashboard creation and visualization are the common associations that developers draw when they think of Grafana. Its power as a visualization tool and its ability to work with just about any type of data made it a hugely popular open-source project, well beyond distributed computing and cloud-native use cases.

Hobbyists use Grafana visualization for everything from visualizing bee colony activities inside the hive, to tracking carbon footprints in scientific research. Grafana was used in the SpaceX control center for the Falcon 9 launch in 2015, then again by the Japan Aerospace Exploration Agency in its own lunar landing. This is a technology that is literally everywhere you find visualization use cases.

But the real story is Grafana’s impact on an observability domain that prior to its arrival was defined by proprietary back-end databases and query languages that locked users into specific vendor offerings, major switching costs for vendors to migrate to other users, and walled gardens of supported data sources.

Ödegaard attributes much of the early success of Grafana to the plugin system that he created in its early days. After he personally wrote the InfluxDB and Elasticsearch data sources for Grafana, community members contributed integrations with Prometheus and OpenTSDB, setting off a wave of community plugins to Grafana. Today the project supports more than 160 external data sources—what it calls a “big tent” approach to observability.

The Grafana project continues to work with other open-source projects like OpenTelemetry to provide simple standard semantic models to all telemetry data types and to unify the “pillars” of observability telemetry data (logs, metrics, traces, profiling). The Grafana community is connected by an “own your own data” philosophy that continues to attract connectors and integrations with every possible database and telemetry data type.

Grafana futures: New visualizations and telemetry sources

Ödegaard says that Grafana’s visualization capabilities have been a big personal focus for the evolution of the project. “There’s been a long journey of creating a new React application architecture where third-party developers can build dashboard-like applications in Grafana,” Ödegaard said.

But beyond enriching the ways that third parties can create visualizations on top of this application architecture, the dashboards themselves are getting a big boost in intelligence.

“One big trend is that dashboard creation should eventually be made obsolete,” said Ödegaard. “Developers shouldn’t have to build them manually, they should be intelligent enough to generate automatically based on data types, team relationships, and other criteria. By knowing the query language, libraries detected, the programming languages you are writing with, and more. We are working to make the experience much more dynamic, reusable and composable.”

Ödegaard also sees Grafana visualization capabilities evolving towards new de-aggregation methods—being able to go backward from charts to how graphs are composed and break down the data into component dimensions and root causes.

The cloud infrastructure observability journey will continue to see new layers of abstraction and telemetry data. Kernel-level abstraction eBPF is rewriting the rules for how kernel primitives become programmable to platform engineers. Cilium, a project that recently graduated from Cloud Native Computing Foundation incubation, has created a network abstraction layer that allows for even more aggregations and abstractions across multi-cloud environments.

This is only the beginning. Artificial intelligence is introducing new considerations every day for the intersection of programming language primitives, specialized hardware, and the need for humans to understand what’s happening inside the highly dynamic AI workloads that are so computationally expensive to run.

You write it, you monitor it

As Kubernetes and related projects continue to stabilize the cloud operating model, Ödegaard believes that the health monitoring and observability considerations will continue to fall to human operators to instrument, and that observability will be one of the superpowers that distinguish the most sought-after talent.

“If you write it, you run it, and you should be on call for the software you write—that’s a very important philosophy,” Ödegaard said. “And in that vein, when you write software you should be thinking about how to monitor it, how to measure its behavior, not only from a performance and stability perspective but from a business impact perspective.”

For a cloud operating system that’s evolving at breakneck speed, who better than Ödegaard to champion humans’ need to reason with underlying systems? Besides loving to program, he has a passion for natural history and evolution, and reads every book he can get his hands on about natural history and evolutionary psychology.

“If you don’t think evolution is amazing, something’s wrong with you. It’s the way nature programs. How much more awesome can it get?”