kube-state-metrics the past, the present, and the future
The kube-state-metrics project is a service you can deploy in your Kubernetes cluster. It watches Kubernetes objects, in order to expose metrics in the Prometheus format. The project lives under the umbrella of the Kubernetes organization and was started by Sam Ghods at Box in May 2016. There has been a growing popularity in the project, so I wanted to share the current state of the project and what I believe the future holds. This post will particularly focus on the architectural changes that the project has undergone and explain the reasoning for the more significant ones.
Let’s start by highlighting the architecture of the project in the beginning. The way kube-state-metrics was designed to work was by building an in-memory cache of the Kubernetes objects: deployments, nodes, and pods. The in-memory cache was created using the informer framework, which is widely used within Kubernetes. An informer started by listing all available objects and then listened for events on those resources: add, update, delete. Using these events, a best effort in-memory representation was created. Network partitions or other failures can cause events to not be received by a listening informer. The cache was regularly re-created, to ensure that at least every “resync interval” the cache content is consistent with the state in etcd.
The problem was, that kube-state-metrics iterated over the objects in a ten second interval and created metrics off of all objects in the cache. If a metric already existed, it was updated. For example the
container_restarts metric is created per container in a cluster, however, metrics won’t disappear from the metrics registry, meaning the number of metrics and therefore the memory usage of kube-state-metrics will infinitely grow with the lifetime of a Kubernetes cluster. Additionally, this conflicts with Prometheus’ staleness requirements, as it continues to expose the metrics even after the deletion of an object and the information can be up to 10 seconds old.
Around half a year later my colleague Fabian Reinartz and I picked up its development. After continued engagement and development of the architecture as well as features we became the maintainers of the project. We were aware of the limitations of the existing implementation and decided to adjust the architecture to work similarly to many of the exporters within the Prometheus ecosystem. Rather than populating the metrics every ten seconds, the metrics are created on-demand, whenever Prometheus requests the
/metrics endpoint. This most importantly fixed the above mentioned problems, but also opened the architecture up to be more modular. Every type of object has its own collector that is registered onto a Prometheus metrics registry, and each of those collectors live in a separate file.
Because of this work the structure of the project was simplified, and therefore was opened up for new contributions. Surprisingly, kube-state-metrics ended up getting a lot more attention than we thought. Initially we were just scratching our own itch, and today the project has:
- 349 stars on GitHub
- 47 total contributors
- 351 commits
- 114 metrics (7 CronJob, 6 DaemonSet, 11 Deployments, 5 HPA, 12 Job, 2 LimitRange, 3 Namespace, 14 Node, 3 PVC, 22 Pod, 7 ReplicaSet, 8 ReplicationController, 2 ResourceQuota, 3 Service, 9 StatefulSet)
The project has grown organically, Fabian and I only put in an initial effort and made it approachable for contributions and the majority of these metrics were not developed by the maintainers, but through a lot of community contributions. For comparison, initially the project exposed four metrics, whereas after many contributions it is now at 114 metrics. Most of what we have worked on revolve around ensuring that the architecture of kube-state-metrics will continue to scale and enable contributions.
In early 2017, Kubernetes sig-instrumentation decided that kube-state-metrics has become an important enough component that we would like to create more formal compatibility guarantees and more importantly scalability guarantees, and so thanks to a couple of Google employees some scalability tests were executed on 100, 500, and 1000 node clusters. The exact findings are documented in a Google doc, but in short it exceeded all of our expectations and scaled a lot better than we had expected. After a couple rounds of refinements, ensuring that the metrics are consistent and in a good shape (although we know today that we missed multiple) we were comfortable with releasing 1.0 of kube-state-metrics in August 2017.
Since then the project got even more contributions and an additional maintainer has joined the team: @andyxning. I want to give him a huge shout out as he has been a very active part of the project and continuously improves and maintains it.
We have recently released version 1.1.0 of kube-state-metrics and have also hit some new scalability concerns. In extremely high churn environments, meaning when Pods quickly appear and disappear, particularly when a lot of Kubernetes
Jobs are run, it inevitably puts heavy load on Prometheus itself as well as the exporter exporting these metrics. While Prometheus 2.0 aims to solve this on the Prometheus time-series database side, kube-state-metrics still has plenty of optimization opportunities.
While the amount of metrics is going to grow further in the future, there are two primary factors, which can be optimized in regards to the architecture of kube-state-metrics: the size of the in memory state it needs to keep, as well as the effort it needs to go through, in order to present the metrics output to Prometheus when requested.
The memory consumption cannot be lowered, by an order of magnitude, as kube-state-metrics will always need to keep the current metric values in memory, therefore the memory consumption cannot be lower than
Today, kube-state-metrics needs to iterate through the objects stored in cache, whenever Prometheus requests the
/metrics endpoint. The has still has potential for optimization. Currently, when Prometheus requests the metrics endpoint, kube-state-metrics needs to iterate over all objects and create the metrics off of those. The question is, why does the intermediate step have to be taken to cache the objects intermediately only to then on-demand generate the metrics. The metrics could just as well be populated directly. In that case a property of the first architecture would return, however, where metrics can never be unregistered. Since the cache is being replaced with a metrics registry in this architecture, we can perform the same strategy of keeping the registry consistent as it is currently done with the cache. When an object is deleted, the metrics associated with it are unregistered, and to ensure that at least in a given interval the registry is consistent, it is regularly re-created and populated from a clean start.
Clayton Coleman has lead some discussions to go into this direction and provided useful input as to how Red Hat has solved scalability issues in OpenShift similar to how they are present in kube-state-metrics today. The architecture was definitely not my idea, I merely contributed my knowledge about the Prometheus ecosystem and the architecture as described above is a result of a number of discussions. I just happen to have decided to document it here.
As you have probably gathered, the project has come a long way, however, the next steps will affect large portions of the code base. This is necessesary to solve the existing scalability issues and will allow it to continue to scale as expected in the future for large Kubernetes clusters.