My Profile Photo

Sheogorath's Blog

Prometheus Snippets

This page is a wild collection of useful Prometheus snippets I collected over the time. Most of them, will be somewhat related to Kubernetes and/or OpenShift.

Average node temperature

This query collects the temperature of all available sensors of hosts with a node_exporter running, joins the regular node name, and takes the average of them.

You can easily take the average away, if you want to monitor exact temperature per host, but this might be useful for an alarm if you servers run to hot. The average should give you an impression on how well each node is in terms of temperature.

avg(label_replace(node_hwmon_temp_celsius, "internal_ip", "$1", "instance", "(.*):.*") * on (internal_ip) group_left(node) kube_node_info) by (node)

Container registry ranking

This query helps you to find out what container registries are dominant in your clusters, it’s especially useful, because you can even run it with a sum_over_time function to pick up registries that might only show up as part of jobs and other periodic pods that are running.

count(label_replace(kube_pod_container_info, "image", "$1", "image", "([^/]+)/.*")) by (image)

This function should be useful, to detect and work on a list of mirrored repositories for your clusters.

CPU Request utilisiation

This query helps to analyse and optimise your CPU resource requests for pods. It generates a percentile usage overview regarding how much of the requested CPU resources your pods are actually utilising. Since the Kubernetes scheduler takes the CPU requests to find a node that fits your workload, it’s quite important that the majority of your CPU time is spend within the requests in order to make sure, that your application never starves CPU due to overloaded nodes.

(sum(rate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[1m])) by (namespace, pod)) / sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace, pod) * 100

You can also use the first part to measure the amount of mCores you want to request:

sum(rate(container_cpu_usage_seconds_total{image!="",job="kubelet",metrics_path="/metrics/cadvisor"}[1m])) by (namespace, pod) * 1000

Now you can check the metrics over a few weeks or calculate an average using Prometheus and this way figure out a good request.

Note: You might want to adjust the requests upwards, if e.g. booting up your application takes even more resources and your nodes are tightly packed.

Critical CVEs per host

This query utilises starboard, starboard-exporter and kube-prometheus-stack to give a rough estimation of the risk on a host.

sum by (node) (max by (namespace, pod) (max by (image)(label_join(label_join(starboard_exporter_vulnerabilityreport_image_vulnerability_severity_count{severity="CRITICAL"}, "image", "/","image_registry", "image_repository"), "image", ":", "image", "image_tag")) * on (image) group_right kube_pod_container_info) * on(namespace, pod) group_right kube_pod_info)

First it re-creates the image name (registry/repository:tag), then identifies the pods, which run this image, and finally maps the pods that run this image to the host the pod is running on. With the result that one can count the CVEs on a host and assess some basic risks.

Note: This will double-count CVEs if e.g. 2 pods of the same deployment run on the same host. That’s something to be aware when trying to trust this metrics. Modifying the metric to look into specifics is therefore recommended.