Unmountable EBS volumes
Today I investigated an interesting phenomenon: AWS EBS volumes decided to not get mounted for some workload in a Kubernetes cluster. There was no clear reason why given that the volumes had tried to be attached and the error message didn’t indicate that something was fundamentally broken. Also after a while they usually managed to mount the volumes anyway. But when diving into the problem, it turns out to be a bug in Kubernetes or rather the result of unmaintained code.
A colleague of mine, hinted that the EBS volume mount numbers were odd, but looking into the docs, and the general behaviour indicated, that everything worked fine. The scheduler takes EBS volume attachments into consideration these days and all nodes had attachable-volumes-aws-ebs
set in their node object.
The error message
As mentioned the volumes were not attaching to the Pods for quite a while and the events seen regarding this read: AttachVolume.Attach failed for volume "pvc-…" : rpc error: code = Internal desc = Could not attach volume "vol-…" to node "…": attachment of disk "vol-…" failed, expected device to be attached but was attaching
aws-ebs-csi-driver
vs in-tree provider
First of all, there is the aws-ebs-csi-driver
these days, which is a CSI (Container Storage Interface) driver, that is installed in a cluster to provide storage but can be updated “independent” of the general Kubernetes version, providing better flexibility and general universal standards.
The in-tree provider, is a volume provider, that pre-dates the CSI standard, and lives, as the name indicates, in the Kubernetes source tree as part of the controller manager and is deprecated nowadays.
The first thing to check was therefore, that the deployed PVCs (PersistentVolumeClaim
) didn’t use both of them mix, since the Kubernetes scheduler doesn’t handle that well unless the experimental volume migration mode is used.
In this case, it was set, that only the in-tree provider is used, for historic reasons.
Checking metrics
In order to check the current EBS volume allocation on the cluster the following Prometheus query was used to find the percentage of used-up EBS allocation space:
sum by (node) (
sum by (namespace,persistentvolumeclaim) (kube_persistentvolumeclaim_info{storageclass=~"ebs-.*"})
* on (namespace, persistentvolumeclaim) group_right sum by (namespace, persistentvolumeclaim, pod) (kube_pod_spec_volumes_persistentvolumeclaims_info)
* on (namespace, pod) group_left (node) (kube_pod_info * on (namespace, pod) group_left sum by (namespace, pod) (kube_pod_status_phase{phase!~"(Pending|Succeeded|Unknown)"} > 0))
) / sum by (node) (kube_node_status_allocatable{resource="attachable_volumes_aws_ebs"})
This query takes the kube_persistentvolumeclaim_info
metrics, which holds the information about the storageclass
of a PVC, filters only the PVCs that are part of the EBS storage classes1 and sums them up to only have namespace and the name of the PVC in the vector. As a next step, this is multiplied based on namespace and PVC name with the kube_pod_spec_volumes_persistentvolumeclaims_info
metric, which contains the information regarding what pod mounts which PVC. This metric is reduced to only contain namespace, name of the PVC and name of the Pod with the same method as the previous metric. The group_right
adds the pod
label to the resulting vectors.
To prepare the last multiplication, the kube_pod_info
metric, which contains the information regarding what Pod runs on what Node, is multiplied by the kube_pod_status_phase
metric, which is reduced to the phases, which are relevant for the actual volume allocation like like Running
and Failed
. This way, Pods from jobs or alike are filtered out. This information is than multiplied with the information from the steps before, again based on namespace and pod name, then adds the node
label to the resulting vectors.
Finally this is entire construct is summed by the node
label to get the exact count of EBS PVCs per node and it’s divided by the allocatable EBS volume attachments to provide a easier to comprehend percentage.
Another thing to note here is that this query only works for querying the in-tree provider usage, the way CSI drivers announce allocation capacity is by provider a CSINode object corresponding to each Node object. This can be seen in this function of the scheduler.
Docs and the suspicious numbers
When collecting the metrics, it was a bit suspicious that most instances, had attachable_volumes_aws_ebs
count of 25, while a certain instance type had one of 39. This is not completely unexpected, since non-nitro types can attach more than 25 EBS volumes.2
The AWS documentation indicated that all nitro instances should only get 27 at most. Consulting the OpenShift documentation however, made it quite clear, that OpenShift is instance type aware and will select the correct values automatically, as long as one doesn’t mixes the CSI and in-tree drivers.
Docs are liars, only the code knows the truth
After going back and forth, with no proper explanation for the situation and this mismatch between AWS documentation, OpenShift documentation and reality, it was time to dig into the source code.
The first target was obviously the code, that set the allocation in the node object. Immediately the problem became obvious. The way the code detects the volume limits, is by parsing the instance type, and based on matching a regular expression, it would decide whether it would set 25 or 39 as limit. Nitro instances get 25, everything else 39.
The regular expression reads ^[cmr]5.*|t3|z1d
, but the instance type in the cluster was starting with m6
and therefore could not match this regular expression, despite being a AWS Nitro instance. As a result, it got the non-Nitro-instance limits of 39. Bummer.
The future will be brighter
This is what one gets for using deprecated software with modern instance types, right? Well almost. This detection code is actually the same used by the aws-ebs-csi-driver
up until march of 2022, when someone else noticed the same problem and fixed it by replacing the regular expression with a whole module to parse AWS instance types.
The good news is, that all that needs to be done, is switching the CSI driver and things are fixed! Sadly, this CSI driver fix will only be available in OpenShift 4.12, which isn’t released today.
Conclusion
Docs are good, code is better. It’s always worth to read the code if you want to figure out a problem in infrastructure. Which also highlight, why independent of the license, the availability of code, makes live for operation much easier because it takes quite some guesswork out of the debugging process.3
Don’t trust the docs, validate claims and read the code, only this way, you can understand and be sure about the problem.
PS: For those wondering, yes, there were also logs involved in the debugging process as well as other steps, these were skipped to keep this article short and useful.
-
Be away, that in this example, all storage classes for EBS, start with the term
ebs-
, this might not be the case in your environment. ↩ -
Technically speaking, AWS Nitro types are actually limited to 27 attachments in general, which includes NICs and system volumes. ↩
-
Greetings to all Windows Admins! ↩