My Profile Photo

Sheogorath's Blog

Hetzner and Cloudflare

Today I learned that Hetzner relies on Cloudflare to work. The Hetzner API suddenly stops responding and sends you an HTTP 503, all three of their nameserver 213.133.99.99, 213.133.100.100 and 213.133.98.98, go down, because they are no proper recursive nameservers but just forwarder for Cloudflare’s 1.1.1.1 as it seems. And generally speaking things all over the place fall apart. The good news is, my servers continued running. Sadly my entire CI environment went down as both, Digital ocean and as it now turns out Hetzner heavily rely on the availability of Cloudflare to function.

As of today, there was a downtime of Cloudflare in the late evening. To my personal anger this resulted in various of my Services no longer working even when they were entirely independent from Cloudflare. And this resulted in this post.

Lessons learned: Don’t rely on your Hoster’s DNS servers. Run your own, and you know they’ll work or simply use 1.1.1.1, 8.8.8.8 and 9.9.9.9, it’s still more reliable than what Hetzner provides you by default as setting 3(!) DNS server up as forward to 1(!) DNS provider for literally no reason.

Correction: After the DNS Servers came back, I did some further testing. It seems like my assumption that they are forwarding DNS requests to 1.1.1.1 was wrong. The unavailability during the Cloudflare outage had other reasons. I still think it was somehow originating in the Cloudflare outage, which I think might makes the whole situation just worse.

Update 2020-07-20: Today Hetzner published some aftermath of the situation:

“We have since found the underlying cause of the DNS issue. Because of a faulty router configuration at Cloudflare, a large number of domains on Cloudflare’s authoritative DNS servers could not be resolved. Our system, which includes active and standby recursive name servers, is designed to optimize availability. But the fault originating at Cloudflare overwhelmed our system. The load on our recursive name servers skyrocketed by approximately 1000 %, and this caused our recursive name servers to go down.””