Zombie Data Came Back to Life to Kill Me 🧟‍♂️

So, you're minding your business during an upgrade window. After everything comes back up with green lights, your SIEM starts firing alerts of 100% SERVFAIL for your internal authoritative zones (ouch). Indeed, all ~1,000 of your locations are now completely cut off from internal applications including authentication. How did we find ourselves here?

In July 2024, ISC introduced a security fix to BIND to mitigate CVE 2024-1737. This CVE resulted from a denial of service attack's root cause identified as a single owner name accumulating more than 100 resource records (RRs). The database servicing those queries is reduced to a crawl when responding, needing to include 100 RRs in a response. A client would retry the query in TCP, and depending on how many records and the data included, you may rapidly exceed what's transmittable in a TCP datagram.

ISC's remediation of this vulnerability created two outcomes:

introduction of max-records-per-type default of 2000 (per-name per RR-type)
introduction of max-types-per-name default of 100 (distinct RR types per owner name)

What happens should you exceed these defaults? Well, ISC's Knowledge Base article describes, "For authoritative servers, what this means in practice is that (by default) BIND will now not load zones containing RRsets where a single owner name has greater than 100 records of the same type." Same is true with types per name. The zone doesn't load... so a general failure to serve all records in a zone.

For this customer, in 2021, an automation was created to bridge the gap between public cloud assets (not DHCP-eligible) and their on-prem DNS zone as a native mechanism for DDNS didn't yet exist. As assets were created, a script created them in the zone.

This script's error validation lacked a process lock, so one day, a single host required over 2,000 runs prior to completing, resulting in the creation of 2,000+ A records for myhost.example.com. This data wasn't touched since then.

The customer opted (accidentally) to deploy the software upgrade to all servers at once. If the record count was the gasoline-soaked rags made of dynamite, this was the match. The fix to modify the options requires CLI modification of each server, which is difficult to script on short notice. Or deletion of a massive amount of data (that is difficult to ascertain if it's in use during an outage).

Infoblox created DNS Advisor (free) to reveal a number of operational errors related to DNS configurations, including a lack of scavenging use.

Two hours or so to get this fix in place. About $9M in productivity a free lab would've revealed pre-upgrade. I'll never forget the first time meeting the CXO during the RCA presentation, his opening remark, "folks, this better be good. This is the worst outage I've experienced in my 40-year IT career." Indeed.