The DNS Negative Cache
Considering the DNS query chain—
- A host queries a local recursive server to find out about
- The server queries the root server, then recursively the authoritative server, looking for this domain name
banana.exampledoes not exist
There are two possible responses in this chain of queries, actually.
.example might not exist at all. In this case, the root server will return a
server not found error. On the other hand,
.example might exist, but
banana.example might not exist; in this case, the authoritative server is going to return an
NXDOMAIN record indicating the subdomain does not exist.
Assume another hosts, a few moments later, also queries for
banana.example. Should the recursive server request the same information all over again for this second query? It will unless it caches the failure of the first query—this is the negative cache. This negative cache reduces load on the overall system, but it can also be considered a bug.
Take, for instance, the case where you set up a new server, assign it banana.example, jump to a host and try to connect to the new server before the new DNS information has been propagated through the system. On the first query, the local recursive server will cache the nonexistence of banana.example, and you will need to wait until this negative cache entry times out before you can reach the newly configured server. If the time required to propagate the new DNS information is two seconds, you query after one second, and the negative cache is sixty seconds, the negative cache will cost you fifty-eight seconds of your time.
How long will a recursive server keep a negative cache entry? The answer depends on the kind of response it received in its initial attempt to resolve the name. If
server not found is the response, then negative cache timeout is locally configured. If an
NXDOMAIN record is returned, the negative cache is set to timeout based on the timeout found in the SOA.
So, first point about negative caching in DNS: if you are dealing with a local DNS server for internal lookups on a data center fabric or campus network, it might improve the performance of applications and the network in general to turn off negative caching for the local domains. DNS turnaround times can be a major performance bottleneck in application performance. In turning off negative caching for local resources, you are trading processing power on your DNS server against reduced turnaround times, particularly when a new server or service is brought up.
The way a negative cache is built, however, seems to allow for a measure of inefficiency. Assume three subdomains exist as part of
A hosts queries for
banana.example, and the recursive server, on receiving an
NXDOMAIN response that this subdomain does not exist, build a negative cache with ,code>banana.example. A few moments later, some other host (or the same host) queries for
cantaloupe.example. Once again, the recursive server discovers this subdomain does not exist, and builds a negative cache entry. If the point of the negative cache is to reduce the workload on the DNS system, it does not seem to be doing its job. A given host, in fact, could use a good deal of processing power by requesting one domain after another, forcing the recursive server to discover whether or not the subdomain exists.
RFC8198 proposes a way to resolve this problem by including more information in the response to the recursive server. Specifically, given DNSSEC signed zones (to ensure no-one is poisoning the cache to force the building of a large negative cache in the recursive server), an answering DNS server can provide a list of the two domain names on either side of the missing queried domain name.
In this case, a host queries for
banana.example, and the server responds with a the pair of subdomains surrounding the request subdomain—
orange.example. Now when the recursive server receives a request for
cantaloupe.example, it can look into its negative cache and immediately see there is no such domain in the place where it should exist. The recursive server can now respond with a “no server found,” without sending queries to any other upstream server.
This aggressive form of negative caching can reduce the workload of upstream servers, and close an attack surface that might be used for denial of service attacks.