[dnstap] dnstap with auth/recursive servers

Sat Sep 12 01:21:51 UTC 2015

Hi, Evan:

This email is a bit long, sorry about that.  I try to go into some
detail below about what I was thinking when originally developing the
dnstap schema.  Thanks for making me write this down.

Evan Hunt wrote:
> I'm working on a BIND implementation of dnstap (targeted for BIND 9.11.0,
> early 2016), and have run into a problem.  How should I differentiate
> between AUTH_{QUERY,RESPONSE} and CLIENT_{QUERY,RESPONSE} when the server
> is configured to be both authoritative and recursive?
> 
> If a query arrives with RD=1, I can log it as a CQ, but then it might
> be answered authoritatively, in which case I might log it as AR, but it
> seems strange for the query and response to be unbalanced like that.

This is a good question, and one that hasn't come up before in previous
server implementations of dnstap in Unbound and Knot, since Unbound is
caching/forwarding only, and Knot is authoritative only.

There isn't a really good reason to enforce that the query and its
corresponding response be "paired" in terms of Message.Type values
(other than symmetry, I guess).  Adopting Joe's "message_tag" proposal
might make it slightly easier to locate a query/response pair from a
dnstap log.

How "malleable" is the runtime configuration of BIND with regard to
whether authoritative, recursive, or mixed mode service is being
provided?  (IIRC, weren't there some rndc "addzone" and "delzone"
commands added at some point?)

Your hypothetical here is a server that's been configured for mixed-mode
service.  What about the other two cases, where a server is configured
only for recursive service, or only for authoritative service?  Is there
a global variable that indicates whether the server has been configured
for recursive-only vs authoritative-only service?  (That is, is it
straight forward for BIND to make good use of the AUTH_QUERY and
CLIENT_QUERY values when it's not running in mixed mode?)

> I could postpone logging the query until I've determined whether we have
> an authoritative answer, but by that time I'd already be sending a
> response, and AQ and AR messages would be emitted almost simultaneously.

Yeah, ideally a DNS server should emit its dnstap log messages as early
as possible (but in the case of responses, *after* the response has been
sent, because logging should take a secondary priority to providing name
service).  For instance, the Unbound dnstap implementation generates CQs
before even doing basic sanity and ACL checks on the message, but this
is because we can make the simplification that all inbound queries
processed by Unbound will be marked as CLIENT_QUERYs.

But in BIND's case, you might end up traversing a fair amount of data
structures before being able to determine how the query should be
classified, right?  That strikes me as less than optimal, but as long
as you can emit the log message without waiting on cache misses to be
filled, it seems that it would still be desireable to be able to
accurate classify the inbound query.

> It seems the best solution is would be to log all RD=1 queries as CQ and
> their responses as CR, and all RD=0 queries as AQ and their responses as
> AR, and to extend the CR message to indicate whether the response was
> authoritative.

Hm, so, I intentionally tried to not define the Message.Type enum's
*_QUERY values based solely on the RD bit in the query message, because
of the corner cases:

(1) A recursive-only server receiving an RD=0 query is processing a
"cache snooping" request.  It might answer from cache without performing
recursion, or REFUSE it based on policy (e.g. Unbound without the
"allow_snoop" ACL set), etc.  A mixed-mode server might also process
these queries via the cache if it doesn't match an authoritative zone,
too.  So, it shouldn't be classified as an AUTH_QUERY based solely on
the RD bit, because it's not necessarily being processed as if it were a
request for authoritative service.

(2) An authoritative-only server receiving an RD=1 query is
processing...  well, I don't think there's a cute name for it, but you
usually get back a response without the RA bit set that's identical to
what you would have received if the RD bit were cleared.  (The most
common cause of this is probably people running something like "dig
@<AUTH-SERVER> ..." without setting +norec, because after all, it still
works even if you don't set +norec, right?)  So, it shouldn't be
classified as a CLIENT_QUERY based solely on the RD bit, because it's
not being processed as if it were a recursion-desired query.

I originally thought of each Message.Type value as representing a unique
code site inside the nameserver implementation, and corresponding to
separate dnstap logging config knobs that could be independently enabled
or disabled.  (So, for instance, suppose you were interested in passive
DNS replication.  You could enable logging RESOLVER_RESPONSE's but leave
RESOLVER_QUERY's disabled, since the query is largely redundant for that
use case, anyway.)

This concept of Message.Type values corresponding to specific code sites
broke down a bit when I actually implemented dnstap in Unbound and found
that the same code paths were used for both RESOLVER_* and FORWARDER_*,
and there wasn't a good way to distinguish between the two cases, other
than by actually inspecting the RD bit [0,1] of the query that Unbound
was sending out.  I think it's OK to make this classification (compared
to the corner cases above) because the specific RD bit being inspected
here is always under the control of the server and it's correct 100% of
the time; there aren't any corner cases, AFAIK.

[0] https://github.com/jedisct1/unbound/blob/cbe0bdb67691fb8bfa9fa869e1da61389479c150/dnstap/dnstap.c#L420-L429

[1] https://github.com/jedisct1/unbound/blob/cbe0bdb67691fb8bfa9fa869e1da61389479c150/dnstap/dnstap.c#L471-L480

I think we should try to accurately classify the response messages (AR
vs QR) according to how they're actually processed in the server, and
not based on what the query header bits look like.  So I think I'm
leaning towards recommending postponing AQ/CQ logging until you know A
vs C, or possibly introducing an indeterminate "QUERY" type that just
represents a generic query received by a responder.

> A suggestion was already made in
> http://lists.redbarn.org/pipermail/dnstap/2015-February/000017.html
> to extend the CR message to differentiate between cache hits and misses.
> I'd like to piggyback on that suggestion, and propose this, to be added
> as an optional field in the Message type.
> 
>         enum DataSource {
>                 // all data used to generate this response
>                 // are from local authoritative sources.
>                 AUTH_DATA = 1;
> 
>                 // this response was generated from a 
>                 // cache of previously-sent whole DNS responses.
>                 MESSAGE_CACHE = 2;
> 
>                 // this response was generated by consulting
>                 // a cache of DNS records, but without sending
>                 // iterative queries
>                 RECORD_CACHE = 3;
> 
>                 // at least one iterative query was sent in
>                 // the construction of this response
>                 RECURSION = 4;
>         };
> 
> Thoughts?

Is that a replacement for the original CacheStatus enum in the message
you reference?  Where did the "cache miss" value go?

-- 
Robert Edmonds