[dnstap] message tagging (was: Re: suggested optional fields for DNSTAP)

Thu Mar 12 01:44:37 UTC 2015

Hi, Joe:

Joseph Gersch wrote:
> > On Feb 27, 2015, at 3:52 PM, Robert Edmonds <edmonds at mycre.ws> wrote:
> > Joseph Gersch wrote:
> >>   The second one is to generate  a unique GUID  for and store it for each CLIENT_QUERY.  This GUID would also be stored with each RESOLVER_QUERY and RESOLVER_RESPONSE.   This would allow an analysis of a DNS TRACE to determine operational issues with long recursive resolutions.  It is insufficient to just have bailiwick or domain name, because once the recursive resolution starts chasing a CNAME or chain of NS delegations, the domain name changes.  Some recursions can take 10-70 lookups to get full resolution.  Having a GUID to tie them all together would be very useful.
> > 
> > This is a great use case, but it sounds like it might be a bit hard to
> > implement, at least in the recursive DNS server I'm most familiar with
> > (Unbound).
> 
> I don’t think you should limit the schema based on what’s easy or hard.  Some resolvers will be able to do this.  And since the field is optional, it won’t add any overhead to the byte stream if a particular DNS Server implementation decides to not do it.  But in general, if a DNS Server can do it, this would be very useful information.

Hm, that is a fair point.  If a new field is theoretically implementable
in more than one DNS server, and reasonable to implement in at least
one, and there's a good use case for it, it ought to be fair game for
dnstap.

I say "theoretically implementable in more than one", because I want to
exclude from the scope of dnstap (or, at least relegate behind the
'extra' field) the types of log messages that would more accurately be
called "software tracing" rather than "event logging", e.g.:

    http://en.wikipedia.org/wiki/Tracing_%28software%29#Event_logging_versus_tracing

At least, keeping "software tracing" out of the mainline schema
theoretically keeps us more implementation-neutral :-)

> If not specified in the schema, I would have two choices:  use the EXTRA field, or extend the schema myself to add the additional field.  I’m nervous about this latter approach.  I don’t see an easy way for multiple vendors to extend the schema without stepping on each other.  For example, if I add a field #17  and say it is to be used as a proprietary extension for some purpose, and another vendor also decides to extend the schema and use field #17, then a mess will ensue.  I don’t think there is a way for multiple vendors to do independent extensions like they do with OID’s in SNMP.   So I would prefer that the first official version of the schema contain fields that are general, and we have an agreement method for extensions.  And since all fields are optional, it won’t hurt any particular DNS SERVER to simply choose not to implement particular fields.

Technically, there *is* an "extensions" mechanism in protobuf that
sort of covers this particular use case, see:

    https://developers.google.com/protocol-buffers/docs/proto#extensions

However, there are some major caveats:

    - It's never actually been implemented in protobuf-c.

    - nanopb supports extensions, though IIRC, with a different
      interface than is used for regular message fields.

    - The upcoming "protobuf 3" release from Google comes with a major
      new revision to the protobuf data model (though, it will
      apparently continue to support protobuf 2 semantics with a mode
      switch), which removes extensions.

So I think protobuf extensions are probably not a viable solution here.

Unused optional message fields aren't entirely free; an optional 'bytes'
field adds 24 bytes to the protobuf-c generated struct on amd64.  If we
wanted to add a *lot* of new optional fields, it would probably be best
to break the schema down into optional sub-messages.  (The Unbound
dnstap implementation allocates protobuf-c structs on the stack to
reduce the number of malloc()'s in the fast path, so it would be nice to
keep the base structs to a "reasonable" size.)

There are a few interesting models for extensions like this:

    - SNMP OIDs, like you mentioned.

    - DHCP options.

    - Pcap linktypes: http://www.tcpdump.org/linktypes.html.  The
      interesting thing about pcap linktypes is that there's very little
      formal process for allocating new types beyond discussing them on
      the tcpdump mailing list and having someone commit an update to
      the header file.  (It appears to be entirely consensus based.)

    - DNS RRtypes, for which there's a formal registry and "expert
      review" process (RFC 6895).  There is also a "private use" range.

    - IPv6 ULAs (RFC 4193), for which there's a voluntary registry
      (https://www.sixxs.net/tools/grh/ula/).

Avoiding the uncoordinated re-use problem that you note is pretty
important, and private use ranges perhaps make it a bit too easy to have
uncoordinated re-use, so I lean towards an actual registry (whether
that's a voluntary or consensus or formal registry).  Better still to
have consensus/specification on the fields and add new fields to the
mainline schema rather than vendor-specific fields/schemas.

> > [...]
> > it's potentially a mapping to a set of multiple unrelated outstanding
> > CLIENT_QUERYs:
> > 
> >    RESOLVER_RESPONSE -> { CLIENT_QUERY, CLIENT_QUERY, CLIENT_QUERY }
> 
> My suggestion is that only the first CLIENT QUERY gets mapped to the “query tag”.  The subsequent queries are usually put  on a “pending list” and they get a response sent when the original 1st query gets an answer.  So a series of fast queries for xyz.com <http://xyz.com/> (and I’ve seen client queries come in immediately on the same IP address and port and even the same transaction ID!!) would look like this:
> 
>     CQ  xyz.com <http://xyz.com/>  (query tag 1)
>     CQ xyz.com <http://xyz.com/>  (query tag 2)
>     CQ xyz.com <http://xyz.com/>  (query tag 3)
>     RQ xyz.com <http://xyz.com/>  (query tag 1)
>     RR xyz.com <http://xyz.com/>  (query tag 1)
>     RQ xyz.com <http://xyz.com/>  (query tag 1)
>     RR xyz.com <http://xyz.com/>  (query tag 1)
>     CR xyz.com <http://xyz.com/>  (query tag 1)
>     CR xyz.com <http://xyz.com/>  (query tag 2)
>     CR xyz.com <http://xyz.com/>  (query tag 3)
> 
> In other words, each client query generates its own query tag which rides along until that query is finally answered.

> > [...]
> 
> No, you don’t need a lot of tags, just one per new client query.  BTW, I have seen client queries that end up in over a hundred resolver queries.  Each one of these would be tagged with just one “query tag” associated with the original client query.

> > [...]
> 
> Again, don’t get stuck on whether it’s hard or not.  There are other DNS resolvers besides BIND and UNBOUND.

Ah, OK, I didn't like the complexity on the protobuf side that the full
generality of mapping resolver-side messages to all possible client-side
messages would have required, but that's not what you're proposing here;
you're explicitly limiting the scope of the tag to being able to
identify only the first client query that triggered the resolution
process, if there was one.  I like that a lot better, and it's basically
just:

    optional bytes message_tag = ...;

I think I'd call this a "message tag" rather than a "query tag", just to
make it a bit clearer.  (E.g., an RR and a CR can be related to each
other through their tag value, even though neither is a query.)

Tracking issue here:

    https://github.com/dnstap/dnstap.pb/issues/3

> > BTW, have you looked at Casey Deccio's dnsviz tool [1] ?  Not the
> > dnsviz.net service, but the Python tool that drives it.  I wonder if
> > there might be an alternate approach to the problem of analyzing the
> > root-cause of resolution failures that involves active probing, that
> > could be triggered by a hypothetical new "timeout" dnstap payload.
> > (This is a lot easier, we just need to enumerate and define the
> > different kinds of timeouts that can occur.)
> 
> I know DNSVIZ (as well as Casey), but I have never looked at the Python tool.  
> 
> 
> Now, regardless of whether you think these ideas are good or not, I think the base implementation of DNSTAP is already awesome and useful in more ways than we can imagine.  I have some interesting HEAT MAPS that I’ve generated.  These correlate the domain name to an ontological data base to categorize the types of queries being made by a user base (shopping, searching, travel, news, ……into multiple sub-levels).  This and other types of BIG DATA analyses can now go real-time.  Yay!

Great, having dnstap support in many DNS implementations is very
gratifying :-)

-- 
Robert Edmonds