Skip to content

Probably should not return ServFail for unknown zones #8061

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iximeow opened this issue Apr 28, 2025 · 1 comment
Open

Probably should not return ServFail for unknown zones #8061

iximeow opened this issue Apr 28, 2025 · 1 comment
Labels
good first issue Issues that are good for learning the codebase

Comments

@iximeow
Copy link
Member

iximeow commented Apr 28, 2025

if our DNS servers get a query for an unknown zone, we return SERVFAIL. while working on #8047 i was writing a test that brushed up against this behavior, and i left some notes: https://github.com/oxidecomputer/omicron/pull/8047/files#diff-c0d9697cfa638ea3c770968ab4bd7fd28692d676e50055101339740a1688bd4aR415-R432

reproduced here in case future diffs or GitHub navigation breaks that link,

    let lookup_err = resolver
        .soa_lookup(TEST_ZONE)
        .await
        .expect_err("test zone should not exist");
    // I think we really should answer with ResponseCode::Refused. We are not
    // authoritative for the .internal TLD, so we don't know that some *other*
    // server would have records for `oxide.internal`. It is not a failure of
    // our server to not know what that domain is, we should just refuse to
    // answer.
    //
    // One may imagine we should return at least NXDomain without the
    // authoritative bit set. RFC 1035 says that "Name Error - Meaningful only
    // from an authoritative name server, ...". Does that mean that recursive
    // resolvers and clients would faithfully ignore our error in that case? Is
    // there a risk that something would miss the non-authoritative nature of
    // such an NXDomain and incorrectly cache the non-existence of some other
    // domain? Hopefully not! Answering `Refused` would side-step this question.
    expect_no_records_error_code(&lookup_err, ResponseCode::ServFail);

the behavior now is pretty simple and intentional, this is from dns-server/src/dns_server.rs:

impl From<QueryError> for RequestError {
    fn from(source: QueryError) -> Self {
        match &source {
            QueryError::NoName(_) => RequestError::NxDomain(source),
            // Bail with servfail when this query is for a zone that we don't
            // own (and other server-side failures) so that resolvers will look
            // to other DNS servers for this query.
            QueryError::NoZone(_)
            | QueryError::QueryFail(_)
            | QueryError::ParseFail(_) => RequestError::ServFail(source.into()),
        }
    }
}

from reading RFC 1035 and related (though i'm not certain that no later RFC refines it), "Server failure" means that there was an issue with the name server in processing a request. we obviously should not return "Name Error" (aka NXDOMAIN), as we don't know if a zone we're not authoritative for does not exist. this suggests to me that a more accurate error would be to return Refused, since we can neither say there are nor are not records for the queried domain. as it turns out, this also the response i get from dig a foo. $nameserver against a dozen nameservers i tried.

realistically i think this is unlikely to matter. if we're getting queries for zones we don't know about, something else is pretty horked. but i could imagine some intermediary seeing SERVFAIL as an implication that the server(s) are unhealthy for other queries that we're perfectly happy to answer. (and yes, i do wonder if we should better justify SERVFAIL for QueryFail and ParseFail errors. answers for queries like dig a $(printf "hello\x7fworld.$zone") ns1.$zone get a mix of NXDOMAIN and REFUSED, for comparison)

@iximeow iximeow added the good first issue Issues that are good for learning the codebase label Apr 28, 2025
@iximeow
Copy link
Member Author

iximeow commented Apr 28, 2025

@rmustacc notes that the particular error code may be important to some clients that have disparate DNS servers - one example here might be NTP on the rack, using the system libresolv2, where internal DNS may be consulted along with whatever upstream resolver the rack also queries. SERVFAIL here is one way we'd ensured that an error is not taken fatal to the query and instead resulted in the client trying other DNS servers that may be able to actually provide an answer.

so if this behavior is changed, we'll need to be particularly careful that changing Omicron's DNS servers doesn't result in issues for DNS clients outside Omicron (both in the rack and obviously on customer networks).

i'm a bit optimistic here - in some cases SERVFAIL is considered in the same way as REFUSED. but in other cases SERVFAIL seems less fatal than REFUSED. this deserves some more careful reading and testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Issues that are good for learning the codebase
Projects
None yet
Development

No branches or pull requests

1 participant