The
meme
is
real,
but
I
think
this
particular
case
is
sort
of
interesting,
because
it
turned
out,
ultimately,
to
not
be
due
to
DNS
configuration,
but
an
honest-to-goodness
bug
in
glibc.
As
previously
mentioned,
I
heavily
rely
on
email-oauth2-proxy
for
my
work
email.
Every
now
and
then,
I’d
see
a
failure
like
this:
Email OAuth 2.0 Proxy: Caught network error in IMAP server at [::]:1993 (unsecured) proxying outlook.office365.com:993 (SSL/TLS) - is there a network connection? Error type <class 'socket.gaierror'> with message: [Errno -2] Name or service not known
This
always
coincided
with
a
change
in
my
network,
but
-
and
this
is
the
issue
-
the
app
never
recovered.
Even
though
other
processes
-
even
Python
ones
-
could
happily
resolve
outlook.office365.com
-
this
long-running
daemon
remained
stuck,
until
it
was
restarted.
A
bug
in
the
proxy?
My
first
suspect
here
was
this
bit
of
code:
1761 def create_socket(self, socket_family=socket.AF_UNSPEC, socket_type=socket.SOCK_STREAM):
1762 # connect to whichever resolved IPv4 or IPv6 address is returned first by the system
1763 for a in socket.getaddrinfo(self.server_address[0], self.server_address[1], socket_family, socket.SOCK_STREAM):
1764 super().create_socket(a[0], socket.SOCK_STREAM)
1765 return
We’re
looping
across
the
gai
results,
but
returning
after
the
first
one,
and
there’s
no
attempt
to
account
for
the
first
address
result
being
unreachable,
but
later
ones
being
fine.
Makes
no
sense,
right?
My
guess
was
that
somehow
getaddrinfo()
was
returning
IPv6
results
first
in
this
list,
as
at
the
time,
the
IPv6
configuration
on
the
host
was
a
little
wonky.
Perhaps
I
needed
to
tweak
gai.conf
?
However,
while
this
was
a
proxy
bug,
it
was
not
the
cause
of
my
issue.
DNS
caching?
Perhaps,
then,
this
is
a
local
DNS
cache
issue?
Other
processes
work
OK,
even
Python
test
programs,
so
it
didn’t
seem
likely
to
be
the
system-level
resolver
caching
stale
results.
Python
itself
doesn’t
seem
to
cache
results.
This
case
triggered
(sometimes)
when
my
VPN
connection
died.
The
openconnect
vpnc
script
had
correctly
updated
/etc/resolv.conf
back
to
the
original
configuration,
and
as
there’s
no
caching
in
the
way,
then
the
overall
system
state
looked
correct.
But
somehow,
this
process
still
had
wonky
DNS?
A
live
reproduction
I
was
not
going
to
get
any
further
until
I
had
a
live
reproduction
and
the
spare
time
to
investigate
it
before
restarting
the
proxy.
The
running
proxy
in
this
state
could
be
triggered
easily
by
waking
up
fetchmail,
which
made
it
much
easier
to
investigate
what
was
happening
each
time.
So
what
was
the
proxy
doing
on
line
:1763
above?
Here’s
an
strace
snippet:
[pid 1552] socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 7
[pid 1552] setsockopt(7, SOL_IP, IP_RECVERR, [1], 4) = 0
[pid 1552] connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("ELIDED")}, 16) = 0
[pid 1552] poll([{fd=7, events=POLLOUT}], 1, 0) = 1 ([{fd=7, revents=POLLOUT}])
[pid 1552] sendto(7, "\250\227\1 \0\1\0\0\0\0\0\1\7outlook\toffice365\3c"..., 50, MSG_NOSIGNAL, NULL, 0) = 50
[pid 1552] poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLERR}])
[pid 1552] close(7) = 0
As
we
might
expect,
we’re
opening
a
socket,
connecting
over
UDP
to
port
53,
and
sending
out
a
request
to
the
DNS
server.
This
indicated
the
proximal
issue:
the
DNS
server
IP
address
was
wrong
-
the
DNS
servers
used
were
the
ones
originally
set
up
by
openconnect
still.
The
process
wasn’t
incorrectly
caching
DNS
results
but
the
DNS
servers.
Forever.
Nameserver
configuration
itself
is
not
something
that
applications
typically
control,
so
the
next
question
was
-
how
does
this
work
normally?
When
I
update
/etc/resolv.conf,
or
the
thousand
other
ways
to
configure
name
resolution
in
modern
Linux
systems,
what
makes
getaddrinfo()
continue
to
work,
normally?
/etc/resolv.conf
and
glibc
So,
how
does
glibc
account
for
changes
in
resolver
configuration?
The
contents
of
the
/etc/resolv.conf
file
are
the
canonical
location
for
DNS
server
addresses
for
processes
(like
Python
ones)
using
the
standard
glibc
resolver.
Logically
then,
there
must
be
a
way
for
updates
to
the
file
to
affect
running
processes.
In
glibc,
such
configuration
is
represented
by
struct
resolv_context.
This
is
lazily
initialized
via
__resolv_context_get()->maybe_init(),
which
looks
like
this:
68 /* Initialize *RESP if RES_INIT is not yet set in RESP->options, or if
69 res_init in some other thread requested re-initializing. */
70 static __attribute__ ((warn_unused_result)) bool
71 maybe_init (struct resolv_context *ctx, bool preinit)
72 {
73 struct __res_state *resp = ctx->resp;
74 if (resp->options & RES_INIT)
75 {
76 if (resp->options & RES_NORELOAD)
77 /* Configuration reloading was explicitly disabled. */
78 return true;
79
80 /* If there is no associated resolv_conf object despite the
81 initialization, something modified *ctx->resp. Do not
82 override those changes. */
83 if (ctx->conf != NULL && replicated_configuration_matches (ctx))
84 {
85 struct resolv_conf *current = __resolv_conf_get_current ();
86 if (current == NULL)
87 return false;
88
89 /* Check if the configuration changed. */
90 if (current != ctx->conf)
...
Let’s
take
a
look
at
__resolv_conf_get_current():
123 struct resolv_conf *
124 __resolv_conf_get_current (void)
125 {
126 struct file_change_detection initial;
127 if (!__file_change_detection_for_path (&initial, _PATH_RESCONF))
128 return NULL;
129
130 struct resolv_conf_global *global_copy = get_locked_global ();
131 if (global_copy == NULL)
132 return NULL;
133 struct resolv_conf *conf;
134 if (global_copy->conf_current != NULL
135 && __file_is_unchanged (&initial, &global_copy->file_resolve_conf))
This
is
the
file
change
detection
code
we’re
looking
for:
_PATH_RESCONF
is
/etc/resolv.conf,
and
__file_is_unchanged()
compares
the
cached
values
of
things
like
the
file
mtime
and
so
on
against
the
one
on
disk.
If
it
has
in
fact
changed,
then
maybe_init()
is
supposed
to
go
down
the
“reload
configuration”
path.
Now,
in
my
case,
this
wasn’t
happening.
And
the
reason
for
this
is
line
83
above:
the
replicated_configuration_matches()
call.
Resolution
options
We
already
briefly
mentioned
gai.conf.
There
is
also,
as
the
resolver.3
man
page
says,
this
interface:
The resolver routines use configuration and state information
contained in a __res_state structure (either passed as the statep
argument, or in the global variable _res, in the case of the
older nonreentrant functions). The only field of this structure
that is normally manipulated by the user is the options field.
So
an
application
can
dynamically
alter
options
too,
outside
of
whatever
static
configuration
there
is.
And
(I
think)
that’s
why
we
have
the
replicated_configuration_matches()
check:
static bool
replicated_configuration_matches (const struct resolv_context *ctx)
{
return ctx->resp->options == ctx->conf->options
&& ctx->resp->retrans == ctx->conf->retrans
&& ctx->resp->retry == ctx->conf->retry
&& ctx->resp->ndots == ctx->conf->ndots;
}
The
idea
being,
if
the
application
has
explicitly
diverged
its
options,
it
doesn’t
want
them
to
be
reverted
just
because
the
static
configuration
changed.
Our
Python
application
isn’t
changing
anything
here,
so
this
should
still
work
as
expected.
In
fact,
though,
we
find
that
it’s
returning
false:
the
dynamic
configuration
has
somehow
acquired
the
extra
options
RES_SNGLKUP
and
RES_SNGLKUPREOP.
We’re
now
very
close
to
the
source
of
the
problem!
A
hack
that
bites
So
what
could
possibly
set
these
flags?
Turns
out
the
send_dg()
function
does:
999 {
1000 /* There are quite a few broken name servers out
1001 there which don't handle two outstanding
1002 requests from the same source. There are also
1003 broken firewall settings. If we time out after
1004 having received one answer switch to the mode
1005 where we send the second request only once we
1006 have received the first answer. */
1007 if (!single_request)
1008 {
1009 statp->options |= RES_SNGLKUP;
1010 single_request = true;
1011 *gotsomewhere = save_gotsomewhere;
1012 goto retry;
1013 }
1014 else if (!single_request_reopen)
1015 {
1016 statp->options |= RES_SNGLKUPREOP;
1017 single_request_reopen = true;
1018 *gotsomewhere = save_gotsomewhere;
1019 __res_iclose (statp, false);
1020 goto retry_reopen;
1021 }
Now,
I
don’t
believe
the
relevant
nameservers
have
such
a
bug.
Rather,
what
seems
to
be
happening
is
that
when
the
VPN
connection
drops,
making
the
servers
inaccessible,
we
hit
this
path.
And
these
flags
are
treated
by
maybe_init()
as
if
the
client
application
set
them,
and
has
thus
diverged
from
the
static
configuration.
As
the
application
itself
has
no
control
over
these
options
being
set
like
this,
this
seemd
like
a
real
glibc
bug.
The
fix
I
originally
reported
this
to
the
list
back
in
March;
I
was
not
confident
in
my
analysis
but
the
maintainers
confirmed
the
issue.
More
recently,
they
fixed
it.
The
actual
fix
was
pretty
simple:
apply
the
workaround
flags
to
statp->_flags
instead,
so
they
don’t
affect
the
logic
in
maybe_init().
Thanks
DJ
Delorie!