Developing for Debugging

The Oxide control plane operates two different DNS services: external DNS, exposed to our customers, provides names for locating the system’s externally-facing API and web console; internal DNS only operates on the system’s internal network and provides service discovery for components of the control plane. These use separate deployments of the same stack. Here are the basics:

  • The authoritative source of truth of DNS data (i.e., the names and records that the DNS servers should report) is the control plane database. This is a strongly consistent distributed database (CockroachDB).

  • The core of the control plane is a component called Nexus. When something happens that should trigger a DNS change (e.g., a component is added or removed that provides service either internally or externally):

    • Nexus first updates the DNS configuration in the database.

    • A series of background tasks in Nexus propagate the updated configuration to all in-service DNS servers. These tasks run in response to events that change DNS and periodically (to deal with transient failures, missed wakeups, etc.)

Diagram: DNS propagation

This whole flow is asynchronous with respect to the event that triggered the change, but it generally completes in a few seconds in a working system.

But everyone knows that when infrastructure goes offline, it’s always DNS that’s at fault. So in working on this system, we spent some time thinking about: what can go wrong in this process? When it breaks, how will we know why?

To be specific: suppose a customer reports that a DNS name they expect to be there is missing. They get no records when they query the external DNS server. What could be wrong? Well:

  • The DNS data in the database could be wrong.

  • The propagation process could be broken:

    • Maybe Nexus hasn’t noticed that the DNS data has changed.

    • Maybe Nexus has the wrong list of DNS servers. (e.g., maybe one was added to the system but not added to the propagation list)

    • Maybe the propagation of the DNS data to the DNS servers is broken.

    • Maybe the propagation process is stuck.

  • The DNS server may be serving the wrong data for its configuration.

If we got a report that DNS was doing the wrong thing, it seems obvious that what you’d want is a tool that could report all of these things:

  • What’s in the database?

    • What names are part of the DNS configuration?

    • When was it last updated? By which Nexus? Why?

  • For each Nexus instance, for the last propagation attempt:

    • When did it start? Is it still running?

    • What version of the DNS configuration is it trying to propagate?

    • What DNS servers did it find that it should propagate to? When did it determine that?

    • What was the result of the last completed attempt to propagate DNS configuration to each server?

  • For each DNS server, what version of dns data is it operating from?

This example was the catalyst for building omdb, the Omicron debugger. Omicron is the internal name for the control plane on Oxide systems. It’s the Omicron debugger, not the DNS debugger, because we figured that this wouldn’t be the only subsystem that we needed runtime observability for.[1]

Now, we built DNS propagation in terms of a new abstraction called background tasks. Background tasks have basic observability built in. They’re basically just asynchronous Rust functions with a few properties: they can be called (activated) in response to certain events; they’re always activated on some period as well; the complete set of them is statically defined; and they can return an arbitrary status payload (currently, a schemaless JSON object). With these pieces, omdb is able to generate a useful summary report about DNS propagation without omdb itself knowing too much about it. First, it can report basic documentation that comes from the server:

# omdb nexus background-tasks doc | grep -A1 "dns.*internal"
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:102::3]:12221
task: "dns_config_internal"
    watches internal DNS data stored in CockroachDB
--
task: "dns_propagation_internal"
    propagates latest internal DNS configuration (from "dns_config_internal"
    background task) to the latest list of DNS servers (from
    "dns_servers_internal" background task)

--
task: "dns_servers_internal"
    watches list of internal DNS servers stored in internal DNS

and here’s the status:

# omdb nexus background-tasks show dns_internal
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:104::3]:12221
task: "dns_config_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 4500, triggered by a periodic timer firing
    started at 2025-06-10T03:31:47.775Z (23s ago) and ran for 1303ms
    last generation found: 18

task: "dns_servers_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 4500, triggered by a periodic timer firing
    started at 2025-06-10T03:31:47.775Z (23s ago) and ran for 1ms
    servers found: 3

      DNS_SERVER_ADDR
      [fd00:1122:3344:1::1]:5353
      [fd00:1122:3344:2::1]:5353
      [fd00:1122:3344:3::1]:5353

task: "dns_propagation_internal"
  configured period: every 1m
  currently executing: no
  last completed activation: iter 4502, triggered by a periodic timer firing
    started at 2025-06-10T03:31:47.775Z (23s ago) and ran for 876ms
    attempt to propagate generation: 18

      DNS_SERVER_ADDR            LAST_RESULT
      [fd00:1122:3344:1::1]:5353 success
      [fd00:1122:3344:2::1]:5353 success
      [fd00:1122:3344:3::1]:5353 success

This is a tidy example of what I call developing for debugging: in building the system, we assume that any part of it can be unexpectedly broken in production. We think up front about what information we’ll want to debug it and then make sure we’re able to collect it from production systems.

Note that once we invested in some basic abstractions like background tasks, it was very little work to factor the pieces of DNS propagation so that we get these little status reports. Yet with this, in just a few seconds, we can determine exactly where the propagation pipeline broke down. The result is a system that’s not just functional, but debuggable.

Software should be able to exonerate itself

It’s great when software can tell you what’s wrong with it. It’s even better when it can convincingly tell you that it’s working.

It’s one thing for a system like this to report:

0 problems

or:

DNS propagation is up to date.

Okay. But is that really true? "0 problems" means "0 problems that this program knows about". "up to date" is vague at best and potentially wrong as soon as the words arrive at your screen.

The system above reports much stronger, affirmative evidence when it’s working correctly: it says "I propagated DNS version 18, which I discovered at 2025-06-10T03:31:47.775Z, to these three DNS servers (which I discovered at 2025-06-10T03:31:47.775Z) and that all had finished by 2025-06-10T03:31:48.651000Z.

This is so much more information. Very often, this can tell you that DNS isn’t the problem.[2] And even when it’s something else, this information can help you make timeline of events for figuring out what did happen. Maybe the user queried DNS before propagation completed?

Other benefits

There are so many other benefits of developing software like this:

  • You can use omdb while building the system to play with it and test it out before writing automated tests. We didn’t do that here, but you can imagine building the database part before the rest of the propagation pipeline and using omdb to see Nexus noticing when it gets updated. Debugger-driven development is the subject of a whole Oxide and Friends podcast episode.

  • You can use omdb while working on an existing system to understand how it behaves. In this case, you could imagine seeing what happens when some DNS servers are offline.

  • You can use omdb to teach people how the software works. Onboarding docs for the DNS subsystem (which don’t exist, but should!) can walk through an example change to DNS and the subsequent propagation. Mentors can show it to new team members. A demo (at this level of detail) is worth a thousand diagrams.

  • You can use omdb as a tool for communicating about software changes. Instead of telling people about the new DNS system, you can give a deep systems demo of the DNS subsystem in 10 minutes, showing not just that a DNS change got propagated as expected, but how, and even showing what happens when things aren’t working.

From past experience, I can say that when a team manages to do this across the system, the easy availability of debugging information can really change the emotional tenor of handling support calls. A lot of people naturally view debugging as a painful distraction from something else they’d rather be doing. It’s extra stressful when you’re under the gun because a customer’s waiting for a fix. A lot of the stress comes from the fear and uncertainty. When you’ve built out these tools, you can solve more problems quickly (and more confidently). It still sucks to know a customer’s in pain, but it’s a lot less scary when you know how to figure it out.

I also have this gut feeling that, ironically, software built this way is less likely to have runtime problems because this process often prompts you to think about them in development. I can’t say I’ve got any data to back that up.


1. Many team members have added many commands to omdb and it now has commands for querying quite a lot of the control plane.
2. I’m ignoring problems at both ends of the process here (the database having incorrect data or the DNS server serving the wrong thing even when its configuration is correct) just to keep the explanation simple. We have tools for observing these too, and they’re similarly detailed, not just "working correctly".