One of the considerations in designing our Oxide rack is asking which parts we expect to be accessible and by what means. The Oxide rack is designed to live in a data center with exclusive access via the network. The only reason an engineer should ever need to physically visit a rack is to replace a failing part, such as a disk. Our Service Processor (SP) is accessible via the management network.
During some of our first attempts at putting our next generation Cosmo sled into an Oxide rack, we would see the Service Processor drop off the network. This is a tricky situation to debug, as without network access we have limited insight into the state of the SP itself. Debugging started based on the state of the rest of the system (original Hubris bug may contains spoilers for the blog post!):
-
The AMD host CPU was still alive, meaning the full system itself still had power
-
The SP itself was not broadcasting over the management network that it was alive
-
There were no increases in network data counters coming from the SP
-
The fans were spinning at a constant elevated rate. The service processor is responsible for fan control, so this was an indication the fan controller may have fallen back to emergency full power mode.
-
This was not reproducible on a sled outside a rack
The Service Processor runs our custom operating system, Hubris. Each portion of the system (networking, thermal control, update etc.) is written as a separate task. Hubris is not a true Real Time Operating System with deadline guarantees, but it does have the notion of task priorities. One of our working theories was that we had a software bug that was causing task starvation. If the networking task was unable to run due to some other task eating up all the CPU time, it would not be able to respond over the network. A likely culprit of task starvation could be a task that had gotten into an infinite crash loop, with all CPU time being spent restarting the task. We adjusted the task restart time to have a longer delay to catch this case. We also wanted to be able to observe if the SP was still making progress even if we lacked networking access, and so switched our chassis LED from "always on" to blinking.
We were fortunate to be able to reproduce the issue with these debug changes, but the results were still confusing: in some cases we would see the LED stuck on, and in other cases the LED was stuck off. The task responsible for LED blinking was near the top of priorities, which limited the number of places we could have a stuck task.
One
of
the
many
advantages
of
writing
Hubris
in
Rust
is
eliminating
bug
classes
such
as
buffer
overflows.
A
category
of
issues
Hubris
is
still
particularly
prone
to
is
stack
overflows.
This
is
because
Hubris
requires
manual
sizing
of
stacks
for
tasks
and
calculating
maximum
stack
size
has
proven
tricky.
Our
ability
to
detect
undersized
stacks
has
improved
with
the
addition
of
emit-stack-sizes
feature
but
we
can
still
hit
some
edge
cases.
When
a
stack
overflow
occurs,
the
task
safely
restarts.
A
stack
overflow
in
the
kernel
would
potentially
produce
similar
behavior
of
a
system
that
looks
like
it
isn’t
making
progress.
Unfortunately
for
us
the
stack
margins
on
the
kernel
were
relatively
large
(512
bytes!)
so
this
was
an
unlikely
case.
At this point, we really needed to get more debugging information out of the system. For manufacturing purposes, we have SWD debug headers. These are not expected to be used on a production system and especially not a system in a running rack. We had to do some creative cable pulling to get them attached with the assistance of coworkers in the Oxide office.
Fortunately, our cable attachment paid dividends: we reproduced the issue with the probe attached! This was not immediately fruitful: the debug probe was unable to actually halt the CPU via debug halt, which limited our ability to extract diagnostic information. Our Service Processor uses a Cortex-M7 STM32H7, and the number of ways to put the system in such a state is limited.
This put our focus on identifying what parts of the system could cause such behavior. A major change from our first generation Gimlet system was the addition of an FPGA to control more parts of our system such as host flash. This FPGA is connected using a simple, old-school parallel bus, like the sort you might use for RAM, and accessed via the STM32H7 Flexible Memory Controller. As stated in the manual (Section 22.1 RM0433):
Its main purposes are:
* to translate AXI transactions into the appropriate external device protocol
* to meet the access time requirements of the external memory devices
One way a CPU can potentially get stuck is if it never receives a bus acknowledgement from an external device. A bug in the FPGA timing, for example, could result in the CPU hanging forever when attempting to read a register. To test this theory, we created an FPGA test image with a register that when read would intentionally hang the FMC bus. This produced very similar behavior to what we observed and was a strong indicator we were looking at the right part of the system to find the issue.
We typically rely on full system dumps to debug Hubris problems. This is not possible unless we can halt the CPU. ARM CPUs do support vector catch though: it’s possible to configure the CPU so that on reset, it halts before executing the first instruction. Our hope was that a vector catch reset would unstick the CPU sufficiently without trampling over our existing state. This did work. We lost the running register state with the program counter but the rest of the Hubris state in RAM was preserved across reset and looked reasonably consistent. We could see what Hubris task was running, but nothing there looked like it was accessing the FMC.
Our hardware engineers did a review of FPGA timings and did find that we might not have been meeting timing constraints required by the memory interface. We merged the fix and figured that the vector catch dumps were just inconsistent, most likely due to the cache. When we ran experiments to turn off the cache the dumps were consistent but we never reproduced the actual issue.
We continued hubris development as usual over the next several weeks. One of the changes we worked on during this period was related to our measured boot work. Our Root of Trust (RoT) is responsible for taking a hash of the SP flash at bootup which eventually gets used by higher level software. To acheive the security properties we need, the SP may reset itself multiple times in a row at first bootup. While testing this change, we saw the same symptoms come back: the Cosmo SP would disappear from the network and appear dead. This change turned out to be incredibly good at reproducing the issue, turning a potentially 24+ hour reproduction rate to approximately 10-20 minutes. The initial dumps still didn’t show a significant smoking gun, but we were still highly suspicious of the FMC bus since there were still limited cases that could produce such symptoms.
The high reproduction rate gave us a chance to try many experiments, none of which were fruitful:
-
Adjusting the rate at which we reset and the number of resets before normally booting
-
Clearing the FPGA bit stream an extra time
-
Restricting tasks from accessing the FMC bus
-
Removing whole tasks that seemed to be unrelated
Finally, staring at the STM32H7 manual provided an insight: maybe the processor itself was performing accesses on the FMC bus that we weren’t expecting! Modern processors hold a large amount of internal state that isn’t directly visible to the programmer. It is not possible for a programmer to know when a CPU will pull data into or out of the cache outside of certain synchronization points or cache instructions. A CPU writing data from the cache to memory is considered a memory access so it’s possible for the CPU to be making memory accesses to addresses unrelated to the current program counter.
Hubris utilizes the Memory Protection Unit (MPU) to provide isolation between tasks and enforce privilege levels. Our configuration uses the MPU for the unprivileged tasks but uses the default memory map for the (privileged) kernel. In the tasks, the FMC is mapped as Uncached Device Memory. Based on our reading of the STM32H7 manual, it turned out our chosen base address for the FMC bus had a default memory type of Normal Cached. This means the FMC has different attributes depending on whether it’s being accessed from a task or the kernel.
Section A3.5.7 of the ARMv7-m reference manual has an entire section about mismatched memory attributes and what properties are lost in this situation. Based on discussion with our hardware engineers, the line "Preservation of the size of accesses" was the most suspicious. Our FPGA interface was designed for 32-bit accesses, and 16-bit or 8-bit accesses could potentially cause problems.
It’s important to note that the kernel was never intentionally accessing the FMC through the Normal Cached mapping. The most likely scenario was:
-
The CPU running an unprivileged task accessing the FMC issues a store that makes it to the processor’s store buffer
-
An interrupt occurs, switching us into privileged mode which uses the default memory map
-
The store hits the cache because the default memory map said that address is cached
-
The cache attempted to write to memory in ways outside the expected Device Memory attributes
One of the last lines of section A3.5.7 is "Arm strongly recommends that software does not use mismatched attributes for aliases of the same location." The default ARM memory map (which the kernel relies on) assigns different attributes to different sections of the address space, and one of the sections is set up the way we want: device memory, no caching. It turns out the STM32H7 FMC supports changing its base address to appear in this section of address space, likely to avoid the specific problem we were facing. The final fix was changing the base address to the section with matching attributes. We’ve seen no instances of this issue since that fix was merged.
Transparency continues to be an Oxide value. Debugging modern CPUs often involves diving into areas with little transparency. "Under what circumstances will you be unable to access your memory bus" is a tricky question to answer. Our debugging efforts this time were aided by documentation from ARM and STM that eventually explained our problem. Given the difficulty in debugging this issue, highlighting this potential problem in vendor documentation would be beneficial to all customers. Oxide hopes all hardware vendors continue to document as much of their part as possible for the benefit of their customers.
)