Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Dec 3, 2025

Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new read_panic_message kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace.

I've also added a test case that exercises this functionality.

Fixes #2311

@hawkw hawkw requested a review from cbiffle December 3, 2025 19:53
@hawkw hawkw force-pushed the eliza/read-panic-message branch 2 times, most recently from c4ca702 to 2c80a58 Compare December 3, 2025 19:56
Currently, there is no way to programmatically access the panic message
of a task which has faulted due to a Rust panic fron within the Hubris
userspace. This branch adds a new `read_panic_message` kipc that copies
the contents of a panicked task's panic message buffer into the caller.
If the requested task has not panicked, this kipc returns an error
indicating this. This is intended by use by supervisor implementations
or other tasks which wish to report panic messages from userspace.

I've also added a test case that exercises this functionality.

Fixes #2311
@hawkw hawkw force-pushed the eliza/read-panic-message branch from 2c80a58 to 5ba535a Compare December 3, 2025 20:27
Comment on lines 522 to 526
if index >= tasks.len() {
return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(
UsageError::TaskOutOfRange,
)));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, something like

Suggested change
if index >= tasks.len() {
return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(
UsageError::TaskOutOfRange,
)));
}
let Some(task) = tasks.get(index) else {
return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage(
UsageError::TaskOutOfRange,
)));
};

(then use task below instead of indexing repeatedly)

Co-authored-by: Matt Keeter <matt@oxide.computer>
@hawkw hawkw added kernel Relates to the Hubris kernel userlib Related to userlib, the fundamental library used by tasks fault-management Everything related to the Oxide's Fault Management architecture implementation labels Dec 4, 2025
@hawkw hawkw self-assigned this Dec 4, 2025
Comment on lines +529 to +533
let TaskState::Faulted {
fault: FaultInfo::Panic,
..
} = task.state()
else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huuuh, I've never seen a let/else used in a way that doesn't bind any names like this. I assume you prefer this over if matches!?

Copy link
Member Author

@hawkw hawkw Dec 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it just felt a little bit better than

if !matches!(task.state(), TaskState::Faulted { fault: FaultInfo::Panic, .. })) {
    return ...;
}

for some vague aesthetic reason. But it should be equivalent.

..
} = task.state()
else {
return Err(UserError::Recoverable(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any code that uses this API yet? I suspect this error is going to be unwrap()'d, in which case we might want to eliminate it and have the kernel fault the caller if they got the task state wrong. (Which is to say, switch this to Unrecoverable.)

FWIW all the other errors in this implementation look right to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e., packrat) gets pinged about a panic, but doesn't actually get to run before a timer in jefe elapses and the task gets restarted. In that case, the task would no longer have a panic message to share, and packrat (or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.

If we don't end up implementing that behavior in the supervisor, and instead make it wait to be informed that a panic message has been collected before restarting the faulted task, it might make more sense to make this case a fault for the caller. But, the approach we discussed in #2309 (comment) made me feel inclined to go down the path of treating the restart cooldown in jefe as a timeout for collecting panic messages, so I was planning to do that (so that packrat or similar can't block other tasks from restarting indefinitely). And, even if we did decide to make jefe wait for panic messages to be recorded before restarting a task, making this a handleable error allows more flexibility for other (theoretical) supervisor implementations to do other things.

// using the `userlib::ipc::read_panic_message()` wrapper, then the caller's
// buffer will always be exactly 128 bytes long. However, we can't rely on
// that here, as either task *could* be an arbitrary binary that wasn't
// compiled with the Hubris userlib, so we need to be safe regardless.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(agreed)

///
/// Note that Hubris only preserves the first [`PANIC_MESSAGE_MAX_LEN`] bytes of
/// a task's panic message, and panic messages greater than that length are
/// truncated. Thus, this function accepts a buffer of that length.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that we could most likely loosen this later, if required, by changing this to a &mut [u8] and not breaking callers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that was my thinking as well.

///
/// - [`Ok`]`(&[u8])` if the task is panicked. The returned slice is borrowed
/// from `buf`, and contains the task's panic message as a sequence of
/// UTF-8 bytes. Note that the slice may be empty, if the task has panicked
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, nothing ensures that the contents of that slice are UTF-8, so I wouldn't advertise that. The panic message is normally expected to be UTF-8 but it may be truncated in the middle of a multi-byte sequence, and the task is panicking so strictly speaking it could have stomped all over its RAM, producing garbage.

This is another case where seeing a caller might clarify the API design -- it might make sense to call utf8_chunks on the buffer and return the iterator that results so that the caller can easily step over valid and invalid sections if desired. Or, callers may just byte-copy the result into CBOR and not parse it at all (in which case it's quite important that it not hit CBOR as UTF-8!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. So, I was planning to probably just encode the valid regions as a CBOR string in the caller which I haven't written yet. But, I had a vague sense that the KIPC API ought not make too many decisions about what the caller wants; I had kind of been considering having a higher-level read_panic_message_utf8 or something that just gives you the valid UTF-8 regions. The sense I got from the other KIPC APIs is that they don't tend to do much beyond what's necessary to actually send and receive the IPC from the kernel, so I wanted to be consistent with that.

On the other hand, I suspect basically all callers are going to want to get a string out and not want anything that isn't UTF-8, so maybe this API should just do the higher-level behavior. I think you're starting to sell me on that.

.unwrap();
// it should look kinda like a panic message (but since the line number may
// change, don't make assertions about the entire contents of the string...
assert!(core::str::from_utf8(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the expect here is right -- assuming the assistant panics with an ASCII-only message (which it likely does) you could still get this test to fail by running it in a deep path containing non-ASCII characters.

Since in the end you're only checking the prefix of the string, you can get the valid UTF-8 prefix using

msg.utf8_chunks().next().unwrap().valid()

(the unwrap() there is an unfortunate detail of the utf8_chunks implementation, which is defined as returning an iterator yielding at least one chunk... but the type system can't see that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good point, thanks!

Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
};

let Ok(message) = task.save().as_panic_args().message else {
return Err(UserError::Recoverable(
Copy link
Collaborator

@cbiffle cbiffle Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(In case anyone's curious, I was musing on how this could fail in practice. Because we're getting a USlice<u8>, it can't be misaligned. In practice, the only "invalid" combination that will trigger this specific check is: a slice that spans the end of the address space, such that base + len would overflow.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably write that down here, huh?

hawkw and others added 4 commits December 29, 2025 17:04
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the Oxide's Fault Management architecture implementation kernel Relates to the Hubris kernel userlib Related to userlib, the fundamental library used by tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

want some kinda way for Jefe to actually know things about panic messages

5 participants