-
Notifications
You must be signed in to change notification settings - Fork 216
Add read_panic_message kipc
#2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
c4ca702 to
2c80a58
Compare
Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new `read_panic_message` kipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace. I've also added a test case that exercises this functionality. Fixes #2311
2c80a58 to
5ba535a
Compare
sys/kern/src/kipc.rs
Outdated
| if index >= tasks.len() { | ||
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | ||
| UsageError::TaskOutOfRange, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, something like
| if index >= tasks.len() { | |
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | |
| UsageError::TaskOutOfRange, | |
| ))); | |
| } | |
| let Some(task) = tasks.get(index) else { | |
| return Err(UserError::Unrecoverable(FaultInfo::SyscallUsage( | |
| UsageError::TaskOutOfRange, | |
| ))); | |
| }; |
(then use task below instead of indexing repeatedly)
Co-authored-by: Matt Keeter <matt@oxide.computer>
| let TaskState::Faulted { | ||
| fault: FaultInfo::Panic, | ||
| .. | ||
| } = task.state() | ||
| else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huuuh, I've never seen a let/else used in a way that doesn't bind any names like this. I assume you prefer this over if matches!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it just felt a little bit better than
if !matches!(task.state(), TaskState::Faulted { fault: FaultInfo::Panic, .. })) {
return ...;
}for some vague aesthetic reason. But it should be equivalent.
| .. | ||
| } = task.state() | ||
| else { | ||
| return Err(UserError::Recoverable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any code that uses this API yet? I suspect this error is going to be unwrap()'d, in which case we might want to eliminate it and have the kernel fault the caller if they got the task state wrong. (Which is to say, switch this to Unrecoverable.)
FWIW all the other errors in this implementation look right to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I was imagining we might see this error in the event that a task responsible for collecting panic messages (i.e., packrat) gets pinged about a panic, but doesn't actually get to run before a timer in jefe elapses and the task gets restarted. In that case, the task would no longer have a panic message to share, and packrat (or whomever wants the panic message) has just missed its chance to read the panic. I would definitely not want to panic packrat in that case, so I felt like this error ought to be handle-able without panicking.
If we don't end up implementing that behavior in the supervisor, and instead make it wait to be informed that a panic message has been collected before restarting the faulted task, it might make more sense to make this case a fault for the caller. But, the approach we discussed in #2309 (comment) made me feel inclined to go down the path of treating the restart cooldown in jefe as a timeout for collecting panic messages, so I was planning to do that (so that packrat or similar can't block other tasks from restarting indefinitely). And, even if we did decide to make jefe wait for panic messages to be recorded before restarting a task, making this a handleable error allows more flexibility for other (theoretical) supervisor implementations to do other things.
| // using the `userlib::ipc::read_panic_message()` wrapper, then the caller's | ||
| // buffer will always be exactly 128 bytes long. However, we can't rely on | ||
| // that here, as either task *could* be an arbitrary binary that wasn't | ||
| // compiled with the Hubris userlib, so we need to be safe regardless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(agreed)
| /// | ||
| /// Note that Hubris only preserves the first [`PANIC_MESSAGE_MAX_LEN`] bytes of | ||
| /// a task's panic message, and panic messages greater than that length are | ||
| /// truncated. Thus, this function accepts a buffer of that length. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to note that we could most likely loosen this later, if required, by changing this to a &mut [u8] and not breaking callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that was my thinking as well.
| /// | ||
| /// - [`Ok`]`(&[u8])` if the task is panicked. The returned slice is borrowed | ||
| /// from `buf`, and contains the task's panic message as a sequence of | ||
| /// UTF-8 bytes. Note that the slice may be empty, if the task has panicked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, nothing ensures that the contents of that slice are UTF-8, so I wouldn't advertise that. The panic message is normally expected to be UTF-8 but it may be truncated in the middle of a multi-byte sequence, and the task is panicking so strictly speaking it could have stomped all over its RAM, producing garbage.
This is another case where seeing a caller might clarify the API design -- it might make sense to call utf8_chunks on the buffer and return the iterator that results so that the caller can easily step over valid and invalid sections if desired. Or, callers may just byte-copy the result into CBOR and not parse it at all (in which case it's quite important that it not hit CBOR as UTF-8!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. So, I was planning to probably just encode the valid regions as a CBOR string in the caller which I haven't written yet. But, I had a vague sense that the KIPC API ought not make too many decisions about what the caller wants; I had kind of been considering having a higher-level read_panic_message_utf8 or something that just gives you the valid UTF-8 regions. The sense I got from the other KIPC APIs is that they don't tend to do much beyond what's necessary to actually send and receive the IPC from the kernel, so I wanted to be consistent with that.
On the other hand, I suspect basically all callers are going to want to get a string out and not want anything that isn't UTF-8, so maybe this API should just do the higher-level behavior. I think you're starting to sell me on that.
| .unwrap(); | ||
| // it should look kinda like a panic message (but since the line number may | ||
| // change, don't make assertions about the entire contents of the string... | ||
| assert!(core::str::from_utf8(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the expect here is right -- assuming the assistant panics with an ASCII-only message (which it likely does) you could still get this test to fail by running it in a deep path containing non-ASCII characters.
Since in the end you're only checking the prefix of the string, you can get the valid UTF-8 prefix using
msg.utf8_chunks().next().unwrap().valid()
(the unwrap() there is an unfortunate detail of the utf8_chunks implementation, which is defined as returning an iterator yielding at least one chunk... but the type system can't see that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a good point, thanks!
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
| }; | ||
|
|
||
| let Ok(message) = task.save().as_panic_args().message else { | ||
| return Err(UserError::Recoverable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In case anyone's curious, I was musing on how this could fail in practice. Because we're getting a USlice<u8>, it can't be misaligned. In practice, the only "invalid" combination that will trigger this specific check is: a slice that spans the end of the address space, such that base + len would overflow.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should probably write that down here, huh?
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Co-authored-by: Cliff L. Biffle <cliff@oxide.computer>
Currently, there is no way to programmatically access the panic message of a task which has faulted due to a Rust panic fron within the Hubris userspace. This branch adds a new
read_panic_messagekipc that copies the contents of a panicked task's panic message buffer into the caller. If the requested task has not panicked, this kipc returns an error indicating this. This is intended by use by supervisor implementations or other tasks which wish to report panic messages from userspace.I've also added a test case that exercises this functionality.
Fixes #2311