-
Notifications
You must be signed in to change notification settings - Fork 216
Description
I troubleshot a system in mfg where, due to test-station SWD shenanigans, the SP was being interrupted in the critical window between enabling the Tofino power supplies and adjusting the voltage rails and acknowledging via the VidAck mechanism.
You end up with ringbuffs that look like this:
humility: ring buffer drv_sidecar_seq_server::__RINGBUF in sequencer:
NDX LINE GEN COUNT PAYLOAD
16 91 1 1 FrontIOControllerIdent { fpga_id: 0x1, ident: 0x1deaa55 }
17 98 1 1 FrontIOControllerChecksum { fpga_id: 0x1, checksum: [ 0xd4, 0xaa, 0x2a, 0x16 ], expected: [ 0xd4, 0xaa, 0x2a, 0x16 ] }
18 344 1 1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
19 154 1 1 FanModuleLedUpdate(Zero, On)
20 154 1 1 FanModuleLedUpdate(One, On)
21 154 1 1 FanModuleLedUpdate(Two, On)
22 154 1 1 FanModuleLedUpdate(Three, On)
23 344 1 3 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
24 245 1 1 FrontIOBoardPowerGood
25 328 1 1 FrontIOBoardPhyPowerEnable(true)
26 550 1 1 FrontIOBoardPhyOscGood
27 344 1 1 TofinoSequencerTick(LatchOffOnFault, A2 { error: None })
28 81 1 1 TofinoPowerUp
29 89 1 1 TofinoVidAttempt(0x0)
30 262 1 1 TofinoNoVid
31 89 1 1 TofinoVidAttempt(0x1)
0 262 2 1 TofinoNoVid
1 89 2 1 TofinoVidAttempt(0x2)
2 262 2 1 TofinoNoVid
3 89 2 1 TofinoVidAttempt(0x3)
4 262 2 1 TofinoNoVid
5 89 2 1 TofinoVidAttempt(0x4)
6 262 2 1 TofinoNoVid
7 89 2 1 TofinoVidAttempt(0x5)
8 262 2 1 TofinoNoVid
9 89 2 1 TofinoVidAttempt(0x6)
10 262 2 1 TofinoNoVid
11 89 2 1 TofinoVidAttempt(0x7)
12 262 2 1 TofinoNoVid
13 796 2 1 TofinoSequencerError(SequencerTimeout)
14 286 2 1 TofinoSequencerAbort { state: InPowerUp, step: AwaitVidAck, error: VidAckTimeout }
15 344 2 11 TofinoSequencerTick(LatchOffOnFault, A2 { error: VidAckTimeout })
Which can be a bit confusing because we were already timed out when we entered iteration 0 of the loop. Getting some better reporting about this case would have significantly sped up debug here.
We tried 8 loops, but didn't look to see if we had timed out already before entering so you're left thinking that the timeout happened during the loops when in fact it occurred before, but we wait until all 8 loops finish before giving up and then going down the error path.