-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
I found dpd hung on sled 16 on dublin. It seems to have been derailed by a PCI error during startup.
At the system level we see:
BRM23230018 # fmdump -v
TIME UUID SUNW-MSG-ID EVENT
Dec 28 00:01:37.7986 1a7d366c-fe31-48df-b247-21cc33c4ec1c SUNOS-8000-J0 Diagnosed
50% defect.sunos.eft.unexpected_telemetry
Problem in: dev:////pci@ab,0/pci1de,fff9@3,2
Affects: -
FRU: -
Location: -
50% fault.sunos.eft.unexpected_telemetry
Problem in: dev:////pci@ab,0/pci1de,fff9@3,2
Affects: -
FRU: -
Location: -
...
At the same time in the dendrite log, we see:
00:01:37.560Z DEBG dpd: Set 4ns pulse config to 0xc30c30c
module = Lld
unit = bf-sde
00:01:37.560Z DEBG dpd: Set global ts inc config to 0xc30c30c
module = Lld
unit = bf-sde
00:01:37.560Z DEBG dpd: Set global PSC inc config to 0xaab
module = Lld
unit = bf-sde
00:01:39.925Z INFO dpd: bf_device_add dev id 0, is_sw_model 0
module = Dvm
unit = bf-sde
The 2 second gap between the last 2 messages is unusual and corresponds with the timing of the PCI error. Interestingly, the first time dendrite reoprts any PCI-related issues is nearly a minute later:
00:02:37.118Z INFO dpd: Entering pipe_mgr_config_complete, dev 0
module = Pipe
unit = bf-sde
00:02:37.181Z DEBG dpd: LLD: FAULT: DMA error: dev_id=0, d0=00039a83cc00014f, d1=0000010000000016
module = Lld
unit = bf-sde
00:02:37.181Z DEBG dpd: FAULT: 3 : 0000000000000000 : 00039a83cc00014f : 0000010000000016
module = Lld
unit = bf-sde
Metadata
Metadata
Assignees
Labels
No labels