Skip to content

tofino-related PCI error #173

@Nieuwejaar

Description

@Nieuwejaar

I found dpd hung on sled 16 on dublin. It seems to have been derailed by a PCI error during startup.

At the system level we see:

BRM23230018 # fmdump -v
TIME                 UUID                                 SUNW-MSG-ID EVENT
Dec 28 00:01:37.7986 1a7d366c-fe31-48df-b247-21cc33c4ec1c SUNOS-8000-J0 Diagnosed
   50%  defect.sunos.eft.unexpected_telemetry

        Problem in: dev:////pci@ab,0/pci1de,fff9@3,2
           Affects: -
               FRU: -
          Location: -

   50%  fault.sunos.eft.unexpected_telemetry

        Problem in: dev:////pci@ab,0/pci1de,fff9@3,2
           Affects: -
               FRU: -
          Location: -
...

At the same time in the dendrite log, we see:

00:01:37.560Z DEBG dpd: Set 4ns pulse config to 0xc30c30c
    module = Lld
    unit = bf-sde
00:01:37.560Z DEBG dpd: Set global ts inc config to 0xc30c30c
    module = Lld
    unit = bf-sde
00:01:37.560Z DEBG dpd: Set global PSC inc config to 0xaab
    module = Lld
    unit = bf-sde
00:01:39.925Z INFO dpd: bf_device_add dev id 0, is_sw_model 0
    module = Dvm
    unit = bf-sde

The 2 second gap between the last 2 messages is unusual and corresponds with the timing of the PCI error. Interestingly, the first time dendrite reoprts any PCI-related issues is nearly a minute later:

00:02:37.118Z INFO dpd: Entering pipe_mgr_config_complete, dev 0
    module = Pipe
    unit = bf-sde
00:02:37.181Z DEBG dpd: LLD: FAULT: DMA error: dev_id=0, d0=00039a83cc00014f, d1=0000010000000016
    module = Lld
    unit = bf-sde
00:02:37.181Z DEBG dpd: FAULT: 3 : 0000000000000000 : 00039a83cc00014f : 0000010000000016
    module = Lld
    unit = bf-sde

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions