Skip to content

I2C hang on Berlin #2261

@mkeeter

Description

@mkeeter

This is summarizing work done by Alan, John, and Laura (among others).

After a Nexus-driven update on Berlin, one of the SPs is unreachable by the control plane (i.e. talking to the SP's control-plane-agent task).

Other tasks are still reachable over the network, so Alan got a dump (at /staff/alan/berlin-sp-undiscovered/hubris.core.0).

Here's the task list:

john@castle ~ $ humility -d /staff/alan/berlin-sp-undiscovered/hubris.core.0 tasks
humility: attached to dump
system time = 10797477
ID TASK                       GEN PRI STATE
 0 jefe                         0   0 recv, notif: fault timer(T+23)
 1 net                          0   5 recv, notif: eth-irq(irq61) wake-timer(T+300)
 2 sys                          0   1 recv, notif: exti-wildcard-irq(irq6/irq7/irq8/irq9/irq10/irq23/irq40)
 3 spi2_driver                  0   3 recv
 4 i2c_driver                   0   3 notif: i2c2-irq(irq33/irq34)
 5 spd                          0   2 notif: i2c1-irq(irq31/irq32)
 6 packrat                      0   1 recv
 7 thermal                      0   5 wait: reply from i2c_driver/gen0
 8 power                        0   6 wait: send to i2c_driver/gen0
 9 hiffy                        0   5 notif: bit31(T+190)
10 gimlet_seq                   0   4 recv, notif: timer(T+28) vcore
11 gimlet_inspector             0   6 notif: socket
12 hash_driver                  0   2 recv
13 rng_driver                   0   6 recv
14 hf                           0   3 recv, notif: timer
15 update_server                0   3 recv
16 sensor                       0   4 recv
17 host_sp_comms                0   8 recv, notif: jefe-state-change usart-irq(irq82) multitimer control-plane-agent
18 udpecho                      0   6 notif: socket
19 udpbroadcast                 0   6 notif: bit31(T+350)
20 control_plane_agent          0   7 wait: reply from validate/gen0
21 sprot                        0   4 notif: rot-irq timer(T+993)
22 validate                     0   5 wait: send to i2c_driver/gen0
23 vpd                          0   4 recv
24 user_leds                    0   2 recv, notif: timer
25 dump_agent                   0   6 wait: reply from sprot/gen0
26 snitch                       0   6 notif: socket
27 sbrmi                        0   4 recv
28 idle                         0   9 RUNNING

Note that control-plane-agent is waiting on validate, which is waiting on i2c_driver. Indeed, many tasks (validate, thermal, power) are waiting on i2c_driver, which is in turn waiting for a hardware IRQ.

This is the 1.0.47 release, git commit 87124929:

871249297 (tag: all-sp-v1.0.47) host-sp-messages: don't copy that floppy^H^H^H^H^H^HInventoryData (#2253)
674e868ef Update lockfile from #2250 (#2254)
e2ecf169e ADM1273 support (#2250)
9c40dcce1 Update sparse registry hash for new toolchain (#2248)
e744ccf5d stm32xx-gpio-common: fix another glitch
e2ad457b0 stm32xx-i2c: flush txdr on NACK
43e13b955 jefe: fix comparison inversion in timeout handling
7b66c2227 stm32xx-sys: fix glitch on initial pin configuration
2a5c2b9d4 Disable SWD pins when debugger is connected (#2228)
caf0cd888 Use `NotificationBits` in Idol (#2233)
ef8e9acbb Add `NotificationBits` type; use it in `sys_recv_notification` (#2232)
e19947055 stm32xx-i2c: remove soft timeout, LostInterrupt
c07b417b1 bump `toml` and `toml_edit` (#2234)
0d9a6615b Bump idol to incorporate lease count check fix.
6affe395a Fix code that assumes timer notification => timer fired (#2230)
74a9279a0 psc_seq: rectifier ereports (#2214)
7ccf45982 cpu_seq: standardize ereport naming (#2231)

Notably, this release has a handful of I2C changes!

Here are the I2C ringbufs:

humility: ring buffer drv_stm32xx_i2c::__RINGBUF in i2c_driver:
   TOTAL VARIANT
   33588 Read
   21366 Write
   11308 WriteWait
    8606 Wait
    6036 ReadWait
       4 Reset
 NDX LINE      GEN    COUNT PAYLOAD
  33  578     1033        3 WriteWait(ISR, 0x8021)
  34  578     1033        1 WriteWait(ISR, 0x8061)
  35  645     1033        2 Read(ISR, 0x8021)
  36  645     1033        1 Read(ISR, 0x8025)
  37  691     1033        3 ReadWait(ISR, 0x8021)
  38  691     1033        1 ReadWait(ISR, 0x8061)
  39  461     1033        1 Wait(ISR, 0x21)
  40  546     1033        1 Write(ISR, 0x21)
  41  546     1033        2 Write(ISR, 0x8021)
  42  546     1033        1 Write(ISR, 0x8023)
  43  578     1033        1 WriteWait(ISR, 0x8020)
  44  578     1033        2 WriteWait(ISR, 0x8021)
  45  578     1033        1 WriteWait(ISR, 0x8061)
  46  461     1033        9 Wait(ISR, 0x8021)
  47  461     1033        1 Wait(ISR, 0x21)
   0  546     1034        3 Write(ISR, 0x21)
   1  546     1034        1 Write(ISR, 0x8023)
   2  578     1034        1 WriteWait(ISR, 0x8020)
   3  578     1034        2 WriteWait(ISR, 0x8021)
   4  578     1034        1 WriteWait(ISR, 0x8061)
   5  645     1034        2 Read(ISR, 0x8021)
   6  645     1034        1 Read(ISR, 0x8025)
   7  645     1034        2 Read(ISR, 0x8021)
   8  645     1034        1 Read(ISR, 0x8025)
   9  645     1034        2 Read(ISR, 0x8021)
  10  645     1034        1 Read(ISR, 0x8025)
  11  645     1034        2 Read(ISR, 0x8021)
  12  645     1034        1 Read(ISR, 0x8025)
  13  645     1034        2 Read(ISR, 0x8021)
  14  645     1034        1 Read(ISR, 0x8025)
  15  645     1034        2 Read(ISR, 0x8021)
  16  645     1034        1 Read(ISR, 0x8025)
  17  645     1034        2 Read(ISR, 0x8021)
  18  645     1034        1 Read(ISR, 0x8025)
  19  645     1034        2 Read(ISR, 0x8021)
  20  645     1034        1 Read(ISR, 0x8025)
  21  691     1034        2 ReadWait(ISR, 0x8021)
  22  691     1034        1 ReadWait(ISR, 0x8061)
  23  461     1034        1 Wait(ISR, 0x21)
  24  546     1034        1 Write(ISR, 0x21)
  25  546     1034        2 Write(ISR, 0x8021)
  26  546     1034        1 Write(ISR, 0x8023)
  27  578     1034        1 WriteWait(ISR, 0x8020)
  28  578     1034        2 WriteWait(ISR, 0x8021)
  29  578     1034        1 WriteWait(ISR, 0x8061)
  30  461     1034       10 Wait(ISR, 0x8021)
  31  461     1034        1 Wait(ISR, 0x21)
  32  546     1034        3 Write(ISR, 0x8021)
humility: ring buffer drv_stm32xx_i2c_server::__RINGBUF in i2c_driver:
 NDX LINE      GEN    COUNT PAYLOAD
   0  670        1        4 Wiggles(0x0)
   1  484        1        1 Error(0x6a, BusLocked)
   2  487        1        1 SegmentOnError((M1, S1))
   3  670        1        1 Wiggles(0x0)
   4  251        1        1 Reset((I2C2, PortIndex(0x0)))
   5  261        1        1 ResetMux(0x73)
   6  189        1        1 MuxUnknownRecover((I2C2, PortIndex(0x0)))
   7  484        1        1 Error(0x6a, BusLocked)
   8  487        1        1 SegmentOnError((M1, S2))
   9  670        1        1 Wiggles(0x0)
  10  251        1        1 Reset((I2C2, PortIndex(0x0)))
  11  261        1        1 ResetMux(0x73)
  12  189        1        1 MuxUnknownRecover((I2C2, PortIndex(0x0)))

Nothing is changing here; connecting remotely now (hours later) shows the same ringbuf values.

I'm not yet sure why the I2C task is hanging, more to come...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions