Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-75205

Deadlock between stepdown and restoring locks after yielding when all read tickets exhausted

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Blocker - P1 Blocker - P1
    • 7.0.0-rc0, 4.4.20, 5.0.16, 6.0.6, 6.3.0-rc3
    • Affects Version/s: 6.0.0, 4.4.15, 5.0.10, 6.3.0-rc2
    • Component/s: None
    • None
    • Fully Compatible
    • ALL
    • v6.3
    • Execution Team 2023-04-03

      After yielding, operations will restore their lock state via restoreLockState. This function will iterate over each lock that was previously held and try to reacquire it in sorted order. However, we don't actually try to reacquire the FCV lock, which should be reacquired after the PBWM. When we go to try to reacquire the RSTL, we fail the check since the lock in question is actually the FCV lock (but we never checked for it). We will then acquire the global lock (including a acquiring read ticket) without having the FCV lock or the RSTL.

      Once that is done, we will reacquire all the other locks we held, which in this case includes the RSTL (but now out of order).

      When the stepdown thread starts, it enqueues the RSTL in X mode, which jumps to the top of the queue. At the same time, there will operations that are holding the RSTL in IX mode, but are waiting to acquire read tickets, which is preventing the stepdown thread from proceeding. If we have exhausted all read tickets in the system, then these threads are stuck waiting while holding the RSTL but the threads holding the read tickets cannot progress since they are stuck behind the stepdown thread waiting for the RSTL.

      There is also a variation of this that can happen on step up when we are holding the RSTL and waiting on ticket acquisition. 

      We should be accounting for the FCV lock when we restore locks.

            Assignee:
            matt.kneiser@mongodb.com Matt Kneiser
            Reporter:
            samy.lanka@mongodb.com Samyukta Lanka
            Votes:
            0 Vote for this issue
            Watchers:
            45 Start watching this issue

              Created:
              Updated:
              Resolved: