Snek Stall Post Mortem

Snek Stall Post Mortem

How Basilisk stopped at block #2 145 599 and how we'll fix it

Basilisk is now live

On Nov 7. 2022 The Basilisk parachain stopped producing blocks at block #2 145 599. The core team was notified by multiple sources about this incident, including our monitoring. We quickly gathered the investigation team and started digging. We knew that this issue was specific to the relaychain database because we have a multistep CI, Review process, and Release process which aims to eliminate this kind of issue. Unfortunately, it was not bulletproof. So what happened and why we didn’t catch it?

Finding the issue

2022-11-07 18:45:00 [Parachain] Starting collation. relay_parent=0x1eb0805de9f0a253837741054f561147da3da2198612bc9804dc446a68bd045f at=0x992a3739d4ccd00a18897a79504f5ec9b2636ac7eca14433089778aafba0ea33
2022-11-07 18:45:00 [Parachain] 🙌 Starting consensus session on top of parent 0x992a3739d4ccd00a18897a79504f5ec9b2636ac7eca14433089778aafba0ea33
2022-11-07 18:45:00 [Parachain] assembling new collators for new session 1789 at #2145600
2022-11-07 18:45:00 [Parachain] Corrupted state at [21, 70, 76, 172, 51, 120, 212, 111, 17, 60, 213, 183, 164, 215, 28, 132, 85, 121, 41, 127, 77, 251, 150, 9, 231, 228, 194, 235, 171, 156, 228, 10]
2022-11-07 18:45:00 [Parachain] panicked at 'attempt to calculate the remainder with a divisor of zero', /home/mrq/.cargo/git/checkouts/substrate-f4423eebfa0046a3/cc370aa/frame/aura/src/lib.rs:254:36
2022-11-07 18:45:00 [Parachain] Proposing failed: Import failed: Error at calling runtime api: Execution failed: Execution aborted due to trap: wasm trap: wasm `unreachable` instruction executed
WASM backtrace:
     0: 0x3f15c5 - <unknown>!rust_begin_unwind
     1: 0x39bf63 - <unknown>!core::panicking::panic_fmt::hd28c67e025192d49
     2: 0x39bf23 - <unknown>!core::panicking::panic::he0ff776486355274
     3: 0x2dda5f - <unknown>!pallet_aura::<impl frame_support::traits::validation::FindAuthor<u32> for pallet_aura::pallet::Pallet<T>>::find_author::h9d1dc04fd02a7314
     4: 0x33fba8 - <unknown>!<pallet_session::FindAccountFromAuthorIndex<T,Inner> as frame_support::traits::validation::FindAuthor<<T as pallet_session::pallet::Config>::ValidatorId>>::find_author::h1abd210146362e9c
     5: 0x3814ef - <unknown>!<pallet_authorship::pallet::Pallet<T> as frame_support::traits::hooks::OnInitialize<<T as frame_system::pallet::Config>::BlockNumber>>::on_initialize::hed178f454f568958
     6: 0x238948 - <unknown>!<(TupleElement0,TupleElement1,TupleElement2,TupleElement3,TupleElement4,TupleElement5,TupleElement6,TupleElement7,TupleElement8,TupleElement9,TupleElement10,TupleElement11,TupleElement12,TupleEl>
     7: 0x2f3d14 - <unknown>!frame_executive::Executive<System,Block,Context,UnsignedValidator,AllPalletsWithSystem,COnRuntimeUpgrade>::initialize_block::hca26bd80267605ff
     8: 0x19c584 - <unknown>!Core_initialize_bloc

In this log, we can immediately see that a new session is starting and that we got a panic in the Aura pallet. Upon closer investigation, we can see that this is caused by empty storage of session.queuedKeys. But why this didn’t happen on our test-net and rococo where we tested this upgrade before?

We can see the part of the answer to this in this log

Corrupted state at [21, 70, 76, 172, 51, 120, 212, 111, 17, 60, 213, 183, 164, 215, 28, 132, 85, 121, 41, 127, 77, 251, 150, 9, 231, 228, 194, 235, 171, 156, 228, 10]

Which translates to storage key: 0x15464cac3378d46f113cd5b7a4d71c845579297f4dfb9609e7e4c2ebab9ce40a and refers to the storage of collatorSelection.invulnerables . This log actually appeared sooner, right after the start of the first session, after the upgrade to the latest version of substrate.

Why is the storage corrupted on the mainnet and not in our CI nor in our testnet nor Rococo parachain?

This pull request was a part of the massive upgrade to the latest substrate version which included an upgrade of the weights and bounding of storage. We tried to be very rigorous since this is a big change and looked at all the dependencies with multiple members of the team, but there was one small change that slipped.

GitHub-Issue

This was the only change in cumulus as part of the substrate upgrade that could cause this issue. It changes the meaning of the variable we used in the runtime from being used in benchmarking to the actual hard limit of elements for the vector.

We finally found the core issue. On the mainnet, we have 25 “invulnerable” collators right now and the max limit for this vector was set to 10. This caused the corrupted state error in the logs on read, subsequently removing the queuedKeys of the session pallet and breaking the block production in the next session.

Once we found this we immediately started to work on a fix.

The fix

After consulting with fellow friends at Parity and Substrate Builders, we realized the best way forward would be to increase this limit, update the chainspec of our nodes by using codeSubstitutes, and propose a relaychain governance vote to force this runtime as the new valid runtime of our parachain.

We tested this in our sandbox environment by bricking it first and then applying this fix successfully. After this, we bricked the rococo parachain as it is running the same runtime as our mainnet parachain. We then applied the same fix and confirmed it is working. We can now proceed to restart the chain.

Further prevention

During the creation of the fix, we will try to make sure this or a similar error doesn’t happen again. We are implementing CI workflows with try-runtime to test the runtime upgrade of the mainnet to a new version. This CI will poll the storage of the whole chain, failing if any error occurs. We already do this with migrations and this should prevent any similar mistakes to pass our checks in the future.

One more takeaway from this incident is that we could have detected it before it stalled the chain. We have missed the corrupted storage error. Having better alerting tools could prevent the stall if we were very fast to act. That is why we would like to encourage parachain developers to discuss monitoring and alerting for the substrate nodes and attend an event we will be hosting at the sub0 barcamp or alternatively side event after the sub0 conference in Lisbon.

Restarting the chain

Due to the nature of our chain, the market conditions, and the fact the unlocking takes some time, we will restart the chain with trading locked to prevent LPs from incurring loss to arbitrageurs. We will then allow LPs a grace period to withdraw liquidity. After this period, trading will be restarted and Basilisk can live again.

Incentivized.

Update: proposal for unbricking the chain is in the discussion here

Update 2: council motion for unbricking basilisk has started here

Update 3: while discussing fast tracking of the fix, we have joined a discussion about how to recover parachain quickly since stalls can have serious implications. Especially in De-Fi here

Update 4: fast tracking of the fix has passed technical committee after a long but valid discussion. This shows that people in the ecosystem care and are willing to support parachain teams if unexpected event occurs. The fundamental issue with chain recovery is to be fixed, but this incident has shown the need for it. The referendum can be seen here