On Thursday July 09, 2020, Bison Trails' Celo validator 1 stopped signing blocks due to a partial key rotation for the validator signing key. We quickly determined the root cause and remediated the issue, however we were unable to get the validator signing again due to a bug in the Celo client implementation that we discovered in collaboration with cLabs. The validator was able to sign again after cLabs released a patch to the Celo client to remediate the bug. This post provides details of the event.
On Tuesday July 07, 2020, the cLabs team announced that they discovered a critical security vulnerability that required an immediate update to the Celo client. Because a simple restart of the client induces downtime in excess of the 12 blocks (or 60s) which would incur financial penalties for our validators, the upgrade path is to do a key rotation. This method ensures zero downtime and is endorsed by cLabs as the current best practice to upgrade network clients.
We decided to test the upgraded client on one of our own validators before rolling out the change to our customers. On Thursday July 09, 2020 shortly before the epoch boundary, we rotated the signing keys to those hosted by a new cluster with the security patch by using the celocli to do the authorizations. However, we left off two flags to the command,
--blsPop. Without these two bits of information, the validator was put in a state where the ECDSA key authorized as the validator signer was different from the key used to derive the BLS key. The validator signing key had been “partially rotated.”
Because the BLS key is the one actually involved in signing blocks, when signing duties were handed over at the epoch boundary (77-78) signatures made with the new BLS key were getting rejected by the network (starting at block 1347841) because it did not match the BLS key registered by the validator.
Over the next two hours we discovered our omission and authorized the correct BLS key using a different
celocli command. Unfortunately per the Celo protocol design, once the authorized keys are registered, they are frozen for the entire epoch.
With the assistance of the Celo community, we discovered that this partial rotation was a feature of the client and a flag passed to the client at boot,
--blsbase, should allow for the use of a BLS key derived from a different ECDSA private key than the one authorized and registered to the Validator at epoch. After hours of coding variations to use this flag, our signatures were still failing.
In the evening of July 09, we hopped on a session with cLabs to try to get the validator successfully signing before the epoch boundary, more than 18 hours away. During the course of the call it was discovered that there was a bug in the Celo core client code which prevented the successful use of
--blsbase and thus made the use of a BLS key derived from another ECDSA private key impossible. cLabs quickly released a patch and a build which we applied to our validator. Finally at block 1353115, around 8:09 PM EDT, the cluster began successfully signing blocks again. In total, the validator cluster missed signatures for a little over 7h 15m total within the epoch.
Bison Trails now has in place mechanisms that will prevent partial key rotations. Now that the
--blsbase flag bug has been fixed by cLabs and is slated for a future release, we will be upgrading all of our clusters when it becomes available.
We see this event as a learning experience to help improve the Celo protocol and build upon cLabs’ good work to date. Our recommendations to improve the Celo client and node operator processes are below. While having a BLS signing key that is not derived from the ECDSA private key for a validator signer is a network feature, in practice this is not used. This is evidenced by the current bug in the Celo client software that makes the use of this feature impossible. Not only are there no validators in the active set using partial key rotation, it is considered harmful or dangerous by most to the extent where libraries have been released to monitor for this situation.
We believe that this split in validator signing keys adds too much complexity for node operators and thus increases the surface area for errors in the most critical aspect of participation. While we understand that during validator registration, a validator may choose to authorize only an ECDSA key which then allows for that signing key to both register the BLS key or update it. This is a benefit when using HSMs that lack BLS support for instance. But the downsides of a partial key rotation seems to greatly outweigh whatever benefits this feature may confer to a specific workflow.
One solution that would not require modifications to the contracts is to put some safeguards in the CLI. For key rotations using the celocli authorize functions, we endorse the suggestion made in the Discord channel by the community and cLabs to require the
--blsPop flags when authorizing a new key via the CLI. Partial key rotations should not be the default, but rather the exception to accurately reflect common practice.
Finally, key rotations are a security mechanism that happen to allow for zero downtime. The primary reason that key rotations are used to upgrade the Celo client though is because the validator/proxy pair is typically unable to be restarted in a timeframe that would avoid network penalties. Celo has a long lifetime of upgrades ahead of it and key rotations are an expensive process for operators who must coordinate asset holders, custodians, and node operations. Relaxing the window that incurs downtime penalties would go a long way to allowing node operators to efficiently update and restart Validating clients without key rotations.
We submitted a PR, to add a safeguard for ECDSA-only authorization of validator signing keys, here.
Comments or questions about this report? Continue the discussion with
BisonD#8301 in the Celo Discord.
Learn more about Celo and our support of the protocol.
eth2 Update 007Dec 3 2020
eth2 Update 006Nov 25 2020
Substrate Ecosystem Update 003Nov 25 2020
Now Available: Diem (Libra) QTNov 24 2020
eth2 Insights: Network PerformanceNov 23 2020
eth2 Update 005Nov 18 2020
eth2 Insights: Validator EffectivenessNov 16 2020
tBTC and DeFi: How to Get InvolvedNov 12 2020
Coinbase Custody Expands Bison Trails Integration to Add Staking Support for CeloNov 12 2020
eth2 Update 004Nov 11 2020
QT Archival: When a Full Node is Not EnoughNov 11 2020
Bison Trails Newsletter 009 • October 2020Nov 10 2020
View More →