Welcome to our eth2 Insights Series, presented by Elias Simos–Protocol Specialist at Bison Trails. In this series, Elias reveals insights he uncovered in his joint research effort with Sid Shekhar (Blockchain Research lead at Coinbase), eth2 Medalla—A journey through the underbelly of eth2’s final call before takeoff.
In this third post, Elias discusses the different parameters that govern validator effectiveness in eth2, proposes a methodology to rank validators, and examines how validators were distributed along those in Medalla.
Validators are the key consensus party in eth2. They are required to perform a range of useful tasks, and, if they perform their duties correctly and add value to the network, they stand to earn rewards for it. More concretely, validators in eth2 are rewarded for proposing blocks, getting attestations included, and whistleblowing on protocol violations. In Phase 0, things are a little simpler as whistleblower rewards are folded into the proposer function.
While the block proposer rewards far exceed those of attesters per unit, as every active validator is called upon to attest once per epoch, the bulk of the predictable rewards in Phase 0 will come from attestations—at a ratio of approximately 7:1 vs proposer rewards.
Taking into account the protocol rules that govern rewards, it quickly becomes apparent that not all validators will be made equal—and therefore the distribution of rewards across validators will not be uniform!
It then follows that if validators are interested in optimizing for rewards, which is what an economically rational actor would do, there is merit to thinking of their existence in eth2 in terms of their “effectiveness.”
In the grand scheme of things, being a proposer in eth2 is like a “party round” for validators. The protocol picks 32 block proposers in every epoch, and tasks them with committing attestations (and later transaction data) on-chain and working towards finalizing a block.
The probability of becoming a proposer in Phase 0, all else equal, is then 32/n where n is the total number of active validators on the network. As the number of network participants grows, the probability of proposing a block diminishes; and so it did in Medalla!
In Phase 0, a proposer is rewarded by the protocol for (i) including pending attestations collected from the P2P network, and (ii) including proposer and attester slashing proofs. The more valuable the attestations that the proposer includes are, the higher their reward will be.
As a result, there are two key vectors along which we can score proposers on an effectiveness spectrum: (a) how often a proposer actually proposes a block when they are called upon to do so, and (b) how many valuable attestations they manage to include in the blocks they propose.
From the two, what a proposer can more effectively control is how many blocks they actually propose from the slots they are allocated by the protocol—a function of their overall uptime. This is the case as the number of valuable attestations a proposer includes on a proposed block depends not only on the proposer, but also both on the attesters sending their votes in on time and also the aggregators aggregating valuable attestations for the proposer to include.
A coarse way to then score for proposer effectiveness is a simple ratio of
proposed_slots : total_slots_attributed.
However, given that validators enter the network at different points in time, it is wise to control for the time each validator has been active (measured in epochs) and the diminishing probability of being selected as a proposer, as the total set of activated validators increases.
In eth2data.github.io we introduced a further optimization to the ratio in order to capture the difficulty factor, by dividing the time weighted ratio of
proposed_slots : total_slots_attributed with the probability of proposing at least once, given a validator’s activation epoch in the Testnet. This is the inverse of the probability that they were allocated 0 slots over the epochs they have been active for (n), such that:
Which, plotted over ~14,700 epochs, looks like this:
With this in mind we defined proposer effectiveness as:
n = epochs active
P_ratio = proposed_slots / total_slots_attributed
P_time_weight = 1 / epochs active
P_prob_weight = 1−P(0_proposals)*n
P_effectiveness = [P_ratio * P_time_weight] / P_prob_weight
Given that the probability of proposing at least once diminishes fast in the epochs closer to the present, dividing by
P_prob_weight disproportionately boosts the overall score of the proposer.
A further set of optimizations to the
P_effectiveness score, introduced here to control for the distortions described above, are:
This is what the distribution of ~70k validators that we tracked in Medalla looks like, when scored for
At the edges of the distribution, we observe a large concentration of really ineffective proposers (10% of the whole) that have missed virtually every opportunity they had to propose a block, and very effective proposers (5% of the whole) that have proposed in every slot they were allotted. Between those we find an exceedingly large representation of proposers at the 40% level of
P_effectiveness. The remainder approximately follows a normal distribution with a mode of 0.35.
So far so good, but proposer effectiveness is only ˜⅛ of the story...
Broadly, the two key variables for attesters to optimize along in eth2 are (i) getting valuable attestations included on-chain (a function of uptime) and (ii) the inclusion delay. In plain terms, from an attestation rewards point of view, to optimize for rewards the attester (a) must always participate in consensus and (b) must participate swiftly when their time comes.
For example, with 1 being the minimum possible delay, an inclusion delay of 2 slots leads to the attester only receiving half of the maximum reward available. The expected rewards “moderator” distribution is presented below.
Given the above, the main categories we focused our attention on when scoring for attester effectiveness are the:
aggregate inclusion delay- measured as the average of inclusion delays a validator has been subject to overtime. A score of 1 means that the validator always got their attestations in without a delay, and maximized their rewards potential.
uptime ratio- measured as the number of valuable attestations vs the time a validator has been active (in epochs). A ratio of 1 implies that the validator has been responsive in every round they have been called to attest in.
We defined the attester effectiveness score as:
A_effectiveness = uptime ratio * 100 / aggregate inclusion delay
In order to optimize generalizability, here a normalization of the A_effectiveness score is introduced so that it tops out at 100%.
Here we find that 12% of the active validators in the first 14500 epochs of Medalla scored below 10% in
A_effectiveness–clearly a derivative of the fact that Medalla is a Testnet with no real value at stake and likely exaggerated by the roughtime incident.
The final step for arriving at a comprehensive validator effectiveness score is combining both the attester and proposer effectiveness scores into one “master” score.
To simplify the process we define the master validator effectiveness score as:
V_effectiveness = A_effectiveness * ⅞ + P_effectiveness * ⅛
V_effectiveness reflects the ecosystem-wide expectation of the distribution of ETH rewards that validators stand to achieve by performing their attester and proposer duties, respectively.
Given the much heavier weighting of attester effectiveness in the score, the distribution of
V_effectiveness ends up looking very much like that of
In order to test the accuracy of this approach to calculating
V_effectiveness, we plotted the score against the total rewards that the validators achieved over their lifetime in Medalla and tested for how good a predictor of rewards
V_effectiveness can be.
To ensure that the comparison is “apples-to-apples,” for the purpose of the exercise we only selected the group of validators that were active at genesis (20k unique indices).
When broadening the correlations matrix to capture a wider range of variables of interest, we find that the correlation between
V_effectiveness and rewards drops to 50%—likely because the different entry points, and exposure to varying network conditions, become a moderating factor.
A few more observations worth surfacing are (i) that uptime is a lot more closely correlated to
V_effectiveness than the inclusion delay is—and is thus likely what operators should optimize for first; (ii) that
V_effectiveness improved significantly as Medalla matured—precisely because the aggregate inclusion delay improved greatly, as demonstrated by the 70% positive correlation between the delay and epochs_active; and (iii) as far as the attestations surplus introduced on-chain, there is only a weak relationship between a validator’s effectiveness and the amount of surplus info they load the chain with—meaning that the top performers, from a rewards perspective, may only be marginally less net “pollutants.”
When testing for validator effectiveness vs rewards in groups of validators aggregated by their most common denominator (a mix of graffiti, self id name, common withdrawal keys, and common eth1 deposit address), the results are equally strong!
The correlation between rewards and
V_effectiveness stands at 80%, while in the distribution view we can discern between three groups of operators that performed poorly (C), average (B), or well (A).
Zooming into the top performers (A), and filtering for client identifier, we find that—surprisingly—validators running Lodestar and Teku performed better than those running Lighthouse or Prysm–with a big caveat in that the representation of validators running Lodestar and Teku in group (A) is nearly statistically insignificant.
Finally, when zooming out again at the population level and segmenting the view of
V_effectiveness distributions by client choice, it becomes more clear why it is Prysm, and more recently Lighthouse, that are dominating client choice among validators in Medalla.
Prysm and Lighthouse appear to significantly differentiate from other clients with respect to the average validator performance recorded in Medalla. The result becomes even more pronounced when segmenting out the bottom performers (group C), which likely represents a type of negligence that we can’t possibly expect on Mainnet—where real ETH is at stake.
On the other end of the distribution, the Nimbus client appears to be the one with the most ground to cover.
What is perhaps the most valuable finding of all, however, is the fact that even among the top performing clients, there seems to be a significantly wide distribution in in-group validator effectiveness. This is a strong hint towards the fact that client choice has only so much to do with validator performance. The remainder likely leans on the strength of the operator’s design choices!
In this post, we developed a feature complete validator effectiveness methodology that is not only a strong predictor of ETH rewards achieved, but also takes into consideration a relative approach to scoring by normalizing scores.
Given that the majority of the factors governing rewards in eth2 are relative to the state of the network, we believe that, while computationally more demanding, the approach we introduce here is painting a more accurate picture compared to methodologies that take a nominal view.
We also found that the lead the Prysm and Lighthouse clients currently have in market adoption is well-founded, as they very likely enable better performance. It’s worth underlining, however, that seemingly a large chunk of the determinants of performance lie outside of client choice and more closely relate to the robustness of the operator’s set-up.
More broadly, taking the 10,000 ft view, and using the framing developed over the course of this post, we can liken client software development to a performance sport. This is what the protocol rewards for, after all. Clients will find adoption when they help users achieve maximum rewards: first by playing defense (e.g. helping improve a validator’s security and uptime), and then by playing offense (e.g. by getting attestations included on-chain fast).
But, as with most performance sports, the distribution of winners and not-winners is a power law! And when there are only a handful of categories to compete on, the power law distribution becomes less easily breakable—eventually yielding to a network concentrated around one client choice and all the systemic risks that that implies.
If client variety and a more equal distribution across the whole network is desired, protocol designers should perhaps consider adding to the parameters that the protocol rewards for (e.g. lean attestations, better aggregation, etc), so that different clients can specialize in different niches and cater to user groups with different preference sets.
It’s not too late to be an eth2 Pioneer. Learn more about the eth2 Pioneer Program for enterprise. We want you to have early access to build on the Beacon Chain!
Bison Trails is a blockchain infrastructure company based in New York City. We built a platform for anyone who wants to participate in 19 new chains effortlessly. We also make it easy for anyone building Web 3.0 applications to connect to blockchain data from 27 protocols with QT. Our goal is for the entire blockchain ecosystem to flourish by providing robust infrastructure for the pioneers of tomorrow.
eth2 Update 006Nov 25 2020
Substrate Ecosystem Update 003Nov 25 2020
Now Available: Libra QTNov 24 2020
eth2 Insights: Network PerformanceNov 23 2020
eth2 Update 005Nov 18 2020
tBTC and DeFi: How to Get InvolvedNov 12 2020
Coinbase Custody Expands Bison Trails Integration to Add Staking Support for CeloNov 12 2020
eth2 Update 004Nov 11 2020
QT Archival: When a Full Node is Not EnoughNov 11 2020
Bison Trails Newsletter 009 • October 2020Nov 10 2020
eth2 Insights: SlashingsNov 9 2020
View More →