Handicapping pub trivia
The following question was posted recently on Reddit’s statistics forum:
If there is a quiz of x questions with varying results between teams of different sizes, how could you logically handicap the larger teams to bring some sort of equivalence in performance measure?
[Suppose there are] 25 questions and a team of two scores 11/25. A team of 4 scores 17/25. Who did better […]?
One respondent suggested a binomial model, in which every player has the same probability of answering any question correctly.
I suggested a model based on item response theory, in which each question has a level of difficulty, d, each player has a level of efficacy e, and the probability that a player answers a question is
expit(e-d+c)
where c is a constant offset for all players and questions and expit is the inverse of the logit function.
Another respondent pointed out that group dynamics will come into play. On a given team, it is not enough if one player knows the answer; they also have to persuade their teammates.

I wrote some simulations to explore this question. You can see a static version of my notebook here, or you can run the code on Colab.
I implement a binomial model and a model based on item response theory. Interestingly, for the scenario in the question they yield opposite results: under the binomial model, we would judge that the team of two performed better; under the other model, the team of four was better.
In both cases I use a simple model of group dynamics: if anyone on the team gets a question, that means the whole team gets the question. So one way to think of this model is that “getting” a question means something like “knowing the answer and successfully convincing your team”.
Anyway, I’m not sure I really answered the question, other than to show that the answer depends on the model.
Probably Overthinking It
- Allen B. Downey's profile
- 235 followers
