To ensure we didn’t declare a well-performing algorithm incorrect, each entry would be allowed to provide a rank-ordered list of five labels in total—making room for “strawberry” and “apple,” in this case—an evaluation metric we came to call the “top-5 error rate.” It encouraged submissions to intelligently hedge their bets, and ensured we were seeing the broadest, fairest picture of their capabilities.

