More on this book
Community
Kindle Notes & Highlights
This involved some circular reasoning, of course—if algorithms were capable of recognizing objects accurately enough to help us label them, then we wouldn’t need ImageNet in the first place.
It made a kind of perverse if debatable sense, but we never got the balance right.
Our goal was to embed unalloyed human perception in every image, in the hopes that a computer vision model trained on the complete set would be imbued with some similar spark of acumen.
But the funds simply weren’t there. It was infuriating to think that after so much emotional investment, this would all come down to a question of money.
Vision is blurrier at seventy miles an hour, but no less rich in content.
The conference felt like the perfect excuse for an escape, and I looked forward to twelve hundred miles of blissfully monotonous driving that I could spend thinking about anything—anything—other than our work. I rented a van and filled it with a few students from the lab.
“So Fei-Fei, now that you’ve got a lab of your own, what are you working on these days?” It was a question I was dreading, but it came from Jitendra—Pietro’s advisor and my “academic grandfather”—the person I was most hoping to run into.
“Honestly, Jitendra, it’s a bit of a sore subject.”
A frightening idea was beginning to sink in: that I’d taken a bigger risk than I realized, and it was too late to turn back.
He’d entered the world of computer vision talented but naive, and he’d trusted me to guide him. Now, I could sense his frustration growing—justifiably—and I knew he was worried about his own path to a PhD.
I was running late for a faculty meeting when Min, a master’s student, popped up in front of me.
“I was hanging out with Jia yesterday,” he continued, “and he told me about your trouble with this labeling project. I think I have an idea you two haven’t tried yet—like, one that can really speed things up.”
It was a clever name, taken from the original Mechanical Turk, an eighteenth-century chess-playing automaton that toured the world for years as both a marvel of engineering and a formidable opponent, even for experienced players. The device was actually a hoax; concealed in its base was a human chess master, who controlled the machine to the delight and bewilderment of its audiences.
ImageNet owed the very possibility of its existence to so many converging technological threads: the internet, digital cameras, and search engines. Now crowdsourcing—delivered by a platform that had barely existed a year earlier—was providing the capstone. If I ever needed a reminder that the default position of any scientist should be one of absolute humility—an understanding that no one’s intellect is half as powerful as serendipity—this was it.
At the peak of ImageNet’s development, we were among the largest employers on the AMT platform, and our monthly bills for the service reflected it.
Although it wasn’t a world I’d aspired to join myself, I was impressed by the reach of Stanford’s influence on it, with companies like Hewlett-Packard, Cisco Systems, Sun Microsystems, Google, and so many others tracing their roots to the school.
Princeton felt like home, but I couldn’t deny that Stanford seemed like an even more hospitable backdrop for my research.
And so, in 2009, I made the decision to once again head west, with Jia and most of my students transferring along with me.
In spite of the many challenges we’d faced along the way, we’d actually done it: fifteen million images spread across twenty-two thousand distinct categories, culled from nearly a billion candidates in total, and annotated by a global team of more than forty-eight thousand contributors hailing from 167 countries. It boasted the scale and diversity we’d spent years dreaming of, all while maintaining a consistent level of precision: each individual image was not just manually labeled, but organized within a hierarchy and verified in triplicate.
But beyond the numbers lay the accomplishment that moved me most: the realization of a true ontology of the world, as conceptual as it was visual, curated from the ground up by humans for the sole purpose of teaching machines.
We fielded the usual questions and enjoyed a handful of pleasant conversations but left with little to show for our presence. It was soon clear that whatever was in store for ImageNet—whether it would be embraced as a resource of uncommon richness or written off as folly—it wasn’t going to get a boost at CVPR. On the bright side, people seemed to like the pens.
ImageNet was more than a data set, or even a hierarchy of visual categories. It was a hypothesis—a bet—inspired by our own biological origins, that the first step toward unlocking true machine intelligence would be immersion in the fullness of the visual world.
ImageNet’s slide toward obscurity was beginning to feel so inevitable that I’d resorted to an impromptu university tour to counteract it, delivering live presentations wherever I could to lecture halls filled with skeptical grad students and postdocs.
If image data sets can be thought of as the language of computer vision research—a collection of concepts that an algorithm and its developers can explore—ImageNet was a sudden, explosive growth in our vocabulary.
It massively broadened the range of possibilities our algorithms might face, presenting challenges that smaller data sets didn’t.
The PASCAL Visual Object Classes data set, generally known as PASCAL VOC, was a collection of about ten thousand images organized into twenty categories.
The collective power of collaboration, energized by the pressure of competition.
To ensure we didn’t declare a well-performing algorithm incorrect, each entry would be allowed to provide a rank-ordered list of five labels in total—making room for “strawberry” and “apple,” in this case—an evaluation metric we came to call the “top-5 error rate.” It encouraged submissions to intelligently hedge their bets, and ensured we were seeing the broadest, fairest picture of their capabilities.
To be sure we were providing novel tests to the algorithms, we recapitulated much of ImageNet’s development process by downloading and labeling hundreds of thousands of new images, complete with yet another round of crowdsourced labeling.
Along the way, Jia’s efforts were supported by a growing team that included newcomers like Olga Russakovsky, a smart, energetic grad student looking for something interesting to throw her weight behind.
She was already a solid choice on intellectual grounds, but possessed a social adroitness that was rare in our department as well. I could tell she had the intellect to contribute to the project behind the scenes, but I began to wonder if, someday, she might tap into her natural savvy to represent it publicly as well.
Support vector machines, random forests, boosting, even the Bayesian network Pietro and I employed in our one-shot learning paper would buckle under its weight, forcing us to invent something truly new. “I don’t think ImageNet will make today’s algorithms better,” I said. “I think it will make them obsolete.”
Recognizing our lack of experience, not to mention ImageNet’s still-flagging name recognition, we reached out to Mark Everingham, a founding organizer of PASCAL VOC.
It was a fitting continuation of the biological influence that drove the entire project. ImageNet was based on the idea that algorithms need to confront the full complexity and unpredictability of their environments—the nature of the real world. A contest would imbue that environment with true competitive pressures.
The winning entrant, from a joint team composed of researchers at NEC Labs, Rutgers, and the University of Illinois, was an example of a support vector machine, or SVM—one of the algorithms I’d assumed ImageNet would overpower.
We’d dedicated years of our lives to a data set that was orders of magnitude beyond anything that had ever existed, orchestrated an international competition to explore its capabilities, and, for all that, accomplished little more than simply reifying the status quo.
Both facts freighted my offer with a seriousness I hadn’t appreciated until the words came out of my mouth. Silence. Then a sharp intake of breath. Faint, scratchy, and trembling. It couldn’t be what I thought it was. Is he … crying?
By chance, a contact I’d made through a fellowship program connected me to the neurobiology department of a nearby university hospital. The next day, he was transferred to one of the most advanced care units in the state.
Bob never realized his dream of being published in the sci-fi world, but he continued to write so prodigiously that he developed a habit of emailing me his personal journal entries at the end of each month.
By August 2012, ImageNet had finally been dethroned as the topic keeping me awake at night. I’d given birth, and a new reality of nursing, diapers, and perpetually interrupted sleep had taken over my life.
A twenty-first-century student using the word “ancient” to describe work from a couple of decades earlier was a testament to just how young our field was.
Our world evolved fast, and by the 2010s, most of us saw the neural network—that biologically inspired array of interconnected decision-making units arranged in a hierarchy—as a dusty artifact, encased in glass and protected by velvet ropes.
Dropping everything to attend ECCV had thrown my homelife into chaos, but Jia’s news didn’t leave much choice. And I had to admit that there was pretty significant upside to living with one’s parents when an infant needs last-minute babysitting.
The winner was dubbed AlexNet, in homage to both the technique and the project’s lead author, University of Toronto researcher Alex Krizhevsky.
AlexNet was an example of a convolutional neural network, or CNN. The name is derived from the graphical process of convolution, in which a series of filters are swept across an image in search of features corresponding to things the network recognizes. It’s a uniquely organic design, drawing inspiration from Hubel and Wiesel’s observation that mammalian vision occurs across numerous stages. As in nature, each layer of a CNN integrates further details into higher and higher levels of awareness until, finally, a real-world object comes fully into view.
Rather than arbitrarily deciding in advance which features the network should look for, the authors allowed each of its hundreds of thousands of neurons to learn their own sensitivities gradually, exclusively from the training data, without manual intervention. Like a biological intelligence, AlexNet was a natural product of its environment. Next, signals from those thousands of receptive fields travel deep into the network, merging and clustering into larger, clearer hints.
Finally, the few remaining signals that survived the trip through each layer, filtered and consolidated into a detailed picture of the object in question, collide with the final stage of the network: recognition.
Yann LeCun had remained astonishingly loyal to convolutional neural networks in the years since his success applying them to handwritten ZIP codes at Bell Labs.
The project was helmed by the eponymous Alex Krizhevsky and his collaborator, Ilya Sutskever, both of whom were smart but young researchers still building their reputations.
The same Hinton who’d made his name as an early machine learning pioneer with the development of backpropagation in the mid-1980s, the breakthrough method that made it possible to reliably train large neural networks for the first time.

