More on this book
Kindle Notes & Highlights
Thanks to large investments in the general area of data science, many major innovations (e.g., Hadoop, Voldemort, Cassandra, HBase, Pig, Hive, etc.) have made data products easier to build.
Smart data scientists don’t just solve big, hard problems; they also have an instinct for making big problems small.
Before investing in a big effort, you need to answer one simple question: Does anyone want or need your product? If no one wants the product, all the analytical work you throw at it will be wasted. So, start with something simple that lets you determine whether there are any customers. To do that, you’ll have to take some clever shortcuts to get your product off the ground. Sometimes, these shortcuts will survive into the finished version because they represent some fundamentally good ideas that you might not have seen otherwise; sometimes, they’ll be replaced by more complex analytic
...more
The key is to start simple and stay simple for as long as possible.
One of the biggest challenges of working with data is getting the data in a useful form. It’s easy to overlook the task of cleaning the data and jump to trying to build the product, but you’ll fail if getting the data into a usable form isn’t the first priority.
The point is to have a conversation rather than just a form. Engage the user to help you, rather than relying on analysis. You’re not just getting the user more involved (which is good in itself), you’re getting clean data that will simplify the work for your back-end systems. As a matter of practice, I’ve found that trying to solve a problem on the back end is 100-1,000 times more expensive than on the front end.
As technologists, we are predisposed to look for scalable technical solutions. We often jump to technical solutions before we know what solutions will work. Instead, see if you can break down the task into bite-size portions that humans can do, then figure out a technical solution that allows the process to scale.
Amazon’s Mechanical Turk is a system for posting small problems online and paying people a small amount (typically a couple of cents) for solutions.
Humans are also useful for separating valid input from invalid.
By using humans to solve the problem initially, we can learn a great deal about the problem at a very low cost.
The human solution not only made it clear what they needed to build, it proved that the technical solution was worth the effort and bought them the time they needed to build it.
The point is that technical solutions will always win in the long run; they’ll always be more efficient, and even a poor technical solution is likely to scale better than using humans to answer questions. But when you’re getting started, you don’t care about the long run. You just want to survive long enough to have a long run, to prove that your product has value. And in the short term, human solutions require much less work. Worry about scaling when you need to.
LinkedIn’s People You May Know embodies both Data Jujitsu and grounding the product in the real world.
By giving data back to the user, you can create both engagement and revenue. We’re far enough into the data game that most users have realized that they’re not the customer, they’re the product.
How do you give data back to the user? LinkedIn has a product called “Who’s Viewed Your Profile.” This product lists the people who have viewed your profile (respecting their privacy settings, of course), and provides statistics about the viewers.
In short, everyone reading this has probably spent the last year or more of their professional life immersed in data. But it’s not just us. Everyone, including users, has awakened to the value of data. Don’t hoard it; give it back, and you’ll create an experience that is more engaging and more profitable for both you and your company.
One of the biggest challenges of developing a data product is figuring out how to give data back to the user.
An “inverse interaction law” applies to most users: The more data you present, the less interaction.
The best way to avoid data vomit is to focus on actionability of data. That is, what action do you want the user to take?
What tools do we have to think about bad results — things like unfortunate recommendations and collaborative filtering gone wrong? Two concepts, precision and recall, let us describe the problem more precisely.
Precision — The ability to provide a result that exactly matches what’s desired.
Recall — The set of possible good recommendations. Recall is fundamentally about inventory: Good recall means that you have a lot of good recommendations, or a lot of advertisements that you can potentially show the user.
Unfortunately, precision and recall often work against each other: As precision increases, recall drops, and vice versa.
Low-precision search results yield a poor experience. On the other hand, low-precision ads are almost harmless
Another issue to contend with is subjectivity: How does the user perceive the results?
The most common guideline is to strive for a distribution in which there are many good results, a few great ones, and no bad ones.
We often focus on getting a limited set of data from a user. But done correctly, you can engage the user to give you more useful, high-quality data.
Take heed not just to demand data. You need to explain to the user why you’re asking for data; you need to disarm the user’s resistance to providing more information by telling him that you’re going to provide value (in this case, more valuable recommendations), rather than abusing the data. It’s essential to remember that you’re having a conversation with the user, rather than giving him a long form to fill out.

