Noah’s Kindle Notes & Highlights for Data Science

Machine learning (ML) focuses on the design and evaluation of algorithms for extracting patterns from data. Data mining generally deals with the analysis of structured data and often implies an emphasis on commercial applications. Data science takes all of these considerations into account but also takes up other challenges, such as the capturing, cleaning, and transforming of unstructured social media and web data; the use of big-data technologies to store and process big, unstructured data sets; and questions related to data ethics and regulation.

11%

The term data science came to prominence in the late 1990s in discussions relating to the need for statisticians to join with computer scientists to bring mathematical rigor to the computational analysis of large data sets.

11%

Breiman’s distinction between a statistical focus on models that explain the data versus an algorithmic focus on models that can accurately predict the data highlights a core difference between statisticians and ML researchers. The debate between these approaches is still ongoing

11%

today most data science projects are more aligned with the ML approach of building accurate prediction models and less concerned with the statistical focus on explaining the data.

13%

we recommend two books, The Visual Display of Quantitative Information by Edward Tufte (2001) and Show Me the Numbers: Designing Tables and Graphs to Enlighten by Stephen Few (2012) as excellent introductions to the principles and techniques of effective data visualization.

15%

from a pure data science perspective perhaps the most important aspect of the moneyball story is that it highlights that sometimes the primary value of data science is the identification of informative attributes. A common belief is that the value of data science is in the models created through the process. However, once we know the important attributes in a domain, it is very easy to create data-driven models. The key to success is getting the right data and finding the right attributes.

15%

The reason why data science is used in so many domains is that it doesn’t matter what the problem domain is: if the right data are available and the problem can be clearly defined, then data science can help.

16%

One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process.

16%

“Data mining lets computers do what they do best—dig through lots of data. This, in turn, lets people do what people do best, which is to set up the problem and understand the results”

16%

Human talent in data science is at a premium, and sourcing this talent is currently the main bottleneck in the adoption of data science.

17%

The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement.

17%

A third data science myth is that modern data science software is easy to use, and so data science is easy to do. It is true that data science software has become more user-friendly. However, this ease of use can hide the fact that doing data science properly requires both appropriate domain knowledge and the expertise regarding the properties of the data and the assumptions underpinning the different ML algorithms.

17%

The last myth about data science we want to mention here is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization.

20%

Data are generated through a process of abstraction, so any data are the result of human decisions and choices. For every abstraction, somebody (or some set of people) will have made choices with regard to what to abstract from and what categories or measurements to use in the abstracted representation. The implication is that data are never an objective description of the world. They are instead always partial and biased.

21%

two characteristics of data science cannot be overemphasized: (a) for data science to be successful, we need to pay a great deal of attention to how we create our data (in terms of both the choices we make in designing the data abstractions and the quality of the data captured by our abstraction processes), and (b) we also need to “sense check” the results of a data science process—that is, we need to understand that just because the computer identifies a pattern in the data this doesn’t mean that it is identifying a real insight in the processes we are trying to analyze; the pattern may ...more

22%

It is frequently the case that the real value of a data science project is the identification of one or more important derived attributes that provide insight into a problem.

23%

Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?

26%

The iterative nature of data science projects is perhaps the aspect of these projects that is most often overlooked in discussions of data science.

26%

data science veterans will spend more time on ensuring that the project has a clearly defined focus and that it has the right data.

27%

That around 80 percent of project time is spent on gathering and preparing data has been a consistent finding in industry surveys for a number of years. Sometimes this finding surprises people because they imagine data scientists spend their time building complex models to extract insight from the data. But the simple truth is that no matter how good your data analysis is, it won’t identify useful patterns unless it is applied to the right data.

31%

“Move the algorithms to the data instead of the data to the algorithms.”

37%

The real challenge in using ML is to find the algorithm whose learning bias is the best match for a particular data set.

37%

A challenge for clustering is figuring out how to measure similarity.

38%

the choice of which attributes to include and exclude from a data set is a key task in data science,

41%

attribute selection is a key task in data science. So is attribute design. Designing a derived attribute that has a strong correlation with an attribute we are interested in is often where the real value of data science is found. Once you know the correct attributes to use to represent the data, you are able to build accurate models relatively quickly. Uncovering and designing the right attributes is the difficult part.

43%

One consequence of this weighting is that instances that have extreme values (outliers) can have a disproportionately large impact on the line-fitting process, resulting in the line being dragged away from the other instances. Thus, it is important to check for outliers in a data set prior to fitting a line to the data set (or, in other words, training a linear regression function on the data set) using the least squares algorithm.

44%

Correlation and regression are similar concepts insofar as both are techniques that focus on the relationship across columns in the data set. Correlation is focused on exploring whether a relationship exists between two attributes, and regression is focused on modeling an assumed relationship between attributes with the purpose of being able to estimate the value of one target attribute given the values of one or more input attributes.

44%

At its core, a neuron is simply a multi-input linear-regression function. The only significant difference between the two is that in a neuron the output of the multi-input linear-regression function is passed through another function that is called the activation function.

47%

The difficulty in training a neural network is that the weight-update rule requires an estimate of the error at a neuron, and although it is straightforward to calculate the error for each neuron in the output layer of the network, it is difficult to calculate the error for the neurons in the earlier layers.

47%

after each training instance is presented to the network, the algorithm passes (or backpropagates) the error of the network back through the network starting at the output layer and at each layer in the network calculates the error for the neurons in that layer before sharing this error back to the neurons in the preceding layer.

48%

Deep-learning networks are simply neural networks that have multiple8 layers of hidden units; in other words, they are deep in terms of the number of hidden layers they have.

52%

Although decision trees work well with both nominal and ordinal data, they struggle with numeric data.

52%

without a learning bias there can be no learning, and the algorithm will only be able to memorize the data.

53%

The golden rule for evaluating models is that models should never be tested on the same data they were trained on.

53%

The standard process for ensuring that the models aren’t able to peek at the test data during training is to split the data into three parts: a training set, a validation set, and a test

53%

It is crucial that the test set is not used during the process to select the best algorithm, nor should it be used to train this final model. If these caveats are followed, then the test set can be used to estimate the generalization performance of this final model on unseen data.

54%

the world changes, and models don’t. Implicit in the ML process of data set construction, model training, and model evaluation is the assumption that the future will be the same as the past. This assumption is known as the stationarity assumption: the processes or behaviors that are being modeled are stationary through time (i.e., they don’t change).

54%

Processes need to put in place postmodel deployment to ensure that a model has not gone stale, and when it has, it should be retrained. The majority of these decisions cannot be automated and require human insight and knowledge. A computer will answer the question it is posed, but unless care is taken, it is very easy to pose the wrong question.

58%

Because of its versatility, clustering is often used as a data-exploration tool during the data-understanding stage of many data science projects.

59%

anomaly detection is the opposite of clustering: the goal of clustering is to identify groups of similar instances, whereas the goal of anomaly detection is to find instances that are dissimilar to the rest of the data in the data set.

69%

the more consistent a prejudice is in a society, the stronger that prejudicial pattern will appear in the data about that society, and the more likely a data science algorithm will extract and replicate that pattern of prejudice.

71%

The distinction between individuals who believe and act as though they are free of surveillance and individuals who self-discipline out of fear that they inhabit a Panopticon is the primary difference between a free society and a totalitarian state.