Notes & Highlights for Data Science by Field Cady

You assume there

is a particular mathematical formula that relates x and y – a line, an exponential decay, etc. That formula has a small number of parameters that characterize it (the slope and intercept of a line, the growth rate and initial value of an exponential growth curve, etc.). You then try to find the values of those parameters that...

This highlight has been truncated due to consecutive passage length restrictions.

41%

Residuals measure the accuracy of a model. Here the gray points are our data and the line is a line‐of‐best‐fit through them. The residuals are how far the y‐values in the data are off from the y‐values predicted by the line.

42%

There are two big questions to ask when fitting a curve: (i) how good is the

42%

fit, and

42%

(ii) was a line actually the right curve to fit? Residuals play a key role in answerin...

This highlight has been truncated due to consecutive passage length restrictions.

42%

The typical way to measure how far off we are from the line of best fit is by assigning a “cost” to each residual, that is 0 if the residual is zero and positive otherwise, and then adding up the cost of every residual. Traditionally the cost of a residual is just its square, so that Typically we divide this by what the total cost would have been if we had just passed a flat line through the average height of the data, as in Figure 4.9. Passing a flat line like this is about the most naïve “curve fitting” possible, so it is used as the benchmark for the penalty that a very bad fit would incur ...more

This highlight has been truncated due to consecutive passage length restrictions.

42%

Large residuals can come from two sources: either that data we are trying to fit a curve to is noisy or we are fitting a type of curve that is a bad match for data. If nearby residuals are highly correlated with each other, it means that we have large spans where we are under‐ or over‐estimating the function, suggesting that we've chosen a bad curve to fit. Uncorrelated residuals suggest that the data is simply noisy.

42%

If our dataset is set in stone, then this total cost is determined by the slope and intercept of our fit line, and we can find the “best fit” by finding the slope and intercept that minimize the total cost. Because the cost of a residual is its square, this is called “least squares fitting,” and it is by far the most common way to fit lines and other curves.

42%

One popular alternative to least‐squares fitting is to say that the cost of a residual is simply its absolute value – outliers are still costly, but we don't amplify their impact by taking a square:

43%

Unless there is a compelling reason to use something else, you should at least start with least squares – it is the easiest to compute and to communicate with people about. As

44%

The “p‐value” – the central concept of statistics – is the probability of getting a test statistic that is at least as extreme as the one we observe if you assume the null hypothesis is true.

44%

It is a convention to take 5% as the cutoff for whether a pattern is “statistically significant,” but that's entirely convention.

49%

section will

49%

A probability distribution is a general, mathematical way to say which numbers are how likely to happen.

49%

Discrete distributions: The data can only come in discrete integers, and each one has a finite probability.

49%

Continuous distributions: In this case the data are decimal numbers, not integers. Any single number has probability zero, but we can say

49%

Most continuous distributions can be thought of as the limiting cases of some underlying discrete distribution,

50%

The Bernoulli distribution can be used to describe any yes/no or other binary situation.

50%

A binomial is the number of heads when you flip n biased coins – the sum of n independent Bernoulli distributions.

51%

Central limit theorem. Say you have a probability distribution with mean μ and standard deviation σ. If you take n independent samples from it and take their average, that average is approximately a normal distribution with mean μ and standard deviation σ/.

51%

In any application where multiple more‐or‐less independent numbers get added up, you should start thinking about a normal distribution.

52%

The exponential distribution is used

52%

The exponential distribution is characterized by a single number: the average time between events.

52%

The exponential distribution is often used to estimate the length of time between consecutive events. Its most important property is that it is “memoryless”; if the first event happened T seconds ago and you are still waiting for the next event, the additional time you have left to wait still follows an exponential distribution.

53%

The key assumption about a Poisson distribution is that the events are independent of each other.

55%

approach: the primary goal of machine learning is to create software programs that make correct decisions with minimal human involvement (although human intuition can be crucial in crafting and vetting the models).

57%

The more inputted features you have, the more likely it is that some of them will correlate with the target variable just by coincidence and your model will be overfitted.

57%

So having a large ratio of labeled datapoints to inputted features is the key to reducing overfitting. This is one of the reasons that feature extraction is important – it reduces the number of raw inputs.

57%

We can't avoid overfitting completely, but what we can do is to estimate how well your model performs accounting for it. This is done via a process called “cross validation,” where we divide our labeled data into “training data” and “test data.” You tune a model to the training data, but you only evaluate its performance on the test data.

78%

Probably the most standard data science library is called Pandas. It operates on tables of data called DataFrames, where each column can hold data of a different type. Conceptually DataFrames are like tables in a relational database, and they provide many of them same functions (in addition to others). Pandas is extremely handy for cleaning and processing data and for basic analyses.

78%

As of this writing the standard visualization package is called matplotlib, but that may be changing. While it has all of the basic functions, you would expect from a visualization library it is widely regarded as the weakest link in the Python technical stack. Among other problems the figures look fairly cartoonish by default. It is losing ground to other libraries like Seaborn (which is actually built on matplotlib but arguably has better default behavior) and web‐based visualizations like Plotly.

78%

NumPy. This is a low‐level library that lets you store and process large arrays of numbers with performance comparable with a low‐level language like C. From

78%

Many data scientists develop their code using a tool called Jupyter Notebooks. Jupyter is a browser‐based software development tool that divides the code into cells, displaying the output (including any visualizations) of each cell next to it. It's wonderful for analytics prototyping and is similar in feel to Mathematica.

78%

you like Matlab's syntax, but don't like paying for software, then you could also consider Octave. It is an open‐source version of Matlab. It doesn't capture all of Matlab's functionality, and certainly doesn't have the same support infrastructure and documentation, but it may be your best option if your team is used to using Matlab but the software needs to be free.

78%

Mathematica is best known for crunching symbols – solving large systems of equations to derive reusable formulas,

79%

6.3.2.6 Julia

79%

Its main claim to fame is that while you can develop scripts in it like Python you can also compile those scripts into lightning‐fast programs comparable with if you'd written the code in C.

79%

that controls the behavior of a website in a browser, and D3.js (standing for “Data‐Driven Documents”) is a JavaScript library that allows the browser to feature beautiful, efficient graphs of data. D3.js is free and open‐source, but it gives very fine‐grained control and can be finicky to use.

80%

processing technology these days, having largely replaced traditional Hadoop map‐reduce. It is usually more efficient, especially if you are chaining several operations together, and it's tremendously easier to use. From a user's perspective, Spark is just a library that you import when you are using either Python or Scala.

80%

The central data abstraction in PySpark is a “resilient distributed dataset” (RDD), which is just a collection of python objects.

81%

performing various operations that turn them into other RDDs and storing them out appropriately. However, RDDs that are small enough to fit onto a single node can also be pulled down into local space and operated on with all of the tools available in Python and Scala.

85%

As a rule of thumb AI refers to technologies that do the sorts of tasks that you would normally require a human to do.

85%

“artificial neural networks,”

85%

Strong AI, also called General AI, involves a computer having an honest‐to‐goodness mind, including self‐awareness and whatever else goes into a mind

86%

Weak AI means replicating human‐like behavior on a specific task or range of tasks. The range of tasks might be impressively large

87%

AI's downsides work: the heuristics that it finds can racist or sexist, they can (as in this case) be an idiosyncrasy of the training data that doesn't generalize, or they can be a numerical hash that is much more complicated than the real‐world phenomenon that it correlates with.

87%

At its core a typical neural network is just a classifier. It differs from other classifiers only in the complexity of the model, and correspondingly in the sophistication of the problem that it can solve.

87%

The “architecture” of the network – how many layers there are, how many neurons are in each layer, and which neurons take which others as input – is as important as the actual parameters that get tuned during training.

95%

WordNet A popular lexical database that groups words into “synsets” all of whose words have a similar meaning.