Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals
Rate it:
41%
Flag icon
You assume there
41%
Flag icon
is a particular mathematical formula that relates x and y – a line, an exponential decay, etc. That formula has a small number of parameters that characterize it (the slope and intercept of a line, the growth rate and initial value of an exponential growth curve, etc.). You then try to find the values of those parameters that...
This highlight has been truncated due to consecutive passage length restrictions.
41%
Flag icon
Residuals measure the accuracy of a model. Here the gray points are our data and the line is a line‐of‐best‐fit through them. The residuals are how far the y‐values in the data are off from the y‐values predicted by the line.
42%
Flag icon
There are two big questions to ask when fitting a curve: (i) how good is the
42%
Flag icon
fit, and
42%
Flag icon
(ii) was a line actually the right curve to fit? Residuals play a key role in answerin...
This highlight has been truncated due to consecutive passage length restrictions.
42%
Flag icon
The typical way to measure how far off we are from the line of best fit is by assigning a “cost” to each residual, that is 0 if the residual is zero and positive otherwise, and then adding up the cost of every residual. Traditionally the cost of a residual is just its square, so that Typically we divide this by what the total cost would have been if we had just passed a flat line through the average height of the data, as in Figure 4.9. Passing a flat line like this is about the most naïve “curve fitting” possible, so it is used as the benchmark for the penalty that a very bad fit would incur ...more
This highlight has been truncated due to consecutive passage length restrictions.
42%
Flag icon
Large residuals can come from two sources: either that data we are trying to fit a curve to is noisy or we are fitting a type of curve that is a bad match for data. If nearby residuals are highly correlated with each other, it means that we have large spans where we are under‐ or over‐estimating the function, suggesting that we've chosen a bad curve to fit. Uncorrelated residuals suggest that the data is simply noisy.
42%
Flag icon
If our dataset is set in stone, then this total cost is determined by the slope and intercept of our fit line, and we can find the “best fit” by finding the slope and intercept that minimize the total cost. Because the cost of a residual is its square, this is called “least squares fitting,” and it is by far the most common way to fit lines and other curves.
42%
Flag icon
One popular alternative to least‐squares fitting is to say that the cost of a residual is simply its absolute value – outliers are still costly, but we don't amplify their impact by taking a square:
43%
Flag icon
Unless there is a compelling reason to use something else, you should at least start with least squares – it is the easiest to compute and to communicate with people about. As
44%
Flag icon
The “p‐value” – the central concept of statistics – is the probability of getting a test statistic that is at least as extreme as the one we observe if you assume the null hypothesis is true.
44%
Flag icon
It is a convention to take 5% as the cutoff for whether a pattern is “statistically significant,” but that's entirely convention.
49%
Flag icon
section will
49%
Flag icon
A probability distribution is a general, mathematical way to say which numbers are how likely to happen.
49%
Flag icon
Discrete distributions: The data can only come in discrete integers, and each one has a finite probability.
49%
Flag icon
Continuous distributions: In this case the data are decimal numbers, not integers. Any single number has probability zero, but we can say
49%
Flag icon
Most continuous distributions can be thought of as the limiting cases of some underlying discrete distribution,
50%
Flag icon
The Bernoulli distribution can be used to describe any yes/no or other binary situation.
50%
Flag icon
A binomial is the number of heads when you flip n biased coins – the sum of n independent Bernoulli distributions.
51%
Flag icon
Central limit theorem. Say you have a probability distribution with mean μ and standard deviation σ. If you take n independent samples from it and take their average, that average is approximately a normal distribution with mean μ and standard deviation σ/.
51%
Flag icon
In any application where multiple more‐or‐less independent numbers get added up, you should start thinking about a normal distribution.
52%
Flag icon
The exponential distribution is used
52%
Flag icon
The exponential distribution is characterized by a single number: the average time between events.
52%
Flag icon
The exponential distribution is often used to estimate the length of time between consecutive events. Its most important property is that it is “memoryless”; if the first event happened T seconds ago and you are still waiting for the next event, the additional time you have left to wait still follows an exponential distribution.
53%
Flag icon
The key assumption about a Poisson distribution is that the events are independent of each other.
55%
Flag icon
approach: the primary goal of machine learning is to create software programs that make correct decisions with minimal human involvement (although human intuition can be crucial in crafting and vetting the models).
57%
Flag icon
The more inputted features you have, the more likely it is that some of them will correlate with the target variable just by coincidence and your model will be overfitted.
57%
Flag icon
So having a large ratio of labeled datapoints to inputted features is the key to reducing overfitting. This is one of the reasons that feature extraction is important – it reduces the number of raw inputs.
57%
Flag icon
We can't avoid overfitting completely, but what we can do is to estimate how well your model performs accounting for it. This is done via a process called “cross validation,” where we divide our labeled data into “training data” and “test data.” You tune a model to the training data, but you only evaluate its performance on the test data.
78%
Flag icon
Probably the most standard data science library is called Pandas. It operates on tables of data called DataFrames, where each column can hold data of a different type. Conceptually DataFrames are like tables in a relational database, and they provide many of them same functions (in addition to others). Pandas is extremely handy for cleaning and processing data and for basic analyses.
78%
Flag icon
As of this writing the standard visualization package is called matplotlib, but that may be changing. While it has all of the basic functions, you would expect from a visualization library it is widely regarded as the weakest link in the Python technical stack. Among other problems the figures look fairly cartoonish by default. It is losing ground to other libraries like Seaborn (which is actually built on matplotlib but arguably has better default behavior) and web‐based visualizations like Plotly.
78%
Flag icon
NumPy. This is a low‐level library that lets you store and process large arrays of numbers with performance comparable with a low‐level language like C. From
78%
Flag icon
Many data scientists develop their code using a tool called Jupyter Notebooks. Jupyter is a browser‐based software development tool that divides the code into cells, displaying the output (including any visualizations) of each cell next to it. It's wonderful for analytics prototyping and is similar in feel to Mathematica.
78%
Flag icon
you like Matlab's syntax, but don't like paying for software, then you could also consider Octave. It is an open‐source version of Matlab. It doesn't capture all of Matlab's functionality, and certainly doesn't have the same support infrastructure and documentation, but it may be your best option if your team is used to using Matlab but the software needs to be free.
78%
Flag icon
Mathematica is best known for crunching symbols – solving large systems of equations to derive reusable formulas,
79%
Flag icon
6.3.2.6 Julia
79%
Flag icon
Its main claim to fame is that while you can develop scripts in it like Python you can also compile those scripts into lightning‐fast programs comparable with if you'd written the code in C.
79%
Flag icon
that controls the behavior of a website in a browser, and D3.js (standing for “Data‐Driven Documents”) is a JavaScript library that allows the browser to feature beautiful, efficient graphs of data. D3.js is free and open‐source, but it gives very fine‐grained control and can be finicky to use.
80%
Flag icon
processing technology these days, having largely replaced traditional Hadoop map‐reduce. It is usually more efficient, especially if you are chaining several operations together, and it's tremendously easier to use. From a user's perspective, Spark is just a library that you import when you are using either Python or Scala.
80%
Flag icon
The central data abstraction in PySpark is a “resilient distributed dataset” (RDD), which is just a collection of python objects.
81%
Flag icon
performing various operations that turn them into other RDDs and storing them out appropriately. However, RDDs that are small enough to fit onto a single node can also be pulled down into local space and operated on with all of the tools available in Python and Scala.
85%
Flag icon
As a rule of thumb AI refers to technologies that do the sorts of tasks that you would normally require a human to do.
85%
Flag icon
“artificial neural networks,”
85%
Flag icon
Strong AI, also called General AI, involves a computer having an honest‐to‐goodness mind, including self‐awareness and whatever else goes into a mind
86%
Flag icon
Weak AI means replicating human‐like behavior on a specific task or range of tasks. The range of tasks might be impressively large
87%
Flag icon
AI's downsides work: the heuristics that it finds can racist or sexist, they can (as in this case) be an idiosyncrasy of the training data that doesn't generalize, or they can be a numerical hash that is much more complicated than the real‐world phenomenon that it correlates with.
87%
Flag icon
At its core a typical neural network is just a classifier. It differs from other classifiers only in the complexity of the model, and correspondingly in the sophistication of the problem that it can solve.
87%
Flag icon
The “architecture” of the network – how many layers there are, how many neurons are in each layer, and which neurons take which others as input – is as important as the actual parameters that get tuned during training.
95%
Flag icon
WordNet A popular lexical database that groups words into “synsets” all of whose words have a similar meaning.
« Prev 1 2 Next »