Jump to ratings and reviews
Rate this book

Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python

Rate this book
Statistical methods are a key part of data science, yet few data scientists have formal statistical training. Courses and books on basic statistics rarely cover the topic from a data science perspective. The second edition of this popular guide adds comprehensive examples in Python, provides practical guidance on applying statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what’s important and what’s not.

Many data science resources incorporate statistical methods but lack a deeper statistical perspective. If you’re familiar with the R or Python programming languages and have some exposure to statistics, this quick reference bridges the gap in an accessible, readable format.

With this book, you’ll

Why exploratory data analysis is a key preliminary step in data scienceHow random sampling can reduce bias and yield a higher-quality dataset, even with big dataHow the principles of experimental design yield definitive answers to questionsHow to use regression to estimate outcomes and detect anomaliesKey classification techniques for predicting which categories a record belongs toStatistical machine learning methods that "learn" from dataUnsupervised learning methods for extracting meaning from unlabeled data

632 pages, Kindle Edition

Published April 10, 2020

315 people are currently reading
992 people want to read

About the author

Peter Bruce

46 books6 followers

Ratings & Reviews

What do you think?
Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars
110 (45%)
4 stars
92 (37%)
3 stars
34 (13%)
2 stars
7 (2%)
1 star
1 (<1%)
Displaying 1 - 27 of 27 reviews
Profile Image for marc.
10 reviews
June 8, 2020
Should probably be advertised as more of a 'primer'. Covers a large breadth of material, and is probably readable cover to cover in about a week. Inclusion of both Python and R for nearly all examples is a nice plus.

Despite brevity of many sections, there are suggested textbook readings for nearly all major areas. Some sections have more depth to them - the Decision Tree / Random Forest area is a really impressive breakdown w/ detailed algorithmic pseudocode. Same goes for linkage discussion w/ Agglomerative Clustering.

Only gripe is now and again there are some errors that are a bit head-scratching - e.g., in the RF section it calls out in the recursive tree splitting process: "For the first split, sample p < P variables at random without replacement." 99% sure that's wrong and what's happening is that you randomly select any available p from P at each split step (w/ replacement), e.g., see ESLR p. 589 https://web.stanford.edu/~hastie/Elem...
Profile Image for Giulio Ciacchini.
371 reviews12 followers
November 15, 2022
A really good, practical handbook about Data Science and Statistcs.
Straight to the point and concise.
The fact that it shows all examples both in R and in Python can be seen as a positive or a negative depending on your prior knowledge.
I would highly recommend this text to a beginner who wants to have a general overview of the concept you need to master in Data Science (sampling and hypothesis testing are explained very well).
Last note: the section on Machine Learning is a bit light.

NOTES
Mean: the sum of all values divided by the number of values.
Median: the value such that one out of the data lies above and below, the middle number on a sorted list of the data a.k.a. 50th percentile.
Percentile: the value such that P percent of the data lies below a.k.a. quintile.
Outlier: data value that is very different from most of the data.
Variance: difference between the observed values and the estimate of location a.k.a. dispersion, the sum of squared deviations from the mean divided by N-1 where N is the number of data values.
Sample bias: the sample is different in some meaningful and non-random way from the larger population it was meant to represent, Data quality often matters more than data quantity if there is bias that is systematic error (for instance the reviews of restaurants or hotels are prone to bias because the people submitting them are not randomly selected rather they themselves have taken the initiative to write, this leads to self selection bias).
Bias: measurement or sampling errors that are systematic and produced by the measurement or sampling process. Random selection helps to avoid the problem of sample bias.
The distribution of a sample statistic such as the mean is likely to be more regular and Bell shaped than the distribution of the data itself, the larger the sample this statistic is based on the more this is true. This is another way to look at their regression to the mean phenomenon: Extreme observations tend to be followed by more central ones.
Central Limit theorem: The means drawn from multiple samples will resemble the bell-shaped normal curve even if the source population is not normally distributed, provided that this sample size is large enough and the departure of the data from normality is not too great.
Standard Error: standard deviation of the sample values divided by the square root of the sample size SE=s/√n
As the sample size increases the standard or decreases.
Do not confuse the standard deviation, which measures divide ability of individual data points, with the standard error, which measures availability of a sample metric.
The frequency distribution of a sample statistic tells us how the metric would turn out differently from sample to sample. This sampling distribution can be estimated via the bootstrap.
Bootstrap: a way to estimate of the sampling distribution of a statistic or of model parameters by drawing additional samples with a replacement from the sample itself and recalculated this statistic or model for each resample. Its power is that it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.
The number of iterations of the bootstrap is arbitrarily, the more iterations you do the more accurate the estimate of the standard error or the confidence interval.
Misconception: the bootstrap does not compensate for the small sample size, it does not create a new data, nor does it feel in holes in an existing data set.
It merely informs us about how lots of additional samples would behave when drawn from a population like our original sample.
When applied to predictive models, aggregating multiple bootstrap sample predictions (bagging aka bootstrap aggregating) outperforms the use of a single model.
With decision trees running multiple trees on bootstrap samples and then averaging the production or taking a majority vote, generally performs better than using a single tweet (random forest).
Confidence interval: it is the interval that encloses the central X% of the bootstraps sampling distribution of a sample statistic, on average it should contain similar simple estimates X% of the time.
However, it does not answer to the question, what is the probability that the true value lies within a certain interval.

The Normal (Gaussian) distribution is so iconic because distributions of sample statistics are often normally shaped.
Standardised: subtracting mean and divide by the standard deviation.
Standard normal: a normal distribution with mean=0 and standard deviation=1
In a normal distribution 68% of the data lies within one standard deviation of the mean, and 95% lies within two standard deviations.
Raw data is typically not normally distributed, errors often are, as averages and totals are in large samples. Despite the importance of the normal distribution historically in statistics daughter is not generally normally distributed.
The distribution can be highly skewed (asymmetric) such as with income data or the distribution can be discrete as with binomial data. Both symmetric and asymmetric distribution may have long tails.
T-Distribution: it is a normally shaped distribution, except that it is a bit thicker and longer on the tails. Distribution of sample means are typically shaped like it, and there is a family of t-distributions that differ depending on how large the sample is. The larger the sample, the more normally shape to the t-distribution becomes.
Its accuracy in depicting the behaviour of a sample statistic requires that the distribution of that statistic for that sample be shaped like a normal distribution, in fact it turns out that the sample statistics are often normally distributed even when the underlying population data is not.
It is used in classical statistical inference but is not as central to the purposes of data science. Because empirical bootstrap sampling can answer most questions about the sampling error.
With the larger samples and provided that the probability is not too close to 0 or one, the binomial distribution can be approximated by the normal distribution.
Chi-squared distribution is typically concerned with the count of subjects what items falling into categories.
F-Distribution measures the extent to which differences among group means are greater then we might expect under normal random variation. This comparison is termed analysis of variance (ANOVA). It compares variation due to factors of interest to overall valuation.
Poisson distribution tells us the distribution of events per unit of time or space when we sample many such units, if the assumption is that the rate remains constant over the period being considered. This is rarely reasonable in a global sense, however the time periods areas of space can usually be divided into segments the type of sufficiently homogeneous so that analyses or simulation within those periods is valid. For events that occurred at the constant rate the number of events per unit of time or space can be modeled as a Poisson distribution.

The term inference reflects the intention to apply the experiment results, which involve a limited set of data, to a lager process or population.

Hypothesis tests (significant tests)
The alternative hypothesis is what we hope to prove, the null hypothesis is the one that the chance is to blame. We need a hypothesis because human mind tends to underestimate the scope of natural random behaviour. One manifestation of these is the failure to anticipate extreme events. Ror misinterpreted random events as having patterns of some significance.
Statistical hypothesis testing was invented as a way to protect researchers from being fooled by random chance.
In our experiment we will require proof that the difference between groups is more extreme than what chance might reasonably produce. This involves a baseline assumption that the treatments are equivalent and any difference between the group is due to chance. This baseline assumption is termed the null hypothesis. Our hope is that we can In fact prove the null hypothesis wrong and show that the outcomes from groups a and B are more different than what chance might produce: nothing special has happened and any effect to yourself is you to random chance.
Resampling has the general goal of assessing random variability in a statistic, it can also be used to assess and improve the accuracy of some machine learning models. Permutation test is the procedure of combining two or more samples together and randomly reallocating the observations to resamples.
P-value: it not is simply the probability that the result is due to chance, It is the probability that, given a chance model, results as extreme as the observed results could occur.
In fact, a significant P-value does not carry you quite as far along the road to proof as it seems to promise. The logical foundation for the conclusion “statistically significant” is weaker when the real meaning of the P value is understood.
It does not measure the probability that the studied hypothesis is true. Larger samples ensure that small, non-meaningful effects can nonetheless be big enough to rule out a chance as an explanation.

Formal statistical tests are use the only sparingly in data science since the data size is usually large enough that it rarely makes a real difference whether the denominator has N or N-1.

Chi-square statistic: a measure of the extent to which some observed data departs from expectation (that is under the null hypothesis).
Intuitive re-sampling procedures (permutation and bootstrap) allow to gauge the extent to which chance variation can play a role in data analysis.

Regression and Predictions
When the predictor variables are highly correlated it is difficult to interpret the individual coefficients.
An extreme case of correlated variables produces multicollinearity, a condition in which there is redundance among the predictor variables. Perfect multicollinearity occurs when one predictor variable can be expressed as a linear combination of others.
However, it is not such a problem for non-linear regression methods like trees, plastering, and nearest neighbours, and in such methods it may be advisable to retain P dummies instead of P-1.
The opposite problem is confounding variables: an important variable is not included in the regression equation, this can lead to regression equation with spurious relationships.
Heteroscedasticity is the lack of constant residual variance across the range of predicted values.
Most often in data science the interest is primarily in predictive accuracy, even though a regression may violate one of the distribution of assumptions. You may discover that there is some signal in the data that your model has not captured. However, satisfying distribution of assumptions simply for the sake of validating formal statistical inference (p-value, F-statistics) is not that’s important for the data scientist.

While outliers can cause problems for the small data sets, the primary interest with outliers is to identify problems with the data or locate anomalies.

One strategy for imbalanced data is under sampling that is downsample the prevalent class so the data to be modelled is more balanced between zeros and ones. The basic idea is that the data for the dominant class has many redundant records.
The opposite solution is over sampling order data generation via bootstrapping.

K-nearest neighbours
Is one of the simpler models because there is no model to be fit.
Find K records that have similar features; for classification find out what the majority class is among those similar records and assign that class to the new record; for prediction a.k.a. regression, find the average among those similar records and predict that average for the new record.
Neighbour is a record that has similar predictor values to another record.
Typically the predictor variables are standardized so that variables of the large scale do not dominate the distance metric.
When K is low we may be over fitting that is including the noise in the data whereas when it is too high it loses the ability to capture the local structure in the data (oversmoothing).
This model can be useful as a feature engine too, to add local knowledge in a staged process with other classification techniques where for each record the ossification is the arrived and that result is added as a new feature to the record. Then another classification method is run on the data. The original predictor variables are does used twice without the problem of multicollinearity because the additional information is not redundant.

Decision Trees
As the tree grows bigger than splitting rules become more detailed, and the tree gradually shifts from identifying big reels that identify a real and reliable relationships in the data to tiny rules that reflect only noise.
A full grown tree results in completely pure leaves and hence 100% accuracy in classifying the data trained on, which is illusory, it is overfitting (fitting the noise in the training data).
For continuing variables impurity is measured by squared deviations from the mean in each subpartition.
Trees can capture non-linear relationships among predictor variables.

Bagging (Bootstrap Aggregating) is the basic algorithm for ensembles, but instead of fitting the various models to the same data each new model is fitted to a bootstrap resample.
Form a collection of models by bootstrapping data.
Random Forest applies Bagging to decision trees.

Boosting
It fits a series of models in which each successive model seeks to minimize the error of the previous model, by giving more weight to the records with large residuals for each successive around.
XGBoost is an implementation of stochastic gradient boosting which incorporates resampling of records and columns in each around. Contrary to the random Forest (fits deep trees) this sampling is done without a replacement and it fits the shallow trees which avoids spurious complex interactions.
While many models do not need tuning the parameters, blind application of XGBoost can lead to unstable models as a result of overfitting to the training data: the accuracy of the model on new data not in the training a set will be degraded; the predictions from the model are highly variable, leading to unstable results.
Regularisation tries to avoid the overfitting by adding a penalty term to the cost of function on the number of parameters in the model, ultimately penalizing the complexity of the model.
There are two parameters alpha and Lambda, increasing them will penalize more complex model and reduce the size of the trees that are fit.
Another useful tool is cross validation call on randomly speeds up the data into K different groups also called fold. For each fold a model is trained on the data not in the fold and then evaluated on the data in the fold.
Profile Image for Zhuldyzay Baki.
32 reviews
December 27, 2024
Probably very informative for people with little DS and statistics background to get to the field. For those with academic degrees in statistics, quite useful to get into professional jargon that data scientists use.
Profile Image for Rick Sam.
432 reviews155 followers
July 7, 2022
A Layman explanation for you:


A Good Scientist, is able to concise and explain in most simple way, even that my Grandma can understand

Pre-requisite: Thirst for Knowledge & insane Curiosity with questioning.

What do you want?

Depends on Who are you?

Engineer: You need to know, only how to use it or apply it

Do you want to build next-generation technology?

Research Scientist/Professor: You need to understand deeply, to create novel methods & contribute


I am wanting to write Math & Statistics, perhaps in my next writings.
In this, I concise to core aspects.


What is Data-Science?

Applied-Statistics represented using Programming Languages for Making Meaning out of Data.

Outline of this work:

1. Exploratory Data Analysis
2. Data and Sampling Distribution
3. Statistical Experiment and Significance Testing
4. Regression and Prediction
5. Classification
6. Statistical Machine Learning
7. Unsupervised Learning

So, What is the meat of this Book?

Let's go through this

1. Exploratory Data Analysis

First Chapter, gives clues for Data Analysis, from John Turkey's seminal paper. In short, Data Analysis is Exploratory through simple plots, summary statistics.

We have, numeric & categorical data

Numeric: Continuous & Categorical
Moreover, for two dimensional data, we have Rectangular Data [Row & Column]

In PANDAS, Google Colab et al - we have Data-frame.

In addition to the above, we have non-rectangular data structure, time-series & spatial data-structure.

Summary Statistics, What we want is to create a short summary of our data.

To do Summary Statistics: We commonly have, mean, median, outliers, anomaly detection, deviation, variance.

These are basic statistical measures for exploratory data.

For Visualization: We can use, Box-Plot, Histogram, Density Plot, Scatterplot, Contour Plots,

Gist of this chapter, Summarizing, Visualizing the data.

2. Data and Sampling Distribution

In applied work [industry], you'd probably be more concerned with data-quality, scale et al.

The Author goes into details with the following,

Population, in statistics describes defined large set of data.

Sample is subset of data, from larger set.
Random sampling, equal chance of being chosen.

Data quality consists of completeness, consistency, cleanliness, accuracy & representativeness.

Bias within statistics includes, difference between actual and predicted values.

Sometimes, bias might be because of random chance.
Sometimes, bias might be due to actual bias.

Size vs Quality: When does Size Matter?

The Author says, surprising, smaller is better.
Consider best predicted search destination for a query.

For Massive amounts, the author gives example of query in Google Search.

Imagine Google Search, for a query, "Tamil" Which one would come first?

The query has to pass through 100,000k documents, and give you relevant result.

So, how to mitigate bias?

We do it through, random shuffling.

Specifying a hypothesis,
Collecting data following randomization,
Random sampling principles ensures against bias.

Regression towards Mean,

Regression toward the mean simply says that, following an extreme random event, the next random event is likely to be less extreme.

Bootstrap is when, we take a sample taken with replacement from observed data-set.

And why do we do this? To assess variability of sample statistic.

Bootstrap is also a way to construct confidence interval.

Confidence Interval, is a way to represent uncertainty, gives us a range of interval.

Normal Distribution, the most famous distribution, imagine a nicely assorted Indian food, thali.

Contrary to what we believe, most of the data used in Data Science project is Long-Tailed Distribution.

They are not Normally Distribution.

T-Distribution is shaped like normal, a bit thicker and longer on tails.

Binomial Distribution -- Well, you have discrete set of values within a random distribution.

We have two values, that is why Binomial [Yes/No]

We also, have Chi-Square Distribution; In short, we want to measure extend of departure from what we expect in null model.

We represent this as, null hypothesis.

We have F-Distribution, where we measure ratio of variability among groups.

Imagine, a fertilizer or groups, and we want to find out variability of its effectiveness.

Poisson Distribution, When we have time involved, which is, Average number of events per unit of time or space.

We could ask, How many capacity do we need to be 9% sure that internet traffic arrives every second?

Exponential Distribution, We want to estimate failure rate i.e aircraft engine failure.

Weibull Distribution, We extend further from Exponential, where event rate is allowed to change.

3. Statistical Experiment and Significance Testing

Many Scientific Application requires experimentation.

Formulate Hypothesis, Experiment, Collect Data, Inference & Conclusions

A Popular one is A/B Testing, And Why do we do it?

Basically to find, which one is better?

Another popular method is Multi-Arm Bandit Algorithm.

So, Why use it? To Optimize decision making, through, m number of trails.

So, imagine, we want to design a policy that maximizes most returns.

Next, Hypothesis Tests consists of Significance test

Null Hypothesis: We take chance to blame
Alternative Hypothesis: Counterpoint to null
One-way test -- Hypothesis that counts chance result in one direction
Two-way test -- Hypothesis test that counts chance result in two direction

T-Test: We want to find if there's difference between probabilities of two population.

Degrees of Freedom: No of independent values

ANOVA: Analysis of Variance

Why use this? We have more than A/B; A/B/C/D with numeric data

F-Test: F value depends on this ratio of variance

Fisher's Test: Significance test, where, we use it for finding purposeful association between two categories

Multi-Arm Bandit Algorithm:

Basically, we have a hypothetic slot machine, where we try multiple attempts for making optimum decision.

4. Regression and Prediction

Linear Regression:

Relationship between Magnitude of One Variable and Second Variable.

Multiple Linear Regression: Here, we find relationship between two or more independent variables to predict outcome of dependent variable.

Root Mean Squared Error:

We want a performance metric, meaning, we build a model.

And we want this model to predict something.

So, we find out difference between predicted and actual.

Many ways to reduce this error, and RMSE is a popular way.

So, we have square root, okay of what? Of averaged squared errors.

Cross-Validation:

What are we trying to do with Cross-validation?

We are trying to sample data from here and there within the data to see, how it does on prediction.

Weighted Regression:

So - Regression, recall we get a scalar value.

Weighted Regression, We want to use it when, our dataset doesn't display heteroscedasticity, meaning no display of variance.

Multi-Collinearity:

We say, two variables are multi-collinear, when there is high correlation between them.

5. Classification

Questions like,


a) "Is this customer likely to churn?"
b) "Is this person going to come back and read my review?"
c) "Is this person going to eat Tamil Dosai?"

These are all, Classification questions.

They come under, supervised learning.
Recall, supervised learning,

We have a label, Tamil Dosai for food, and then, we want to classify if the next food, say Idly is Dosai or not?

If it did classify Tamil Idly as Dosai, then we are not able to generalize.

That is not, what we are wanting.

Naive Bayes:

So, what is this?

We assume, between the data-sets, there is no relationship.

They are all independent, and it's a simple form of Baye's Theorem.

Discriminant Analysis:

So, What are we wanting?

We are wanting groups, yes, groups.

Imagine a set of South Indian food [Dosai, Idly, Chutney, Rice, Sambar, Chicken]

Well, We eat it, but we keep them in groups, right?

And, what do we want to do with it?

We want a combination of those variables to predict, veg or non-veg?

Yes --So, we have assumption called, multivariable normal distribution.

Which means, we are assume the dataset is normally distributed.

We have independent variable & dependent variable.

We use the dependent variable for forming groups and independent variables for variables.

We can use it for categorical or continuous predictors

Logistic Regression:

A Popular method, here the outcome is binary.

Mostly, it's simple and faster to use.

Generalized Linear Models

A Probability distribution or family.

6. Statistical Machine Learning

K-Nearest Neighbours:

Basically, this is non-parametric, meaning, there's no assumption in the dataset.

And it is supervised learning.

Distance Metrics:

Mahalanobis distance:

You'd come across frequently many distance metrics.

Mahalanobis is a distance metric, that measures distance between points of dataset involving mean and co-variance matrix.

One Hot Encoder:

We have, "Dosa" and we want to represent this into a way, computer understands.

In simple terms, we translate this into 0 and 1.

Normalization:

Again, we have, "Dosa" variable, and we are scaling to make sure to do computation easier.

Z-Scores:

How far is taste of yours from regular Tamil people in Food?

We want a normal distribution to represent taste buds.

And then, we have Z-score, if your Z-score is zero, it says to me, it's identical.

Classification and Regression Trees:

CART:

I actually don't know tree names in America.

I'd stick with tree names in Tamil Nadu, India.

Oh, Coconut Trees.

So, imagine classifying each branch from coconut, tree based on some criteria [Dosai/non-Dosai]

So, we split the tree with criteria.

Recursive partitioning:

We are wanting to create a decision tree, based on a criteria.

Bagging and Random Forest:

We want to use Bagging alongside with Random Forest - Why?

To reduce bias.

Ah, recall aggregation & bootstrapping.

Variable Importance,

How much, we give a model, that uses the variable to make accurate prediction.

Hyper-parameter:

Basically, we have parameters to control, learning process of the models.

Ah! We want to tune our model.

Boosting:

We want to reduce errors in data analysis.

Usually, we apply where classifier has high bias.

Ah! Bias, recall, when actual difference predicted value is large.


XGBoost:

XGBoost, a type of regression and classification technique, mostly used in learning to rank.

Well, we have something called, mix of models, they say it ensemble models.

Basically, we combine few of the models, to create better performance and increase prediction.

Cross-Validation:

What we are wanting to do, resample method, so that we can test and validate different portions of data.

7. Unsupervised Learning

Frequently, in Machine Learning; We do come across ways to do learning.

So, Unsupervised Learning is basically a way to extract meaning without Labels.

Clustering:

Recall, we want to group them, we can use this for exploratory data.

We could use it to wanting to reduce dimensions.

Principal Component Analysis:

We want to do PCA, when we want to find, most important variables that give, most divergence.

Think this way, you have 100 columns of data, you want to find most important for your question.

So, PCA would give you the 3 or 4 that co-vary the most.

Issues:

We can't use PCA for categorical values

Correspondance Analysis:

Ah, so we can't use PCA for categorical variables.

What do we do? We use Correspondence Analysis to find association between categories.

We get the output as a bi-plot.

K-Means Clustering:

Clustering, recall, we want groups.

And why do we want it?

Well, we want it because, we make groups to do some exploratory data or something like it.

Cluster - similar records, and k is number we want.

Hierarchical Clustering:

We usually have tree of cluster map.

A subset of Agglomerative Algorithm.

Measure of Dissimilarity:

So, how do we measure dissimilarity?

Complete linkage, Single linkage, Average Linkage, Minimum variance.

Multivariate Normal Distribution:

There are lot of distribution, Multivariate is normal distribution, you have more dimensions.

Scaling and Categorical Variables:

Basically, we are squashing , expanding data to bring multiple variables with same scale.

Categorical variables: Yes/No

Gower's Distance: Similarity measure, Ah! We use it for binary or categorical variables

"Torture the Data long enough, and it will confess"

Deus Vult,
Gottfried
Profile Image for Xandra.
10 reviews
July 16, 2025
Really good reference book, with practical code examples. Also really appreciated some of the side notes, especially the little “scorpion” ones on very specific details to look out for that might cause confusion. Reading it from cover to cover was a bit dry at times, but was generally pleasantly surprised how doable it was for a book that packed so much technical knowledge. It complimented the Machine Learning specialisation I had done on Coursera very well. I feel more confident knowing what I don’t know, the book did a good job outlining a lot of concepts that might be relevant at some point at least I will know where to look next time.
5 reviews
September 26, 2021
'Praktische Statistik für Data Scientists' von Peter Bruce, Andrew Bruce und Peter Gedeck.

Das Buch habe ich hach dem anderen vom Schriftsteller Aurélien Géron mit dem Titel 'Praxiseinstieg Machine Learning mit Scikit-Learn, Keras und TensorFlow' (auch von O'Reilly Verlag) gelesen und, meine Meinung nach, war das die richtige Entscheidung. Bei der Praktische Statistik geht es um die essentiellen Informationen, die man sehr schnell finden und erfrischen kann, wenn es notwendig ist. Deswegen aus meiner Sicht muss man schon im Voraus bestimmte Grundkenntnisse im Bereich Machine Learning haben, um die Inhalte und die Anwendungen besser nachvollziehen zu können.

Die Schriftsteller haben relativ kurze Kapitel geschrieben, was den Überblick über das Thema gibt, aber nicht in die Tiefe geht. Dafür sind die anderen Quellen bei weiterführender Literatur gelistet. Am Anfang war das Buch für die Sprache R geschrieben, später in der zweiten Auflage wurde es jedoch mit Python ergänzt. Ich als Hobby-Pythonist, der mit R so gut wie keine Berührungspunkte hatte, konnte alle Beispiele verfolgen und bis auf Paar Ausnahmen war es für mich möglich die gleichen Ergebnisse, wie im Buch mit R-Sprache dargestellt ist, auch mit Python zu bekommen. Der vollständige Code ist übrigens auf Github zu finden, falls beim Abtippen es zu den Fehlermeldungen kommt.

Die Thematik der Statistik kann etwas langweilig sein und leider waren Paar Kapitel auch nicht ganz spannend (zum Beispiel das Ende des zweiten Kapitels, wo verschiedene Verteilungen beschrieben wurden, und das dritte Kapitel über statistische versuche und Signifikanztests). Das kann vielleicht auch daran liegen, dass ich nicht Uni-Background aus dem Bereich besitze.

Insgesamt aber muss ich sagen, dass das Buch gelungen ist, und ich werde es auf dem Regal daneben haben, falls ich schnell über Paar Statistik-Themen nachlesen muss.
Profile Image for Jack Bodine.
4 reviews
July 9, 2025
A quick tour of many statistical concepts. It presents topics intuitively rather than mathematically, which isn't inherently good or bad but something to be aware of.

I would've preferred more depth than the book offers. It feels more like a primer and quickly moves from one topic to another. I wouldn't recommend it as a reference as there are scarcely algorithms presented for each topic.

To illustrate, one of my favorite statistical topics: multi-armed bandits, only gets 3 pages. After reading, the reader has no idea what bandit algorithms exist out there, just a high level overview of what a bandit problem is.

Despite its shallowness, I still like the book for what it is— an introduction for people who don't actually want to learn stats. Otherwise, it seems inferior in every way when compared to Gareth James's 'An Introduction to Statistical Learning.'
333 reviews3 followers
September 2, 2022
I found this really handy as a reference and a supplement for several of my classes. It didn't go in the same order as my class so I didn't read it cover to cover but pulled from most sections. It's a little too dense and too brief to be the only learning tool for beginners (to both programming and stats) like me, but they do a good job of summarizing the topics, defining vocabulary, and then giving useful code samples in both Python and R. Packages evolve fast though, so some places where they say "there's no package in Python for x" there actually are now. Covers everything from beginning stats through supervised learning, and clusters but not market basket in unsupervised learning.
126 reviews
December 9, 2024
Great overview focused heavily on data science, not traditional statistics

I got this book to get a better overview of traditional statistics while also having a reference how to apply it in data science. To my surprise, data science takes a different angle and prefers to avoid assumptions about the mathematical distribution of the data, using Monte-Carlo like approaches instead. This was an unexpected but interesting takeaway.
In general, the book covers the individual topics in quite a bit of detail, but not so much that it was overwhelming. Lots of resources for further reading are given.
Profile Image for Bjoern Rochel.
398 reviews83 followers
September 5, 2025
Unfortunately my cup of tea. Good overview of the statistical methods, models and techniques that are out there and used in the context of Data Science.

But very far away from “practical”. For me practical doesn’t mean “with code examples”, but rather no fluff/ no detour/ straight to the point. This book contains several parts, that later are abandoned ala “you probably will do this instead nowadays” or alternatively where it’s not clear in which use case you should prefer which tool. But I guess a large part was due to me not potentially being in the target audience for this book.

Anyway it was ok to read it to get a bit more background on models used pre Data science and LLMs.
Profile Image for Dmitry Rizdvanetsky.
2 reviews
March 14, 2021
The book gives breve overview of various applications of statistics for data science purpose. The concepts explanations short and case in point. The examples of code are concise and generally well written - many of them are reusable straight from the box. This book is especially good if you are looking for visually appealing methods to perform EDA (the graphical part of the book is just splendid). Recommended for beginners and established professionals as a refreshment.
37 reviews4 followers
April 24, 2021
The content of the book goes with the title. Gives practical knowledge about concepts which are generally and widely used in the world of Data Science. Although I felt that some topics require much more in depth information, as they are just briefly touched upon, but authors gave references which can be looked after every topic.
Overall a good book to learn without much concentration on theories and with practical examples.
Profile Image for Krishnan.
202 reviews6 followers
December 1, 2022
Can't say the book was always easy to understand. I had to keep referring to YouTube videos to better understand the concepts mentioned in the book. On the bright side, I discovered a nice YouTube channel called statquest which is probably one of the best teaching material on the topic of statistics out there.
Profile Image for Aaron Schumacher.
203 reviews11 followers
November 24, 2021
I was happy to do a pre-publication technical review of the upcoming second edition of this fine book by three folks including the founder of statistics.com. The book has a unique perspective enlivened by examples and historical detail, which I enjoyed.
Profile Image for Lorenzo Barberis Canonico.
133 reviews5 followers
December 13, 2022
Such an accessible book that makes otherwise tedius material easy to follow.

Overall, this book is a great intro that helped me truly understand the difference between a statistician and a data scientist
Profile Image for YellowG.
9 reviews7 followers
September 1, 2021
Excellent primer on a multitude of subjects relavant to acolytes of data science and machine learning, with a clear writing style, good examples and illustrative graphs made by R. Highly recommended.
87 reviews3 followers
March 21, 2022
The Theory is well explained, you start reviewing your statistics concepts - arguably the hardest part - and when you notice you are already using supervised machine learning libraries lol
Profile Image for David McAtee.
6 reviews1 follower
January 31, 2023
Excellent overview of many different topics. Great resource to have on the shelf for a quick reference.
Profile Image for Jimmy Pang.
3 reviews
March 23, 2023
have to say it is pretty intensive one - definitely take some time & focus to read & understand the concepts
Profile Image for Franklin Tan.
30 reviews
August 18, 2023
I picked up this book as a refresher, and it did a great job of succinctly describing essential concepts you need to know in the field of data science.
Profile Image for Jiwon Kim.
207 reviews3 followers
February 29, 2024
It felt like a good summary of what my graduate school studies were about.
Profile Image for JP.
259 reviews3 followers
April 15, 2024
Good survey of the methods utilized
Profile Image for Tim.
265 reviews2 followers
January 2, 2022
More R than Python. You can tell the different authors as you go through the sections
Displaying 1 - 27 of 27 reviews

Can't find what you're looking for?

Get help and learn more about the design.