This book is not as good as R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, but if you are constrained or committed to using PThis book is not as good as R for Data Science: Import, Tidy, Transform, Visualize, and Model Data, but if you are constrained or committed to using Python, it is the best available alternative as of 2018. Chapters 1 through 3 on ipython, Numpy, and Pandas are very well written, although they do suffer from using mostly small, made-up examples. Chapter 4 on Matplotlib is disappointing, but that's because Matplotlib is itself a weak and obsolete tool; the book acknowledges that fact and cannot fix it. I do not care for Chapter 5, which attempts too much and delivers too little (for example, the “in depth” treatment of linear regression is all of 2 pages). I suggest that you stop at the end of Chapter 4 and instead move on to Introduction to Machine Learning with Python: A Guide for Data Scientists.
As an alternative to this book, also consider Python for Data Analysis by Wes McKinney, which includes more verbose coverage of Pandas, at the expense of removing the ML section that you probably don't want to read anyway. The two are about equally good and share the same strengths (good writing) and weaknesses (dry references with mostly made-up data in the example, and use of Matplotlib for graphics).
Update in late 2018: I now recommend Altair as the best native-Python graphics library, or plotnine, a clone of ggplot. Either way, you should skip most of Chapter 4 on matplotlib and learn one of these other libraries instead....more
I've never met Pearl, but having read a couple of his books, I'm pretty sure he's an asshole. His anger and bitterness comes through very clearly in hI've never met Pearl, but having read a couple of his books, I'm pretty sure he's an asshole. His anger and bitterness comes through very clearly in his book — he spends as much space naming and vilifying his professional enemies, both living and dead, as he does explaining his work. This is a real shame, because his work is actually quite good and deserves a popular presentation; sadly the sanctimony in this book is almost unbearable, and there is no humor to lighten it.
Unfortunately I don't have an alternative to recommend instead. I think your best bet is to read chapters 1,4, and 6, and keep a bottle of antacid handy....more
This book is good, provided you do not believe the author's facile claims that it is the only book you need. This book explains almost nothing about hThis book is good, provided you do not believe the author's facile claims that it is the only book you need. This book explains almost nothing about how deep learning actually works, and is actually more like a user manual for Keras. Provided you actually want an instruction manual for Keras, it's an excellent book. If you want to *understand* something about Deep Learning, go read the book by Goodfellow et al. They make a nice set, in either order or alternating between the two....more
* Practical guidance on data preprocessing, feature engineering, and handling class imbalance * An introduction to the caret library, which offers a uniform interface to cross-validation and hyperparameter tuning * An overview of a larger set of models and libraries than ISLR covers
This book is a well-written, verbose introduction to Pandas by the main author of that library. Don't expect to learn much besides Pandas - matplotlibThis book is a well-written, verbose introduction to Pandas by the main author of that library. Don't expect to learn much besides Pandas - matplotlib gets a brief mention, and there is a short Numpy section, but broadcasting is relegated to an appendix.
This book is a peer of Python Data Science Handbook by Jake VanderPlas, and they are more alike than different. They both start with long sections on manipulating data in Numpy and Pandas, on mostly made up examples of random numbers. This book is the more verbose of the two; it does have more complete coverage of Pandas functionality (albeit less coverage of Numpy), and it also takes longer to read. It's only 4 stars because it's not very engaging: I prefer a book like this to introduce some real data early and to motivate the learning of techniques by showing how it helps answer questions in the data, like R for Data Science does.
I find that matplotlib is unusably low-level for modern data science, and you should skip that section in any of the books and learn either Altair or plotnine (a clone of ggplot) for your plotting work in Python....more
This book is an excellent gentle introduction to data analysis and exploration in R. I especially recommend it as the 1st book for software engineers This book is an excellent gentle introduction to data analysis and exploration in R. I especially recommend it as the 1st book for software engineers who want to move into data science.
Because the "tidyverse" libraries are very "magical", making extensive use of nonstandard syntactical features like unquoted column names, this book does not teach good general programming practices. Therefore I do not recommend it if you are unfamiliar with programming and want to learn.
I recommend skipping Part IV entirely, as I feel the attempt at introducing regression in a non-mathematical way is largely a failure. I like the book in spite of this shortcoming.
This book is for people who already understand machine learning or predictive modeling, and who already understand investment, and would like some guiThis book is for people who already understand machine learning or predictive modeling, and who already understand investment, and would like some guidance on applying the one to the other. It is an excellent book if and only if you meet these conditions.
The author has a hint of Taleb-style arrogance, wanting to be recognized for being the smartest person in the room, but not enough to impede enjoyment of the book, and it answers the question of why he published it at all in a field which is otherwise characterized by "those who know do not say."...more
This book is well written and packs a substantial amount of information into a small number of pages. It is best used to get a survey and overview of This book is well written and packs a substantial amount of information into a small number of pages. It is best used to get a survey and overview of many of the facets of the domain of data science. This book will not teach you anything in enough depth to actually execute it well — it will teach you just enough to be dangerous and not realize when you've gone off the rails. I recommend it for managers who may never go into technical depth, for people considering whether or not they are interested in data science, or as a preview book to create a framework from which to hang more detailed understanding. Although this is an introductory book, it assumes you can already program in R. If you can't, either accept that you won't be able to follow the specifics of the examples, or read The Art of R Programming and/or R for Data Science.
I dislike that the authors make a number of categorical statements of the form "Data Scientists do this" or "Data Scientists don't need that". I disagree with many of these assertions and I think they have taken a definition of "data science" which is narrower than the prevailing consensus in the industry.
This book has some errors (see, for example, the confusion matrix on page 196) but overall the accuracy is acceptable relative to recent norms....more
This should not be your first book on causality. Start with Kline, and if you finish that book and want more on SCM, then come back to this book. AnotThis should not be your first book on causality. Start with Kline, and if you finish that book and want more on SCM, then come back to this book. Another reasonable place to start would be Mostly Harmless Econometrics.
The problem is that Pearl, who is undeniably a significant contributor to the field, is not a good writer. He does not explain concepts clearly, and he cares more about promoting his own contributions than educating. Although this book has a general-sounding title, it makes no attempt to actually cover the whole field of causal inference; it's only about Pearl's work and that of his students.
What this book is really about is Pearl's mathematical "do-calculus", and how, given a complete causal graph, it can be used to rigorously state what it means to intervene or to assess a counterfactual.
For a brief introduction to using causal graphs to select your controls, see Chapter 17 of "Statistical Modeling - A Fresh Approach". That chapter is available free from the author at http://www.mosaic-web.org/go/Statisti...
For more about inferring causal graphs from the data, look for a series of papers by Colombo and Maathuis at ETH Zurich....more
This is the correct first book to read on causal inference. It covers structural equation modeling (SEM), confirmatory factor analysis (CFA), and PearThis is the correct first book to read on causal inference. It covers structural equation modeling (SEM), confirmatory factor analysis (CFA), and Pearl's structured causal modeling (SCM). Adequate preparation for understanding this book would be a basic treatment of multivariate regression, such as Gelman and Hill. Introduction to Statistical Learning would also be sufficient. If you want to really understand confirmatory factor analysis, you should probably already know something about factor analysis as well; I liked Gorsuch.
Although this book claims to cover various software packages, the treatment is cursory and the code examples (online) are mostly uncommented; don't expect to really learn how to use the software from this book. Read this book for the principles and then also read the software manual for whatever tool you're going to use.
Ironically, this book, whose title claims to be about SEM only, actually covers most of modern causal inference, whereas Pearl's book, with the grand title "Causality", covers only his own narrow work. This is definitely the one you want....more
This is my favorite introductory statistics book. I've gone through a lot of stats books, and I've found that most of them have either too much math oThis is my favorite introductory statistics book. I've gone through a lot of stats books, and I've found that most of them have either too much math or too little math. I don't want more equations than text, and I also don't want "This here is a NOR-MAL distribution. Can you say NOR-MAL?" This is the "goldilocks" book - just right. It gives a very gentle introduction to what a probability distribution is, but it also covers all the critical concepts and tools: hypothesis and power tests on both continuous quantities and proportions, ANOVA, and Chi-sq tests.
When I'm teaching with this book, I use chapters 1 through 6, but I don't use chapters 7 and 8 on regression, because I think it's a weaksauce explanation; I move my students on to either Introduction to Statistical Learning or Gelman and Hill's multilevel book instead.
You should read one book from Nassim Taleb, and this one would be a reasonable choice. With each book he writes, he develops some new ideas, but he alYou should read one book from Nassim Taleb, and this one would be a reasonable choice. With each book he writes, he develops some new ideas, but he also gets more tiresome as an author, focusing as much on name-dropping his high-powered friends and trying to convince you that he's the smartest person in the room as he is with presenting his actual ideas. In The Black Swan, he was annoying but still worth tolerating for the sake of the ideas. One book prior to this, Fooled by Randomness, would also be a reasonable choice. I found the next book, Antifragile, too tiresome. 70% of the content is the same between these books, so you won't need to read all of them....more
This is an excellent 2nd book on linear algebra, after a traditional book like Strang (or whatever your professor used in college). It presents an unuThis is an excellent 2nd book on linear algebra, after a traditional book like Strang (or whatever your professor used in college). It presents an unusually intuitive and geometric interpretation of important operations like SVD and QR decomposition, which is just the thing for thinking about models based on these algorithms.
The second half of the book discusses conditioning and which algorithms are more susceptible to instability; data science focused readers who are not writing their own models may want to skim through that part.
Disclosure: One of the authors is a friend of mine, but he didn't get me a discount on the book. ;)...more
This is my recommended book on multilevel models, and it's also much more than that. Even if you have no interest whatsoever in multilevel models, theThis is my recommended book on multilevel models, and it's also much more than that. Even if you have no interest whatsoever in multilevel models, the first part of this book has very useful things to say about designing and interpreting experiments for causal inference, a topic which is sorely neglected in many modeling and machine learning books.
One caveat is that all the MCMC examples in this book use Bugs, which was Windows-only and is now somewhat obsolete. You should not actually use Bugs, but rather JAGS instead, which is mostly syntax-compatible....more
I found this book rather theoretical and inaccessible; it is written in a style where the equations are largely expected to speak for themselves withoI found this book rather theoretical and inaccessible; it is written in a style where the equations are largely expected to speak for themselves without assistance from words. Do not read it unless you are comfortable reading pages of equations with no useful expository text and deriving value from them....more
This book is well-written and I agree with almost all of the authors' advice about how analyses should be carried out.
The problem is that it's a surveThis book is well-written and I agree with almost all of the authors' advice about how analyses should be carried out.
The problem is that it's a survey: it shows you a bunch of techniques and sketches out how to implement them in R, but it does not teach you enough about any of them to conduct them proficiently (even graphing in the beginning is taught in Base R, which almost nobody currently uses). So you invest in reading a 400 page technical book, and then you're still not actually skilled enough to execute anything at the end. For most people, I would recommend starting with R for Data Science which has less breadth and more depth.
Like Data Science for Business, this book will be useful for managers who need to know about the skills without having the skills. It may also be useful to people who already understand statistics and want to see the R tools which implement them, or to people from very different fields who want to see what marketing problems look like.
Perhaps the best use is for people who already work in the field and would like an easy read refresher with some new tips here and there. I especially liked the authors' use of simulated data to illustrate techniques, a practice I recommend....more
This book is an extensive intermediate-level survey of the literature in recommender systems, organized by topic. It is mathematically very accessibleThis book is an extensive intermediate-level survey of the literature in recommender systems, organized by topic. It is mathematically very accessible, and provided you have read an introductory book about predictive models, such as Introduction to Statistical Learning, you should be able to follow it.
Aggarwal presents the tradeoffs between purely collaborative models (using what other people think, treating the item as an opaque ID), content-based models (using meaningful properties of the item), and guided search models, and how to combine them. There is a short but valuable section on learning to rank, and then he extends to even more challenging cases such as location or time-dependent recommendations.
Like all of Aggarwal's books, this one has an extensive bibliography so you can find more detials. Unlike his Data Mining, it presents few entirely new algorithms, and instead talks about how to apply and reconfigure tools you already have for the specific case of recommendations and collaborative filtering. Recommended....more
If you already understand the concepts of frequentist statistics, this book will clearly show you how to apply them using R, and get you from zero to If you already understand the concepts of frequentist statistics, this book will clearly show you how to apply them using R, and get you from zero to a place where you can comfortably learn more from the online documentation. The book is clearly written and has copious examples; the explanations of the meaning of the output is often better than the library documentation. The chapter on manipulating data in R is particularly strong with both clear exposition and a good selection of what to cover to help statisticians become productive.
However, the explanation of the statistics is very brief and entirely unsuitable for novices. When I ran into a few things I was already not familiar with, the explanations were too condensed and I had to look them up elsewhere to understand. If you do not already know statistics well, reading this book will result only in frustration. For people who have no statistics background, I think I would recommend starting with Baclawski's "Introduction to Probability with R" or "OpenIntro Statistics" by Diez et al....more
This book contains some excellent points which people of all political stripes should be able to agree with, regrettably mashed together with anti-marThis book contains some excellent points which people of all political stripes should be able to agree with, regrettably mashed together with anti-market socialism. Overall worth reading, but fails to live up to what it could have been if written with a different viewpoint.
The core point of this book is that algorithms must operate in a negative feedback cycle and be corrected when they diverge from predicting the true outcome. There are two main ways in which this can fail to happen:
1. The desired outcome was hard or impossible to measure, so proxies were used instead. In this case, if the algorithm ever becomes important, people will game the hell out of the proxies, which will then cease to actually be good proxies for what they were originally trying to approximate. Since the true result is not measured, nothing in the model will correct for this divergence, and it will continue indefinitely unless there is a policy change (example: U.S. News college rankings)
2. Even if a model is calibrated against the ground truth, it can start to change that truth if it becomes sufficiently widely adopted, becoming a self-fulfilling prophecy and creating positive feedback instead of the required negative feedback. (example: convict recidivism scoring)
I think almost everyone can agree that these failure modes are both plausible and undesirable.
Now for the bad parts. The author is openly anti-market: "...teachers and social workers make less money than engineers, chemists, and computer scientists. But they're no less valuable to society." (p66) Surely the author has taken a microeconomics class and is aware that other people believe prices convey information - does she believe that nobody with orthodox economics is even reading the book? More egregiously, she herself uses bad math when trying to make an emotional point: "...black and Latino males between the ages of 14 and 24 made up only 4.7 percent of the city's population, they accounted for 40.6 percent of the stop-and-frisk checks by police. More than 90 percent of those stopped were innocent." (p25) Well, what fraction of the non-black-and-Latinos stopped were innocent? Was it also 90%? 80%? 99%? This information is critical! There's a separate argument that harassing people who are 90% innocent is a bad policy regardless of race, but muddling this together with an argument that the algorithm is racially biased serves neither case well. On page 209 she uses the terms positive and negative feedback backwards from the established definition. One is left wondering if the book was actually ghostwritten by an uncredited second author, since someone with O'Neil's supposed credentials should not make mistakes like this.
The ratio of new information and detail to warmed-over rhetoric is mediocre. There is little depth in most of the examples presented. At one point the author admits that "the trouble... is an oversupply of low wage labor" (p128) but does not develop this idea and brings little clarity to the question of when models are inherently broken and when they are successfully performing their intended function at increasing efficiency and profit. She says "does the model work against the subject's interest? In short, is it unfair?" (p29) but these are not at all the same thing - it is clearly in every criminal's interest to serve zero time in prison when caught, but this is not the interest of the rest of society. Adversarial relationships are sometimes necessary in a functioning civilization.
My overall assessment is that it's worth a quick read in spite of serious flaws....more
This is an excellent book on causal inference for econometrics and related problems (many observations, some unobserved covariates, no randomized expeThis is an excellent book on causal inference for econometrics and related problems (many observations, some unobserved covariates, no randomized experiment or only partial compliance in the experiment). If you want to learn about propensity score matching, discontinuity regression, and a hell of a lot about instrumental variables, this is the book for you.
As a minimum preparation, I think you would need to read and clearly understand parts 1 and 2 of Data Analysis Using Regression and Multilevel/Hierarchical Models before reading this book. It is written in a style which is at times humorous, but don't let this mislead you into thinking that it is for a general audience; this is a highly technical book and at times it's tough going. You'll want a thorough understanding of the linear algebra behind OLS, and this book is often cursory on the derivations....more
I found this book to be an excellent introduction and overview of deep neural networks for someone who already understands other types of statistical I found this book to be an excellent introduction and overview of deep neural networks for someone who already understands other types of statistical and machine learning models. It can be a challenging book, but it's clear and well written; the challenge is commensurate with the inherent complexity of the material, and not because the authors capriciously skip steps. In fact, rather the opposite is the case - the authors are quite explicit and put in more intermediate steps in their derivations than most books, let alone papers, which I quite like.
This is not the "here's some code, off you go" book. This is the book that explains what is going on and why, so that you will be able to make principled decisions and not just be an "appliance operator" when you then go read the book with the practical details and code samples. If your preferred style of learning is to understand the concepts before applying them, read this book first; if not, come back and read it afterwards.
This is a nontechnical book on data modeling and mining. It has a small chapter on specific algorithms and tools, but you shouldn't even read that chaThis is a nontechnical book on data modeling and mining. It has a small chapter on specific algorithms and tools, but you shouldn't even read that chapter, because it's cursory and obsolete. The real value in this book is in understanding the business value of the things you can do as a data miner. It even has a survival guide in the back with concrete steps for each class of problem you might have, like "data is not suitable for requested purpose".
I recommend reading this book after you've learned and practiced the technical skills, when you're ready to start interfacing directly with the business stakeholders instead of a technical manager who defines and assigns tasks.
The biggest drawback to this book is that the author has used twice as many words as he should have for the number of ideas he conveyed. I'd give it 5 stars if it were half as long....more
Good book, but unless you have a substantial investment in R infrastructure that you can't afford to abandon, you should get the Python version of it Good book, but unless you have a substantial investment in R infrastructure that you can't afford to abandon, you should get the Python version of it instead. The deep neural network community has clearly standardized on Python, not R, and it is simply the better choice for any new project in that area if you get to pick.
Also, do not believe the author's facile claims that this is the only book you need. This book explains almost nothing about how deep learning actually works, and is actually more like a user manual for Keras. Provided you actually want an instruction manual for Keras, it's an excellent book. If you want to understand something about Deep Learning, go read the book by Goodfellow et al. They make a nice set, in either order or alternating between the two....more
From the title, one expects this book to be comprehensive and encyclopedic, but I found the opposite to be the case. This is a very mathematical rapidFrom the title, one expects this book to be comprehensive and encyclopedic, but I found the opposite to be the case. This is a very mathematical rapid-survey of statistics which does not explain how to actually do any of the things that a working engineer or scientist would need to do.
I think the audience of this book is "mathematicians who find books with more equations than text to be comfortable and easy to learn from, who also know nothing about statistics and want a quick survey of the field, and who will use statistics to prove theorems and write papers instead of actually calculating anything." This book is completely unsuitable for engineers; for those I would recommend Baclawski and then Diez. Even Casella&Berger is much more accessible than this book....more
This book is THE canonical reference for effect size and power testing; it coined the term “Cohen's d”. References in R functions, such as pwr.2p2n.teThis book is THE canonical reference for effect size and power testing; it coined the term “Cohen's d”. References in R functions, such as pwr.2p2n.test, are to this book, so it is a required reference.
That said, I feel it has not aged well as a book to actually read cover-to-cover. Written for the pre-computer era, it offers numerous tables and examples in their use, but offers little exposition on why certain measures were chosen. For example, the effect size transformation for proportions, phi=2*arcsin(sqrt(p)), is presented entirely without explanation of any kind, verbal or mathematical.
There are nuggets of wisdom here, but you have to sift for them....more
You need at least one thorough book on linear regression, and Fox is arguably the gold standard, as he's the author of the R "car" package, derived frYou need at least one thorough book on linear regression, and Fox is arguably the gold standard, as he's the author of the R "car" package, derived from his Companion to Applied Regression book, which contains the canonical Type II and III Anova implementation, the variable inflation factor function, and other useful tools. I found the writing in this book to be pretty clear overall.
There are newer editions of this book, but frankly linear regression has not changed much in decades, and this edition is available for 1/10 the cost of the latest one....more
This is a good introduction to time series forecasting for managers who need to talk about forecasting but who will not be personally doing forecastinThis is a good introduction to time series forecasting for managers who need to talk about forecasting but who will not be personally doing forecasting. It is targeted at an audience with very little statistical or modeling background. It gives a good overview of the space and makes several points that I agree with about how problems should be formulated. It then goes on to introduce specific techniques and R functions to implement them, but given the assumed background of the reader, these techniques are not adequately explained. You could implement them by cutting and pasting the code, but if you got poor or unrealistic results, you would be helpless to understand why or how to address it.
If you actually need to do forecasting yourself, this book is neither adequate in itself nor useful preparation for a more advanced book. In that case you will need to first learn statistics and regression, and then read a more technical time series book like Cowpertwait and Metcalfe....more
This book gets generally good reviews but I'm sorry to say I just didn't like it. The authors' ideas of what's simple and what's complicated just don'This book gets generally good reviews but I'm sorry to say I just didn't like it. The authors' ideas of what's simple and what's complicated just don't jibe with mine - I often found that they were using a very complicated derivation to "explain" a simple and intuitive result. I found the density of equations a rather hard slog with no commensurate reward, as other books covered the same material in a way that I found much easier and quicker going.
I would Cowpertwait and Metcalfe instead, or Tsay....more
This brief book is designed in the model of a practitioner's guide with just enough theory to understand how to call and interpret the R functions. UnThis brief book is designed in the model of a practitioner's guide with just enough theory to understand how to call and interpret the R functions. Unfortunately, it partially fails in this; the mathematical background it provides is too thin to explain several key concepts and they are glossed over.
For example, when explaining why to use multilevel models, the book compares them to a strawman of not including the groupings at all; a more meaningful comparison would have been to a linear model which included the grouping variable as a factor. A better explanation is provided for free in one paragraph in section 2.2 of the lmer vignette documentation.
As another example, in describing the lme4 syntax, the book explains how to specify that random slopes are correlated or uncorrelated, but does not explain what that actually means, what it translates to in equations, or how it actually impacts the model fit. The term "shrinkage" is never mentioned.
I would recommend the multilevel modeling book by Gelman and Hill instead, or the unfinished online PDF by Doug Bates....more