Many texts are excellent sources of knowledge about individual statistical tools, but the art of data analysis is about choosing and using multiple tools. Instead of presenting isolated techniques, this text emphasizes problem solving strategies that address the many issues arising when developing multivariable models using real data and not standard textbook examples. It includes imputation methods for dealing with missing data effectively, methods for dealing with nonlinear relationships and for making the estimation of transformations a formal part of the modeling process, methods for dealing with "too many variables to analyze and not enough observations," and powerful model validation techniques based on the bootstrap. This text realistically deals with model uncertainty and its effects on inference to achieve "safe data mining".
An excellent practical introduction to all things regression-related. Harrell’s book will not tell you about the Gauss-Markov assumptions and the asymptotic properties of estimators. Instead, it is an opinionated guide to the realities of statistical modelling.
The introduction chapter addresses important meta-questions about the importance of models and things to consider before modelling. Chapter 2 gives a good overview of how to interpret and evaluate regression models, while providing a comparison with other model formulations like decision trees and ML. Chapter 4 provides several strategies for multivariate modelling; Chapter 5 suggests some very interesting techniques for validating models.
I derived a lot of value from the book. It contains a large toolbox of statistical techniques and it’s full of rules of thumb that academic statisticians are often loath to provide: for example, you need at least 15 data points per predictor.
I skimmed over some parts of the book as they aren’t relevant to me at the moment: survival analysis, logistic regression, longitudinal response. But I’m comfortable in the knowledge that I can flick back to those chapters when and if.
This is a selection of advice that resonates with me:
- Prefer regression over classification. It allows you to separate the decision function from the actual model. - Random methods: - Add noise to y variables before doing feature selection to avoid overfitting. - Apply your pipeline to the same dataset with shuffled y (no signal). If you still get a fit, you have overfit. - Be careful “spending” degrees of freedom by e.g. looking at scatter plots before regressing. - Avoid stepwise feature selection
I recently had an urgent need to transport goods from the port to the warehouse. Based on all the information provided, I chose the company that offered the best value for money and quality of service. It was important to choose top drayage companies to avoid delays and damage to the shipment. Upon completion of the transportation, I evaluated the company's performance and provided feedback. This is important both for the company and for future customers. Ordering drayage services requires attention to detail and careful preparation. My experience has shown that choosing the right provider and transparent communication with them can make the whole process much easier.