Name: Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples
Rating: 4.31 (4 reviews)
ISBN: 9781800206571

Giulio Ciacchini

381 reviews14 followers

November 30, 2022

A wonderful cutting edge book which explains the latest technologies and algorithms.
Event though the neural network section is very advanced it is well written and accessible.

NOTES
https://github.dev/PacktPublishing/In...

Fairness: are predictions made without bias?
Accountability: can we trace their predictions reliably back to something or someone?
Transparency: can we explain how and why predictions are made?
Interpretability is the extent to which humans, including non-subject matter experts, can understand the codes and effect and input and output, of the machine learning model. To say your model has high level of enterprise ability means you can describe in a human interpretable way its inference.
Explainability encompasses everything interpretability is. It goes deeper on the transparency requirement because he demands human friendly explanations for the models inner workings and the model training process and not just model inference.
- Model transparency
- Design transparency
- Algorithmic transparency
Opaque models are due to:
- Not statistically grounded
- Uncertainty and non-reproducibility
- Overfitting and the curse of dimensionality

White Box Algorithms
Linear Regression
Normality is the property that each feature is normally distributed. No normality can be corrected with non-linear transformation, if a feature isn’t normally distributed it will make its coefficient confidence intervals invalid.
Independence: observation are independent of each other like a different and unrelated events.
Lack of multicollinearity is desirable otherwise you’d have inaccurate coefficient.
Homoscedasticity: when the residuals are more or less equal across the regression line.
If you’re going to use the Lena regression heavily we need to test these assumptions before fitting the data.
They intercept is not a feature, its meaning is if all features where at 0, what would the prediction be? In practice did this doesn’t happen unless your features happened to all have a plausible reason to be 0.

Ridge Regression
It is part of penalized or regularized regression family. It is called sparse linear model because thanks to the regularization it cuts out some of the noise by making irrelevant features less relevant.

Polynomial Regression
It is a special case of linear or logistic regression where every feature is expanded to have higher degree terms and interactions between all the features. It is still a linear regression in every way except it has extra features, higher degree terms, and interactions.

Logistic Regression
It is expressed by a logistic function that involves exponentials of the linear combination of the coefficient and the features. The presence of the exponentials explains why the coefficient extracted from the model are log-odds because to isolate vehicle efficient you should apply a logarithm to both sides of the equation.
To interpret each coefficient, is the same as with Linear regression, except each unit increase in the features, you increase the odds of getting the positive case by a factor expressed by the exponential of the confusion, all things being equal a.k.a. ceteris paribus.
There is no consensus on how to get Features importance yet.

Decision Trees
They have been used for the longest time, even before they were turned into algorithms.

[Neural Networks do not have the option to capture features importance]

XGBoost
It implements gradient boosted decision trees, an ensamble method.

SHapley additive exPlanations SHAP
It is a collection of methods or explainers that approximate Shapley values, its value is the average of these contributions over many simulations. You have a full coalition with all your feature sand you have all the possible subsets of the features minus the feature you are evaluating. The contribution of a feature a.k.a. its pay-off, is a reduction in predictive error for regression or an increase in probability for classification.
The computation time grows exponentially as features increase, that is why we should sample some of the possible subsets of features using Monte Carlo sampling, which randomly samples from a probability distribution.

Support Vector Classifiers SVC
SVM is a family of model classes that operate in high dimensional space to find an optimal hyperplane when they attempt to separate the classes with a maximum margin between them. Support vectors are the points closest to the decision boundary that would change it if were removed. They tend to work effectively and efficiently when there are many features compared to the observations but SVM is not as scalable to larger data sets and it’s hard to tune its hyperparameters.

Global Surrogates
A white-box model that you train with the black-box models’ predictions.

Permutation feature importance
It is a model agnostic method that can be used with the unseen data and it tells you what the model thinks is important according to what was learned from the training data, but it cannot to tell you what is most important once you introduce unseen data.
Its main disadvantage is that it won’t pick up on the impact of features correlated with each others, that is multicollinearity will trump feature importance.

Partial Dependence Plot
PDP conveys the marginal effect of a feature on the prediction throughout all possible values for that feature. It’s a global modern interpretation method that can visually demonstrate the impact of a future and the nature of the relationship with the target.
Its main disadvantages are that it can only display up to 2 features at a time and it assumes independence of features when they might be correlated with each other.

Features Selection
The advantages of selecting a smaller subset of features: easier to understand simpler models; shorter training time; improve the generalization by reducing overfitting because sometimes with little production value many of the variables are just noise and their ML model learned from this noise and triggers overfitting.

Bias Mitigation
Feature engineering; balancing or resampling; re-labelling or massaging; reweighing; disparate impact remover; Prejudice remover regolizer; exponentiated gradient reduction.

Tuning Hyperparameters
- Regularization
- Iterations
- learning rate
- early stopping
- class imbalance
- sample weight

coding non-fiction

Interpretable Machine Learning with Python: Learn to build interpretable high-performance models with hands-on real-world examples

Serg Masís

About the author

Serg Masís

Ratings & Reviews

Friends & Following

Community Reviews

Join the discussion

Can't find what you're looking for?