Kindle Notes & Highlights
by
Field Cady
Read between
March 27 - July 31, 2022
“data‐driven development” (DDD).
hypotheses can be tested rigorously and retroactively,
Monitoring systems use autonomous decision algorithms to prioritize incidents for human investigation.
DDD goes so far beyond just giving people access to a common database; it keeps a pulse on all parts of a business operation, it automates large parts of it, and where automation isn't possible it puts all the best analyses at people's fingertips.
Data Science: Analytics work that, for one reason or another, requires a substantial amount of software engineering skills
Data science can largely be divided into two types of work: the kind where the clients are humans and the kind where the clients are machines.
If the client is a human, then typically you are either investigating a business situation (what are our users like?) or you are using data to help make a business decision (is this feature of our product useful enough to justify the cost of its upkeep?). Some
Determining whether some kind of compelling pattern exists in the available data.
Finding patterns that predict whether a machine will fail or a transaction will go through to completion.
Test which of two versions of a website works better. AB
If the client is a machine then typically the data scientist is devising some logic that will be used by a computer to make real‐time judgements autonomously.
Determining which ad to show a user, or which product to try up‐selling them with on a website Monitoring an industrial machine to identify leading indicators of failure and sound an alarm when a situation arises Identifying components on an assembly line that are likely to cause failures downstream so that they can be discarded or re‐processed
Table 2.1 Data science work can largely be divided into producing human‐understandable insights or producing code and models that get run in production.
But maybe you could use a very complicated model instead and just solve it with a computer. In that case you might not need to “understand” the world: if you had enough data, you could just fit a model to it by rote methods. The world is a complex place, and fitting complicated models to large datasets might be more accurate than any idealization simple enough to fit into a human brain.
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
The early “algorithmic models” included the standard classifiers and regressors that the discipline of ML is built on. It has since grown to include deep learning, which is dramatically more complicated than the early models but also (potentially) much more powerful.
And so the hybrid role of data scientist was born. They were mostly drawn from the ranks of computer scientists and software engineers, especially the ones who were originally from math‐heavy backgrounds (I've always been shocked by how many of the great computer scientist were originally physicists).
The reality though was that you just needed somebody who was competent at both coding and analytics, and the people who knew both of these largely unrelated disciplines were predominantly polymaths. It
Many of these problems fall under the umbrella of “operations research,” a grab‐bag discipline that uses computational math to solve various logistics problems. It is a fascinating subject in its own right, and I would argue that its practitioners deserve
ML guesses an answer based on the patterns it has seen previously. Operations research deduces the correct answer based on first principles.
operations research will often use data science to solve sub‐problems, but they are different disciplines.
The most basic function of a data scientist is to provide business insights.
A critical distinction to understand is whether an insight is just nice to know, or whether it will be an ingredient in a business decision.
It is better to use data science to test the assumptions underlying business decisions, identify pain points in
an operation, and help make crucial decisions.
Many of the most high‐profile applications of data science involve embedding logic into production software. This
success metrics that capture business value),
There is tremendous overhead required to communicate the algorithm in the first place, after which you have the daunting task of ensuring that two parallel codebases perform the same even as bugs are fixed and features are added. Usually it is better to figure out a way for the data scientists to write the logical code themselves and then plug it into
a larger framework that is maintained by the engineers.
A data engineer is a software engineer who specializes in designing, developing, and maintaining data storage and processing pipelines.
each is a software engineer working in a highly specialized niche, with a large corpus of technologies and best practices that a normal software engineer is unlikely to know.
Personally I generally draw the line at programming; if your analytics work includes a meaningful amount of programming in a coding or scripting language, then you are a data scientist, but if you stick to database queries, you are an analyst.
BI analysts generally lack the ability to create mathematically complicated models or to write their own code (except possibly for database queries). However, they have deep knowledge of the business itself and are experts in communicating results in compelling ways.
Software engineers create products of a scale and complexity far greater than data scientists can typically handle. However, data scientists often create logic that gets plugged into those products or analyze the data that they generate.
Typically they are domain experts who have picked up analytics skills as a way to make them better at their real job.
The biggest advantage of a citizen data scientist is that they have an intimate knowledge of the domain.
It will also be invaluable in the most important part of data science: asking the right questions.
You do not anticipate any need for writing production software
You do not see why you would need the more advanced subjects like deep learning and Bayesian modeling
of sophisticated math, but I've seen many that fall victim to entry‐level coding mistakes.
for making business decisions measure the simplest thing that adequately captures the real‐world phenomenon you're trying to study. There
caution against having lots of edge cases in the definition of a metric, but really what I mean is that you don't want a lot of judgment calls baked in.
Sometimes metrics have a clear and simple business meaning, but a lot of complexity is required to faithfully measure that meaning.
advocate simple metrics because they are easy to dissect and reason about. But in a few rare cases, you want metrics that are deliberately somewhat opaque, so as to discourage counter‐productive nit‐picking. The
Bearing in mind that there is no such thing as a perfect metric, you should ask how big your error bars actually are before you worry about making them smaller. If the metric is adequate, move on for now and check back on it periodically.
Only Z should be manipulated. As much as possible everything else about the datapoints should either be the same or, at least, have the same variation between your test and control group.
In robust statistics it is common to summarize a distribution with five numbers – the min, max, 25th, and 75th percentiles, and the median – and to display them in what's called a “box and whisker plot”
In theory the MI between X and Y tells you how much knowing the value of X allows you to make predictions about Y, relative to not knowing
The biggest practical advantage of MI is that it applies to categorical data as well as numerical data, whereas correlation only applies to numerical data (or at least, in the case of ordinal correlations, data that can be sorted from “smallest” to “largest” even if it's not strictly numerical).
All of these fall under the umbrella of “parametric curve fitting.”