Data Science: The Executive Summary - A Technical Book for Non-Technical Professionals
Rate it:
Open Preview
8%
Flag icon
“data‐driven development” (DDD).
8%
Flag icon
hypotheses can be tested rigorously and retroactively,
8%
Flag icon
Monitoring systems use autonomous decision algorithms to prioritize incidents for human investigation.
8%
Flag icon
DDD goes so far beyond just giving people access to a common database; it keeps a pulse on all parts of a business operation, it automates large parts of it, and where automation isn't possible it puts all the best analyses at people's fingertips.
9%
Flag icon
Data Science: Analytics work that, for one reason or another, requires a substantial amount of software engineering skills
9%
Flag icon
Data science can largely be divided into two types of work: the kind where the clients are humans and the kind where the clients are machines.
9%
Flag icon
If the client is a human, then typically you are either investigating a business situation (what are our users like?) or you are using data to help make a business decision (is this feature of our product useful enough to justify the cost of its upkeep?). Some
9%
Flag icon
Determining whether some kind of compelling pattern exists in the available data.
9%
Flag icon
Finding patterns that predict whether a machine will fail or a transaction will go through to completion.
9%
Flag icon
Test which of two versions of a website works better. AB
9%
Flag icon
If the client is a machine then typically the data scientist is devising some logic that will be used by a computer to make real‐time judgements autonomously.
9%
Flag icon
Determining which ad to show a user, or which product to try up‐selling them with on a website Monitoring an industrial machine to identify leading indicators of failure and sound an alarm when a situation arises Identifying components on an assembly line that are likely to cause failures downstream so that they can be discarded or re‐processed
10%
Flag icon
Table 2.1 Data science work can largely be divided into producing human‐understandable insights or producing code and models that get run in production.
10%
Flag icon
But maybe you could use a very complicated model instead and just solve it with a computer. In that case you might not need to “understand” the world: if you had enough data, you could just fit a model to it by rote methods. The world is a complex place, and fitting complicated models to large datasets might be more accurate than any idealization simple enough to fit into a human brain.
11%
Flag icon
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
11%
Flag icon
The early “algorithmic models” included the standard classifiers and regressors that the discipline of ML is built on. It has since grown to include deep learning, which is dramatically more complicated than the early models but also (potentially) much more powerful.
11%
Flag icon
And so the hybrid role of data scientist was born. They were mostly drawn from the ranks of computer scientists and software engineers, especially the ones who were originally from math‐heavy backgrounds (I've always been shocked by how many of the great computer scientist were originally physicists).
11%
Flag icon
The reality though was that you just needed somebody who was competent at both coding and analytics, and the people who knew both of these largely unrelated disciplines were predominantly polymaths. It
15%
Flag icon
Many of these problems fall under the umbrella of “operations research,” a grab‐bag discipline that uses computational math to solve various logistics problems. It is a fascinating subject in its own right, and I would argue that its practitioners deserve
15%
Flag icon
ML guesses an answer based on the patterns it has seen previously. Operations research deduces the correct answer based on first principles.
15%
Flag icon
operations research will often use data science to solve sub‐problems, but they are different disciplines.
15%
Flag icon
The most basic function of a data scientist is to provide business insights.
15%
Flag icon
A critical distinction to understand is whether an insight is just nice to know, or whether it will be an ingredient in a business decision.
15%
Flag icon
It is better to use data science to test the assumptions underlying business decisions, identify pain points in
15%
Flag icon
an operation, and help make crucial decisions.
15%
Flag icon
Many of the most high‐profile applications of data science involve embedding logic into production software. This
15%
Flag icon
success metrics that capture business value),
15%
Flag icon
There is tremendous overhead required to communicate the algorithm in the first place, after which you have the daunting task of ensuring that two parallel codebases perform the same even as bugs are fixed and features are added. Usually it is better to figure out a way for the data scientists to write the logical code themselves and then plug it into
15%
Flag icon
a larger framework that is maintained by the engineers.
16%
Flag icon
A data engineer is a software engineer who specializes in designing, developing, and maintaining data storage and processing pipelines.
16%
Flag icon
each is a software engineer working in a highly specialized niche, with a large corpus of technologies and best practices that a normal software engineer is unlikely to know.
17%
Flag icon
Personally I generally draw the line at programming; if your analytics work includes a meaningful amount of programming in a coding or scripting language, then you are a data scientist, but if you stick to database queries, you are an analyst.
17%
Flag icon
BI analysts generally lack the ability to create mathematically complicated models or to write their own code (except possibly for database queries). However, they have deep knowledge of the business itself and are experts in communicating results in compelling ways.
18%
Flag icon
Software engineers create products of a scale and complexity far greater than data scientists can typically handle. However, data scientists often create logic that gets plugged into those products or analyze the data that they generate.
19%
Flag icon
Typically they are domain experts who have picked up analytics skills as a way to make them better at their real job.
19%
Flag icon
The biggest advantage of a citizen data scientist is that they have an intimate knowledge of the domain.
19%
Flag icon
It will also be invaluable in the most important part of data science: asking the right questions.
19%
Flag icon
You do not anticipate any need for writing production software
19%
Flag icon
You do not see why you would need the more advanced subjects like deep learning and Bayesian modeling
20%
Flag icon
of sophisticated math, but I've seen many that fall victim to entry‐level coding mistakes.
33%
Flag icon
for making business decisions measure the simplest thing that adequately captures the real‐world phenomenon you're trying to study. There
33%
Flag icon
caution against having lots of edge cases in the definition of a metric, but really what I mean is that you don't want a lot of judgment calls baked in.
33%
Flag icon
Sometimes metrics have a clear and simple business meaning, but a lot of complexity is required to faithfully measure that meaning.
34%
Flag icon
advocate simple metrics because they are easy to dissect and reason about. But in a few rare cases, you want metrics that are deliberately somewhat opaque, so as to discourage counter‐productive nit‐picking. The
34%
Flag icon
Bearing in mind that there is no such thing as a perfect metric, you should ask how big your error bars actually are before you worry about making them smaller. If the metric is adequate, move on for now and check back on it periodically.
35%
Flag icon
Only Z should be manipulated. As much as possible everything else about the datapoints should either be the same or, at least, have the same variation between your test and control group.
38%
Flag icon
In robust statistics it is common to summarize a distribution with five numbers – the min, max, 25th, and 75th percentiles, and the median – and to display them in what's called a “box and whisker plot”
41%
Flag icon
In theory the MI between X and Y tells you how much knowing the value of X allows you to make predictions about Y, relative to not knowing
41%
Flag icon
The biggest practical advantage of MI is that it applies to categorical data as well as numerical data, whereas correlation only applies to numerical data (or at least, in the case of ordinal correlations, data that can be sorted from “smallest” to “largest” even if it's not strictly numerical).
41%
Flag icon
All of these fall under the umbrella of “parametric curve fitting.”
« Prev 1