More on this book
Community
Kindle Notes & Highlights
Read between
August 30, 2016 - March 24, 2018
At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incorporate these principles. As a term, “data science” often is applied more broadly than the traditional use of “data mining,” but data mining techniques provide some of the clearest illustrations of the principles of data science.
In this book, we will view the ultimate goal of data science as improving decision making, as this generally is of direct interest to business.
Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely on intuition.
The sort of decisions we will be interested in in this book mainly fall into two types: (1) decisions for which “discoveries” need to be made within data, and (2) decisions that repeat, especially at massive scale, and so decision-making can benefit from even small increases in decision-making accuracy based on data analysis.
For the time being, it is sufficient to understand that a predictive model abstracts away most of the complexity of the world, focusing in on a particular set of indicators that correlate in some way with a quantity of interest
This highlights the often overlooked fact that, increasingly, business decisions are being made automatically by computer systems. Different industries have adopted automatic decision-making at different rates.
However, much more often the well-known big data technologies are used for data processing in support of the data mining techniques and other data science activities, as represented in Figure 1-1.
data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets.
The best data science team can yield little value without the appropriate data; the right data often cannot substantially improve decisions without suitable data science talent.
You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One.
This has an important implication: banks with bigger data assets may have an important strategic advantage over their smaller competitors.
If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making.
This book devotes a good deal of attention to the extraction of useful (nontrivial, hopefully actionable) patterns or models from large bodies of data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996), and to the fundamental data science principles underlying such data mining.
Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages.
Fundamental concept: If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset.
Fundamental concept: Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used.
In collaboration with business stakeholders, data scientists decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problem. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks.
Recognizing familiar problems and their solutions avoids wasting time and resources reinventing the wheel. It also allows people to focus attention on more interesting parts of the process that require human involvement — parts that have not been automated, so human creativity and intelligence must come into play.
Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to.
Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
Similarity matching attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities.
Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.
Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population.
Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set.
Causal modeling attempts to help us understand what events or actions actually influence others.
Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis: they attempt to understand what would be the difference between the situations — which cannot both happen — where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
Here no specific purpose or target has been specified for the grouping. When there is no such target, the data mining problem is referred to as unsupervised.
segmentation is being done for a specific reason: to take action based on likelihood of churn. This is called a supervised data mining problem.
Supervised tasks require different techniques than unsupervised tasks do, and the results often are much more useful.
Technically, another condition must be met for supervised data mining: there must be data on the target.
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised.
Regression involves a numeric target while classification involves a categorical (often binary) target.
A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable.
the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining.
Often the entire process is an exploration of the data, and after the first iteration the data science team knows much more. The next iteration can be much more well-informed.
The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, as we will describe, but often the key to a great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems.
In this first stage, the design team should think carefully about the problem to be solved and about the use scenario.
In data understanding we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables (and sometimes it can be surprisingly hard for them to admit it).
A leak is a situation where a variable collected in historical data gives information on the target variable — information that appears in historical data but is not actually available when the decision has to be made.
The modeling stage is the primary place where data mining techniques are applied to the data.
What that means varies from application to application, but often stakeholders are looking to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes.[
We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision-making.
Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually. In these cases, it may be best to deploy the data mining phase into production.
In many cases, the data science team is responsible for producing a working prototype, along with its evaluation.
data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.