Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
3%
Flag icon
At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incorporate these principles. As a term, “data science” often is applied more broadly than the traditional use of “data mining,” but data mining techniques provide some of the clearest illustrations of the principles of data science.
4%
Flag icon
In this book, we will view the ultimate goal of data science as improving decision making, as this generally is of direct interest to business.
4%
Flag icon
Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely on intuition.
4%
Flag icon
The sort of decisions we will be interested in in this book mainly fall into two types: (1) decisions for which “discoveries” need to be made within data, and (2) decisions that repeat, especially at massive scale, and so decision-making can benefit from even small increases in decision-making accuracy based on data analysis.
4%
Flag icon
For the time being, it is sufficient to understand that a predictive model abstracts away most of the complexity of the world, focusing in on a particular set of indicators that correlate in some way with a quantity of interest
4%
Flag icon
This highlights the often overlooked fact that, increasingly, business decisions are being made automatically by computer systems. Different industries have adopted automatic decision-making at different rates.
4%
Flag icon
However, much more often the well-known big data technologies are used for data processing in support of the data mining techniques and other data science activities, as represented in Figure 1-1.
4%
Flag icon
data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets.
5%
Flag icon
The best data science team can yield little value without the appropriate data; the right data often cannot substantially improve decisions without suitable data science talent.
5%
Flag icon
You may not have heard of little Signet Bank, but if you’re reading this book you’ve probably heard of the spin-off: Capital One.
5%
Flag icon
This has an important implication: banks with bigger data assets may have an important strategic advantage over their smaller competitors.
5%
Flag icon
If these employees do not have a fundamental grounding in the principles of data-analytic thinking, they will not really understand what is happening in the business. This lack of understanding is much more damaging in data science projects than in other technical projects, because the data science is supporting improved decision-making.
6%
Flag icon
This book devotes a good deal of attention to the extraction of useful (nontrivial, hopefully actionable) patterns or models from large bodies of data (Fayyad, Piatetsky-Shapiro, & Smyth, 1996), and to the fundamental data science principles underlying such data mining.
6%
Flag icon
Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages.
6%
Flag icon
Fundamental concept: If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at. This is referred to as overfitting a dataset.
6%
Flag icon
Fundamental concept: Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used.
7%
Flag icon
In collaboration with business stakeholders, data scientists decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problem. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks.
ElvinOuyang
Basic tools in data science
7%
Flag icon
Recognizing familiar problems and their solutions avoids wasting time and resources reinventing the wheel. It also allows people to focus attention on more interesting parts of the process that require human involvement — parts that have not been automated, so human creativity and intelligence must come into play.
7%
Flag icon
Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to.
7%
Flag icon
Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
7%
Flag icon
Similarity matching attempts to identify similar individuals based on data known about them. Similarity matching can be used directly to find similar entities.
7%
Flag icon
Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.
7%
Flag icon
Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
7%
Flag icon
Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population.
7%
Flag icon
Profiling is often used to establish behavioral norms for anomaly detection applications such as fraud detection and monitoring for intrusions to computer systems
7%
Flag icon
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
7%
Flag icon
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set.
7%
Flag icon
Causal modeling attempts to help us understand what events or actions actually influence others.
7%
Flag icon
Both experimental and observational methods for causal modeling generally can be viewed as “counterfactual” analysis: they attempt to understand what would be the difference between the situations — which cannot both happen — where the “treatment” event (e.g., showing an advertisement to a particular individual) were to happen, and were not to happen.
8%
Flag icon
Here no specific purpose or target has been specified for the grouping. When there is no such target, the data mining problem is referred to as unsupervised.
8%
Flag icon
segmentation is being done for a specific reason: to take action based on likelihood of churn. This is called a supervised data mining problem.
8%
Flag icon
Supervised tasks require different techniques than unsupervised tasks do, and the results often are much more useful.
8%
Flag icon
Technically, another condition must be met for supervised data mining: there must be data on the target.
8%
Flag icon
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised.
8%
Flag icon
Regression involves a numeric target while classification involves a categorical (often binary) target.
8%
Flag icon
A vital part in the early stages of the data mining process is (i) to decide whether the line of attack will be supervised or unsupervised, and (ii) if supervised, to produce a precise definition of a target variable.
8%
Flag icon
the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining.
8%
Flag icon
ElvinOuyang
Standard procedure for a data science project
8%
Flag icon
Often the entire process is an exploration of the data, and after the first iteration the data science team knows much more. The next iteration can be much more well-informed.
8%
Flag icon
The Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role. Data science has some things to say, as we will describe, but often the key to a great success is a creative problem formulation by some analyst regarding how to cast the business problem as one or more data science problems.
8%
Flag icon
In this first stage, the design team should think carefully about the problem to be solved and about the use scenario.
9%
Flag icon
In data understanding we need to dig beneath the surface to uncover the structure of the business problem and the data that are available, and then match them to one or more data mining tasks for which we may have substantial science and technology to apply.
9%
Flag icon
Often the quality of the data mining solution rests on how well the analysts structure the problems and craft the variables (and sometimes it can be surprisingly hard for them to admit it).
9%
Flag icon
A leak is a situation where a variable collected in historical data gives information on the target variable — information that appears in historical data but is not actually available when the decision has to be made.
9%
Flag icon
The modeling stage is the primary place where data mining techniques are applied to the data.
9%
Flag icon
What that means varies from application to application, but often stakeholders are looking to see whether the model is going to do more good than harm, and especially that the model is unlikely to make catastrophic mistakes.[
9%
Flag icon
We may also want to instrument deployed systems for evaluations to make sure that the world is not changing to the detriment of the model’s decision-making.
10%
Flag icon
Two main reasons for deploying the data mining system itself rather than the models produced by a data mining system are (i) the world may change faster than the data science team can adapt, as with fraud and intrusion detection, and (ii) a business has too many modeling tasks for their data science team to manually curate each model individually. In these cases, it may be best to deploy the data mining phase into production.
10%
Flag icon
In many cases, the data science team is responsible for producing a working prototype, along with its evaluation.
10%
Flag icon
data mining is an exploratory undertaking closer to research and development than it is to engineering. The CRISP cycle is based around exploration; it iterates on approaches and strategy rather than on software designs. Outcomes are far less certain, and the results of a given step may change the fundamental understanding of the problem.
« Prev 1 3 4