Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
0%
Flag icon
It is liberally sprinkled with compelling real-world examples outlining familiar, accessible problems in the business world: customer churn, targeted marking, even whiskey analytics!
Paweł Cisło
About the book
3%
Flag icon
information is now widely available on external events such as market trends, industry news, and competitors’ movements.
Paweł Cisło
Information is everywhere
3%
Flag icon
two brief case studies of analyzing data to extract predictive patterns.
Paweł Cisło
Two examples: "Hurricane Frances" + "Predicting Customer Churn"
4%
Flag icon
The sort of decisions we will be interested in in this book mainly fall into two types: (1) decisions for which “discoveries” need to be made within data, and (2) decisions that repeat, especially at massive scale, and so decision-making can benefit from even small increases in decision-making accuracy based on data analysis.
Paweł Cisło
Types of decisions
4%
Flag icon
Consumers tend to have inertia in their habits and getting them to change is very difficult.
Paweł Cisło
inertia - tendency to do nothing or to remain unchanged
4%
Flag icon
the arrival of a new baby in a family is one point where people do change their shopping habits significantly.
Paweł Cisło
Predictive model
4%
Flag icon
Big data essentially means datasets that are too large for traditional data processing systems, and therefore require new processing technologies.
Paweł Cisło
Big data
4%
Flag icon
data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets.
Paweł Cisło
Data Science principle
5%
Flag icon
credit cards essentially had uniform pricing, for two reasons: (1) the companies did not have adequate information systems to deal with differential pricing at massive scale, and (2) bank management believed customers would not stand for price discrimination.
5%
Flag icon
Around 1990, two strategic visionaries (Richard Fairbanks and Nigel Morris) realized that information technology was powerful enough that they could do more sophisticated predictive modeling
Paweł Cisło
First steps of data science
5%
Flag icon
modeling profitability, not just default probability, was the right strategy.
5%
Flag icon
quantitative demonstrations of the value of a data asset are hard to find, primarily because firms are hesitant to divulge results of strategic value.
5%
Flag icon
Sociodemographic data provide a substantial ability to model the sort of consumers that are more likely to purchase one product or another. However, sociodemographic data only go so far; after a certain volume of data, no additional advantage is conferred.
5%
Flag icon
consumers find value in the rankings and recommendations that Amazon provides.
Paweł Cisło
Amazon ranking system
5%
Flag icon
data analysis is now so critical to business strategy.
5%
Flag icon
They employ data science teams to bring advanced technologies to bear to increase revenue and to decrease costs.
5%
Flag icon
With an understanding of the fundamentals of data science you should be able to devise a few probing questions to determine whether their valuation arguments are plausible.
6%
Flag icon
The Cross Industry Standard Process for Data Mining, abbreviated CRISP-DM (CRISP-DM Project, 2000),
Paweł Cisło
Useful conditions of data mining process
7%
Flag icon
classification predicts whether something will happen, whereas regression predicts how much something will happen.
Paweł Cisło
Classification
7%
Flag icon
The result of co-occurrence grouping is a description of items that occur together. These descriptions usually include statistics on the frequency of the co-occurrence and an estimate of how surprising it is.
Paweł Cisło
Co-occurrence grouping
7%
Flag icon
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
Paweł Cisło
Link prediction
8%
Flag icon
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised.
8%
Flag icon
This is still considered classification modeling rather than regression because the underlying target is categorical. Where necessary for clarity, this is called “class probability estimation.”
Paweł Cisło
Classification modeling vs regression
8%
Flag icon
There is another important distinction pertaining to mining data: the difference between (1) mining the data to find patterns and build models, and (2) using the results of data mining.
Paweł Cisło
Data mining confusion
8%
Flag icon
The use of data mining results should influence and inform the data mining process itself, but the two should be kept distinct.
Paweł Cisło
Use of data mining
8%
Flag icon
Data mining is a craft. It involves the application of a substantial amount of science and technology, but the proper application still involves art as well.
Paweł Cisło
Data mining is a craft
8%
Flag icon
Business Understanding stage represents a part of the craft where the analysts’ creativity plays a large role.
Paweł Cisło
Business understanding importance
8%
Flag icon
This can mean structuring (engineering) the problem such that one or more subproblems involve building models for classification, regression, probability estimation, and so on.
Paweł Cisło
Breaking problem into the subproblems
9%
Flag icon
critical part of the data understanding phase is estimating the costs and benefits of each data source and deciding whether further investment is merited.
Paweł Cisło
Critical part of the data understanding phase
9%
Flag icon
Therefore a data preparation phase often proceeds along with data understanding, in which the data are manipulated and converted into forms that yield better results.
Paweł Cisło
Data understanding
9%
Flag icon
modeling is some sort of model or pattern capturing regularities in the data.
Paweł Cisło
Modeling
9%
Flag icon
fraud detection, spam detection, and intrusion monitoring) is that they produce too many false alarms.
Paweł Cisło
False alarms
9%
Flag icon
To facilitate such qualitative assessment, the data scientist must think about the comprehensibility of the model to stakeholders (not just to the data scientists).
Paweł Cisło
Comprehensibility of the model to stakeholders
10%
Flag icon
data science team is responsible for producing a working prototype, along with its evaluation.
Paweł Cisło
Data Science team responsibility
10%
Flag icon
data mining is an exploratory undertaking closer to research and development than it is to engineering.
10%
Flag icon
Team members may be evaluated using software metrics such as the amount of code written or number of bug tickets closed. In analytics, it’s more important for individuals to be able to formulate problems well, to prototype solutions quickly, to make reasonable assumptions in the face of ill-structured problems, to design experiments that represent good investments, and to analyze results. In building a data science team, these qualities, rather than traditional software engineering expertise, are skills that should be sought.
Paweł Cisło
Qualifying software engineering projects vs data science ones
12%
Flag icon
In data science, prediction more generally means to estimate an unknown value
Paweł Cisło
Prediction meaning
12%
Flag icon
descriptive modeling, where the primary purpose of the model is not to estimate a value but instead to gain insight into the underlying phenomenon or process.
Paweł Cisło
Descriptive modeling
13%
Flag icon
instance is also sometimes called a feature vector, because it can be represented as a fixed-length ordered collection (vector) of feature values.
Paweł Cisło
Instance as a feature vector
13%
Flag icon
The target variable, whose values are to be predicted, is commonly called the dependent variable in statistics.
Paweł Cisło
Target/dependent variable
13%
Flag icon
The creation of models from data is known as model induction.
Paweł Cisło
Model induction
13%
Flag icon
direct, multivariate supervised segmentation is just one application of this fundamental idea of selecting informative variables.
Paweł Cisło
direct, multivariate supervised segmentation
13%
Flag icon
If every member of a group has the same value for the target, then the group is pure. If there is at least one member of the group that has a different value for the target variable than the rest of the group, then the group is impure.
Paweł Cisło
Pure vs impure
14%
Flag icon
formula that evaluates how well each attribute splits a set of examples into segments, with respect to a chosen target variable. Such a formula is based on a purity measure. The most common splitting criterion is called information gain, and it is based on a purity measure called entropy.
Paweł Cisło
Purity measure, information gain
14%
Flag icon
Entropy is a measure of disorder
Paweł Cisło
Entropy
14%
Flag icon
Disorder corresponds to how mixed (impure) the segment is with respect to these properties of interest.
Paweł Cisło
Disorder
14%
Flag icon
information gain (IG) to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates.
Paweł Cisło
Purpose of information gain
14%
Flag icon
information gain measures the change in entropy due to any amount of new information being added;
Paweł Cisło
Information gain - what it measures
14%
Flag icon
Numeric variables can be “discretized” by choosing a split point (or many split points) and then treating the result as a categorical attribute.
Paweł Cisło
Discretization of numerical variables
14%
Flag icon
natural measure of impurity for numeric values is variance.
Paweł Cisło
Variance
« Prev 1 3 4