More on this book
Community
Kindle Notes & Highlights
Read between
August 2, 2018 - February 5, 2019
At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incorporate these principles.
Data-driven decision-making (DDD) refers to the practice of basing decisions on the analysis of data, rather than purely on intuition.
data, and the capability to extract useful knowledge from data, should be regarded as key strategic assets.
Once we view data as a business asset, we should think about whether and how much we are willing to invest.
When faced with a business problem, you should be able to assess whether and how data can improve performance.
Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages.
From a large mass of data, information technology can be used to find informative descriptive attributes of entities of interest.
If you look too hard at a set of data, you will find something — but it might not generalize beyond the data you’re looking at.
Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used.
Classification and class probability estimation attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to.
Regression (“value estimation”) attempts to estimate or predict, for each individual, the numerical value of some variable for that individual.
Informally, classification predicts whether something will happen, whereas regression predicts how much something will happen.
Similarity matching attempts to identify similar individuals based on data known about them.
Clustering attempts to group individuals in a population together by their similarity, but not driven by any specific purpose.
Co-occurrence grouping (also known as frequent itemset mining, association rule discovery, and market-basket analysis) attempts to find associations between entities based on transactions involving them.
Profiling (also known as behavior description) attempts to characterize the typical behavior of an individual, group, or population.
Link prediction attempts to predict connections between data items, usually by suggesting that a link should exist, and possibly also estimating the strength of the link.
Data reduction attempts to take a large set of data and replace it with a smaller set of data that contains much of the important information in the larger set.
Causal modeling attempts to help us understand what events or actions actually influence others.
Classification, regression, and causal modeling generally are solved with supervised methods. Similarity matching, link prediction, and data reduction could be either. Clustering, co-occurrence grouping, and profiling generally are unsupervised.
Two main subclasses of supervised data mining, classification and regression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical (often binary) target.