Paweł Cisło’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

Rate it:

Open Preview

More on this book

Community

Azka

8 notes & 100 highlights

ElvinOuyang

35 notes & 181 highlights

Andrew Sorge

logan

Michael Ross

Juan

antoine pecatikov

Andrés Mise Olivera

Lars

Kindle Notes & Highlights

by Paweł Cisło

See all Paweł’s Notes & Highlights

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

by Foster Provost

36%

It is called Manhattan (or taxicab) distance because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points — the total east-west distance traveled plus the total north-south distance traveled.

Manhattan distance

36%

Jaccard distance treats the two objects as sets of characteristics. Thinking about the objects as sets allows one to think about the size of the union of all the characteristics of two objects X and Y, |X ∪ Y|, and the size of the set of characteristics shared by the two objects (the intersection), |X ∩ Y|.

Jaccard distance

36%

Cosine distance is often used in text classification to measure the similarity of two documents.

Cosine distance

36%

Cosine distance is particularly useful when you want to ignore differences in scale across instances — technically, when you want to ignore the magnitude of the vectors.

Cosine distance usefulness

36%

edit distance or the Levenshtein metric. This metric counts the minimum number of edit operations required to convert one string into the other,

Edit distance/Levenshtein metric

37%

This idea of finding natural groupings in the data may be called unsupervised segmentation, or more simply clustering

Segmentation/clustering

37%

advantage of hierarchical clustering is that it allows the data analyst to see the groupings — the “landscape” of data similarity — before deciding on the number of clusters to extract.

Hierarchical clustering

37%

For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.

Linkage function

38%

there is a relatively long distance between cluster 3 (at about 0.10) and cluster 4 (at about 0.17). This suggests that this segmentation of the data, yielding three clusters, might be a good division.

Good division of clusters

38%

Whenever a single point merges high up in a dendrogram, this is an indication that it seems different from the rest, which we might call an “outlier,”

Outlier

38%

In k-means the “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster.

Centroids

38%

clusters’ distortion, which is the sum of the squared differences between each data point and its corresponding centroid.

Clusters' distortion

38%

In terms of run time, the k-means algorithm is efficient.

K-means = efficient run time

38%

Hierarchical clustering is generally slower, as it needs to know the distances between all pairs of clusters on each iteration, which at the start is all pairs of data points.

Hierarchical clustering is generally slower

39%

value for k can be decreased if some clusters are too small and overly specific, and increased if some clusters are too broad and diffuse.

Value for k

39%

Wikipedia’s article Determining the number of clusters in a data set describes various metrics for evaluating sets of candidate clusters.

https://en.wikipedia.org/w/index.php?title=Determining_the_number_of_clusters_in_a_data_set&oldid=526596002

40%

old cliché in statistics: Correlation is not causation, meaning that just because two things co-occur doesn’t mean that one causes another.

Correlation is not causation

40%

Syntactic similarity is not semantic similarity. Just because two things — particularly text passages — have common surface characteristics doesn’t mean they’re necessarily related semantically.

Syntactic != semantic similarity

40%

mixing unsupervised learning (the clustering) with supervised learning in order to create differential descriptions of the clusters.

Creating differential descriptions of the clusters

40%

We have k clusters so we could set up a k-class task (one class per cluster). Alternatively, we could set up a k separate learning tasks, each trying to differentiate one cluster from all the other (k–1) clusters.

Approach of setting up classification task

40%

characteristic description; it describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of these characteristics.

Characteristic description

41%

differential description; it describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared by whiskeys within it.

Differential description

41%

for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process — in the Evaluation stage.

No precise formulation = more time spent in the Evaluation stage

42%

It is useful to think of a positive example as one worthy of attention or alarm, and a negative example as uninteresting or benign.

Positive example vs negative example

43%

expected value computation provides a framework that is extremely useful in organizing thinking about data-analytic problems. Specifically, it decomposes data-analytic thinking into (i) the structure of the problem, (ii) the elements of the analysis that can be extracted from the data, and (iii) the elements of the analysis that need to be acquired from other sources (e.g., business knowledge of subject matter experts).

Expected value computation

« Prev 1 2 3 4 Next »

See a Problem?

Preview — Data Science for Business by Foster Provost