Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
36%
Flag icon
It is called Manhattan (or taxicab) distance because it represents the total street distance you would have to travel in a place like midtown Manhattan (which is arranged in a grid) to get between two points — the total east-west distance traveled plus the total north-south distance traveled.
Paweł Cisło
Manhattan distance
36%
Flag icon
Jaccard distance treats the two objects as sets of characteristics. Thinking about the objects as sets allows one to think about the size of the union of all the characteristics of two objects X and Y, |X ∪ Y|, and the size of the set of characteristics shared by the two objects (the intersection), |X ∩ Y|.
Paweł Cisło
Jaccard distance
36%
Flag icon
Cosine distance is often used in text classification to measure the similarity of two documents.
Paweł Cisło
Cosine distance
36%
Flag icon
Cosine distance is particularly useful when you want to ignore differences in scale across instances — technically, when you want to ignore the magnitude of the vectors.
Paweł Cisło
Cosine distance usefulness
36%
Flag icon
edit distance or the Levenshtein metric. This metric counts the minimum number of edit operations required to convert one string into the other,
Paweł Cisło
Edit distance/Levenshtein metric
37%
Flag icon
This idea of finding natural groupings in the data may be called unsupervised segmentation, or more simply clustering
Paweł Cisło
Segmentation/clustering
37%
Flag icon
advantage of hierarchical clustering is that it allows the data analyst to see the groupings — the “landscape” of data similarity — before deciding on the number of clusters to extract.
Paweł Cisło
Hierarchical clustering
37%
Flag icon
For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.
Paweł Cisło
Linkage function
38%
Flag icon
there is a relatively long distance between cluster 3 (at about 0.10) and cluster 4 (at about 0.17). This suggests that this segmentation of the data, yielding three clusters, might be a good division.
Paweł Cisło
Good division of clusters
38%
Flag icon
Whenever a single point merges high up in a dendrogram, this is an indication that it seems different from the rest, which we might call an “outlier,”
Paweł Cisło
Outlier
38%
Flag icon
In k-means the “means” are the centroids, represented by the arithmetic means (averages) of the values along each dimension for the instances in the cluster.
Paweł Cisło
Centroids
38%
Flag icon
clusters’ distortion, which is the sum of the squared differences between each data point and its corresponding centroid.
Paweł Cisło
Clusters' distortion
38%
Flag icon
In terms of run time, the k-means algorithm is efficient.
Paweł Cisło
K-means = efficient run time
38%
Flag icon
Hierarchical clustering is generally slower, as it needs to know the distances between all pairs of clusters on each iteration, which at the start is all pairs of data points.
Paweł Cisło
Hierarchical clustering is generally slower
39%
Flag icon
value for k can be decreased if some clusters are too small and overly specific, and increased if some clusters are too broad and diffuse.
Paweł Cisło
Value for k
39%
Flag icon
Wikipedia’s article Determining the number of clusters in a data set describes various metrics for evaluating sets of candidate clusters.
Paweł Cisło
https://en.wikipedia.org/w/index.php?title=Determining_the_number_of_clusters_in_a_data_set&oldid=526596002
40%
Flag icon
old cliché in statistics: Correlation is not causation, meaning that just because two things co-occur doesn’t mean that one causes another.
Paweł Cisło
Correlation is not causation
40%
Flag icon
Syntactic similarity is not semantic similarity. Just because two things — particularly text passages — have common surface characteristics doesn’t mean they’re necessarily related semantically.
Paweł Cisło
Syntactic != semantic similarity
40%
Flag icon
mixing unsupervised learning (the clustering) with supervised learning in order to create differential descriptions of the clusters.
Paweł Cisło
Creating differential descriptions of the clusters
40%
Flag icon
We have k clusters so we could set up a k-class task (one class per cluster). Alternatively, we could set up a k separate learning tasks, each trying to differentiate one cluster from all the other (k–1) clusters.
Paweł Cisło
Approach of setting up classification task
40%
Flag icon
characteristic description; it describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of these characteristics.
Paweł Cisło
Characteristic description
41%
Flag icon
differential description; it describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared by whiskeys within it.
Paweł Cisło
Differential description
41%
Flag icon
for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process — in the Evaluation stage.
Paweł Cisło
No precise formulation = more time spent in the Evaluation stage
42%
Flag icon
It is useful to think of a positive example as one worthy of attention or alarm, and a negative example as uninteresting or benign.
Paweł Cisło
Positive example vs negative example
43%
Flag icon
expected value computation provides a framework that is extremely useful in organizing thinking about data-analytic problems. Specifically, it decomposes data-analytic thinking into (i) the structure of the problem, (ii) the elements of the analysis that can be extracted from the data, and (iii) the elements of the analysis that need to be acquired from other sources (e.g., business knowledge of subject matter experts).
Paweł Cisło
Expected value computation
1 2 4 Next »