ElvinOuyang’s Kindle Notes & Highlights for Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

Rate it:

Open Preview

More on this book

Community

Azka

8 notes & 100 highlights

Paweł Cisło

156 notes & 175 highlights

Andrew Sorge

logan

Michael Ross

Juan

antoine pecatikov

Andrés Mise Olivera

Lars

Kindle Notes & Highlights

by ElvinOuyang

See all ElvinOuyang’s Notes & Highlights

Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking

by Foster Provost

Read between August 30, 2016 - March 24, 2018

34%

To incorporate this concern, nearest-neighbor methods often use weighted voting or similarity moderated voting such that each neighbor’s contribution is scaled by its similarity.

use similarity scaled voting to determine distance

34%

scoring. Weighted scoring has a nice consequence in that it reduces the importance of deciding how many neighbors to use. Because the contribution of each neighbor is moderated by its distance, the influence of neighbors naturally drops off the farther they are from the instance. Consequently, when using weighted scoring the exact value of k is much less critical than with majority voting or unweighted averaging.

weighted scoring to make the results mostly concise than majority voting and averaging

35%

we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data. Then when we have chosen a value of k, we build a k-NN model from the entire training set.

35%

The boundaries for k-NN are more strongly defined by the data.

35%

As mentioned, in some fields such as medicine and law, reasoning about similar historical cases is a natural way of coming to a decision about a new case. In such fields, a nearest-neighbor method may be a good fit. In other areas, the lack of an explicit, interpretable model may pose a problem.

35%

What is difficult is to explain more deeply what “knowledge” has been mined from the data.

Challenges in identifying 'knowledge' from the nearest neighbor models

35%

numeric attributes may have vastly different ranges, and unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range.

the range difference of variables matters because distance calculation iss based on it

35%

Such problems are said to be high-dimensional — they suffer from the so-called curse of dimensionality — and this poses problems for nearest neighbor methods.

35%

The main computational cost of a nearest neighbor method is borne by the prediction/classification step, when the database must be queried to find nearest neighbors of a new instance.

36%

For this reason nearest-neighbor-based systems often have variable-scaling front ends. They measure the ranges of variables and scale values accordingly, or they apportion values to a fixed number of bins.

Be sure to create scaling front ends for nearest neighbor methods

36%

Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by || · ||2. Equation 6-2 shows how it looks formally.

36%

The Manhattan distance or L1-norm is the sum of the (unsquared) pairwise distances,

36%

Jaccard distance. Jaccard distance treats the two objects as sets of characteristics.

36%

This is appropriate for problems where the possession of a common characteristic between two items is important, but the common absence of a characteristic is not.

36%

Cosine distance is often used in text classification to measure the similarity of two documents.

36%

Cosine distance is particularly useful when you want to ignore differences in scale across instances — technically, when you want to ignore the magnitude of the vectors.

36%

edit distance or the Levenshtein metric. This metric counts the minimum number of edit operations required to convert one string into the other, where an edit operation consists of either inserting, deleting, or replacing a character

37%

The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar.

37%

This diagram shows the key aspects of what is called “hierarchical” clustering. It is a clustering because it groups the points by their similarity. Notice that the only overlap between clusters is when one cluster contains other clusters. Because of this structure, the circles actually represent a hierarchy of clusterings.

37%

The graph on the bottom of the figure is called a dendrogram, and it shows explicitly the hierarchy of the clusters.

37%

An advantage of hierarchical clustering is that it allows the data analyst to see the groupings — the “landscape” of data similarity — before deciding on the number of clusters to extract.

37%

So far we have discussed distance between instances. For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.

38%

The most common method for focusing on the clusters themselves is to represent each cluster by its “cluster center,” or centroid.

38%

numeric measure such as the clusters’ distortion, which is the sum of the squared differences between each data point and its corresponding centroid. In the latter case, the clustering with the lowest distortion value can be deemed the best clustering.

Clustering centroid computation

40%

The general strategy is this: we use the cluster assignments to label examples. Each example will be given a label of the cluster it belongs to, and these can be treated as class labels. Once we have a labeled set of examples, we run a supervised learning algorithm on the example set to generate a classifier for each class/cluster. We can then inspect the classifier descriptions to get a (hopefully) intelligible and concise description of the corresponding cluster. The important thing to note is that these will be differential descriptions: for each cluster, what differentiates it from the ...more

41%

To put it another way: characteristic descriptions concentrate on intragroup commonalities, whereas differential descriptions concentrate on intergroup differences. Neither is inherently better — it depends on what you’re using it for.

41%

We should spend as much time as we can in the business understanding/data understanding mini-cycle, until we have a concrete, specific definition of the problem we are trying to solve.

41%

What do we do when in the business understanding phase we conclude: we would like to explore our data, possibly with only a vague notion of the exact problem we are solving? The problems to which we apply clustering often fall into this category.

41%

We should not let our desire to be concrete and precise keep us from making important discoveries from data. But there is a trade-off. The tradeoff is that for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process — in the Evaluation stage.

41%

Therefore, for clustering, additional creativity and business knowledge must be applied in the Evaluation stage of the data mining process.

41%

Briefly, Haimowitz and Schwarz took this new knowledge and cycled back to the beginning of the data mining process. They used the knowledge to define a precise predictive modeling problem: using data that are available at the time of credit approval, predict the probability that a customer will fall into each of these clusters.

« Prev 1 2 3 4 Next »

See a Problem?

Preview — Data Science for Business by Foster Provost