Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
34%
Flag icon
To incorporate this concern, nearest-neighbor methods often use weighted voting or similarity moderated voting such that each neighbor’s contribution is scaled by its similarity.
ElvinOuyang
use similarity scaled voting to determine distance
34%
Flag icon
scoring. Weighted scoring has a nice consequence in that it reduces the importance of deciding how many neighbors to use. Because the contribution of each neighbor is moderated by its distance, the influence of neighbors naturally drops off the farther they are from the instance. Consequently, when using weighted scoring the exact value of k is much less critical than with majority voting or unweighted averaging.
ElvinOuyang
weighted scoring to make the results mostly concise than majority voting and averaging
35%
Flag icon
we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data. Then when we have chosen a value of k, we build a k-NN model from the entire training set.
35%
Flag icon
The boundaries for k-NN are more strongly defined by the data.
35%
Flag icon
As mentioned, in some fields such as medicine and law, reasoning about similar historical cases is a natural way of coming to a decision about a new case. In such fields, a nearest-neighbor method may be a good fit. In other areas, the lack of an explicit, interpretable model may pose a problem.
35%
Flag icon
What is difficult is to explain more deeply what “knowledge” has been mined from the data.
ElvinOuyang
Challenges in identifying 'knowledge' from the nearest neighbor models
35%
Flag icon
numeric attributes may have vastly different ranges, and unless they are scaled appropriately the effect of one attribute with a wide range can swamp the effect of another with a much smaller range.
ElvinOuyang
the range difference of variables matters because distance calculation iss based on it
35%
Flag icon
Such problems are said to be high-dimensional — they suffer from the so-called curse of dimensionality — and this poses problems for nearest neighbor methods.
35%
Flag icon
The main computational cost of a nearest neighbor method is borne by the prediction/classification step, when the database must be queried to find nearest neighbors of a new instance.
36%
Flag icon
For this reason nearest-neighbor-based systems often have variable-scaling front ends. They measure the ranges of variables and scale values accordingly, or they apportion values to a fixed number of bins.
ElvinOuyang
Be sure to create scaling front ends for nearest neighbor methods
36%
Flag icon
Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by || · ||2. Equation 6-2 shows how it looks formally.
36%
Flag icon
The Manhattan distance or L1-norm is the sum of the (unsquared) pairwise distances,
36%
Flag icon
Jaccard distance. Jaccard distance treats the two objects as sets of characteristics.
36%
Flag icon
This is appropriate for problems where the possession of a common characteristic between two items is important, but the common absence of a characteristic is not.
36%
Flag icon
Cosine distance is often used in text classification to measure the similarity of two documents.
36%
Flag icon
Cosine distance is particularly useful when you want to ignore differences in scale across instances — technically, when you want to ignore the magnitude of the vectors.
36%
Flag icon
edit distance or the Levenshtein metric. This metric counts the minimum number of edit operations required to convert one string into the other, where an edit operation consists of either inserting, deleting, or replacing a character
37%
Flag icon
The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar.
37%
Flag icon
This diagram shows the key aspects of what is called “hierarchical” clustering. It is a clustering because it groups the points by their similarity. Notice that the only overlap between clusters is when one cluster contains other clusters. Because of this structure, the circles actually represent a hierarchy of clusterings.
37%
Flag icon
The graph on the bottom of the figure is called a dendrogram, and it shows explicitly the hierarchy of the clusters.
37%
Flag icon
An advantage of hierarchical clustering is that it allows the data analyst to see the groupings — the “landscape” of data similarity — before deciding on the number of clusters to extract.
37%
Flag icon
So far we have discussed distance between instances. For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.
38%
Flag icon
The most common method for focusing on the clusters themselves is to represent each cluster by its “cluster center,” or centroid.
38%
Flag icon
numeric measure such as the clusters’ distortion, which is the sum of the squared differences between each data point and its corresponding centroid. In the latter case, the clustering with the lowest distortion value can be deemed the best clustering.
ElvinOuyang
Clustering centroid computation
40%
Flag icon
The general strategy is this: we use the cluster assignments to label examples. Each example will be given a label of the cluster it belongs to, and these can be treated as class labels. Once we have a labeled set of examples, we run a supervised learning algorithm on the example set to generate a classifier for each class/cluster. We can then inspect the classifier descriptions to get a (hopefully) intelligible and concise description of the corresponding cluster. The important thing to note is that these will be differential descriptions: for each cluster, what differentiates it from the ...more
41%
Flag icon
To put it another way: characteristic descriptions concentrate on intragroup commonalities, whereas differential descriptions concentrate on intergroup differences. Neither is inherently better — it depends on what you’re using it for.
41%
Flag icon
We should spend as much time as we can in the business understanding/data understanding mini-cycle, until we have a concrete, specific definition of the problem we are trying to solve.
41%
Flag icon
What do we do when in the business understanding phase we conclude: we would like to explore our data, possibly with only a vague notion of the exact problem we are solving? The problems to which we apply clustering often fall into this category.
41%
Flag icon
We should not let our desire to be concrete and precise keep us from making important discoveries from data. But there is a trade-off. The tradeoff is that for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process — in the Evaluation stage.
41%
Flag icon
Therefore, for clustering, additional creativity and business knowledge must be applied in the Evaluation stage of the data mining process.
41%
Flag icon
Briefly, Haimowitz and Schwarz took this new knowledge and cycled back to the beginning of the data mining process. They used the knowledge to define a precise predictive modeling problem: using data that are available at the time of credit approval, predict the probability that a customer will fall into each of these clusters.
1 2 4 Next »