More on this book
Community
Kindle Notes & Highlights
Read between
August 30, 2016 - March 24, 2018
scoring. Weighted scoring has a nice consequence in that it reduces the importance of deciding how many neighbors to use. Because the contribution of each neighbor is moderated by its distance, the influence of neighbors naturally drops off the farther they are from the instance. Consequently, when using weighted scoring the exact value of k is much less critical than with majority voting or unweighted averaging.
we can conduct cross-validation or other nested holdout testing on the training set, for a variety of different values of k, searching for one that gives the best performance on the training data. Then when we have chosen a value of k, we build a k-NN model from the entire training set.
The boundaries for k-NN are more strongly defined by the data.
As mentioned, in some fields such as medicine and law, reasoning about similar historical cases is a natural way of coming to a decision about a new case. In such fields, a nearest-neighbor method may be a good fit. In other areas, the lack of an explicit, interpretable model may pose a problem.
Such problems are said to be high-dimensional — they suffer from the so-called curse of dimensionality — and this poses problems for nearest neighbor methods.
The main computational cost of a nearest neighbor method is borne by the prediction/classification step, when the database must be queried to find nearest neighbors of a new instance.
Because it employs the squares of the distances along each individual dimension, it is sometimes called the L2 norm and sometimes represented by || · ||2. Equation 6-2 shows how it looks formally.
The Manhattan distance or L1-norm is the sum of the (unsquared) pairwise distances,
Jaccard distance. Jaccard distance treats the two objects as sets of characteristics.
This is appropriate for problems where the possession of a common characteristic between two items is important, but the common absence of a characteristic is not.
Cosine distance is often used in text classification to measure the similarity of two documents.
Cosine distance is particularly useful when you want to ignore differences in scale across instances — technically, when you want to ignore the magnitude of the vectors.
edit distance or the Levenshtein metric. This metric counts the minimum number of edit operations required to convert one string into the other, where an edit operation consists of either inserting, deleting, or replacing a character
The basic idea is that we want to find groups of objects (consumers, businesses, whiskeys, etc.), where the objects within groups are similar, but the objects in different groups are not so similar.
This diagram shows the key aspects of what is called “hierarchical” clustering. It is a clustering because it groups the points by their similarity. Notice that the only overlap between clusters is when one cluster contains other clusters. Because of this structure, the circles actually represent a hierarchy of clusterings.
The graph on the bottom of the figure is called a dendrogram, and it shows explicitly the hierarchy of the clusters.
An advantage of hierarchical clustering is that it allows the data analyst to see the groupings — the “landscape” of data similarity — before deciding on the number of clusters to extract.
So far we have discussed distance between instances. For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function.
The most common method for focusing on the clusters themselves is to represent each cluster by its “cluster center,” or centroid.
The general strategy is this: we use the cluster assignments to label examples. Each example will be given a label of the cluster it belongs to, and these can be treated as class labels. Once we have a labeled set of examples, we run a supervised learning algorithm on the example set to generate a classifier for each class/cluster. We can then inspect the classifier descriptions to get a (hopefully) intelligible and concise description of the corresponding cluster. The important thing to note is that these will be differential descriptions: for each cluster, what differentiates it from the
...more
To put it another way: characteristic descriptions concentrate on intragroup commonalities, whereas differential descriptions concentrate on intergroup differences. Neither is inherently better — it depends on what you’re using it for.
We should spend as much time as we can in the business understanding/data understanding mini-cycle, until we have a concrete, specific definition of the problem we are trying to solve.
What do we do when in the business understanding phase we conclude: we would like to explore our data, possibly with only a vague notion of the exact problem we are solving? The problems to which we apply clustering often fall into this category.
We should not let our desire to be concrete and precise keep us from making important discoveries from data. But there is a trade-off. The tradeoff is that for problems where we did not achieve a precise formulation of the problem in the early stages of the data mining process, we have to spend more time later in the process — in the Evaluation stage.
Therefore, for clustering, additional creativity and business knowledge must be applied in the Evaluation stage of the data mining process.
Briefly, Haimowitz and Schwarz took this new knowledge and cycled back to the beginning of the data mining process. They used the knowledge to define a precise predictive modeling problem: using data that are available at the time of credit approval, predict the probability that a customer will fall into each of these clusters.

