Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
Rate it:
Open Preview
Kindle Notes & Highlights
31%
Flag icon
This has become the standard procedure for building numerical models that give a good balance between data fit and model complexity. This general approach to optimizing the parameter values of a data mining procedure is known as grid search.
31%
Flag icon
Sidebar: Beware of “multiple comparisons”
31%
Flag icon
The problem is that you randomly chose the stocks! You have no idea whether the stocks in these “best” funds performed better because they indeed are fundamentally better, or because you cherry-picked the best from a large set that simply varied in performance. If you flip 1,000 fair coins many times each, one of them will have come up heads much more than 50% of the time. However, choosing that coin as the “best” of the coins for later flipping obviously is silly. These are instances of “the problem of multiple comparisons,” a very important statistical phenomenon that business analysts and ...more
31%
Flag icon
A learning curve shows model performance on testing data plotted against the amount of training data used. Usually model performance increases with the amount of data, but the rate of increase and the final asymptotic performance can be quite different between models.
32%
Flag icon
Euclidean distance is not limited to two dimensions. If A and B were objects described by three features, they could be represented by points in three-dimensional space and their positions would then be represented as (xA, yA, zA) and (xB, yB, zB). The distance between A and B would then include the term (zA–zB)2.
32%
Flag icon
General Euclidean distance
32%
Flag icon
So the distance between these examples is about 19. This distance is just a number — it has no units, and no meaningful interpretation. It is only really useful for comparing the similarity of one pair of instances to that of another pair. It turns out that comparing similarities is extremely useful.
34%
Flag icon
Generally, we can think of the procedure as weighted scoring. Weighted scoring has a nice consequence in that it reduces the importance of deciding how many neighbors to use. Because the contribution of each neighbor is moderated by its distance, the influence of neighbors naturally drops off the farther they are from the instance. Consequently, when using weighted scoring the exact value of k is much less critical than with majority voting or unweighted averaging. Some methods avoiding committing to a k by retrieving a very large number of instances (e.g., all instances, k = n) and depend ...more
35%
Flag icon
Intelligibility of nearest-neighbor classifiers is a complex issue. As mentioned, in some fields such as medicine and law, reasoning about similar historical cases is a natural way of coming to a decision about a new case. In such fields, a nearest-neighbor method may be a good fit. In other areas, the lack of an explicit, interpretable model may pose a problem. There are really two aspects to this issue of intelligibility: the justification of a specific decision and the intelligibility of an entire model.
35%
Flag icon
What is difficult is to explain more deeply what “knowledge” has been mined from the data. If a stakeholder asks “What did your system learn from the data about my customers? On what basis does it make its decisions?” there may be no easy answer because there is no explicit model. Strictly speaking, the nearest-neighbor “model” consists of the entire case set (the database), the distance function, and the combining function. In two dimensions we can visualize this directly as we did in the prior figures. However, this is not possible when there are many dimensions. The knowledge embedded in ...more
35%
Flag icon
For example, in the credit card offer domain, a customer database could contain much incidental information such as number of children, length of time at job, house size, median income, make and model of car, average education level, and so on. Conceivably some of these could be relevant to whether the customer would accept the credit card offer, but probably most would be irrelevant. Such problems are said to be high-dimensional — they suffer from the so-called curse of dimensionality — and this poses problems for nearest neighbor methods. Much of the reason and effects are quite ...more
35%
Flag icon
Another way of injecting domain knowledge into similarity calculations is to tune the similarity/distance function manually. We may know, for example, that the attribute Number of Credit Cards should have a strong influence on whether a customer accepts an offer for another one. A data scientist can tune the distance function by assigning different weights to the different attributes (e.g., giving a larger weight to Number of Credit Cards).
Azka
Pake tools yg mana y
35%
Flag icon
There are techniques for speeding up neighbor retrievals. Specialized data structures like kd-trees and hashing methods (Shakhnarovich, Darrell, & Indyk, 2005; Papadopoulos & Manolopoulos, 2005) are employed in some commercial database and data mining systems to make nearest neighbor queries more efficient.
36%
Flag icon
The reason there are so many is that in a nearest-neighbor method the distance function is critical. It basically reduces a comparison of two (potentially complex) examples into a single number. The data types and specifics of the domain of application greatly influence how the differences in individual attributes should combine.
36%
Flag icon
Manhattan distance or L1-norm
36%
Flag icon
Jaccard distance treats the two objects as sets of characteristics.
36%
Flag icon
Cosine distance
36%
Flag icon
edit distance or the Levenshtein metric.
37%
Flag icon
For hierarchical clustering, we need a distance function between clusters, considering individual instances to be the smallest clusters. This is sometimes called the linkage function. So, for example, the linkage function could be “the Euclidean distance between the closest points in each of the clusters,” which would apply to any two clusters.
38%
Flag icon
Also notice point F in the dendrogram. Whenever a single point merges high up in a dendrogram, this is an indication that it seems different from the rest, which we might call an “outlier,” and want to investigate it.
40%
Flag icon
Using Supervised Learning to Generate Cluster Descriptions
40%
Flag icon
Lapointe and Legendre’s is a characteristic description; it describes what is typical or characteristic of the cluster, ignoring whether other clusters might share some of these characteristics. The one generated by the decision tree is a differential description; it describes only what differentiates this cluster from the others, ignoring the characteristics that may be shared by whiskeys within it. To put it another way: characteristic descriptions concentrate on intragroup commonalities, whereas differential descriptions concentrate on intergroup differences. Neither is inherently better — ...more
41%
Flag icon
There is a direct trade-off in where and how effort is expended in the data mining process. For the supervised problems, since we spent so much time defining precisely the problem we were going to solve, in the Evaluation stage of the data mining process we already have a clear-cut evaluation question: do the results of the modeling seem to solve the problem we have defined? For example, if we had defined our goal as improving prediction of defection when a customer’s contract is about to expire, we could assess whether our model has done this. In contrast, unsupervised problems often are much ...more
41%
Flag icon
they settled on five clusters that represented very different consumer credit behavior (e.g., those who spend a lot but pay off their cards in full each month versus those who spend a lot and keep their balance near their credit limit). These different sorts of customers can tolerate very different credit lines (in the two examples, extra care must be taken with the latter to avoid default). The problem with using this clustering immediately for decision making is that the data are not available when the initial credit line is set. Briefly, Haimowitz and Schwarz took this new knowledge and ...more
43%
Flag icon
Another problem with simple classification accuracy as a metric is that it makes no distinction between false positive and false negative errors. By counting them together, it makes the tacit assumption that both errors are equally important.
43%
Flag icon
A Key Analytical Framework: Expected Value
44%
Flag icon
With these example values, we should target the consumer as long as the estimated probability of responding is greater than 1%.
45%
Flag icon
A common way of expressing expected profit is to factor out the probabilities of seeing each class, often referred to as the class priors. The class priors, p(p) and p(n), specify the likelihood of seeing positive and negative instances, respectively. Factoring these out allows us to separate the influence of class imbalance from the fundamental predictive power of the model,
46%
Flag icon
weather forecasters have two simple — but not simplistic — baseline models that they compare against. One (persistence) predicts that the weather tomorrow is going to be whatever it was today. The other (climatology) predicts whatever the average historical weather has been on this day from prior years. Each model performs considerably better than random guessing, and both are so easy to compute that they make natural baselines of comparison. Any new, more complex model must beat these.
46%
Flag icon
some general guidelines for good baselines?
46%
Flag icon
For classification tasks, one good baseline is the majority classifier, a naive classifier that always chooses the majo...
This highlight has been truncated due to consecutive passage length restrictions.
46%
Flag icon
For regression problems we have a directly analogous baseline: predict the average value over the population (usually the mean or median). In some applications there are multiple simple averages that one may want to combine. For example, when evaluating recommendation systems that internally predict how many “stars” a particular customer would give to a particular movie, we have the average number of stars a movie gets across the population (how well liked it is) and the average number of stars a particular customer gives to movies (what that customer’s overall bias is).
46%
Flag icon
Moving beyond these simple baseline models, a slightly more complex alternative is a model that only considers a very small amount of feature information.
46%
Flag icon
One example of mining such single-feature predictive models from data is to use tree induction to build a “decision stump” — a decision tree with only one internal node, the root node. A tree limited to one internal node simply means that the tree induction selects the single most informative feature to make a decision. In a well-known paper in machine learning, Robert Holte (1993) showed that decision stumps often produce quite good baseline performance on many of the test datasets used in machine learning research. A decision stump is an example of the strategy of choosing the single most ...more
46%
Flag icon
If you are considering building models that integrate data from various sources, you should compare the result to models built from the individual sources. Often there are substantial costs to acquiring new sources of data.
46%
Flag icon
Beyond comparing simple models (and reduced-data models), it is often useful to implement simple, inexpensive models based on domain knowledge or “received wisdom” and evaluate their performance. For example, in one fraud detection application it was commonly believed that most defrauded accounts would experience a sudden increase in usage, and so checking accounts for sudden jumps in volume was sufficient for catching a large proportion of fraud.
48%
Flag icon
Profit curves are appropriate when you know fairly certainly the conditions under which a classifier will be used. Specifically, there are two critical conditions underlying the profit calculation: The class priors; that is, the proportion of positive and negative instances in the target population, also known as the base rate (usually referring to the proportion of positives). Recall that Equation 7-2 is sensitive to p(p) and p(n). The costs and benefits. The expected profit is specifically sensitive to the relative levels of costs and benefits for the different cells of the cost-benefit ...more
48%
Flag icon
A ROC graph is a two-dimensional plot of a classifier with false positive rate on the x axis against true positive rate on the y axis. As such, a ROC graph depicts relative trade-offs that a classifier makes between benefits (true positives) and costs (false positives).
48%
Flag icon
An advantage of ROC graphs is that they decouple classifier performance from the conditions under which the classifiers will be used. Specifically, they are independent of the class proportions as well as the costs and benefits.
49%
Flag icon
One of the most common examples of the use of an alternate visualization is the use of the “cumulative response curve,” rather than the ROC curve. They are closely related, but the cumulative response curve is more intuitive. Cumulative response curves plot the hit rate (tp rate; y axis), i.e., the percentage of positives correctly classified, as a function of the percentage of the population that is targeted (x axis).
49%
Flag icon
Both lift curves and cumulative response curves must be used with care if the exact proportion of positives in the population is unknown or is not represented accurately in the test data. Unlike for ROC curves, these curves assume that the test set has exactly the same target class priors as the population to which the model will be applied.
52%
Flag icon
Azka
Ini biasanya pake koma gak sii jadi p(A,B)
54%
Flag icon
Practitioners do use Naive Bayes regularly for ranking, where the actual values of the probabilities are not relevant — only the relative values for examples in the different classes.
55%
Flag icon
This chapter introduced a new family of methods that essentially turns the question around and asks: “How do different target segments generate feature values?” They attempt to model how the data was generated. In the use phase, when faced with a new example to be classified, they use the models to answer the question: “Which class most likely generated this example?” Thus, in data science this approach to modeling is called generative. The large family of popular methods known as Bayesian methods, because they depend critically on Bayes’ Rule, are usually generative methods.
60%
Flag icon
General methods for creating topic models include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models, such as Latent Dirichlet Allocation.
64%
Flag icon
Equation 11-1. VT decomposition where is the difference in the predicted probabilities of staying, depending on whether the customer is targeted or not. Again we see an intuitive result: we want to target those customers with the greatest change in their probability of staying, moderated by their value if they were to stay!
65%
Flag icon
Finally, we would like the association to be in some sense “surprising.” There are many notions of surprisingness that have been pursued in data mining, but unfortunately most of them involve matching the discovered knowledge to our prior background knowledge, intuition, and common sense. In other words, an association is surprising if it contradicts something we already knew or believed. Researchers study how to address this difficult-to-codify knowledge, but dealing with it automatically is not common in practice. Instead, data scientists and business users pore over long lists of ...more
70%
Flag icon
Louis Pasteur famously wrote, “Fortune favors the prepared mind.” Modern thinking on creativity focuses on the juxtaposition of a new way of thinking with a mind “saturated” with a particular problem.
71%
Flag icon
We also must embrace the fact that data science is in part a craft. Analytical expertise takes time to acquire, and all the great books and video lectures alone will not turn someone into a master. The craft is learned by experience. The most effective learning path resembles that in the classic trades: aspiring data scientists work as apprentices to masters. This could be in a graduate program with a top applications-oriented professor, in a postdoctoral program, or in industry working with one of the best industrial data scientists.
76%
Flag icon
Data incorporate the beliefs, purposes, biases, and pragmatics of those who designed the data collection systems. The meaning of data is colored by our own beliefs.
« Prev 1 2 Next »