Doug Lautzenheiser’s Kindle Notes & Highlights for Confident Data Skills: Master the Fundamentals of Working with Data and Supercharge Your Career (Confident Series)

You do not need to be a life learner to master the principles of data science. What you really need is an ability to think about the various ways in which one or more questions – about business operations, about personal motivations – might be asked of data. Because data scientists are there to examine the possibilities of the information they have been given. You may be surprised to know that you already have some skills and experience that you can leverage in your journey to mastering the discipline.

10%

When you interacted with each of these touchpoints, you left a little bit of data about yourself behind. We call this ‘data exhaust’. It isn’t confined to your online presence, nor is it only for the social media generation. Whether or not you use social media platforms, whether you like it or not, you’re contributing data.

10%

Quite simply, data is any unit of information. It is the by-product of any and every action, pervading every part of our lives, not just within the sphere of the internet, but also in history, place and culture.

10%

Let’s say that in this definition of data being a unit of information, data is the tangible past. This is quite profound when you think about it. Data is the past, and the past is data. The record of things to which data contributes is called a database. And data scientists can use it to better understand our present and future operations. They’re applying the very same principle that historians have been telling us about for ages: we can learn from history. We can learn from our successes – and our mistakes – in order to improve the present and future.

10%

All things considered, you might start to ask what the limits to the definition of data are. Does factual evidence about a plant’s flowering cycle (quantitative data) count as data as much as the scientist’s recording of the cultural stigma associated with giving a bunch to a dying relative in the native country (qualitative data)? The answer is yes. Data doesn’t discriminate. It doesn’t matter whether the unit of information collected is quantitative or qualitative. Qualitative data may have been less usable in the past when the technology wasn’t sophisticated enough to process it, but thanks ...more

11%

Put very simply, big data is the name given to datasets with columns and rows so considerable in number that they cannot be captured and processed by conventional hardware and software within a reasonable length of time.

17%

Data science always builds upon what has been left behind, the information of our past. It is this ability to crowdsource data that makes the application of data science in the discipline of medicine so powerful – for as long as the data remains, the gathered knowledge will not be dependent on individuals.

22%

Not only is data science simple to pick up, it is also beneficial to have come to work in the discipline after having a grounding in another.

23%

A ‘raw’ data scientist may be able to play with the material, but a data scientist with the right background will be able to ask the right questions of the project in order to produce truly interesting results.

23%

This is another crucial factor for data scientists: if you want to be able to run a data science project, you will need to be able to speak to the right people. That will often mean asking around, outside your team and potential comfort zone. The data won’t tell you anything unless you ask the right questions, so it is your job to get out there and find answers from the people who have contributed towards your data.

25%

With business intelligence, you are required to identify the business question, find the relevant data and both visualize and present it in a compelling way to investors and stakeholders. Those are already four of the five stages of the Data Science Process, to which we will return in Parts II and III. The main exception is that BI does not carry out detailed, investigative analyses on the data. It simply describes what has happened, in a process that we call ‘descriptive analytics’.

28%

The process has five stages, which are: Identify the question. Prepare the data. Analyse the data. Visualize the insights. Present the insights.

30%

Before we jump into our project, then, we must speak with the person who presented us with the problem in the first place. Understanding not only what the problem is but also why it must be resolved now, who its key stakeholders are, and what it will mean for the institution when it is resolved, will help you to start refining your investigation.

31%

In consulting, we outline the possible strategic approaches for our business. Consultants tend to be people who have worked in the business or the industry for several years, and they will have accrued a lot of knowledge about the sector. These people are often concerned with improving the large-scale strategic and organizational aspects of a company, which demands a top-down approach – and this methodology to analyse the big picture requires certain assumptions about a given problem to be made.

34%

Simply put, quantitative methods gather numerical information, while qualitative methods gather non-numerical information.

35%

The one thing that I insist on at the outset of any data science project is to make sure that you get stakeholder buy-in in writing. You may be the best of mates in your personal life, but in my experience, stakeholders have a tendency to change their concept of what they want as the project develops, no matter who they are. It’s understandable for undefined projects, but this contributes to scope creep, which can either overwork you beyond the original parameters of the project or it can kill the project completely. So before you move on to preparing your data, get that confirmation in ...more

35%

Give yourself ample time to meet the deadline. It is far better to under-promise and over-deliver than to do the opposite. A good rule of thumb is to work out how many days you think the project will take as a whole and then add 20 per cent to that number. More often than not in data science, you barely meet your deadline. And if you come up against any obstacles and you think that you may not meet the date on which you had initially agreed, make sure to tell the person who needs to know as early as possible. Keeping people informed will build trust between you and your stakeholders, and keep ...more

35%

Always remember that the essential job of a data scientist is to derive value for the company. This defining criterion for success will also serve as an aid for you when you tell people why you are not taking on their project. How do you say no? First, refrain from responding to a project idea immediately. Tell the person with whom you are speaking that you will think about it and that you will get back to them within a certain timeframe (I give myself two working days to mull a request over). If you want to take a more diplomatic pathway, one trick is to make sure that you are not considered ...more

44%

Put simply, we use classification when we already know the groups into which we want an analysis to place our data, and we use clustering when we do not know what the groups will be, in terms of either number or name.

44%

The following classification algorithms have been organized in order of difficulty. We will begin with decision trees, as many readers will already be familiar with flowcharts – they both use the same principle of splitting information into individual steps before presenting the participant with a final response. Random forest regression is simply an expansion of the decision trees algorithm, for it uses multiple decision trees for individual components of a dataset in order to provide more accurate results. Both the K-nearest neighbours and Naive Bayes algorithms classify data points into ...more

45%

Random forest classification builds upon the principles of decision trees through ensemble learning. Instead of there being only one tree, a random forest will use many different trees to make the same prediction, taking the average of the individual trees’ results.

46%

For projects that use relatively little data, using a random forest algorithm will not give optimum results because it will unnecessarily subdivide your data. In these scenarios, you would use a decision tree, which will give you fast, straightforward interpretations of your data. But if you are working with a large dataset, random forest will give a more accurate, though less interpretable prediction.

46%

K-nearest neighbours uses patterns in the data to place new data points in the relevant categories.

46%

K-NN analyses ‘likeness’. It will work by calculating the distance between your new data point and the existing data points.

47%

The K-NN algorithm is often the right choice because it is intuitive to understand and, unlike the Naive Bayes as we will see below, it doesn’t make assumptions about your data. The main disadvantage of K-NN, however, is that it takes a very long time to compute. Having to calculate the distance to every single point in the dataset takes its toll, and the more points you have the slower K-NN will run.

48%

Now that we have a taste for Bayesian inference in practice, let’s look at the equation for the Bayes theorem. Here are the notations that will be used in this example: P(Drunk) P(Drunk | Positive) P(Positive | Drunk) P(Positive) where P stands for probability and the vertical bar signifies a conditional probability. Each of the elements above has a mathematical name. P(Drunk) is the probability that a randomly selected driver is drunk. In Bayesian statistics, this probability is called the prior probability. If we recall our initial assumptions, we can calculate the prior probability as ...more

50%

Naive Bayes relies on a strong, naive independence assumption: that the features of the dataset are independent of each other. Indeed it would be naive to think so, as for many datasets there can be a level of correlation between the independent variables contained within them. Despite this naive assumption, the Naive Bayes algorithm has proved to work very well in many complex applications such as e-mail spam detection.

51%

Naive Bayes is good for: non-linear problems where the classes cannot be separated with a straight line on the scatter plot datasets containing outliers (unlike other algorithms, Naive Bayes cannot be biased by outliers) The drawback to using Naive Bayes is that the naive assumptions it makes can create bias.

51%

Deterministic models like K-NN assign a new observation to a single class, while probabilistic models like Naive Bayes assign a probability distribution across all classes.

52%

Despite its name, logistic regression is actually not a regression algorithm; it is a type of classification method. It will use our data to predict our chances of success in, say, selling a product to a certain group of people, determining a key demographic opening your e-mails, or in many other non-business fields such as medicine (predicting, for example, if a patient has a coronary heart disease based on age, sex and blood test results).

54%

If you do not know what the groups resulting from an analysis might be, you should use a clustering technique.

54%

In this section, we will cover two algorithms: K-means and hierarchical clustering. These two are similar in many ways, as they help us to segment our data into statistically relevant groups.

54%

K-means discovers statistically significant categories or groups in our dataset. It’s perfect in situations where we have two or more independent variables in a dataset and we want to cluster our data points into groups of similar attributes.

56%

Now that we know how the K-means clustering algorithm works, the only question that remains is: how can we find the optimal number of clusters (K) to use? This is a job for the elbow method. The elbow method helps us to identify our optimal number of clusters.

57%

Within Cluster Sum of Squares (WCSS).

57%

There are two types of hierarchical clustering – agglomerative and divisive – and they are essentially two sides of the same coin. Agglomerative hierarchical clustering uses a bottom-up approach, working from a single data point and grouping it with the nearest data points in incremental steps until all of the points have been absorbed into a single cluster. Divisive hierarchical clustering works in the opposite way to agglomerative clustering. It begins from the top, where a single cluster encompasses all our data points, and works its way down, splitting the single cluster apart in order of ...more

58%

Here are a few possible options for the distance between two clusters: Distance between their ‘centres of mass’ Distance between their two closest points Distance between their two furthest points The average of B and C

58%

A dendrogram will plot the points of your data (P1, P2, P3, P4) on the x axis of a graph. The distances between data points are represented on the y axis.

58%

The biggest advantage of using the hierarchical clustering algorithm is its dendrogram. The dendrogram is a practical visual tool which allows you to easily see all potential cluster configurations.

59%

Reinforcement learning is ultimately a form of machine learning, and it leans on the concepts of behaviourism to train AI and operate robots.

60%

The way reinforcement learning works is through trialling all the variations available to the machine and then working out the optimal actions from those individual experiences.

61%

The stakes for solving this real problem in the casino are high. We must spend our money to carry out these experiments, and the longer we take to find the solution, the more money we will have spent. For that reason, we must figure out our solution as quickly as possible in order to contain our losses. To maintain efficiency, we must take two factors into account: exploration and exploitation. These factors must be applied in tandem – exploration refers to searching for the best machine, and exploitation means applying the knowledge that we already have about each of the machines to make our ...more

62%

Applying the upper confidence bound algorithm (UCB) to our problem will determine which machine will give us the best revenue, leading us to the key difference between the algorithms covered in the previous chapter on classification and clustering, and those that appear here. In our earlier examples, we tended to use datasets with independent variables and dependent variables already collected. For reinforcement learning, however, things are different. We begin with no data at all. We have to experiment, observe and change our strategy based on our previous actions.

62%

It is important to understand the purpose of the confidence bound. In a real-life situation, we wouldn’t know exactly where the expected returns are. At the beginning of the first game, we wouldn’t know anything about them at all.

62%

What the confidence bounds do is ensure that the expected returns are captured within them.

63%

But what happens to the grey box that represents our upper and lower confidence bounds? It will shrink every time we play a round. This is because the more rounds we play, the more accurate our observed average will become and, therefore, the narrower our confidence bounds will be. The size of this box, then, will be inversely proportional to the number of rounds we have played on that machine. The smaller this box is, the more confident we can be that we are getting closer to the machine’s true expected return. This is a direct consequence of the law of large numbers.

63%

This is the law of large numbers: as the sample size grows, the observed average will always converge to the true expected return.

64%

The upper confidence bound algorithm is good for: finding the most effective advertising campaigns; managing multiple project finances. The UCB is not the only algorithm that can solve the multi-armed bandit problem. The following section will consider how Thompson sampling can also be applied, and consider when we might want to use this algorithm over UCB.

65%

Thompson sampling is probabilistic while the upper confidence bound is deterministic – and it’s easy to see why. Both approaches are similar in that through playing rounds they approximate the value of the true expected return. UCB does this through confidence bounds, while Thompson sampling constructs distributions.

67%

There are two branches to visuals in data science: visual analytics and visualization. The difference between the two is important in this chapter. Think of visual analytics as an additional tool for stages 1 to 3 in the Data Science Process (identifying the question, preparing the data and analysing the data). Visualization is what lies at the core of stage 4, data visualization.