Adam Glantz’s Kindle Notes & Highlights for Data Science

refer to a row. So a data set contains a set of instances, and each instance is described by a set of attributes.

The construction of the analytics record is a prerequisite of doing data science. In fact, the majority of the time and effort in data science projects is spent on creating, cleaning, and updating the analytics record.

19%

We could have included many more attributes for each book, but, as is typical of data science projects, we needed to make a choice when we were designing the data set. In this instance, we were constrained by the size of the page and the number of attributes we could fit onto it. In most data science projects, however, the constraints relate to what attributes we can actually gather and

19%

what attributes we believe, based on our domain knowledge, are relevant to the problem we are trying to solve.

19%

The problem of how to choose the correct attribute(s) is a challenge faced by all data science projects, and sometimes it comes down to an iterative process of trial-and-error experiments where each iteration checks the results achieved using different subsets of attributes.

19%

The standard types are numeric, nominal, and ordinal. Numeric attributes describe measurable quantities that are represented using integer or real values. Numeric attributes can be measured on either an interval scale or a ratio scale. Interval attributes are measured on a scale with a fixed but arbitrary interval and arbitrary origin—for example, date and time measurements. It is appropriate to

19%

apply ordering and subtraction operations to interval attributes, but other arithmetic operations (such as multiplication and division) are not appropriate. Ratio scales are similar to interval scales, but the scale of measurement possesses a true-zero origin. A value of zero indicates that none of the quantity is being measured. A consequence of a ratio scale having a true-zero origin is that we can describe a value on a ratio scale as being a multiple (or ratio) of another value. Temperature is a useful example for distinguishing between interval and ratio scales.2 A temperature measurement ...more

19%

Nominal (also known as categorical) attributes take values from a finite set. These values are names (hence “nominal”) for categories, classes, or states of things. Examples of nominal attributes include marital status (single, married, divorced) and beer type (ale, pale ale, pils, porter, stout, etc.). A binary attribute

20%

is a special case of a nominal attribute where the set of possible values is restricted to just two values. For example, we might have the binary attribute “spam,” which describes whether an email is spam (true) or not spam (false), or the binary attribute “smoker,” which describes whether an individual is a smoker (true) or not (false). Nominal attributes cannot have ordering or arithmetic operations applied to them. Note that a nominal attribute may be sorted alphabetically, but alphabetizing is a distinct operation from ordering. In table 1, “author” and “title” are examples of nominal ...more

20%

The distinction between nominal and ordinal data is not always clear-cut. For example, consider an attribute that describes the weather and that can

20%

take the values “sunny,” “rainy,” “overcast.” One person might view this attribute as being nominal, with no natural order over the values, whereas another person might argue that the attribute is ordinal, with “overcast” being treated as an intermediate value between “sunny” and “rainy”

20%

The data type of an attribute (numeric, ordinal, nominal) affects the methods we can use to analyze and understand the data, including both the basic statistics we can use to describe the distribution of values that an attribute takes and the more complex algorithms we use to identify the patterns of relationships between attributes. At the most basic level of analysis, numeric attributes allow arithmetic operations, and the typical statistical analysis applied to numeric attributes is to measure the central tendency (using the mean value of the attribute) and the dispersion of the attributes ...more

This highlight has been truncated due to consecutive passage length restrictions.

20%

implication is that data are never an objective description of the world. They are instead always partial and biased. As Alfred Korzybski has observed, “A map is not the territory it represents, but, if correct, it has a similar structure to the territory which accounts for its usefulness” (1996, 58). In other words, the data we use for data science are not a perfect representation of the real-world entities and processes we are trying to understand, but if we are careful in how we design and gather the data that we use, then the results of our analysis will provide useful insights into our ...more

This highlight has been truncated due to consecutive passage length restrictions.

21%

for data science to be successful, we need to pay a great deal of attention to how we create our data (in terms of both the choices we make in designing the data abstractions and the quality of the data captured by our abstraction processes), and (b) we also need to “sense check” the results of a data science process—that is, we need to understand that just because the computer identifies a pattern in the data this doesn’t mean that it is identifying a real insight...

This highlight has been truncated due to consecutive passage length restrictions.

21%

Other than type of data (numeric, nominal, and ordinal), a number of other useful distinctions can be made regarding data. One such distinction is between structured and unstructured data. Structured data are data that can be stored in a table, and every instance in the table has the same structure (i.e., set of attributes). As an example, consider the demographic data for a population, where each row in the table describes one person and consists of the same set of demographic attributes (name, age, date of birth, address, gender, education level, job status, etc.). Structured data can be ...more

21%

into an analytics record. Unstructured data are data where each instance in the data set may have its own internal structure, and this structure is not necessarily the same in every instance.

21%

Sometimes attributes are raw abstractions from an event or object—for example, a person’s height, the number of words in an email, the temperature in a room, the time or location of an event. But data can also be derived from other pieces of data. Consider the average salary in a company or the variance in the temperature of a room across a period of time. In both of these examples, the resulting data are derived from an original set of data by applying a function to the original raw data (individual salaries or temperature readings).

22%

Recognizing that the interaction between the raw attributes “mass” and “height” provides more information about obesity then either of these two attributes can when examined independently will help us to identify people in the population who are at risk of obesity.

22%

There are generally two terms for gathered raw data: captured data and exhaust data (Kitchin 2014a). Captured data are collected through a direct measurement or observation that is designed to gather the data. For example, the primary purpose of surveys and experiments is to gather specific data on a particular

22%

topic of interest. By contrast, exhaust data are a by-product of a process whose primary purpose is something other than data capture.

22%

One of the most common types of exhaust data is metadata—that is, data that describe other data.

22%

The fact that many organizations have very

22%

specific purposes makes it relatively easy to infer sensitive information about a person based on his phone calls to these organizations. For example, some of the people in the MetaPhone study made calls to Alcoholics Anonymous, divorce lawyers, and medical clinics specializing in sexually transmitted diseases. Patterns in calls can also be revealing.

23%

In fact, one of the factors driving the growth in data science in business today is the recognition of the value of exhaust data and the potential that data science has to unlock this value for businesses.

23%

Data are created through abstractions or measurements taken from the world. Information is data that have been processed, structured, or contextualized so that it is meaningful to humans. Knowledge is information that has been interpreted and understood by a human so that she can act on it if required. Wisdom is acting on knowledge in an appropriate way. The activities in the data science process can also be represented using a similar pyramid hierarchy where the width of the pyramid represents the amount of data being processed at each level and where the higher the layer in the pyramid,

23%

the more informative the results of the activities are for decision making.

24%

The primary advantage of CRISP-DM, the main reason why it is so widely used, is that it is designed to be independent of any software, vendor, or data-analysis technique.

24%

The CRISP-DM life cycle consists of six stages: business understanding, data understanding, data preparation, modeling, evaluation, and deployment, as shown in figure 4. Data are at the center of all data science activities, and that is why the CRISP-DM diagram has data at its center. The arrows between the stages indicate the typical direction of the process. The process is semistructured, which means that a data scientist doesn’t always move through these six stages in a linear fashion. Depending on the outcome of a particular stage, a data scientist may go back to one of the previous ...more

24%

In the first two stages, business understanding and data understanding, the data scientist is trying to define the goals of the project by understanding the business needs and the data that the business has available to it. In the early stages of a project, a data scientist will often iterate between focusing on the business and exploring what data are available. This iteration typically involves identifying a business problem and

24%

then exploring if the appropriate data are available to develop a data-driven solution to the problem. If the data are available, the project can proceed; if not, the data scientist will have to identify an alternative problem to tackle. During this stage of a project, a data scientist will spend a great deal of time in meetings with colleagues in the business-focused departments (e.g., sales, marketing, operations) to understand their problems and with the database administrators to get an understanding of what data are available. Once the data scientist has clearly defined a business problem ...more

This highlight has been truncated due to consecutive passage length restrictions.

25%

a data scientist will normally use a number of different ML algorithms to train a number of different models on the data set. A model is trained on a data set by running an ML algorithm on the data set so as to identify useful patterns in the data and to return a model that encodes these patterns. In some cases an ML algorithm works by fitting a template model structure to a data set by setting the parameters of the template to good values for that data set (e.g., fitting a linear regression or neural network model to a data set). In other cases an ML algorithm builds a model in a piecewise ...more

This highlight has been truncated due to consecutive passage length restrictions.

25%

why the performance of a model is lower than expected or notices that maybe the model’s performance is suspiciously good. Or by examining the structure of the models, the data scientist may find that the model is reliant on attributes that she would not expect, and as a result she revisits the data to check that these attributes are correctly encoded. It is thus not uncommon for a project to go through several r...

This highlight has been truncated due to consecutive passage length restrictions.

25%

The last two stages of the CRISP-DM process, evaluation and deployment, are focused on how the models fit the business and its processes. The tests run during the modeling stage are focused purely on the accuracy of the models for the data set. The evaluation phase involves assessing the models in the broader context defined by the business needs. Does a model meet the business objectives of the process? Is there any business reason why a model is inadequate? At this point in the process, it is also useful for the data scientist to do a general quality-assurance review on the project ...more

26%

decision made during the evaluation phase is whether any of the models should be deployed in the business or another iteration of the CRISP-DM process is required to create adequate models. Assuming the evaluation process approves a model or models, the project moves into the final stage of the process: deployment. The deployment phase involves examining how to deploy the selected models into the business environment. This involves planning how to integrate the models into the organization’s technical infrastructure and business processes. The best models are the ones that fit smoothly into ...more

This highlight has been truncated due to consecutive passage length restrictions.

See a Problem?

Preview — Data Science by John D. Kelleher