More on this book
Kindle Notes & Highlights
The goal of data science is to improve decision making by basing decisions on insights extracted from large data sets. As a field of activity, data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. It is closely related to the fields of data mining and machine learning, but it is broader in scope. Today, data science drives decision making in nearly all parts of modern societies. Some of the ways that data science may affect your daily life include determining which advertisements are
...more
methods for data analysis and modeling, such as deep learning. Together these factors mean that it has never been easier for organizations to gather, store, and process data.
One aspect of a typical data infrastructure that can be challenging is that data in databases and data warehouses often reside on servers different from the servers used for data analysis. As a consequence, when large data sets are handled, a surprisingly large amount of time can be spent moving data between the servers a database or data warehouse are living on and the servers used for data analysis and machine learning.
Data science encompasses a set of principles, problem definitions, algorithms, and processes for extracting nonobvious and useful patterns from large data sets. Many of the elements of data science have been developed in related fields such as machine learning and data mining. In fact, the terms data science, machine learning, and data mining are often used interchangeably. The commonality across these disciplines is a focus on improving decision making through the analysis of data. However, although data science borrows from these other fields, it is broader in scope. Machine learning (ML)
...more
Using data science, we can extract different types of patterns. For example, we might want to extract patterns that help us to identify groups of customers exhibiting similar behavior and tastes. In business jargon, this task is known as customer segmentation, and in data science terminology it is called clustering. Alternatively, we might want to extract a pattern that identifies products that are frequently bought together, a process called association-rule mining. Or we might want to extract patterns that identify strange or abnormal events, such as fraudulent insurance claims, a process
...more
This highlight has been truncated due to consecutive passage length restrictions.
In general, data science becomes useful when we have a large number of data examples and when the patterns are too complex for humans to discover and extract manually.
We humans are reasonably good at defining rules that check one, two, or even three attributes (also commonly referred to as features or variables), but when we go higher than three attributes, we can start to struggle to handle the interactions between them. By contrast, data science is often applied in contexts where we want to look for patterns among tens, hundreds, thousands, and, in extreme cases, millions of attributes. The patterns that we extract using data science are useful only if they give us insight into the problem that enables us to do something to help solve the problem. The
...more
highlights that the insight we get should also be something that we have the capacity to use in some way. For example, imagine we are working for a cell phone company that is trying to solve a customer churn problem—that is, too many customers are switching to other companies. One way data science might be used to address this problem is to extract patterns from the data about previous customers that allow us to identify current customers who are churn risks and then contact these customers and try to persuade them to stay with us. A pattern that enables us to identify likely churn customers
...more
One thread in this longer history is the history of data collection; another is the history of data analysis.
This type of record keeping captures what is known as transactional data. Transactional data include event information such as the sale of an item, the issuing of an invoice, the delivery of goods, credit card payment, insurance claims, and so on. Nontransactional data, such as demographic data, also have a long history.
A milestone in data collection and storage occurred in 1970 when Edgar F. Codd published a paper explaining the relational data model, which was revolutionary in terms of setting out how data were (at the time) stored, indexed, and retrieved from databases. The relational data model enabled users to extract data from a database using simple queries that defined what data the user wanted without requiring the user to worry about the underlying structure of the data or where they were physically stored. Codd’s paper provided the foundation for modern databases and the development of structured
...more
This highlight has been truncated due to consecutive passage length restrictions.
Another difficulty was that databases were optimized for storage and retrieval of data, activities characterized by high volumes of simple operations, such as SELECT, INSERT, UPDATE, and DELETE. In order to analyze their data, these companies needed technology that was able to bring together and reconcile the data from disparate databases and that facilitated more complex analytical data operations. This business challenge led to the development of data warehouses. In a data warehouse, data are taken from across the organization and integrated, thereby providing a more comprehensive data set
...more
However, it is not only the amount of data collected that has grown dramatically but also the variety of data. Just consider the following list of online data sources: emails, blogs, photos, tweets, likes, shares, web searches, video uploads, online purchases, podcasts. And if we consider the metadata (data describing the structure and properties of the raw data) of these events, we can begin to understand the meaning of the term big data. Big data are often defined in terms of the three Vs: the extreme volume of data, the
variety of the data types, and the velocity at which the data must be processed. The advent of big data has driven the development of a range of new database technologies. This new generation of databases is often referred to as “NoSQL databases.” They typically have a simpler data model than traditional relational databases. A NoSQL database stores data as objects with attributes, using an object notation language such as the JavaScript Object Notation (JSON). The advantage of using an object representation of data (in contrast to a relational table-based model) is that the set of attributes
...more
This highlight has been truncated due to consecutive passage length restrictions.
large volumes of data at high speeds, it can be useful from a computational and speed perspective to distribute the data across multiple servers, process queries by calculating partial results of a query on each server, and then merge these results to generate the response to the query. This is the approach taken by the MapReduce framework on Hadoop. In the MapReduce framework, the data and queries are mapped onto (or distributed...
This highlight has been truncated due to consecutive passage length restrictions.
Statistics is the branch of science that deals with the collection and analysis of data.
The simplest form of statistical analysis of data is the summarization of a data set in terms of summary (descriptive) statistics (including measures of a central tendency, such as the arithmetic mean, or measures of variation, such as the range). However, in the seventeenth and eighteenth centuries the work of people such as Gerolamo Cardano, Blaise Pascal, Jakob Bernoulli, Abraham de Moivre, Thomas Bayes, and Richard Price laid the foundations of probability theory, and through the nineteenth century many statisticians began to use probability distributions as part of their analytic tool
...more
developments in mathematics enabled statisticians to move beyond descriptive statistics and to start doing statistical learning.
Gauss, in his search for the missing dwarf planet Ceres, developed the method of least squares, which enables us to find the best model that fits a data set such that the error in the fit minimizes the sum of squared differences between the data points in the data set and the model. The method of least squares provided the foundation for statistical learning methods such as linear regression and logistic regression as well as the development of artificial neural network models in artificial intelligence (we will return to least squares, regression analysis, and neural networks in chapter 4).
...more
Admittedly, it is difficult to visualize large (many data points) or complex (many attributes) data sets, but data visualization is still an important part of data science.
A recent development is the t-distributed stochastic neighbor embedding (t-SNE) algorithm, which is a useful technique for reducing high-dimensional data down to two or three dimensions, thereby facilitating the visualization of those data. The developments in probability theory and statistics continued into the twentieth century. Karl Pearson developed modern hypothesis testing, and R. A. Fisher developed statistical methods for multivariate analysis and introduced the idea of maximum likelihood estimate into statistical inference as a method to draw conclusions based on the relative
...more
The field of ML is at the core of modern data science because it provides algorithms that are able to automatically analyze large data sets to extract potentially interesting and useful patterns.
In fact, the terms knowledge discovery in databases and data mining describe the same concept, the distinction being that data mining is more prevalent in the business communities and KDD more prevalent in academic communities.
In this paper, Breiman characterizes the traditional approach to statistics as a data-modeling culture that views the primary goal of data analysis as identifying the (hidden) stochastic data model (e.g., linear regression) that explains how the data were generated. He contrasts this culture with the algorithmic-modeling culture that focuses on using computer algorithms to create prediction models that are accurate (rather than explanatory in terms of how the data was generated). Breiman’s distinction between a statistical focus on models that explain the data versus an algorithmic focus on
...more
In general, today most data science projects are more aligned with the ML approach of building accurate prediction models and less concerned with the statistical focus on explaining the data. So although data science became prominent in discussions relating to statistics and still borrows methods and models from statistics, it has over time developed its own distinct approach to data analysis.
Gathering and preparing these data for use in data science projects has resulted in the need for data scientists to develop the programming and hacking skills to scrape, merge, and clean data (sometimes unstructured data) from external web sources. Also, the emergence of big data has meant that data scientists need to be able to work with big-data technologies, such as Hadoop.
Data scientists should have some domain expertise. Most data science projects begin with a real-world, domain-specific problem and the need to design a data-driven solution to this problem. As a result, it is important for a data scientist to have enough domain expertise that they understand the problem, why it is important, and how a data science solution
to the problem might fit into an organization’s processes. This domain expertise guides the data scientist as she works toward identifying an optimized solution. It also enables her to engage with real domain experts in a meaningful way so that she can illicit and understand relevant knowledge about the underlying problem. Also, having some experience of the project domain allows the data scientist to bring her experiences from working on similar projects in the same and related domains to bear on defining the project focus and scope. Data are at the center of all data science projects.
...more
In most organizations, a significant portion of the data will come from the databases in the organization. Furthermore, as the data architecture of an organization grows, data science projects will start incorporating data from a variety of other data sources, which are commonly referred to as “big-data sources.” The data in these data sources can exist in a variety of different formats, generally a database of some form—relational, NoSQL,
or Hadoop. All of the data in these various databases and data sources will need to be integrated, cleansed, transformed, normalized, and so on. These tasks go by many names, such as extraction, transformation, and load, “data munging,” “data wrangling,” “data fusion,” “data crunching,” and so on. Like source data, the data generated from data science activities also need to be stored and managed. Again, a database is the typical storage location for the data generated by these activities because they can then be easily distributed and shared with different parts of the organization. As a
...more
This highlight has been truncated due to consecutive passage length restrictions.
required to be able to understand and develop the ML models and integrate them into the production or analytic or back-end applications in an organization. Presenting data in a graphical format makes it much easier to see and understand what is happening with the data. Data visualization applies to all phases of the data science process. When data are inspected in tabular form, it is easy to miss things such as outliers or trends in distributions or subtle changes in th...
This highlight has been truncated due to consecutive passage length restrictions.
Methods from statistics and probability are used throughout the data science process, from the initial gathering and investigation of the data right through to the comparing of the results of different models and analyses produced during the project. Machine learning involves using a variety of advanced statistical and computing techniques to process data to find patterns. The data scientist who is involved in the applied aspects of ML does not have to write his own versions of ML algorithms. By understanding the ML algorithms, what they can be used for, what the results they generate mean,
...more
can consider the ML algorithms as a gray box. This allows him to concentrate on the applied aspects of data science and to test the various ML algorithms to see which ones work best for the scenario and data he is concerned with. Finally, a key aspect of being a successful data scientist is being able to communicate the story in the data. This story might uncover the insight that the analysis of the data has revealed or how the models created during a project fit into an organization’s processes and the likely impact they will have on the organization’s functioning.
The equivalent of up-selling and cross-selling in the online world is the “recommender system.” If you have watched a movie on Netflix or purchased an item on Amazon, you will know that these websites use the data they collect to provide suggestions for what you should watch or buy next.
Chris Anderson’s book The Long Tail (2008) argues
that as production and distribution get less expensive, markets shift from selling large amounts of a small number of hit items to selling smaller amounts of a larger number of niche items.
However, from a pure data science perspective perhaps the most important aspect of the moneyball story is that it highlights that sometimes the primary value of data science is the identification of informative attributes. A common belief is that the value of data science is in the models created through the process. However, once we know the important attributes in a domain, it is very easy to create data-driven models. The key to success is getting the right data and finding the right attributes. In Freakonomics: A Rogue Economist Explores the Hidden Side of Everything, Steven D. Levitt and
...more
The reason why data science is used in so many domains is that it doesn’t matter what the problem domain is: if the right data are available and the problem can be clearly defined, then data science can help.
A number of factors have contributed to the recent growth of data science. As we have already touched upon, the emergence of big data has been driven by the relative ease with which organizations can gather data. Be it through point-of-sales transaction records, clicks on online platforms, social media posts, apps on smart phones, or myriad other channels, companies can now build much richer profiles of individual customers. Another factor is the commoditization of data storage with economies of scale, making it less expensive than ever before to store data. There has also been tremendous
...more
In the past 10 years there have also been major advances in ML. In particular, deep learning has emerged and has revolutionized how computers can process language and image data. The term deep learning describes a family of neural network models with multiple layers of units in the network. Neural networks have been around since the 1940s, but they work best with large, complex data sets and take a great deal of computing resources to train. So the emergence of deep learning is connected with growth in big data and computing power.
However, although deep learning is an important technical development, perhaps what is most significant about it in terms of the growth of data science is the increased awareness of the capabilities and benefits of data science and organization buy-in that has resulted from these high-profile success stories.
One of the biggest myths is the belief that data science is an autonomous process that we can let loose on our data to find the answers to our problems. In reality, data science requires skilled human oversight throughout the different stages of the process. Humans analysts are needed to frame the problem, to design and prepare the data, to select which ML algorithms are most appropriate, to critically interpret the results of the analysis, and to plan the appropriate action to take based on the insight(s) the analysis has revealed. Without skilled human oversight, a data science project will
...more
Human talent in data science is at a premium, and sourcing this talent is currently the main bottleneck in the adoption of
data science.
The second big myth of data science is that every data science project needs big data and needs to use deep learning. In general, having more data helps, but having the right data is the more important requirement.
A third data science myth is that modern data science software is easy to use, and so data science is easy to do. It is true that data science software has become more user-friendly. However, this ease of use can hide the fact that doing data science properly requires both appropriate domain knowledge and the expertise regarding the properties of the data and the assumptions underpinning the different ML algorithms. In fact, it has never been easier to do data science badly. Like everything else in life, if you don’t understand what you are doing when you do data science, you are going to make
...more
The last myth about data science we want to mention here is the belief that data science pays for itself quickly. The truth of this belief depends on the context of the organization.
Furthermore, data science will not give positive results on every project. Sometimes there is no hidden gem of insight in the data, and sometimes the organization is not in a position to act on the insight the analysis has revealed.
As its name suggests, data science is fundamentally dependent on data. In its most basic form, a datum or a piece of information is an abstraction of a real-world entity (person, object, or event). The terms variable, feature, and attribute are often used interchangeably to denote an individual abstraction. Each entity is typically described by a number of attributes. For example, a book might have the following attributes: author, title, topic, genre, publisher, price, date published, word count, number of chapters, number of pages, edition, ISBN, and so on. A data set consists of the data
...more
The terms instance, example, entity, object, case, individual, and record are used in data science literature to

