More on this book
Kindle Notes & Highlights
Generally more data means more information, and hence more data tends to decrease uncertainty.
Bayesian Methods
The Bayesian approach allows us to incorporate our prior beliefs in training.
Bayesian estimation is especially interesting because we are no longer constrained by some parametric model class, but the model complexity also changes dynamically to match the complexity of the task in the data
Bayesian estimation uses Bayes’ rule in probability theory (which we saw before) named after Thomas Bayes (1702–1761)
Artificial Neural Networks
Our brains make us intelligent; we see or hear, learn and remember, plan and act thanks to our brains. In trying to build machines to have such abilities then, our immediate source of inspiration is the human brain, just as birds were the source of inspiration in our early attempts to fly.
The basic idea in connectionist models is that intelligence is an emergent property and high-level tasks, such as recognition or association between patterns, arise automatically as a result of this activity propagation by the rather elemental operations of interconnected simple processing units. Similarly, learning is done at the connection level through simple operations, for instance, according to the Hebbian rule, without any need for a higher-level programmer.
Neural Networks as a Paradigm for Parallel Processing
Since the 1980s, computer systems with thousands of processors have been commercially available. The software for such parallel architectures, however, has not advanced as quickly as hardware. The reason for this is that almost all our theory of computation up to that point was based on serial, single-processor machines. We are not able to use the parallel machines in their full capacity because we cannot program them efficiently.
Thus, artificial neural networks are a way to make use of the parallel hardware we can build with current technology and—thanks to learning—they need not be programmed.
This multilayered network is an example of a hierarchical cone where features get more complex, abstract, and fewer in number as we go up the network until we get to classes
Deep Learning
Though these approaches have had some success, learning algorithms are achieving higher accuracy recently with big data and powerful computers. With few assumptions and little manual interference, structures similar to the hierarchical cone are being automatically learned from large amounts of data. These learning approaches are especially interesting in that, because they learn, they are not fixed for any specific task, and they can be used in a variety of applications.
Successive layers correspond to more abstract representations until we get to the final layer where the outputs are learned in terms of these most abstract concepts. We saw an example of this in the convolutional neural network where starting from pixels, we get to edges, and then to corners, and so on, until we get to a digit.
deep learning, the idea is to learn feature levels of increasing abstraction with minimum human contribution
This is because in most applications, we do not know what structure there is in the input, especially as we go up,
Deep learning methods are attractive mainly because they need less manual interference. We do not need to craft the right features or the suitable transformations. Once we have data—and nowadays we have “big” data—and sufficient computation available—and nowadays we have data centers with thousands of processors—we just wait and let the learning algorithm discover all that is necessary by itself.
The idea of multiple layers of increasing abstraction that underlies deep learning is intuitive.
One method for unsupervised learning is clustering, where the aim is to find clusters or groupings of input.
a clustering model allocates customers similar in their attributes to the same group, providing the company with natural groupings of its
The aim in clustering in particular, or unsupervised learning in general, is to find structure in the data. In the case of supervised learning (for example, in classification), this structure is imposed by the supervisor who defines the different classes and labels the instances in the training data by these classes.
instead of learning association rules between pairs or triples of these items, if we can estimate the hidden baby factor based on past purchases, this will trigger an estimation of whatever it is that has not been bought yet.
a data-driven analysis, as is done by machine learning algorithms. We can use any of the methods we discussed in previous chapters, for classification, regression, clustering, and so on, to build a model from the data.
visualization is one of the best tools for data analysis, and sometimes just visualizing the data in a smart way is enough to understand the characteristics of a process that underlies a complicated data set, without any need for further complex and costly statistical processing;
Data Science
Most of the time data does not obey the parametric assumptions, such as the bell-shaped Gaussian curve, that we use in statistics to make estimation easier. Instead, with the new data, we need to resort to more flexible nonparametric models whose complexity can adjust automatically to the complexity of the task underlying the data. All these requirements make machine learning more challenging than statistics as we used to know and practice it.
in real-world applications how efficiently the data is stored and manipulated may be as critical as the prediction accuracy.
One important point is that intelligence is a vague term and its applicability to assess the performance of computer systems may be misleading. For example, evaluating computers on tasks that are difficult for humans, such as playing chess, is not a good idea for assessing their intelligence.
For a computer, it is much more difficult to recognize the face of its opponent than to play chess.
In real life, all sorts of randomness occurs, and for its survival every species is slowly evolving to be a better cheater than the rest.
this makes trained software less predictable than programmed software.
there is an important risk in basing recommendations too much on past use and preferences. If a person only listens to songs similar to the ones they listened to and enjoyed before, or watches movies similar to those they watched and enjoyed before, or reads books similar to the books they read and enjoyed before, then there will be no new experience
Current deep networks are not deep enough; they can learn enough abstraction in some limited context to recognize handwritten digits or a subset of objects, but they are far from having the capability of our visual cortex to recognize a scene.
Bayesian estimation A method for parameter estimation where we use not only the sample, but also the prior information about the unknown parameters given by a prior distribution.
Data mining Machine learning and statistical methods for extracting information from large amounts of data. For example, in basket analysis, by analyzing large number of transactions, we find association rules.
Data warehouse A subset of data selected, extracted, and organized for a specific data analysis task. The original data may be very detailed and may lie in several different operational databases. The warehouse merges and summarizes them. The warehouse is read-only; it is used to get a high-level overview of the process that underlies the data either through OLAP and visualization tools, or by data mining software.
Deep learning Methods that are used to train models with several levels of abstraction from the raw input to the output. For example, in visual recognition, the lowest level is an image composed of pixels. In layers as we go up, a deep learner combines them to form strokes and edges of different orientations, which can then be combined to detect longer lines, arcs, corners, and junctions, which in turn can be combined to form rectangles, circles, and so on. The units of each layer may be thought of as a set of primitives at a different level of abstraction.
If-then rules
A model that can be written as a set of if-then rules is easy to understand and hence rule bases allow knowledge extraction.
Neural network A model composed of a network of simple processing units called neurons and connections between neurons called synapses. Each synapse has a direction and a weight, and the weight defines the effect of the neuron before on the neuron after.
Occam’s razor A philosophical heuristic that advises us to prefer simple explanations to complicated ones.
Online analytical processing (OLAP) Data analysis software used to extract information from a data warehouse. OLAP is user-driven, in the sense that the user thinks of some hypotheses about the process and using OLAP tools checks whether the data supports those hypotheses. Machine learning is more data-driven in the sense that automatic data analysis may find dependencies not previously thought by users.

