Julian M Drault’s Kindle Notes & Highlights for Machine Learning

Data starts to drive the operation; it is not the programmers anymore but the data itself that defines what to do next.

Before, data was what the programs processed and spit out—data was passive. With this question, data starts to drive the operation; it is not the programmers anymore but the data itself that defines what to do next.

11%

customer behavior is not completely random. People do not go to supermarkets and buy things at random. When they buy beer, they buy chips; they buy ice cream in summer and spices for Glühwein in winter. There are certain patterns in customer behavior, and that is where data comes into play.

11%

This is called data mining. The analogy is that a large volume of earth and raw material is extracted from the mine, which when processed

12%

Data mining is one type of machine learning. We do not know the rules (of customer behavior), so we cannot write the program, but the machine—that is, the computer—“learns” by extracting such rules from (customer transaction) data.

12%

Learning models are used in pattern recognition,

12%

we use learning algorithms to make sense of the bigger and bigger data.

12%

Learning versus Programming

12%

An algorithm is a sequence of instructions that are carried out to transform the input to the output.

13%

we would like the computer (the machine) to extract automatically the algorithm for this task.

13%

Artificial Intelligence

13%

A system that is in a changing environment should have the ability to learn; otherwise, we would hardly call it intelligent. If the system can learn and adapt to such changes, the system designer need not foresee and provide solutions for all possible situations.

13%

Each of us, actually every animal, is a data scientist. We collect data from our sensors, and then we process the data to get abstract rules to perceive our environment and control our actions in that environment to minimize pain and/or maximize pleasure. We have memory to store those rules in our brains, and then we recall and use them when needed. Learning is lifelong; we forget rules when they no longer apply or revise them when the environment changes.

14%

Whereas a computer generally has one or few processors, the brain is composed of a very large number of processing units, namely, neurons, operating in parallel. Though the details are not completely known, the processing units are believed to be much simpler and slower than a typical processor in a computer.

15%

Just as the initial attempts to build flying machines looked a lot like birds until we discovered the theory of aerodynamics, it is also expected that the first attempts to build structures possessing the brain’s abilities will look like the brain with networks of large numbers of processing units.

15%

Pattern Recognition

16%

we should keep in mind that just because we have a lot of data, it does not mean that there are underlying rules that can be learned. We should make sure that there are dependencies in the underlying process and that the collected data provides enough information for them to be learned with acceptable accuracy.

17%

The main theory underlying machine learning comes from statistics, where going from particular observations to general descriptions is called inference and learning is called estimation

What was once called STATISTICS now re-baptized nas Machine Learning

18%

Machine Learning, Statistics, and Data Analytics

18%

we use machine learning when we believe there is a relationship between observations of interest but do not know exactly how.

19%

no matter how many properties we list as input, there are always other factors that affect the output; we cannot possibly record and take all of them as input, and all these other factors that we neglect introduce uncertainty.

20%

expect customers in general to follow certain patterns in their purchases depending on factors such as the composition of their household, their tastes, their income, and so on. Still, there are always additional random factors that introduce variance: vacation, change in weather, some catchy advertisement, and so on.

20%

Model Selection

20%

if we believe that we can write the output as a weighed sum of the attributes, we can use a linear model where attributes have an additive effect—for example, each additional seat increases the value of the car by X dollars and each additional thousand miles driven decreases the value by Y dollars, and so on.

21%

If a weight is estimated to be very close to zero, we can conclude that the corresponding attribute is not important and eliminate it from the model. These weights are the parameters of the model and are fine-tuned using data. The model is always fixed; it is the parameters that are adjustable, and it is this process of adjustment to better match the data that we call learning.

21%

Supervised Learning

21%

Each model corresponds to a certain type of dependency assumption between the inputs and the output.

21%

Learning corresponds to adjusting the parameters so that the model makes the most accurate predictions on the data.

23%

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55 You probably noticed that this is the Fibonacci sequence.

23%

In philosophy, Occam’s razor tells us to prefer simpler explanations, eliminating unnecessary complexity.

23%

Human behavior is sometimes as much Dionysian as it is Apollonian.

24%

Classification is another type of supervised learning where the output is a class code, as opposed to the numeric value we have in regression.

26%

Expert Systems

26%

Another way to represent uncertainty is to use probability theory, as we do in this book.

28%

Pattern Recognition

29%

In machine learning, the aim is to fit a model to the data.

31%

diagnostics is the inference of hidden factors from observed variables.

31%

conditional probability

31%

Bayes’ rule,

31%

Face Recognition

34%

Outlier Detection

34%

find instances that do not obey the general rule—those

35%

Dimensionality Reduction

35%

when an input is deemed unnecessary, we save the cost of measuring it.

35%

simpler models are more robust on small data sets; that is, they can be trained with fewer data; or when trained with the same amount of data, they have smaller variance (uncertainty).

35%

when data can be explained with fewer features, we have a simpler model that is easier to interpret.

35%

if we can find a good way to display the data, our visual cortex can do the rest, without any need for model fitting calculation.

36%

Decision Trees

36%

if-then rules

37%

Trees are used successfully in various machine learning applications, and together with the linear model, the decision tree should be taken as one of the basic benchmark methods before any more complex learning algorithm is tried.