Matt Mitchell’s Kindle Notes & Highlights for Fundamentals of Predictive Analytics with JMP

We believe there are six fundamental concepts: ● FC1: Always take a random and representative sample. ● FC2: Statistics is not an exact science. ● FC3: Understand a z-score. ● FC4: Understand the central limit theorem (not every distribution has to be bell-shaped). ● FC5: Understand one-sample hypothesis testing and p-values. ● FC6: Few approaches are correct and many wrong.

8%

What is a random and representative sample (called a 2R sample)?

8%

representative means representative of the population of interest.

8%

the population of interest is those individuals who are registered to vote and plan to vote.

8%

Random, means that each individual has an equal chance of being selected.

8%

First, if the sample is a 2R sample, then the sample distribution of observations will follow a pattern resembling that of the population.

8%

The population parameters (such as the population mean, µ, the population variance, σ2, or the population standard deviation, σ) are the true values of the population. These are the values that you are interested in knowing.

8%

Because the sample is a 2R sample, the sample distribution of observations is very similar to the population distribution of observations. Therefore, the sample statistics, calculated from the sample, are good estimates of their corresponding population parameters.

8%

The sample statistics (such as the sample mean, sample variance, and sample standard deviation) are estimates of their corresponding population parameters. It is highly unlikely that they will equal their corresponding population parameter.

8%

By using statistical techniques, you can test the likelihood of the population parameter being greater than 50%. (You can construct a confidence interval, and if the lower confidence level is greater than 50%, you can be highly confident that the true population proportion is greater than 50%. Or you can conduct a hypothesis test to measure the likelihood that the proportion is greater than 50%.)

8%

you must realize that these sample statistics are estimates, in that, if other 2R samples are taken, they will produce different estimates.

9%

The z-score (and the t-score) is not just a number. The z-score is how many standard deviations away that a value, like the 570, is from the mean of 500. The z-score can provide you some guidance, regardless of the shape of the distribution. A z-score greater than (absolute value) 3 is considered an outlier and highly unlikely.

9%

It depends on the spread of the data, which is measured by the standard deviation.

9%

In general, the z-score is like a traffic light. If it is greater than the absolute value of 3 (denoted |3|), the light is red; this is an extreme value. If the z-score is between |1.65| and |3|, the light is yellow; this value is borderline. If the z-score is less than |1.65|, the light is green, and the value

9%

is just considered random variation. (The cutpoints of 3 and 1.65 might vary slightly de...

This highlight has been truncated due to consecutive passage length restrictions.

9%

in the real world, you only take one 2R sample.

9%

central limit theorem (CLT)

9%

The CLT will hold regardless of the shape of the population distribution of observations—whether it is normal, bimodal (like the sumo wrestlers and jockeys), or whatever shape, as long as a 2R sample is taken and the sample size is greater than 30.

9%

Then, the sampling distribution of sample means will be approximately normal, with a mean of and a standard deviation of (s / √n) ...

This highlight has been truncated due to consecutive passage length restrictions.

9%

You need to take only one 2R sample with a sample size greater than 30.

9%

If you have a 2R sample greater than 30, you can approximate the sampling distribution of sample means by using the sample’s and standard error, s / √n. If you collect a 2R sample greater than 30, the CLT holds. As a result, you can use inferential statistics. That is, you can construct confidence intervals and perform hypothesis tests.

9%

the CLT theorem is known as the “cornerstone of statistics.”

10%

You have now generated a random sample of 30. If you press F9, the random sample will change.

For Mac, press FN+F9 to do this.

10%

Again, as you press the F9 key, the random sample and corresponding frequency distribution changes. (Hence, it is called a dynamic frequency distribution.)

10%

One of the inferential statistical techniques that you can apply, thanks to the CLT, is one-sample hypothesis testing of the mean.

10%

hypothesis testing consists of two hypotheses, the null hypothesis, called H0, and the opposite to H0—the alternative hypothesis, called H1 or Ha. The null hypothesis for one-sample hypothesis testing of the mean tests whether the population mean is equal to, less than or equal to, or greater than or equal to a particular constant, µ = k, µ ≤ k, or µ ≥ k.

10%

Once the hypotheses are identified, the statistical test statistic is calculated.

10%

calculated statistical test statistic is called Zcalc. This Zcalc is compared to what here will be called the critical z, Zcritical. The Zcritical value is based on what is called a level of significance, called α, which is usually equal to 0.10, 0.05, or 0.01.

10%

The level of significance can be viewed as the probability of making an error (or mistake), given that the H0 is correct.

10%

you want to keep the level of significance rather small.

10%

you want to keep the likelihood of making an error relatively small.

10%

If |Zcalc| > |Zcritical|, you reject H0. When you reject H0, there is enough statistical evidence to support H1.

10%

On the other hand, you do fail to reject H0 when |Zcalc| ≤ |Zcritical|, and you conclude that there is not enough statistical evidence to support H1.

10%

As discussed under FC3, “Understand a Z-Score,” the |Zcalc| is not simply a number. It represents the number of standard deviations away from the mean that a value

10%

is.

10%

you reject H0 when the value is a relatively large number of standard deviations away from the hypothesized value.

10%

10%

The p-value is the probability of rejecting H0. Thus, in terms of the one-sample hypothesis test using the Z, the p-value is the probability that is associated with Zcalc.

10%

11%

General interpretation of a p-value is as follows: ● Less than 1%: There is overwhelming evidence that supports the alternative hypothesis. ● Between 1% and 5%. There is strong evidence that supports the alternative hypothesis. ● Between 5% and 10%. There is weak evidence that supports the alternative hypothesis. ● Greater than 10%: There is little to no evidence that supports the alternative hypothesis.

11%

Two major questions should be asked when considering the use of a statistical approach or technique: ● Is it statistically appropriate?

11%

What will it possibly tell you?

11%

with categorical data, you cannot measure distance.

11%

Simply in terms of graphing, you would use bar and pie charts for categorical data but not for continuous data. On the other hand, graphing a continuous variable requires a histogram or box plot.

11%

When summarizing data, descriptive statistics are insightful for continuous variables. A frequency distribution is much...

This highlight has been truncated due to consecutive passage length restrictions.

11%

Countif.xls in worksheet rawdata.

11%

Major and gender (and correspondingly gender code) are examples of nominal data. The Likert scale of usefulness is an example of ordinal data. Salary, GPA, and years are examples of continuous data.

11%

descriptive statistics are valuable in understanding the continuous data—An example would be the fact that since the average is somewhat less than the median the salary data could be considered to be slightly left-skewed and with a minimum of $31,235 and a maximum of $65,437.

11%

All the categorical variables (Major, Gender, Usefulness, and Gender code), whether they are nominal or ordinal, have frequency numbers and a histogram, and no descriptive statistics. But the continuous variables have descriptive statistics and a histogram.

12%

Most of the time in JMP, if you are looking for some more information to display or statistical options, they can be usually found by clicking the red triangle.