Matt Mitchell’s Kindle Notes & Highlights for Fundamentals of Predictive Analytics with JMP

Click the red triangle next to Summary Statistics (note that summary statistics are listed for continuous variables only), and click Customize Summary Statistics. Click the check box, or check boxes, on the summary statistics that you want displayed, such as Median, Minimum or Maximum; and then click OK.

12%

A positive value implies that, as Years increases, Salary also increases, or the slope is positive. In contrast, a negative relationship has a negative slope. So, as the X variable increases, the Y variable decreases.

12%

RSquare values can range from 0 (no relationship) to 1 (exact/perfect relationship).

12%

(A negative correlation implies a negative linear relationship, and a positive correlation implies a positive linear relationship.)

12%

With a p-value of 0.1993, you would fail to reject H0 and conclude that there is not a significant relationship between Major and Gender.

12%

JMP, in the bivariate analysis diagram of the Fit Y by X dialog box, helps the analyst select the proper statistical method to use. The Y variable is usually considered to be a dependent variable.

12%

Depending on the type of data, some techniques are appropriate and some are not.

12%

just because an approach/technique appears appropriate, before running it, you need to step back and ask yourself what the results could provide. Part of that answer requires understanding and having knowledge of the actual problem situation being solved or examined.

12%

When using a certain technique, three possible outcomes could occur:

12%

● The technique is not appropriate to use with the data and should not be used. ● The technique is appropriate to use with the data. However, the results are not meaningful. ● The technique is appropriate to use with the data and, the results are meaningful.

23%

Regression analysis typically has one or two main purposes. First, it might be used to understand the cause-and-effect relationship between one dependent variable and one or more independent variables: For example, it might answer the question, how does the amount of advertising affect sales? Or, secondly, regression might be applied for prediction—in particular, for the purpose of forecasting a dependent variable based on one or more independent variables.

23%

Regression analyses can handle linear or nonlinear relationships, although the linear models are mostly emphasized.

24%

The Fit Line causes a simple linear regression to be performed and the creation of the tables that contain the regression results.

24%

The F test and t test, in simple linear regression, are equivalent because they both test whether the independent variable (Adver) is significantly related to the dependent variable (Sales).

24%

The oval shape in each scatterplot is the corresponding bivariate normal density ellipse of the two variables. If the two variables are bivariate normally distributed, then about 95% of the points would be within the ellipse. If the ellipse is rather wide (round) and does not follow either of the diagonals, then the two variables do not have a strong correlation.

24%

The more significant the correlation, the narrower the ellipse, and the more it follows along one of the diagonals

24%

The Effect Summary report,

24%

lists in ascending p-value order the LogWorth or False Discovery Rate (FDR) LogWorth values. These statistical values measure the effects of the independent variables in the model.

24%

A LogWorth value greater than 2 corresponds to a p-value of less than .01, The FDR LogWorth is a better statistic for assessing significance since it adjusts the p-values to account...

This highlight has been truncated due to consecutive passage length restrictions.

24%

listed under the Parameter estimates, the multiple linear regression equation is as follows: Sales = −1485.88 + 1.97 * Time + 0.04 * MktPoten + 0.15 * Adver + 198.31* MktShare + 295.87 * Change + 5.61 * Accts + 19.90 * WkLoad

24%

Each independent variable regression coefficient represents an estimate of the change in the dependent variable to a unit increase in that independent variable while all the other independent variables are held constant.

24%

The larger the absolute value of the standardized beta coefficient, the more important the variable.

25%

The Prediction Profiler displays a graph for each independent variable X against the dependent Y variable

25%

These transformed residuals are called X leverage residuals and Y leverage residuals. The black line represents the predicted values for individual X values, and the blue dotted line is the corresponding 95% confidence interval. If the confidence region between the upper and lower confidence interval crosses the horizontal line, then the effect of X is significant.

25%

The process for evaluating the statistical significance of a regression model is as follows: 1. Determine whether it is good or bad:

25%

a. Conduct an F test. b. Conduct a t test for each independent variable. c. Examine the residual plot (if time series data, conduct a Durbin-Watson test). d. Assess the degree of multicollinearity (variance inflation factor, VIF). 2. Determine the goodness of fit: a. Compute Adjusted R2. b. Compute RMSE (or Se)

25%

The F test, in multiple regression, is known as an overall test.

25%

The hypotheses for the F test are as follows, where k is the number of independent variables: H0 : β1 = β2 = … = βk = 0 H1 : not all equal to 0

25%

If you fail to reject the F test, then the overall model is not statistically significant. The model shows no l...

This highlight has been truncated due to consecutive passage length restrictions.

25%

On the other hand, if you reject H0, then you can conclude that one or more of the independent variables are linearly related to the dependent variabl...

This highlight has been truncated due to consecutive passage length restrictions.

25%

you reject the F test and can conclude that one or more of the independent variables

25%

is significantly related to

25%

You are not testing whether independent variable k, xk, is significantly related to the dependent variable. What you are testing is whether xk is significantly related to the dependent variable above and beyond all the other independent variables that are currently in the model.

25%

If βk cannot be determined to be significantly different from 0, then xk has no effect on Y. Again, in this situation, you want to reject H0.

25%

Examine the plot for any patterns—oscillating too often or increasing or decreasing in values or for outliers. The data appears to be random.

25%

If the observations were taken in some time sequence, called a time series (not applicable to the sales performance data set), the Durbin-Watson test should be performed.

25%

In general, you want high p-values. High p-values of the Durbin-Watson test indicate that there is no problem with first order autocorrelation.

25%

Multicollinearity occurs when the two or more independent variables explain the same variability of Y. Multicollinearity does not violate any of the statistical assumptions of regression.

25%

significant multicollinearity is likely to make it difficult to interpret the meaning of the regression coefficients of the independent variables.

25%

variance inflation factor (VIF).

25%

By definition, it must be greater than or equal to 1.

25%

the basic guidelines (Marquardt 1980; Snee 1973) for identifying whether significant multicollinearity exists are as follows: ● 1 ≤ VIFk ≤ 5 means no significant multicollinearity. ● 5 < VIFk ≤ 10 means that you should be concerned that some multicollinearity might exist. ● VIFk > 10 means significant multicollinearity.

26%

One perspective is to include those nonsignificant/high VIFk variables because they explain some, although not much, of the variation in the dependent variable.

26%

The other point of view follows the principle of parsimony, which states that the smaller the number of variables in the model, the better.

26%

There are numerous approaches and several statistical variable selection techniques to achieve this goal of only significant independent variables in the model. Stepwise regression is one of the simplest approaches.

26%

The goodness of fit of the regression model is measured by the Adjusted R2 and the se or RMSE,

26%

The Adjusted R2 measures the percentage of the variability in the dependent variable that is explained by the set of independent variables and is adjusted for the number of independent variables (Xs) in the model. If the purpose of performing the regression is to understand the relationships of the independent variables with the dependent variable, the Adjusted R2 value is a major assessment of the goodness of fit.

26%

The higher the Adjusted R2, the better the fit.

26%

On the other hand, if the regression model is for prediction/forecasting, the value of the se is of more

26%

concern. A smaller se generally means a smaller f...

This highlight has been truncated due to consecutive passage length restrictions.