Complete Test Bank + Intermediate Statistical + Chapter.4 - Intermediate Statistical Investigations 1st Ed - Exam Bank by Nathan Tintle. DOCX document preview.

Complete Test Bank + Intermediate Statistical + Chapter.4

Chapter 4

Intermediate Statistical Investigations Test Bank

Question types: FIB = Fill in the blank Calc = Calculation

Ma = Matching MS = Multiple select

MC = Multiple choice TF = True-false

CHAPTER 4 TERMINAL LEARNING OUTCOMES

TLO4-1: Represent the association between two quantitative variables with a linear regression model, and compare and contrast with a separate-means model.

TLO4-2: Assess the evidence of a linear association between two quantitative variables using both simulation and theory-based approaches.

TLO4-3: Use a regression model to represent the adjusted relationship between two quantitative variables based on a binary categorical variable.

TLO4-4: Include interaction between quantitative and categorical variables in a multiple regression model, and interpret the nature of the interaction.

TLO4-5: Interpret an interaction between a quantitative variable and a multi-level categorical variable in a multiple regression model.

Section 4.1: Quantitative Explanatory Variables

LO4.1-1: Describe the association between two quantitative variables numerically and graphically.

LO4.1-2: Interpret least squares regression models between two quantitative variables.

LO4.1-3: Compare and contrast separate means vs. linear regression models.

Questions 1 through 6: Does vitamin C affect tooth growth in guinea pigs? Sixty guinea pigs were randomly assigned to one of three dose levels of vitamin C (0.5, 1, or 2 mg/day). The response variable was the length of odontoblasts – cells responsible for tooth growth. The table below summarizes three models that could be used to analyze this data.

Predicted Length at each Dosage

Model

0.5 mg

1 mg

2 mg

SSError

Single-mean

18.813

18.813

18.813

3452.2

Separate-means

10.605

19.735

26.100

1025.8

Linear

12.304

17.186

26.950

?

  1. If one of these models were used to predict length for a dosage of 1.5 mg/day, that would be called ________ (extrapolation/interpolation).

If one of these models were used to predict length for a dosage of 5 mg/day, that would be called ________ (extrapolation/interpolation).

  1. Which of the models could be used to predict length for a dosage of 1.5 mg/day. Select all that apply.
    1. Single-mean model
    2. Separate-means model
    3. Linear model
  2. Fill in the degrees of freedom for the separate-means model and the linear model.

Separate-means model

Linear model

Source

DF

Source

DF

Model

Model

Error

Error

Total

Total

  1. Calculate the y-intercept and slope for the linear regression model. Keep three decimal places in your answer.

_______ _______

  1. How does SSError for the linear model compare to SSError for the other models?
    1. SSError for the linear model is less than 1025.8.
    2. SSError for the linear model is greater than 1025.8 but less than 3452.2.
    3. SSError for the linear model is greater than 3452.2.
    4. There is not enough information to determine how SSError for the linear model compares to SSError for the other models.
  2. Does the table provide any indication that the linear model is not a good fit for the data?
    1. Yes, because the separate-means model and the linear model lead to substantially different predicted values, especially for dosages of 0.5 and 1.
    2. Yes, because 17.186 – 12.304 does not equal 26.950 – 17.186. If there were a linear relationship, we’d expect the predicted values to increase at a constant rate.
    3. No, because the linear model shows that odontoblast lengths increase at a constant rate of 9.764 units per mg/day of Vitamin C.
    4. No, because length and dosage are both quantitative variables, so a linear regression model is the most appropriate analysis method.

Questions 7 through 9: A dataset contains information about 325 books for sale at amazon.com. The dataset includes two prices for each book: List Price (the price set by the publisher in dollars) and the Amazon Price (in dollars). The linear model that uses a book’s list price to predict its Amazon price is given below.

  1. Interpret the slope by filling in the blanks.

As the ______ (Amazon/list) price increases by 1 dollar, the ______ (Amazon/list) price is predicted to increase by ______ dollar(s).

  1. Does the y-intercept have a meaningful interpretation in this context?
    1. Yes. The y-intercept describes how SSError for the linear model compares to SSError for the single-mean model.
    2. Yes. The y-intercept describes how Amazon prices compare to list prices, on average.
    3. No. $0 is not a reasonable value for the Amazon price, so the intercept does not have a meaningful interpretation.
    4. No. $0 is not a reasonable value for the list price, so the intercept does not have a meaningful interpretation.
  2. Calculate the residual for a book that has a list price of $12.95 and an Amazon price of $5.18.

Sol:

Questions 10 and 11: The graph below shows the relationship between life expectancy (in years) and fertility rate (babies per woman) in 1960. Each dot represents a country, and the size of the dot indicates the country’s population.

  1. If you fit a linear model to these data…

The slope would be _________ (positive/negative).

The y-intercept would be _________ (positive/negative).

  1. In 1960, China had a fertility rate of 3.99 babies per woman and life expectancy of 31.6 years.

The residual for China would be _________ (positive/negative), which means that the linear model would _________ (overestimate/underestimate) China’s life expectancy.

  1. The scatterplots below show two different datasets (A and B).

"Two dotplots describe the relationship between Explanatories and the responses. In the first dotplot, the horizontal axis is labeled Explanatory A and has markings from 1 to 4 in increments of 1. The vertical axis is labeled Response and has markings from 4 to 10 in increments of 2. The dots are plotted as follows: (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (2, 6), (2, 7), (2, 8), (2, 9), (2, 10), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 4), (4, 5), (4, 6), (4, 7), and (4, 8).
In the second dotplot, the horizontal axis is labeled Explanatory B and has markings from 1 to 4 in increments of 1. The vertical axis is labeled Response and has markings from 4 to 10 in increments of 2. The dots are plotted as follows: (1, 3), (1, 4), (1, 5), (1, 6), (1, 7), (2, 4), (2, 5), (2, 6), (2, 7), (2, 8), (3, 5), (3, 6), (3, 7), (3, 8), (3, 9), (4, 6), (4, 7), (4, 8), (4, 9), and (4, 10)."

In Dataset A, R2 for the separate-means model _______ (>, <, =) R2 for the linear model.

In Dataset B, R2 for the separate-means model _______ (>, <, =) R2 for the linear model.

  1. Which sum of squares is minimized by the least squares regression line?
    1. SSModel
    2. SSError
    3. SSTotal
    4. All three of these values are minimized by the least squares regression line.
  2. True or False: When two models have similar SSError values, you should generally choose the one that uses more degrees of freedom (DF for Model is higher).
  3. You are considering two models to predict a quantitative response: a separate-means model and linear model. You find that SSError is smaller for the separate-means model, yet the SE of the residuals is smaller for the linear model. Does this indicate that you have made a calculation mistake?
    1. Yes. When the data are quantitative, SSError for the linear model should always be smaller than SSError for the separate-means model.
    2. Yes. SSError and the SE of the residuals both measure the amount of prediction error. As SSError increases, SE of the residuals should always increase.
    3. No. The separate-means model and the linear model have different degrees of freedom for error, so SSError and the SE of the residuals don’t always “agree.”
    4. No. Despite their similar names, SSError and the SE of the residuals measure completely different things. It is not surprising that these two values do not “agree.”

Section 4.2: Inference for Simple Linear Regression

LO4.2-1: Carry out simulation-based inference to assess the strength of evidence of a linear association between the quantitative explanatory and response variables.

LO4.2-2: Use a theory-based approach to assess the strength of evidence of a linear association between the quantitative explanatory and response variables.

LO4.2-3: Evaluate the validity of the theory-based approach using residual plots.

Questions 1 through 6: A teacher randomly assigned students to a seat in one of six rows during a particular unit of instruction. At the end of the unit, the teacher recorded each student’s score on the unit test. The scatterplot and least squares regression equation are given below. Do these data provide convincing evidence that sitting further away from the front of the room causes lower scores? (Higher row numbers are further from the front.)

A scatterplot plots the relationship between row and score. The horizontal axis is labeled Row and ranges from 0 to 6 in increments of 1. The vertical axis is labeled Score and has markings from 50 to 100 in increments of 10. Dots are plotted vertically for certain markings on the horizontal axis and some dots are randomly scattered throughout the graph. A regression line starts from 80 on the vertical axis, decreases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 1 to 6 on the horizontal axis and from 52 to 100 on the vertical axis. The concentration of dots is more between 1 and 6 on the horizontal axis and between 60 and 90 on the vertical axis. All values are approximate.

  1. Match and to the appropriate statement in words. Two of the statements will not be used.

A. Sitting further from the front causes lower scores, on average.

B. Sitting further from the front causes higher scores, on average.

C. Sitting further from the front has an effect on average scores.

. D. There is no association between seat location and score.

  1. You decide to use the 3S strategy to evaluate the evidence. Which of the following are statistics you could use to summarize the linear association? Select all that apply.
    1. Sample mean
    2. Sample proportion
    3. Correlation
    4. Slope
  2. You decide to use the 3S strategy to evaluate the evidence. How would you conduct the simulation?
    1. Write the scores on cards. Shuffle and deal the cards into two groups to represent seats close to the board and seats far from the board.
    2. Write the scores on cards. Shuffle and randomly re-assign the scores to row numbers 1 through 6.
    3. For each data point, randomly select an integer between 1 and 6 as the row number and randomly select an integer between 0 and 100 as the test score.
    4. For each data point, randomly select an integer between 1 and 6 as the row number and randomly select an integer between 52 and 100 as the test score.
  3. You decide to use the 3S strategy to evaluate the evidence. The dotplot below shows the shuffled slopes for 100 repetitions of the simulation.

A dotplot depicts the shuffled slopes for 100 repetitions of the stimulation. The horizontal axis has markings from negative 1.5 to 1.5 in increments of 0.5. A series of dots is plotted vertically for certain markings on the horizontal axis. The dots are plotted as follows: 1 dot above negative 1.95, negative 1.85, negative 16.5, negative 1.4; 2 dots above negative 1.25, negative 1.2, negative 1.05; 1 dot above negative 0.95, negative 0.90, negative 0.75; 5 dots above negative 0.70; 2 dots above negative 0.65; 3 dots above negative 0.55; 1 dot above negative 0.45; 2 dots above negative 0.40; 1 dot above negative 0.35, negative 0.25; 3 dots above negative 0.30; 1 dot above negative 0.25; 4 dots above negative 0.20; 1 dot above negative 0.15; 5 dots above negative 0.10; 4 dots above negative 0.05, 0.05; 3 dots above 0.15; 1 dot above 0.20; 7 dots above 0.25; 1 dot above 0.30; 3 dots above 0.35; 1 dot above 0.40, 0.45; 2 dots above 0.5, 0.55; 1 dot above 0.60; 2 dots above 0.65; 1 dot above 0.70, 0.75; 2 dots above 0.80, 0.85; 3 dots above 0.90; 1 dot above 0.95; 1 dot above 1.0, 1.1; 2 dots above 1.15; 1 dot above 1.20, 1.30, 1.35; 2 dots above 1.40; 1 dot above 1.45, 1.55, 1.75, 1.80. A highlighted arrow from the expression, null equals 0, points toward 0.0 on the horizontal axis. All values are approximate.

Calculate the p-value based on this dotplot.

Sol: 10 of the 100 repetitions are less than -1.02, so we estimate the p-value to be 10/100.

  1. One student in the 6th row made a 100 on the test. Suppose that this student’s score were removed from the dataset. Would this affect the p-value?

After removing this point, the p-value would be ________ (higher/lower/the same as) before.

  1. The plots below display the residuals for this regression model. Match each validity condition to the description of how that condition should be checked. Two of the descriptions will not be used.

A dotplot describes the residual plots for predicted values. The dotplot has the horizontal axis labeled Predicted Values and has markings 73 to 78 in increments of 1. The vertical axis is labeled Residual and has markings from negative 30 to 30 in increments of 10. For predicted value 73.25, the dots are plotted as follows: 1 dot above negative 16, negative 13, negative 10, negative 5, 1, 5, and 28 on the vertical axis. For predicted value 74.25, the dots are plotted as follows: 1 dot above negative 19, negative 10, negative 8, negative 5, negative 2, 0, 4, 10, 13, and 18. For predicted value 75.25, the dots are plotted as follows: 1 dot above negative 23, negative 8, negative 2, 4, 6, 10, 11, 15, and 16. For predicted value 76.25, the dots are plotted as follows: 1 dot above negative 15, negative 10, negative 2, 0, 2, 5, 7, 9 and 11. For predicted value 77.30, the dots are plotted as follows: 1 dot above negative 27, negative 13, negative 10, 1, 3, 7, 8, and 12. For predicted value 78.30, the dots are plotted as follows: 1 dot above negative 18, negative 13, negative 9, negative 6, 0, 8, 9, and 15. A highlighted dashed horizontal line starts from 0 on the vertical axis and extends toward the right passing through the dots. All values are approximate. A histogram describes the residual plot. The horizontal axis is labeled Residuals and has markings from negative 30 to 30 in increments of 10. The distribution of the bars starts from negative 30, and ends at 30. The longest bar is at the interval 0 to 10 and the shortest bar is at the interval negative 30 to negative 20 and 25 to 30. From negative 30 to negative 20, there are 2 bars with short heights. From negative 20 to 0, there are 4 bars with different heights. From 0 to 30, there are 5 bars with their heights decreasing progressively. There are no bars on the range between 20 and 25. All values are approximate.

Independence: A. Check that the mean of the residuals is 0.

Equal variance: B. Check that the students were randomly assigned to rows with no repeated measures.

Normality: C. Check that the histogram of the residuals is reasonably symmetric and bell-shaped.

D. Check that predicted values are spaced fairly evenly along the x-axis.

E. Check that the vertical spread of the residuals at each of the predicted values is reasonably similar.

Questions 7 through 10: The scatterplot and software output below show the relationship between CEO salary and number of employees for 100 companies.

A scatterplot plots the relationship between number of employees and C E O Salary. The horizontal axis is labeled Number of employees and has markings from 100 to 600 in increments of 100. The vertical axis is labeled C E O Salary and has markings from 40,000 dollars to 180,000 dollars in increments of 20,000. Dots are randomly scattered throughout the graph. A regression line starts from 88,000 dollars on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 50 to 600 on the horizontal axis and from 41,000 to 175,000 on the vertical axis. The concentration of dots is more between 55 and 570 on the horizontal axis and between 70,000 and 120,000 on the vertical axis. All values are approximate.

Term

Coeff.

SE

t-stat

p-value

Intercept

85378.8

5604.3

15.23

<0.0001

Employees

35.7438

14.8028

2.41

0.0176

  1. Calculate a 95% confidence interval to estimate the population slope. Even using a rough estimate of t*, you should be able to choose the best option below.
    1. (0.07, 71.41)
    2. (6.37, 65.12)
    3. (20.94, 50.55)
    4. (33.33, 38.15)
  2. State the hypotheses for testing the association between CEO salary and number of employees.
  3. The theory-based p-value shown above was calculated using a t distribution with _____ degrees of freedom.
  4. Which of the following is an appropriate conclusion based on the small p-value for the slope?
    1. There is strong evidence of an association between CEO salary and Number of Employees in the population; the sample slope would be unlikely to occur by chance.
    2. There is a strong association between CEO salary and Number of Employees in the sample; a large portion of the variability in salaries is explained by the model.
    3. There is sufficient evidence to conclude that hiring more employees causes a CEO’s salary to increase slightly.

Questions 11 through 13: The owner of a café suspects that coffee sales are related to the weather. For a period of 60 days in the spring, she records the number of coffees sold at her café as well as the minimum daily temperature (in degrees Fahrenheit).

  1. Using the partially filled in ANOVA table below, calculate the F-statistic.

Source

DF

SS

MS

F

Model

1

8010

Error

58

13575

Total

59

21585

Sol:

  1. The 95% confidence interval for the slope is (-1.37, -0.67). Interpret this interval.

We are 95% confident that…

    1. The sample slope is between 0.67 and 1.37 standard errors below the null hypothesis value (.
    2. The true population slope is between 0.67 and 1.37 standard errors below the null hypothesis value (.
    3. The café sells between 0.67 and 1.37 fewer coffees, on average, for every one degree increase in minimum daily temperature.
    4. The café sells between 0.67 and 1.37 fewer coffees, on average, for every one degree decrease in minimum daily temperature.
  1. The 95% confidence interval for the slope is (-1.37, -0.67). Suppose the owner continued to collect data until she had a sample size of 300 days. What would you expect to happen to the width of the confidence interval?
    1. The CI would get narrower, because the SE of would decrease.
    2. The CI would get narrower, because the SE of the residuals would decrease.
    3. The CI would get wider, because the critical value, t*, would increase.
    4. The CI would get wider, because the confidence level would increase.
  2. Describe how each of the following factors are related to the standard error of the slope.

As sample size increases, SEb tends to ________ (increase/decrease).

As SE of the residuals increases, SEb tends to ________ (increase/decrease).

As SD of the explanatory variable increases, SEb tends to ________ (increase/decrease).

  1. The scatterplots below display two datasets of the same size with roughly the same SD of X. Which of the following would produce a larger F statistic?

"Two side by side scatterplots describe the relationship between two sets of data. The first scatterplot is titled, Dataset 1. The horizontal axis is labeled x and ranges from 0 to 10 in increments of 2. The vertical axis is labeled y and ranges from 0 to 50 in increments of 10. Dots are randomly scattered throughout the graph. A regression line starts from 20 on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 0 to 10 on the horizontal axis and from 5 to 53 on the vertical axis. The concentration of dots is more between 0 and 9 on the horizontal axis and between 20 and 48 on the vertical axis. Two outliers are plotted at (2.2, 5) and (10, 53). All values are approximate.
The second scatterplot is titled, Dataset 2. The horizontal axis is labeled x and ranges from 0 to 10 in increments of 2. The vertical axis is labeled y and ranges from 0 to 50 in increments of 10. Dots are plotted in an increasing trend from left to right in the graph. A regression line starts from 13 on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 0.2 to 9.5 on the horizontal axis and from 13 to 45 on the vertical axis. An outlier is plotted at (6.3, 27). All values are approximate."

    1. Dataset 1 would have a larger F statistic, because SSModel is larger.
    2. Dataset 1 would have a larger F statistic, because SSError is smaller.
    3. Dataset 2 would have a larger F statistic, because SSModel is larger.
    4. Dataset 2 would have a larger F statistic, because SSError is smaller.
  1. Which of the datasets below would produce the smallest p-value?

"Three side by side scatterplots describe the relationship between three sets of data. The first scatterplot is titled, Dataset A. The horizontal axis is labeled x and ranges from 0 to 10 in increments of 2. The vertical axis is labeled y and ranges from 0 to 60 in increments of 10. Dots are randomly scattered throughout the graph. A regression line starts from 26 on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 0 to 10 on the horizontal axis and from 20 to 45 on the vertical axis. The concentration of dots is more between 3 and 10 on the horizontal axis and between 20 and 38 on the vertical axis. All values are approximate. Below the scatterplot, the text reads: sample size is 20, slope is 1.10, and correlation is 0.56.
The second scatterplot is titled, Dataset B. The horizontal axis is labeled x and ranges from 0 to 10 in increments of 2. The vertical axis is labeled y and ranges from 0 to 60 in increments of 10. Dots are randomly scattered throughout the graph. A regression line starts from 18 on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 1 to 10 on the horizontal axis and from 15 to 50 on the vertical axis. The concentration of dots is more between 1 and 6 on the horizontal axis and between 15 and 30 on the vertical axis. All values are approximate. Below the scatterplot, the text reads: sample size is 20, slope is 2.17, and correlation is 0.74.
The third scatterplot is titled, Dataset C. The horizontal axis is labeled x and ranges from 0 to 10 in increments of 2. The vertical axis is labeled y and ranges from 0 to 60 in increments of 10. Dots are randomly scattered throughout the graph. A regression line starts from 19 on the vertical axis, increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 0.5 to 10 on the horizontal axis and from 15 to 53 on the vertical axis. The concentration of dots is more between 1 and 9.5 on the horizontal axis and between 17 and 40 on the vertical axis. All values are approximate. Below the scatterplot, the text reads: sample size is 100, slope is 2.17, and correlation is 0.74."

    1. Dataset A would produce the smallest p-value.
    2. Dataset B would produce the smallest p-value.
    3. Dataset C would produce the smallest p-value.
    4. Two of these datasets are tied for the smallest p-value.

Section 4.3: Quantitative and Categorical Explanatory Variables

LO4.3-1: Adjust the relationship between two quantitative variables based on a categorical variable.

LO4.3-2: Evaluate the validity of the regression model.

LO4.3-3: Create indicator variables in order to include binary categorical variables in the regression model.

Questions 1 through 5: Consider a linear regression model that predicts body temperature (in degrees Fahrenheit) based on heart rate (in beats per minute, bpm), and sex (male, female).

The categorical variable, sex, was defined using effect coding:

Term

Estimate

SE

t-stat

p-value

Intercept

96.39

0.649

148.47

<0.0001

Heart Rate

0.025

0.009

2.88

0.0046

Sex [female]

0.135

0.616

2.19

0.0307

  1. True or False: Based on this model, the predicted body temperature for a male subject whose heart rate is 0 bpm is 96.39 degrees Fahrenheit. Note: Interpretations of the intercept often involve extrapolation.
  2. True or False: Based on this model, body temperature is predicted to increase by 0.025 degrees for each 1 bpm increase in heart rate, holding sex constant.
  3. Suppose there are two subjects, one male and one female, who both have the same heart rate. Based on this model, what is the predicted difference in their body temperatures, female – male?
  4. Does this dataset provide strong evidence of a difference in body temperatures for males and females? You may assume that 0.05.

This dataset provides ________ (strong/weak) evidence of a difference in body temperature for males and females, ____________ (ignoring/adjusting for) heart rate.

  1. Are conclusions based on theory-based p-values valid for this regression model?

A histogram describes the residual plot. The horizontal axis is labeled Residuals and has markings from negative 2 to 2.5 in increments of 0.5. The distribution of the bars is approximately bell-shaped and it starts from negative 2, and ends at 2.5. The longest bar is at 0.5 on the horizontal axis and the bars decrease in height to the left of 0 and to the right of 0.5. All values are approximate. A scatterplot describes the residual plots for predicted values. The horizontal axis is labeled Predicted Values and has markings from 97.5 to 99 in increments of 0.5. The vertical axis is labeled Residual and has markings from negative 2.0 to 2.0 in increments of 1.0. Dots are randomly scattered throughout the graph. A regression horizontal line starts from 0 on the vertical axis and extends toward the right passing through the dots, such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. The dots are plotted from 97.2 to 98.75 on the horizontal axis and from negative 2 to 2 on the vertical axis. The concentration of dots is more between 97.85 and 98.7 on the horizontal axis and between negative 1.0 and 1.5 on the vertical axis. All values are approximate.

    1. No, because the independence condition has been violated.
    2. No, because the linearity condition has been violated.
    3. No, because the equal variance condition has been violated.
    4. No, because the normality condition has been violated.
    5. Yes, because all the validity conditions are met in this scenario.

Questions 6 through 8: A botanist collected data to measure the variation of Iris flowers of two related species. The data set consists of 50 flowers from two species of Iris – Iris setosa and Iris versicolor. Various features were measured including the length and width of the sepals, in centimeters. (Sepals are a part of the flower.)

  1. A simple linear regression model used sepal length to predict sepal width. The plot below shows residuals from that model broken down by species.

"Two side by side horizontal box plots with dots describe the association between residual and species. The horizontal axis is labeled Residual and has markings from negative 1.0 to 1.0 in increments of 0.5. The vertical axis is labeled Species and has markings as, versicolor and setosa in the order from top to bottom. 
For Versicolor, the whiskers range from negative 0.87 to 0.37 and its box ranges from negative 0.5 to 0.06 with median at 0.20. A dot is plotted to the left of the lower whisker, above negative 1.20 on the horizontal axis.
For Setosa, the whiskers range from negative 0.37 to 1.30 and its box ranges from 0.06 to 0.5 with median at 0.25. A dot is plotted to the left of the lower whisker, above negative 0.95 on the horizontal axis. All values are approximate."

The simple linear regression model tends to ____________ (overestimate/underestimate) the sepal widths for the setosa species and____________ (overestimate/underestimate) the sepal widths for the versicolor species.

This residual plot suggests that Species should be ____________ (included/excluded) in the model to predict sepal width.

  1. A multiple linear regression model used sepal length and species to predict sepal width.

Which of the following interpretations is correct?

    1. Sepal width is predicted to be 0.517 cm after adjusting for sepal length and species.
    2. As sepal length increases by 1 cm, sepal with is predicted to increase by 0.472 cm, holding species constant.
    3. For both species, sepal width is predicted to increase by 0.548 cm for each one cm increase in sepal length.
    4. For the setosa species, sepal width is predicted to increase by 0.548 cm for each one cm increase in sepal length.
  1. A multiple linear regression model used sepal length and species to predict sepal width.

Suppose the binary categorical variable species had been recorded using indicator coding instead of effect coding. Which of the numerical values would change? Select all that apply.

    1. The intercept would change
    2. The slope corresponding to sepal length would change.
    3. The slope(s) corresponding to species would change.
    4. The SE of the residuals would change.

Questions 9 through 11: The plot below represents a model that uses two explanatory variables, one quantitative and one categorical, to predict y.

An interaction plot describes the interaction between two variables X 1 and Y. The horizontal axis is labeled X 1 and ranges from 0 to 20 in increments of 5. The vertical axis is labeled Y and has markings from 30 to 90 in increments of 10. A blue line denotes B and a red line denotes A. The blue line increases toward right from (0, 50) to (20, 90) and the  red line increases toward right from (0, 30) to (20, 70).

  1. Why are the lines parallel?
    1. On average, Y changes at the same constant rate relative to X for both Category A and Category B in the sample data.
    2. The model assumes that Y changes at the same constant rate relative to X for both Category A and Category B.
  2. Suppose the categorical variable is defined using effect coding.

Complete the model statement below:

  1. Suppose the categorical variable is defined by indicator coding.

Find the intercept and slopes for this model.

Term

Coefficient

Intercept

Variable

Variable

Questions 12 and 13: The scatterplots below display data collected from 27 automotive plants. For each plant, three variables were recorded: Defects = number of assembly defects per 100 cars, Time = time (in hours) to assemble each vehicle, and Location = whether or not the plant is located in Japan. The graph on the left shows the relationship between Defects and Time for all 27 plants. The graph on the right shows the relationship between Defects and Time for each plant location (in Japan and not in Japan).

A scatterplot plots the relationship between time and defects. The horizontal axis is labeled Time and has markings from 10 to 60 in increments of 10. The vertical axis is labeled Defects and has markings from 25 to 175 in increments of 25. Dots are randomly scattered throughout the graph. A regression line starts from (10, 65), increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and a dot lies on the line, and ends at (56, 81). The dots are plotted from 12 to 54 on the horizontal axis and from 27 to 170 on the vertical axis. The concentration of dots is more between 15 and 32 on the horizontal axis and between 37 and 100 on the vertical axis. Two outliers are plotted at (21, 137) and (28, 172). All values are approximate. "A scatterplot plots the relationship between time and defects. The horizontal axis is labeled Time and has markings from 10 to 60 in increments of 10. The vertical axis is labeled Defects and has markings from 25 to 175 in increments of 25. A blue line denotes Japan and a red line denotes Not Japan. The blue and red dots are randomly scattered throughout the graph. 
The blue regression line starts from (12, 62), decreases to the right such that some of the dots lie above the line, some of the dots lie below the line, and a dot lies on the line, and ends at (27, 43). The blue dots are plotted from 12 to 27 on the horizontal axis and from 33 to 87 on the vertical axis. 
The red regression line starts from (17, 93), decreases to the right such that some of the dots lie above the line, some of the dots lie below the line, and a dot lies on the line, and ends at (56, 62). The red dots are plotted from 18 to 54 on the horizontal axis and from 28 to 172 on the vertical axis. Two red outliers are plotted at (20, 138) and (28, 172). All values are approximate."

  1. The slope coefficient for Time is not given in the model above. What is the interpretation of the missing coefficient?

The predicted change in _________ (Location/Defects/Time) for a one-unit change in _________ (Location/Defects/Time), holding ________ (Location/Defects/Time) constant.

  1. The slope coefficient for Time is not given in the model above. Based on the scatterplots, we expect the missing slope coefficient for time to be __________ (positive/negative).
  2. When evaluating a regression model with multiple explanatory variables, which significance test(s) should you conduct first? Why?
    1. First conduct an F-test for the overall model. This protects against Type I error.
    2. First conduct an F-test for the overall model. This protects against Type II error.
    3. First conduct t-tests for the individual predictors. This protects against Type I error.
    4. First conduct t-tests for the individual predictors. This protects against Type II error.
  3. Match the terms below with their features. Each choice (A-D) will be used exactly once.

Effect coding: A. The binary variable is coded (0, 1).

B. The binary variable is coded (-1, 1).

Indicator coding: C. The slope coefficient is the difference in the group means.

. D. The slope coefficient is the difference between the group means and the overall average (least squares mean).

  1. A dataset contains information about a random sample of 200 house sales in a town in the Pacific Northwest. A regression model was fit for predicting the selling price of a house (in thousands of dollars) using house size (sq ft) and whether the house has a garage as predictors. The ANOVA table is given below.

Source

DF

Sum of Squares

F

p-value

House Size

1

1,380,568

215.26

<0.0001

Garage

1

58,076

9.06

0.0030

Error

197

1,263,472

Total

199

2,668,870

Does SSHouseSize + SSGarage = SSModel?

    1. No, because there is likely covariation between size and garage in this observational study.
    2. No, because there is no covariation among explanatory variables when the data come from a random sample.
    3. Yes, because there is likely covariation between size and garage in this observational study.
    4. Yes, because there is no covariation among explanatory variables when the data come from a random sample.
  1. A simple linear regression model is being used to describe the relationship between two quantitative variables, and y, in an observational study. The p-value for testing
    vs. is 0.043.

Suppose the researchers added a new categorical explanatory variable, , to the model. Would the p-value corresponding to the quantitative explanatory variable, , change?

    1. No, adding a new variable, , would not affect the p-value corresponding to .
    2. Yes, the p-value would be larger than 0.043 after adding the new variable, .
    3. Yes, the p-value would be smaller than 0.043 after adding the new variable, .
    4. The p-value is likely to change, but there is not enough information to say whether the p-value would get larger or smaller.

Section 4.4: Quantitative/Categorical Interactions

LO4.4-1: Include interaction between quantitative and categorical variables in a statistical model.

LO4.4-2: Interpret the nature of the interaction.

Questions 1 through 5: A dataset contains information about a random sample of 200 house sales in a town in the Pacific Northwest. A regression model was fit for predicting the selling price of a house (in thousands of dollars) using house size (sq ft), garage (garage=1 if the house has a garage and 0 otherwise), and the interaction between size and garage.

Term

Coeff.

SE

t-stat.

p-value

Intercept

141.835

18.130

7.82

<0.0001

Size

0.039

0.006

7.07

<0.0001

Garage_ind

-120.392

24.163

-4.98

<0.0001

Size Garage_ind

0.062

0.008

7.62

<0.0001

  1. True or False: In this town, there is strong evidence to suggest that the association between size and price differs based on whether the house has a garage.
  2. True or False: In this town, there is strong evidence to suggest that houses with garages cost less, on average, than houses without garages.
  3. Interpret the slope corresponding to size.

As _________ (price/size) increases by 1 unit, _________ (price/size) is predicted to increase by 0.039 units, for houses with _________ (a garage/no garage).

  1. Write an equation that predicts house price (in thousands of dollars) for houses with garages.

For houses with garages:

  1. Suppose the binary variable garage had been recorded using effect coding instead of indicator coding. Which of the numerical values would change? Select all that apply.
  2. The intercept would change.
  3. The slope corresponding to house size would change.
  4. The slope corresponding to garage would change.
  5. The slope corresponding to the interaction would change.
  6. The SE of the residuals would change.

Questions 6 through 10: A botanist collected data to measure the variation of iris flowers of two related species. The data set consists of 50 flowers from two species of iris – iris setosa and iris versicolor. Various features were measured including the length and width of the sepals, in centimeters. (Sepals are a part of the flower.)

A multiple regression model was fit to predict sepal width based on sepal length, species, and the interaction between sepal length and species. Note: Effect coding was used.

  1. Which of the following describes the relationship between length and width for the setosa species?
    1. As length increases by 1 cm, width is predicted to increase by 0.559 cm.
    2. As length increases by 1 cm, width is predicted to decrease by 0.721 cm.
    3. As length increases by 1 cm, width is predicted to increase by 0.239 cm.
    4. As length increases by 1 cm, width is predicted to increase by 0.798 cm.
  2. Which of the following is the appropriate interpretation of the slope corresponding to species? Note: Interpretations may involve extrapolation.
    1. The predicted sepal width of setosa is 0.721 lower than the predicted sepal width of versicolor for irises with a sepal length of 0 cm.
    2. The predicted sepal width of setosa is 0.721 lower than the predicted sepal width of versicolor for irises of average sepal length.
    3. The predicted sepal width of setosa is 1.442 lower than the predicted sepal width of versicolor for irises with a sepal length of 0 cm.
    4. The predicted sepal width of setosa is 1.442 lower than the predicted sepal width of versicolor for irises of average sepal length.
  3. Suppose the model used indicator coding for species instead of effect coding.

Calculate the coefficients for each term in the model.

Term

Coefficient

Intercept

Sepal length

Species

Length Species

  1. Which of the following statements is a description of the interaction between sepal length and species in this sample? Select all that apply.

"A scatterplot plots the relationship between sepal length and sepal width. The horizontal axis is labeled Sepal dot Length and has markings from 4.0 to 7.0 in increments of 0.5. The vertical axis is labeled Sepal dot Width and has markings from 2.0 to 4.5 in increments of 0.5. A blue line denotes Setosa and a red line denotes Versicolor. The blue and red dots are randomly scattered throughout the graph. 
The blue regression line starts from (4.25, 2.75), increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and a dot lies on the line, and ends at (5.87, 4.13). The blue dots are plotted from 4.25 to 5.75 on the horizontal axis and from 2.8 to 4.4 on the vertical axis. An outlier is plotted at (4.5, 2.25).
The red regression line starts from (4.75, 2.4), increases to the right such that some of the dots lie above the line, some of the dots lie below the line, and a dot lies on the line, and ends at (7.1, 3.1). The red dots are plotted from 4.85 to 7.0 on the horizontal axis and from 2.0 to 3.3 on the vertical axis. All values are approximate."

    1. Irises of the versicolor species tend to have longer sepal lengths than irises of the setosa species.
    2. Irises of the setosa species tend to have wider sepals than the versicolor species, and sepal length and sepal width are positively associated for both species.
    3. The relationship between sepal length and sepal width changes based on the species of iris.
    4. The difference in sepal width for the two species of iris is smaller when the sepal length is small and larger when the sepal length is large.
  1. The multiple linear regression model used sepal length, species, and the interaction between sepal length and species to predict sepal width. The scatterplot below shows the residuals of that model vs. the predicted values from that model, by species.

A scatterplot describes the residual plots for predicted values. The horizontal axis is labeled Predicted Value and has markings from 2.5 to 4.0 in increments of 0.5. The vertical axis is labeled Residual and has markings from negative 0.4 to 0.4 in increments of 0.4. Red dots denote Setosa and blue dots denote Versicolor. The red dots are plotted from 2.8 to 4.0 on the horizontal axis and from negative 0.6 to 0.5 on the vertical axis. The blue dots are plotted from 2.4 to 3.15 on the horizontal axis and from negative 0.6 to 0.6 on the vertical axis. A horizontal line starts from 0 on the vertical axis and extends toward the right passing through the blue and red dots, such that some of the dots lie above the line, some of the dots lie below the line, and few dots lie on the line. All values are approximate.

Does this residual plot indicate a problem with the regression model being used?

    1. Yes. The residuals are not normally distributed.
    2. Yes. There is no linear association between the residuals and predicted values.
    3. Yes. The versicolor irises tend to have lower predicted values than setosa irises.
    4. No. This residual plot does not indicate a problem with the regression model.

Questions 11 and 12: Suppose you want to predict body temperature (in degrees Fahrenheit) based on heart rate (in beats per minute), and sex (male, female). You decide to exclude the interaction between heart rate and sex from your model, because the term was not “statistically significant.”

  1. How do you determine whether an interaction is statistically significant?
    1. Fit a model that includes an interaction term. If the p-value for the interaction term is large, then the interaction is statistically significant.
    2. Fit a model that includes an interaction term. If the p-value for the interaction term is small, then the interaction is statistically significant.
    3. Fit a model that includes an interaction term. If the value for that model is large, then the interaction is statistically significant.
    4. Fit a model that does not include an interaction term. If the value for that model is large, then the interaction is not statistically significant.
  2. Explain what “no interaction” means in the context of this question. Select all that apply.
    1. The predicted difference in heart rates for men and women does not depend on body temperature.
    2. The relationship between heart rate and body temperature is the same regardless of sex.
    3. Neither sex nor heart rate is associated with body temperature, so neither of these are useful predictors in the model.
    4. There is no association between sex and heart rate, so knowing a person’s sex does not help predict their heart rate.
  3. What statistical term do we use when one explanatory variable modifies the association of another explanatory variable with the response?
    1. Association
    2. Covariation
    3. Interaction
    4. Regression
  4. In which of the following study designs is it possible to have an interaction between two explanatory variables?
    1. An observational study with substantial covariation between two predictors
    2. An observational study with minimal association between the predictors
    3. A designed experiment with factors that are independent of each other
    4. Interaction may occur in any of the three designs described above.
  5. The plot below represents a model that uses two explanatory variables, one quantitative and one categorical, to predict a quantitative response. The model includes an interaction term. The categorical variable is defined using indicator coding (1 for category A, 0 for category B).

An interaction plot describes the interaction between two variables X 1 and Y. The horizontal axis is labeled X 1 and ranges from 0 to 20 in increments of 5. The vertical axis is labeled Y and ranges from 0 to 50 in increments of 10. A blue line denotes B and a red line denotes A. The blue line decreases toward right from (0, 20) and ends at (20, 0) and the red line increases toward right from (0, 10) and ends at (20, 50). Both the lines intersect approximately at (3, 16).

Find the coefficients for this model.

Term

Coefficient

Intercept

Variable X1

Variable X2

Variable X1 Variable X2

  1. The plots below represent two different datasets. Each dataset has two explanatory variables, one quantitative and one categorical, that are being used to predict a quantitative response. The samples sizes and SE of the residuals are roughly the same for both datasets.

"Two side by side scatterplots describe the relationship between two sets of data. The first scatterplot is titled, Dataset 1. The horizontal axis is labeled X 1 and ranges from 0 to 20 in increments of 5. The vertical axis is labeled Y and ranges from 0 to 50 in increments of 10. A blue line denotes B and a red line denotes A. The blue line decreases from (0, 20) and ends at (20, 0) such that some of the blue dots lie above the line, some of the blue dots lie below the line, and few blue dots lie on the line. The red line increases upward from (0, 10) and ends at (20, 50) such that some of the red dots lie above the line, some of the red dots lie below the line, and few red dots lie on the line. Both the lines intersect approximately at (3, 16). 
The second scatterplot is titled, Dataset 2. The horizontal axis is labeled X 1 and ranges from 0 to 20 in increments of 5. The vertical axis is labeled Y and ranges from 0 to 50 in increments of 10. A blue line denotes B and a red line denotes A. The blue line increases from (0, 20) and ends at (20, 30) such that some of the blue dots lie above the line, some of the blue dots lie below the line, and few blue dots lie on the line. The red line increases upward from (0, 10) and ends at (20, 50) such that some of the red dots lie above the line, some of the red dots lie below the line, and few red dots lie on the line. Both the lines intersect approximately at (7, 23)."

For each dataset, a significance test is used to decide whether the interaction is statistically significant. How do the p-values for the two tests compare?

    1. Dataset 1 would produce a smaller p-value for testing the interaction.
    2. Dataset 2 would produce a smaller p-value for testing the interaction.
    3. Both datasets would produce the same p-value for testing the interaction.
    4. There is not enough information to determine which test would result in a smaller p-value?

Section 4.5: Multi-level Categorical Variables

LO4.5-1: Include categorical variables with more than two categories in a linear model.

LO4.5-2: Interpret an interaction between a quantitative variable and a multi-level categorical variable.

Questions 1 through 5: Most universities ask students to fill out course evaluations at the end of the semester, but some students choose not to complete the survey. The statistics department wonders whether high response rates are associated with higher or lower average ratings. They also want to account for differences in ratings for courses at different levels: introductory, intermediate, and advanced.

A multiple regression model (with no interaction) was used to predict a section’s average rating using response rate and course level. Response rate was recorded as a proportion and course level was recorded using indicator coding. The table of coefficients is given below. Note: The observational units are course sections not individual students.

Term

Coeff.

SE

t-stat.

p-value

Intercept

3.8717

0.2404

16.103

<0.0001

ResponseRate

0.9559

0.3468

2.757

0.0076

Level_intro

-0.7026

0.1111

-6.321

<0.0001

Level_inter

-0.2585

0.1108

-2.563

0.0128

  1. The interaction term was not included in the final model for predicting average course ratings, because it was not statistically significant. Using the partially filled in ANOVA table below, calculate the F-statistic for testing the interaction between course level and response rate. Use three decimal places in your answer.

Source

DF

SS

MS

F

p-value

ResponseRate

0.6888

0.0085

Level

8.9508

<0.0001

ResponseRate Level

0.0307

0.8483

Error

5.6795

Total

66

Sol:

  1. Fill in the linear equations to predict the average rating based on the response rate for each course level. Do not round.

Introductory:

Intermediate:

Advanced:

  1. Describe the relationship between the response rate and the average course rating based on the additive model given above.

As the response rate increases by 0.1, the average course rating is predicted to increase by ________ for ____________ (introductory/intermediate/advanced/all) level(s).

  1. After controlling for the response rate, which course levels are significantly different in terms of average ratings? Match the questions below to the appropriate p-values.

One of the p-values will not be used.

Is introductory different from intermediate? A. P-value = 0.0076

Is intermediate different from advanced? B. P-value < 0.0001

Is introductory different from advanced? C. P-value =0.0128

. D. P-value not shown in table.

  1. The scatterplot below shows the relationship between average rating and response rate with a separate line of best fit for each level (introductory, intermediate, and advanced).

"A scatterplot describes the relationship between response rate and average rating. The horizontal axis is labeled Response Rate and has markings from 0.4 to 0.9 in increments of 0.1. The vertical axis is labeled Average Rating and has markings from 3.0 to 5.0 in increments of 0.5. A red line denotes Advanced, a green line denotes Inter, and a blue line denotes Intro. The red line increases from (0.47, 4.37) and ends at (0.88, 4.7) such that some of the red dots lie above the line, some of the red dots lie below the line, and few red dots lie on the line. The red dots are plotted from 0.47 to 0.88 on the horizontal axis and from 4.2 to 4.8 on the vertical axis. All values are approximate.
The green line increases from (0.49, 4.1) and ends at (0.93, 4.5) such that some of the green dots lie above the line, some of the green dots lie below the line, and few green dots lie on the line. The green dots are plotted from 0.49 to 0.93 on the horizontal axis and from negative 3.75 to 4.95 on the vertical axis. All values are approximate.
The blue line increases from (0.34, 3.45) and ends at (0.8, 4.0) such that some of the blue dots lie above the line, some of the blue dots lie below the line, and few blue dots lie on the line. The blue dots are plotted from 0.35 to 0.8 on the horizontal axis and from negative 3.0 to 4.3 on the vertical axis. All values are approximate."

The output above shows that the slope corresponding to response rate is 0.9559. If level were removed from the model, how would the slope coefficient change?

    1. The slope relating average rating to response rate would be higher than 0.9559.
    2. The slope relating average rating to response rate would be lower than 0.9559.
    3. The slope would change, because there is covariation between level and response rate, but there is not enough information to predict if it would increase or decrease.
    4. The slope would not change much, because there is no significant interaction between response rate and level.

Questions 6 through 8: A realtor took a random sample of 50 houses currently for sale from each quadrant of the town where she works (northeast, northwest, southeast, southwest). She wonders whether the relationship between size (in sq ft) and price (in dollars) differs based on location. The table of coefficients is given below.

Term Coefficient Std. Error t-stat p-value

(Intercept) 127,739.66 17,060.53 7.487 <0.0001

size 129.61 12.85 10.087 <0.0001

locationNW -573.30 23,632.62 -0.024 0.9807

locationSE -5,092.04 21,951.00 -0.232 0.8168

locationSW -15,854.18 19,366.04 -0.819 0.4140

size:locationNW 23.60 15.69 1.504 0.1341

size:locationSE -60.36 17.17 -3.516 0.0005

size:locationSW -24.51 13.65 -1.796 0.0741

  1. Based on the graph below, how can you tell that there is an interaction between size and location in this sample?

"A scatterplot describes the relationship between size and price. The horizontal axis is labeled Size in square feet and has markings from 1000 to 3000 in increments of 500. The vertical axis is labeled Price in thousands of dollars and has markings from 200 to 500 in increments of 100. A red line denotes N E, a green line denotes N W, a blue line denotes S E, and a purple line denotes S W. The red line increases from (1000, 250) and ends at (1650, 350) such that some of the red dots lie above the line, some of the red dots lie below the line, and few red dots lie on the line. The red dots are plotted from 1000 to 1650 on the horizontal axis and from 240 to 360 on the vertical axis. All values are approximate.
The green line increases from (875, 250) and ends at (2500, 510) such that some of the green dots lie above the line, some of the green dots lie below the line, and few green dots lie on the line. The green dots are plotted from 875 to 2500 on the horizontal axis and from 275 to 550 on the vertical axis. All values are approximate.
The blue line increases from (750, 175) and ends at (1800, 250) such that some of the blue dots lie above the line, some of the blue dots lie below the line, and few blue dots lie on the line. The blue dots are plotted from 750 to 1800 on the horizontal axis and from 170 to 240 on the vertical axis. All values are approximate.

The purple line increases from (950, 230) and ends at (3100, 425) such that some of the purple dots lie above the line, some of the purple dots lie below the line, and few purple dots lie on the line. The purple dots are plotted from 950 to 3100 on the horizontal axis and from 220 to 420 on the vertical axis. All values are approximate. "

    1. Some parts of town tend to have larger houses than other parts of town.
    2. Some parts of town tend to have more expensive houses than other parts of town.
    3. The best fit lines for the four parts of town do not all have the same intercept.
    4. The best fit lines for the four parts of town do not all have the same slope.
  1. Which location is the reference category?
    1. NE
    2. NW
    3. SE
    4. SW
  2. Fill in the linear equation to predict price for houses in the NW quadrant of town. Do not round.

NW:

Questions 9 and 10: The plot below represents a dataset with two explanatory variables, one quantitative and one categorical, that are being used to predict a quantitative response.

"A scatterplot describes the relationship between X 1 and Y. The horizontal axis is labeled X 1 and ranges from 0 to 30 in increments of 10. The vertical axis is labeled Y and has markings from 20 to 80 in increments of 20. A red line denotes A, a green line denotes B, and a blue line denotes C. The red line increases from (0, 20) and ends at (29, 79) such that some of the red dots lie above the line, some of the red dots lie below the line, and few red dots lie on the line. The red dots are plotted from 0 to 29 on the horizontal axis and from 20 to 79 on the vertical axis. All values are approximate.
The green line increases from (0, 10) and ends at (29, 67) such that some of the green dots lie above the line, some of the green dots lie below the line, and few green dots lie on the line. The green dots are plotted from 0 to 29 on the horizontal axis and from 10 to 68 on the vertical axis. All values are approximate.
The blue line decreases from (0, 41) and ends at (28, 25) such that some of the blue dots lie above the line, some of the blue dots lie below the line, and few blue dots lie on the line. The blue dots are plotted from 0 to 28 on the horizontal axis and from 25 to 41 on the vertical axis. All values are approximate."

  1. For each term of the model, predict whether the coefficient would be positive, negative, or close to 0?

Term

Coeff.

Intercept

X1

Indicator_B

Indicator_C

X1 Indicator_B

X1 Indicator_C

.

  1. If you tested the significance of the interaction in this model, what would you expect to find? Assume the sample size is large enough to provide strong evidence if the interaction actually exists.
    1. The p-value would probably be large, because two of the slopes are very similar.
    2. The p-value would probably be large, because the slopes are not all the same.
    3. The p-value would probably be small, because two of the slopes are very similar.
    4. The p-value would probably be small, because the slopes are not all the same.
  2. Suppose a regression model has a categorical variable with 5 categories. You would need to create _____ indicator variable(s) to represent this categorical variable.

Provide a numerical value for your answer.

  1. You are using a regression model to predict the prices of used cars based on mileage for four different types of cars (Toyota Camry, Honda Accord, Chevy Malibu, and Ford Fusion). You want to test whether there is an association between type of car and price after adjusting for mileage. What kind of test(s) should you use?
    1. Conduct an ANOVA F-test for the overall model. If the p-value is small, then the test provides strong evidence of an association.
    2. Conduct a partial F-test for the indicators corresponding to type of car. If the p-value is small, then the test provides strong evidence of an association.
    3. Conduct t-tests for each indicator corresponding to the type of car. If at least one of the p-values is small, then the test provides strong evidence of an association.
    4. Conduct t-tests for each indicator corresponding to the type of car. If all of the p-values are small, then the test provides strong evidence of an association.
  2. Which of the following is an appropriate strategy for defining indicator variables to represent political identity (Republican, Democrat, or Independent)? Select all that apply.
  3. You are using a multiple regression model with two explanatory variables, is quantitative and is categorical with multiple levels. You test the term in the model and find that the p-value is small. Which of the following interpretations is appropriate? Select all that apply.
    1. There is strong evidence of an association between the two explanatory variables in the population.
    2. There is strong evidence of an interaction between the two explanatory variables in the population.
    3. Removing the interaction term would lead to a large decrease in and increase in the standard error of the residuals.
    4. The difference in slopes seen in this sample would be unlikely to occur just by random chance if there were really no interaction.
  4. You are using a multiple regression model with two explanatory variables, is quantitative and is categorical with multiple levels. You plan to write separate prediction equations of the form for each level of .

True or False: The prediction equations will be the same, regardless of whether the categorical variable, , was recorded using indicator coding or effect coding.

  1. Which of the following factors affect the size of the test statistic in a partial F-test for comparing two models? Select all that apply.
    1. The difference in for the two models
    2. The difference in the number of coefficients for the two models
    3. The sample size

Document Information

Document Type:
DOCX
Chapter Number:
4
Created Date:
Aug 21, 2025
Chapter Name:
Chapter 4 Intermediate Statistical Investigations Test Bank
Author:
Nathan Tintle

Connected Book

Intermediate Statistical Investigations 1st Ed - Exam Bank

By Nathan Tintle

Test Bank General
View Product →

$24.99

100% satisfaction guarantee

Buy Full Test Bank

Benefits

Immediately available after payment
Answers are available after payment
ZIP file includes all related files
Files are in Word format (DOCX)
Check the description to see the contents of each ZIP file
We do not share your information with any third party