Complete Test Bank + Intermediate Statistical + Chapter.4 - Intermediate Statistical Investigations 1st Ed - Exam Bank by Nathan Tintle. DOCX document preview.
Chapter 4
Intermediate Statistical Investigations Test Bank
Question types: FIB = Fill in the blank Calc = Calculation
Ma = Matching MS = Multiple select
MC = Multiple choice TF = True-false
CHAPTER 4 TERMINAL LEARNING OUTCOMES
TLO4-1: Represent the association between two quantitative variables with a linear regression model, and compare and contrast with a separate-means model.
TLO4-2: Assess the evidence of a linear association between two quantitative variables using both simulation and theory-based approaches.
TLO4-3: Use a regression model to represent the adjusted relationship between two quantitative variables based on a binary categorical variable.
TLO4-4: Include interaction between quantitative and categorical variables in a multiple regression model, and interpret the nature of the interaction.
TLO4-5: Interpret an interaction between a quantitative variable and a multi-level categorical variable in a multiple regression model.
Section 4.1: Quantitative Explanatory Variables
LO4.1-1: Describe the association between two quantitative variables numerically and graphically.
LO4.1-2: Interpret least squares regression models between two quantitative variables.
LO4.1-3: Compare and contrast separate means vs. linear regression models.
Questions 1 through 6: Does vitamin C affect tooth growth in guinea pigs? Sixty guinea pigs were randomly assigned to one of three dose levels of vitamin C (0.5, 1, or 2 mg/day). The response variable was the length of odontoblasts – cells responsible for tooth growth. The table below summarizes three models that could be used to analyze this data.
Predicted Length at each Dosage | ||||
Model | 0.5 mg | 1 mg | 2 mg | SSError |
Single-mean | 18.813 | 18.813 | 18.813 | 3452.2 |
Separate-means | 10.605 | 19.735 | 26.100 | 1025.8 |
Linear | 12.304 | 17.186 | 26.950 | ? |
- If one of these models were used to predict length for a dosage of 1.5 mg/day, that would be called ________ (extrapolation/interpolation).
If one of these models were used to predict length for a dosage of 5 mg/day, that would be called ________ (extrapolation/interpolation).
- Which of the models could be used to predict length for a dosage of 1.5 mg/day. Select all that apply.
- Single-mean model
- Separate-means model
- Linear model
- Fill in the degrees of freedom for the separate-means model and the linear model.
Separate-means model | Linear model | |||
Source | DF | Source | DF | |
Model | Model | |||
Error | Error | |||
Total | Total |
- Calculate the y-intercept and slope for the linear regression model. Keep three decimal places in your answer.
_______ _______
- How does SSError for the linear model compare to SSError for the other models?
- SSError for the linear model is less than 1025.8.
- SSError for the linear model is greater than 1025.8 but less than 3452.2.
- SSError for the linear model is greater than 3452.2.
- There is not enough information to determine how SSError for the linear model compares to SSError for the other models.
- Does the table provide any indication that the linear model is not a good fit for the data?
- Yes, because the separate-means model and the linear model lead to substantially different predicted values, especially for dosages of 0.5 and 1.
- Yes, because 17.186 – 12.304 does not equal 26.950 – 17.186. If there were a linear relationship, we’d expect the predicted values to increase at a constant rate.
- No, because the linear model shows that odontoblast lengths increase at a constant rate of 9.764 units per mg/day of Vitamin C.
- No, because length and dosage are both quantitative variables, so a linear regression model is the most appropriate analysis method.
Questions 7 through 9: A dataset contains information about 325 books for sale at amazon.com. The dataset includes two prices for each book: List Price (the price set by the publisher in dollars) and the Amazon Price (in dollars). The linear model that uses a book’s list price to predict its Amazon price is given below.
- Interpret the slope by filling in the blanks.
As the ______ (Amazon/list) price increases by 1 dollar, the ______ (Amazon/list) price is predicted to increase by ______ dollar(s).
- Does the y-intercept have a meaningful interpretation in this context?
- Yes. The y-intercept describes how SSError for the linear model compares to SSError for the single-mean model.
- Yes. The y-intercept describes how Amazon prices compare to list prices, on average.
- No. $0 is not a reasonable value for the Amazon price, so the intercept does not have a meaningful interpretation.
- No. $0 is not a reasonable value for the list price, so the intercept does not have a meaningful interpretation.
- Calculate the residual for a book that has a list price of $12.95 and an Amazon price of $5.18.
Sol:
Questions 10 and 11: The graph below shows the relationship between life expectancy (in years) and fertility rate (babies per woman) in 1960. Each dot represents a country, and the size of the dot indicates the country’s population.
- If you fit a linear model to these data…
The slope would be _________ (positive/negative).
The y-intercept would be _________ (positive/negative).
- In 1960, China had a fertility rate of 3.99 babies per woman and life expectancy of 31.6 years.
The residual for China would be _________ (positive/negative), which means that the linear model would _________ (overestimate/underestimate) China’s life expectancy.
- The scatterplots below show two different datasets (A and B).
In Dataset A, R2 for the separate-means model _______ (>, <, =) R2 for the linear model.
In Dataset B, R2 for the separate-means model _______ (>, <, =) R2 for the linear model.
- Which sum of squares is minimized by the least squares regression line?
- SSModel
- SSError
- SSTotal
- All three of these values are minimized by the least squares regression line.
- True or False: When two models have similar SSError values, you should generally choose the one that uses more degrees of freedom (DF for Model is higher).
- You are considering two models to predict a quantitative response: a separate-means model and linear model. You find that SSError is smaller for the separate-means model, yet the SE of the residuals is smaller for the linear model. Does this indicate that you have made a calculation mistake?
- Yes. When the data are quantitative, SSError for the linear model should always be smaller than SSError for the separate-means model.
- Yes. SSError and the SE of the residuals both measure the amount of prediction error. As SSError increases, SE of the residuals should always increase.
- No. The separate-means model and the linear model have different degrees of freedom for error, so SSError and the SE of the residuals don’t always “agree.”
- No. Despite their similar names, SSError and the SE of the residuals measure completely different things. It is not surprising that these two values do not “agree.”
Section 4.2: Inference for Simple Linear Regression
LO4.2-1: Carry out simulation-based inference to assess the strength of evidence of a linear association between the quantitative explanatory and response variables.
LO4.2-2: Use a theory-based approach to assess the strength of evidence of a linear association between the quantitative explanatory and response variables.
LO4.2-3: Evaluate the validity of the theory-based approach using residual plots.
Questions 1 through 6: A teacher randomly assigned students to a seat in one of six rows during a particular unit of instruction. At the end of the unit, the teacher recorded each student’s score on the unit test. The scatterplot and least squares regression equation are given below. Do these data provide convincing evidence that sitting further away from the front of the room causes lower scores? (Higher row numbers are further from the front.)
- Match and to the appropriate statement in words. Two of the statements will not be used.
A. Sitting further from the front causes lower scores, on average.
B. Sitting further from the front causes higher scores, on average.
C. Sitting further from the front has an effect on average scores.
. D. There is no association between seat location and score.
- You decide to use the 3S strategy to evaluate the evidence. Which of the following are statistics you could use to summarize the linear association? Select all that apply.
- Sample mean
- Sample proportion
- Correlation
- Slope
- You decide to use the 3S strategy to evaluate the evidence. How would you conduct the simulation?
- Write the scores on cards. Shuffle and deal the cards into two groups to represent seats close to the board and seats far from the board.
- Write the scores on cards. Shuffle and randomly re-assign the scores to row numbers 1 through 6.
- For each data point, randomly select an integer between 1 and 6 as the row number and randomly select an integer between 0 and 100 as the test score.
- For each data point, randomly select an integer between 1 and 6 as the row number and randomly select an integer between 52 and 100 as the test score.
- You decide to use the 3S strategy to evaluate the evidence. The dotplot below shows the shuffled slopes for 100 repetitions of the simulation.
Calculate the p-value based on this dotplot.
Sol: 10 of the 100 repetitions are less than -1.02, so we estimate the p-value to be 10/100.
- One student in the 6th row made a 100 on the test. Suppose that this student’s score were removed from the dataset. Would this affect the p-value?
After removing this point, the p-value would be ________ (higher/lower/the same as) before.
- The plots below display the residuals for this regression model. Match each validity condition to the description of how that condition should be checked. Two of the descriptions will not be used.
Independence: A. Check that the mean of the residuals is 0.
Equal variance: B. Check that the students were randomly assigned to rows with no repeated measures.
Normality: C. Check that the histogram of the residuals is reasonably symmetric and bell-shaped.
D. Check that predicted values are spaced fairly evenly along the x-axis.
E. Check that the vertical spread of the residuals at each of the predicted values is reasonably similar.
Questions 7 through 10: The scatterplot and software output below show the relationship between CEO salary and number of employees for 100 companies.
Term | Coeff. | SE | t-stat | p-value |
Intercept | 85378.8 | 5604.3 | 15.23 | <0.0001 |
Employees | 35.7438 | 14.8028 | 2.41 | 0.0176 |
- Calculate a 95% confidence interval to estimate the population slope. Even using a rough estimate of t*, you should be able to choose the best option below.
- (0.07, 71.41)
- (6.37, 65.12)
- (20.94, 50.55)
- (33.33, 38.15)
- State the hypotheses for testing the association between CEO salary and number of employees.
- The theory-based p-value shown above was calculated using a t distribution with _____ degrees of freedom.
- Which of the following is an appropriate conclusion based on the small p-value for the slope?
- There is strong evidence of an association between CEO salary and Number of Employees in the population; the sample slope would be unlikely to occur by chance.
- There is a strong association between CEO salary and Number of Employees in the sample; a large portion of the variability in salaries is explained by the model.
- There is sufficient evidence to conclude that hiring more employees causes a CEO’s salary to increase slightly.
Questions 11 through 13: The owner of a café suspects that coffee sales are related to the weather. For a period of 60 days in the spring, she records the number of coffees sold at her café as well as the minimum daily temperature (in degrees Fahrenheit).
- Using the partially filled in ANOVA table below, calculate the F-statistic.
Source | DF | SS | MS | F |
Model | 1 | 8010 | ||
Error | 58 | 13575 | ||
Total | 59 | 21585 |
Sol:
- The 95% confidence interval for the slope is (-1.37, -0.67). Interpret this interval.
We are 95% confident that…
- The sample slope is between 0.67 and 1.37 standard errors below the null hypothesis value (.
- The true population slope is between 0.67 and 1.37 standard errors below the null hypothesis value (.
- The café sells between 0.67 and 1.37 fewer coffees, on average, for every one degree increase in minimum daily temperature.
- The café sells between 0.67 and 1.37 fewer coffees, on average, for every one degree decrease in minimum daily temperature.
- The 95% confidence interval for the slope is (-1.37, -0.67). Suppose the owner continued to collect data until she had a sample size of 300 days. What would you expect to happen to the width of the confidence interval?
- The CI would get narrower, because the SE of would decrease.
- The CI would get narrower, because the SE of the residuals would decrease.
- The CI would get wider, because the critical value, t*, would increase.
- The CI would get wider, because the confidence level would increase.
- Describe how each of the following factors are related to the standard error of the slope.
As sample size increases, SEb tends to ________ (increase/decrease).
As SE of the residuals increases, SEb tends to ________ (increase/decrease).
As SD of the explanatory variable increases, SEb tends to ________ (increase/decrease).
- The scatterplots below display two datasets of the same size with roughly the same SD of X. Which of the following would produce a larger F statistic?
- Dataset 1 would have a larger F statistic, because SSModel is larger.
- Dataset 1 would have a larger F statistic, because SSError is smaller.
- Dataset 2 would have a larger F statistic, because SSModel is larger.
- Dataset 2 would have a larger F statistic, because SSError is smaller.
- Which of the datasets below would produce the smallest p-value?
- Dataset A would produce the smallest p-value.
- Dataset B would produce the smallest p-value.
- Dataset C would produce the smallest p-value.
- Two of these datasets are tied for the smallest p-value.
Section 4.3: Quantitative and Categorical Explanatory Variables
LO4.3-1: Adjust the relationship between two quantitative variables based on a categorical variable.
LO4.3-2: Evaluate the validity of the regression model.
LO4.3-3: Create indicator variables in order to include binary categorical variables in the regression model.
Questions 1 through 5: Consider a linear regression model that predicts body temperature (in degrees Fahrenheit) based on heart rate (in beats per minute, bpm), and sex (male, female).
The categorical variable, sex, was defined using effect coding:
Term | Estimate | SE | t-stat | p-value |
Intercept | 96.39 | 0.649 | 148.47 | <0.0001 |
Heart Rate | 0.025 | 0.009 | 2.88 | 0.0046 |
Sex [female] | 0.135 | 0.616 | 2.19 | 0.0307 |
- True or False: Based on this model, the predicted body temperature for a male subject whose heart rate is 0 bpm is 96.39 degrees Fahrenheit. Note: Interpretations of the intercept often involve extrapolation.
- True or False: Based on this model, body temperature is predicted to increase by 0.025 degrees for each 1 bpm increase in heart rate, holding sex constant.
- Suppose there are two subjects, one male and one female, who both have the same heart rate. Based on this model, what is the predicted difference in their body temperatures, female – male?
- Does this dataset provide strong evidence of a difference in body temperatures for males and females? You may assume that 0.05.
This dataset provides ________ (strong/weak) evidence of a difference in body temperature for males and females, ____________ (ignoring/adjusting for) heart rate.
- Are conclusions based on theory-based p-values valid for this regression model?
- No, because the independence condition has been violated.
- No, because the linearity condition has been violated.
- No, because the equal variance condition has been violated.
- No, because the normality condition has been violated.
- Yes, because all the validity conditions are met in this scenario.
Questions 6 through 8: A botanist collected data to measure the variation of Iris flowers of two related species. The data set consists of 50 flowers from two species of Iris – Iris setosa and Iris versicolor. Various features were measured including the length and width of the sepals, in centimeters. (Sepals are a part of the flower.)
- A simple linear regression model used sepal length to predict sepal width. The plot below shows residuals from that model broken down by species.
The simple linear regression model tends to ____________ (overestimate/underestimate) the sepal widths for the setosa species and____________ (overestimate/underestimate) the sepal widths for the versicolor species.
This residual plot suggests that Species should be ____________ (included/excluded) in the model to predict sepal width.
- A multiple linear regression model used sepal length and species to predict sepal width.
Which of the following interpretations is correct?
- Sepal width is predicted to be 0.517 cm after adjusting for sepal length and species.
- As sepal length increases by 1 cm, sepal with is predicted to increase by 0.472 cm, holding species constant.
- For both species, sepal width is predicted to increase by 0.548 cm for each one cm increase in sepal length.
- For the setosa species, sepal width is predicted to increase by 0.548 cm for each one cm increase in sepal length.
- A multiple linear regression model used sepal length and species to predict sepal width.
Suppose the binary categorical variable species had been recorded using indicator coding instead of effect coding. Which of the numerical values would change? Select all that apply.
- The intercept would change
- The slope corresponding to sepal length would change.
- The slope(s) corresponding to species would change.
- The SE of the residuals would change.
Questions 9 through 11: The plot below represents a model that uses two explanatory variables, one quantitative and one categorical, to predict y.
- Why are the lines parallel?
- On average, Y changes at the same constant rate relative to X for both Category A and Category B in the sample data.
- The model assumes that Y changes at the same constant rate relative to X for both Category A and Category B.
- Suppose the categorical variable is defined using effect coding.
Complete the model statement below:
- Suppose the categorical variable is defined by indicator coding.
Find the intercept and slopes for this model.
Term | Coefficient |
Intercept | |
Variable | |
Variable |
Questions 12 and 13: The scatterplots below display data collected from 27 automotive plants. For each plant, three variables were recorded: Defects = number of assembly defects per 100 cars, Time = time (in hours) to assemble each vehicle, and Location = whether or not the plant is located in Japan. The graph on the left shows the relationship between Defects and Time for all 27 plants. The graph on the right shows the relationship between Defects and Time for each plant location (in Japan and not in Japan).
- The slope coefficient for Time is not given in the model above. What is the interpretation of the missing coefficient?
The predicted change in _________ (Location/Defects/Time) for a one-unit change in _________ (Location/Defects/Time), holding ________ (Location/Defects/Time) constant.
- The slope coefficient for Time is not given in the model above. Based on the scatterplots, we expect the missing slope coefficient for time to be __________ (positive/negative).
- When evaluating a regression model with multiple explanatory variables, which significance test(s) should you conduct first? Why?
- First conduct an F-test for the overall model. This protects against Type I error.
- First conduct an F-test for the overall model. This protects against Type II error.
- First conduct t-tests for the individual predictors. This protects against Type I error.
- First conduct t-tests for the individual predictors. This protects against Type II error.
- Match the terms below with their features. Each choice (A-D) will be used exactly once.
Effect coding: A. The binary variable is coded (0, 1).
B. The binary variable is coded (-1, 1).
Indicator coding: C. The slope coefficient is the difference in the group means.
. D. The slope coefficient is the difference between the group means and the overall average (least squares mean).
- A dataset contains information about a random sample of 200 house sales in a town in the Pacific Northwest. A regression model was fit for predicting the selling price of a house (in thousands of dollars) using house size (sq ft) and whether the house has a garage as predictors. The ANOVA table is given below.
Source | DF | Sum of Squares | F | p-value |
House Size | 1 | 1,380,568 | 215.26 | <0.0001 |
Garage | 1 | 58,076 | 9.06 | 0.0030 |
Error | 197 | 1,263,472 | ||
Total | 199 | 2,668,870 |
Does SSHouseSize + SSGarage = SSModel?
- No, because there is likely covariation between size and garage in this observational study.
- No, because there is no covariation among explanatory variables when the data come from a random sample.
- Yes, because there is likely covariation between size and garage in this observational study.
- Yes, because there is no covariation among explanatory variables when the data come from a random sample.
- A simple linear regression model is being used to describe the relationship between two quantitative variables, and y, in an observational study. The p-value for testing
vs. is 0.043.
Suppose the researchers added a new categorical explanatory variable, , to the model. Would the p-value corresponding to the quantitative explanatory variable, , change?
- No, adding a new variable, , would not affect the p-value corresponding to .
- Yes, the p-value would be larger than 0.043 after adding the new variable, .
- Yes, the p-value would be smaller than 0.043 after adding the new variable, .
- The p-value is likely to change, but there is not enough information to say whether the p-value would get larger or smaller.
Section 4.4: Quantitative/Categorical Interactions
LO4.4-1: Include interaction between quantitative and categorical variables in a statistical model.
LO4.4-2: Interpret the nature of the interaction.
Questions 1 through 5: A dataset contains information about a random sample of 200 house sales in a town in the Pacific Northwest. A regression model was fit for predicting the selling price of a house (in thousands of dollars) using house size (sq ft), garage (garage=1 if the house has a garage and 0 otherwise), and the interaction between size and garage.
Term | Coeff. | SE | t-stat. | p-value |
Intercept | 141.835 | 18.130 | 7.82 | <0.0001 |
Size | 0.039 | 0.006 | 7.07 | <0.0001 |
Garage_ind | -120.392 | 24.163 | -4.98 | <0.0001 |
Size Garage_ind | 0.062 | 0.008 | 7.62 | <0.0001 |
- True or False: In this town, there is strong evidence to suggest that the association between size and price differs based on whether the house has a garage.
- True or False: In this town, there is strong evidence to suggest that houses with garages cost less, on average, than houses without garages.
- Interpret the slope corresponding to size.
As _________ (price/size) increases by 1 unit, _________ (price/size) is predicted to increase by 0.039 units, for houses with _________ (a garage/no garage).
- Write an equation that predicts house price (in thousands of dollars) for houses with garages.
For houses with garages:
- Suppose the binary variable garage had been recorded using effect coding instead of indicator coding. Which of the numerical values would change? Select all that apply.
- The intercept would change.
- The slope corresponding to house size would change.
- The slope corresponding to garage would change.
- The slope corresponding to the interaction would change.
- The SE of the residuals would change.
Questions 6 through 10: A botanist collected data to measure the variation of iris flowers of two related species. The data set consists of 50 flowers from two species of iris – iris setosa and iris versicolor. Various features were measured including the length and width of the sepals, in centimeters. (Sepals are a part of the flower.)
A multiple regression model was fit to predict sepal width based on sepal length, species, and the interaction between sepal length and species. Note: Effect coding was used.
- Which of the following describes the relationship between length and width for the setosa species?
- As length increases by 1 cm, width is predicted to increase by 0.559 cm.
- As length increases by 1 cm, width is predicted to decrease by 0.721 cm.
- As length increases by 1 cm, width is predicted to increase by 0.239 cm.
- As length increases by 1 cm, width is predicted to increase by 0.798 cm.
- Which of the following is the appropriate interpretation of the slope corresponding to species? Note: Interpretations may involve extrapolation.
- The predicted sepal width of setosa is 0.721 lower than the predicted sepal width of versicolor for irises with a sepal length of 0 cm.
- The predicted sepal width of setosa is 0.721 lower than the predicted sepal width of versicolor for irises of average sepal length.
- The predicted sepal width of setosa is 1.442 lower than the predicted sepal width of versicolor for irises with a sepal length of 0 cm.
- The predicted sepal width of setosa is 1.442 lower than the predicted sepal width of versicolor for irises of average sepal length.
- Suppose the model used indicator coding for species instead of effect coding.
Calculate the coefficients for each term in the model.
Term | Coefficient |
Intercept | |
Sepal length | |
Species | |
Length Species |
- Which of the following statements is a description of the interaction between sepal length and species in this sample? Select all that apply.
- Irises of the versicolor species tend to have longer sepal lengths than irises of the setosa species.
- Irises of the setosa species tend to have wider sepals than the versicolor species, and sepal length and sepal width are positively associated for both species.
- The relationship between sepal length and sepal width changes based on the species of iris.
- The difference in sepal width for the two species of iris is smaller when the sepal length is small and larger when the sepal length is large.
- The multiple linear regression model used sepal length, species, and the interaction between sepal length and species to predict sepal width. The scatterplot below shows the residuals of that model vs. the predicted values from that model, by species.
Does this residual plot indicate a problem with the regression model being used?
- Yes. The residuals are not normally distributed.
- Yes. There is no linear association between the residuals and predicted values.
- Yes. The versicolor irises tend to have lower predicted values than setosa irises.
- No. This residual plot does not indicate a problem with the regression model.
Questions 11 and 12: Suppose you want to predict body temperature (in degrees Fahrenheit) based on heart rate (in beats per minute), and sex (male, female). You decide to exclude the interaction between heart rate and sex from your model, because the term was not “statistically significant.”
- How do you determine whether an interaction is statistically significant?
- Fit a model that includes an interaction term. If the p-value for the interaction term is large, then the interaction is statistically significant.
- Fit a model that includes an interaction term. If the p-value for the interaction term is small, then the interaction is statistically significant.
- Fit a model that includes an interaction term. If the value for that model is large, then the interaction is statistically significant.
- Fit a model that does not include an interaction term. If the value for that model is large, then the interaction is not statistically significant.
- Explain what “no interaction” means in the context of this question. Select all that apply.
- The predicted difference in heart rates for men and women does not depend on body temperature.
- The relationship between heart rate and body temperature is the same regardless of sex.
- Neither sex nor heart rate is associated with body temperature, so neither of these are useful predictors in the model.
- There is no association between sex and heart rate, so knowing a person’s sex does not help predict their heart rate.
- What statistical term do we use when one explanatory variable modifies the association of another explanatory variable with the response?
- Association
- Covariation
- Interaction
- Regression
- In which of the following study designs is it possible to have an interaction between two explanatory variables?
- An observational study with substantial covariation between two predictors
- An observational study with minimal association between the predictors
- A designed experiment with factors that are independent of each other
- Interaction may occur in any of the three designs described above.
- The plot below represents a model that uses two explanatory variables, one quantitative and one categorical, to predict a quantitative response. The model includes an interaction term. The categorical variable is defined using indicator coding (1 for category A, 0 for category B).
Find the coefficients for this model.
Term | Coefficient |
Intercept | |
Variable X1 | |
Variable X2 | |
Variable X1 Variable X2 |
- The plots below represent two different datasets. Each dataset has two explanatory variables, one quantitative and one categorical, that are being used to predict a quantitative response. The samples sizes and SE of the residuals are roughly the same for both datasets.
For each dataset, a significance test is used to decide whether the interaction is statistically significant. How do the p-values for the two tests compare?
- Dataset 1 would produce a smaller p-value for testing the interaction.
- Dataset 2 would produce a smaller p-value for testing the interaction.
- Both datasets would produce the same p-value for testing the interaction.
- There is not enough information to determine which test would result in a smaller p-value?
Section 4.5: Multi-level Categorical Variables
LO4.5-1: Include categorical variables with more than two categories in a linear model.
LO4.5-2: Interpret an interaction between a quantitative variable and a multi-level categorical variable.
Questions 1 through 5: Most universities ask students to fill out course evaluations at the end of the semester, but some students choose not to complete the survey. The statistics department wonders whether high response rates are associated with higher or lower average ratings. They also want to account for differences in ratings for courses at different levels: introductory, intermediate, and advanced.
A multiple regression model (with no interaction) was used to predict a section’s average rating using response rate and course level. Response rate was recorded as a proportion and course level was recorded using indicator coding. The table of coefficients is given below. Note: The observational units are course sections not individual students.
Term | Coeff. | SE | t-stat. | p-value |
Intercept | 3.8717 | 0.2404 | 16.103 | <0.0001 |
ResponseRate | 0.9559 | 0.3468 | 2.757 | 0.0076 |
Level_intro | -0.7026 | 0.1111 | -6.321 | <0.0001 |
Level_inter | -0.2585 | 0.1108 | -2.563 | 0.0128 |
- The interaction term was not included in the final model for predicting average course ratings, because it was not statistically significant. Using the partially filled in ANOVA table below, calculate the F-statistic for testing the interaction between course level and response rate. Use three decimal places in your answer.
Source | DF | SS | MS | F | p-value |
ResponseRate | 0.6888 | 0.0085 | |||
Level | 8.9508 | <0.0001 | |||
ResponseRate Level | 0.0307 | 0.8483 | |||
Error | 5.6795 | ||||
Total | 66 |
Sol:
- Fill in the linear equations to predict the average rating based on the response rate for each course level. Do not round.
Introductory:
Intermediate:
Advanced:
- Describe the relationship between the response rate and the average course rating based on the additive model given above.
As the response rate increases by 0.1, the average course rating is predicted to increase by ________ for ____________ (introductory/intermediate/advanced/all) level(s).
- After controlling for the response rate, which course levels are significantly different in terms of average ratings? Match the questions below to the appropriate p-values.
One of the p-values will not be used.
Is introductory different from intermediate? A. P-value = 0.0076
Is intermediate different from advanced? B. P-value < 0.0001
Is introductory different from advanced? C. P-value =0.0128
. D. P-value not shown in table.
- The scatterplot below shows the relationship between average rating and response rate with a separate line of best fit for each level (introductory, intermediate, and advanced).
The output above shows that the slope corresponding to response rate is 0.9559. If level were removed from the model, how would the slope coefficient change?
- The slope relating average rating to response rate would be higher than 0.9559.
- The slope relating average rating to response rate would be lower than 0.9559.
- The slope would change, because there is covariation between level and response rate, but there is not enough information to predict if it would increase or decrease.
- The slope would not change much, because there is no significant interaction between response rate and level.
Questions 6 through 8: A realtor took a random sample of 50 houses currently for sale from each quadrant of the town where she works (northeast, northwest, southeast, southwest). She wonders whether the relationship between size (in sq ft) and price (in dollars) differs based on location. The table of coefficients is given below.
Term Coefficient Std. Error t-stat p-value
(Intercept) 127,739.66 17,060.53 7.487 <0.0001
size 129.61 12.85 10.087 <0.0001
locationNW -573.30 23,632.62 -0.024 0.9807
locationSE -5,092.04 21,951.00 -0.232 0.8168
locationSW -15,854.18 19,366.04 -0.819 0.4140
size:locationNW 23.60 15.69 1.504 0.1341
size:locationSE -60.36 17.17 -3.516 0.0005
size:locationSW -24.51 13.65 -1.796 0.0741
- Based on the graph below, how can you tell that there is an interaction between size and location in this sample?
- Some parts of town tend to have larger houses than other parts of town.
- Some parts of town tend to have more expensive houses than other parts of town.
- The best fit lines for the four parts of town do not all have the same intercept.
- The best fit lines for the four parts of town do not all have the same slope.
- Which location is the reference category?
- NE
- NW
- SE
- SW
- Fill in the linear equation to predict price for houses in the NW quadrant of town. Do not round.
NW:
Questions 9 and 10: The plot below represents a dataset with two explanatory variables, one quantitative and one categorical, that are being used to predict a quantitative response.
- For each term of the model, predict whether the coefficient would be positive, negative, or close to 0?
Term | Coeff. |
Intercept | |
X1 | |
Indicator_B | |
Indicator_C | |
X1 Indicator_B | |
X1 Indicator_C |
.
- If you tested the significance of the interaction in this model, what would you expect to find? Assume the sample size is large enough to provide strong evidence if the interaction actually exists.
- The p-value would probably be large, because two of the slopes are very similar.
- The p-value would probably be large, because the slopes are not all the same.
- The p-value would probably be small, because two of the slopes are very similar.
- The p-value would probably be small, because the slopes are not all the same.
- Suppose a regression model has a categorical variable with 5 categories. You would need to create _____ indicator variable(s) to represent this categorical variable.
Provide a numerical value for your answer.
- You are using a regression model to predict the prices of used cars based on mileage for four different types of cars (Toyota Camry, Honda Accord, Chevy Malibu, and Ford Fusion). You want to test whether there is an association between type of car and price after adjusting for mileage. What kind of test(s) should you use?
- Conduct an ANOVA F-test for the overall model. If the p-value is small, then the test provides strong evidence of an association.
- Conduct a partial F-test for the indicators corresponding to type of car. If the p-value is small, then the test provides strong evidence of an association.
- Conduct t-tests for each indicator corresponding to the type of car. If at least one of the p-values is small, then the test provides strong evidence of an association.
- Conduct t-tests for each indicator corresponding to the type of car. If all of the p-values are small, then the test provides strong evidence of an association.
- Which of the following is an appropriate strategy for defining indicator variables to represent political identity (Republican, Democrat, or Independent)? Select all that apply.
- You are using a multiple regression model with two explanatory variables, is quantitative and is categorical with multiple levels. You test the term in the model and find that the p-value is small. Which of the following interpretations is appropriate? Select all that apply.
- There is strong evidence of an association between the two explanatory variables in the population.
- There is strong evidence of an interaction between the two explanatory variables in the population.
- Removing the interaction term would lead to a large decrease in and increase in the standard error of the residuals.
- The difference in slopes seen in this sample would be unlikely to occur just by random chance if there were really no interaction.
- You are using a multiple regression model with two explanatory variables, is quantitative and is categorical with multiple levels. You plan to write separate prediction equations of the form for each level of .
True or False: The prediction equations will be the same, regardless of whether the categorical variable, , was recorded using indicator coding or effect coding.
- Which of the following factors affect the size of the test statistic in a partial F-test for comparing two models? Select all that apply.
- The difference in for the two models
- The difference in the number of coefficients for the two models
- The sample size
Document Information
Connected Book
Intermediate Statistical Investigations 1st Ed - Exam Bank
By Nathan Tintle
Explore recommendations drawn directly from what you're reading
Chapter 2 Intermediate Statistical Investigations Test Bank
DOCX Ch. 2
Chapter 3 Intermediate Statistical Investigations Test Bank
DOCX Ch. 3
Chapter 4 Intermediate Statistical Investigations Test Bank
DOCX Ch. 4 Current
Chapter 5 Intermediate Statistical Investigations Test Bank
DOCX Ch. 5
Chapter 6 Intermediate Statistical Investigations Test Bank
DOCX Ch. 6