R Tutorial: ANOVA and Simple Linear Regression


Part A. ANOVA

Blue light has been a hot topic over the past few years. As people spend more time looking at screens, it is becoming increasingly important to understand the effects of blue light exposure on a person’s sleep. In this experiment, all 22 subjects are required to be exposed to 8 hours of blue light per day. Each subject was randomly assigned one of three blue light treatment groups, the control, treatment 1, or treatment 2. In treatment 1, subjects wore blue light blocking glasses for two hours before bedtime. In treatment 2, subjects were required to read a paper book for 30 minutes before bed. In the control group, subjects were expected to wear glasses that do not block blue light for two hours before bedtime. All subjects were required to be in bed by 10PM and have their alarm set for 6AM. Digital watches were used for tracking the average # of hours slept for each subject over 2 weeks with each assigned treatment group. Below are the average hours of sleep for each of the 22 subjects over 2 weeks.
(Note: The data used in Part A is fake)

Treatment Hours of Sleep (average)
Control 7.2, 6.8, 5.9, 7.4, 7.35, 6.87, 7.95, 7.13
Glasses 7.11, 7.44, 7.32, 7.88, 7.91, 7.43, 7.39
Book 7.9, 5.5, 6.9, 7.63, 7.34, 7.54, 7.92

Question A1

Consider the following incomplete R output:

Source Df Sum of Squares Mean Squares F-statistics p-value
Treatments ? ? ? ? ?
Error ? ? ?
TOTAL ? ?

Fill in the missing values in the analysis of the variance table.

knitr::opts_chunk$set(echo = TRUE)

Control <- c(7.2, 6.8, 5.9, 7.4, 7.35, 6.87, 7.95, 7.13)
Glasses <- c(7.11, 7.44, 7.32, 7.88, 7.91, 7.43, 7.39)
Book <- c(7.9, 5.5, 6.9, 7.63, 7.34, 7.54, 7.92)

df <- data.frame(Treatment=c(rep("Control", times=length(Control)), 
                        rep("Glasses", times=length(Glasses)),
                        rep("Book", times=length(Book))), 
                 Sleep_Hours=c(Control, Glasses, Book))

#print(df)

model <- aov(df$Sleep_Hours ~ df$Treatment)

summary(model)
##              Df Sum Sq Mean Sq F value Pr(>F)
## df$Treatment  2  0.668  0.3341   0.873  0.434
## Residuals    19  7.274  0.3829

Question A1 Answer

Source Df Sum of Squares Mean Squares F-statistics p-value
Treatments 2 0.668 0.3341 0.873 0.434
Error 19 7.274 0.3829
TOTAL 21 7.942

Question A2

Use \(\mu_1\), \(\mu_2\), and \(\mu_3\) as notation for the three mean parameters and define these parameters clearly based in the context of the topic above. Find the estimates of these parameters.

knitr::opts_chunk$set(echo = TRUE)

u1 <- mean(Control)
u2 <- mean(Glasses)
u3 <- mean(Book)

cat("The average hours of sleep for the control group is", u1, ".")
## The average hours of sleep for the control group is 7.075 .
cat("The average hours of sleep for the blue light glasses treatment group is", u2, ".")
## The average hours of sleep for the blue light glasses treatment group is 7.497143 .
cat("The average hours of sleep for the reading a book treatment group is", u3, ".")
## The average hours of sleep for the reading a book treatment group is 7.247143 .

Question A3

What are the null and alternative hypotheses of interest in ANOVA? Use the parameters in part A2 to write the two hypotheses. Based on the ANOVA table in A1, answer the following questions:

Null Hypothesis: U1 = U2 = U3

Alternative Hypothesis: Means are not all equal

(i) Fill in the blanks for k-1 and N-k: Under \(H_0\) (i.e., when the null hypothesis is true), the \(F\)-test statistic has an \(F\)(k-1;N-k) distribution: (This is the distribution under which the p-value is calculated).

k-1 = 2

n-k = 19

(ii) What is the p-value of the ANOVA \(F\)-test?

The p-value of the ANOVA F-test is 0.434.

(iii) State your conclusions in context. Do blue light blocking glasses or reading a book affect sleep?

The difference of the mean values for the treatments and the control are not statistically significant. Therefore, we fail to reject the null hypothesis, accepting that blue light blocking glasses nor reading a book changes a subject’s sleep.

Part B. Simple Linear Regression

Gross Domestic Product is the total monetary value of all the finished goods and services produced within a country’s borders in a specific time period.There are several methods for calculating GDP. The most common one is the expenditure approach. In this approach, GDP is estimated by summing consumption, government spending, investment, and net exports. Consumption is when consumers spend money to purchase goods and services, such as an iPhone or a haircut. Consumer spending accounts for 2/3 of GDP in this approach. Because consumption is dependent on spending, theoretically, it should be correlated with population. In Part B, we will explore the relationship between state population and GDP based on data sourced from the Bureau of Economic Analysis (BEA).

The GDPperState data file includes the following columns:

GeoFips: Geographic code

GeoName: Geographical region, including all U.S. states and various regions

2019: Real GDP for 2019 (millions of chained 2012 dollars)

The PersonalIncomeSummaryTable data file includes the following columns:

GeoFips: Geographic code

GeoName: Geographical region, including all U.S. states

LineCode: Statistic code

Description: Description of the statistic, including units

2019: Value of the statistic in 2019

The data is stored locally in two CSV files. To read the data in R, we will save them in my working directory and read the data using the R function read.csv(). We will utilize the following arguments to clean the source data before the CSV read:

head –> If the file contains Columns names as the First Row then it is TRUE otherwise, FALSE
sep –> specify the character that is separating the fields
skip –> specifies the number of rows you want to skip from file
nrows –> restrict the total number of rows to read

#Import two CSV files from local source

GDPperState = read.csv("C:\\Users\\71104861\\Documents\\Personal\\DataGeneralist\\Data Lessons\\Rblog2\\GDPperState.csv", head = TRUE, sep = ",", skip = 4, nrows = 52)

PersonalIncomeTable = read.csv("C:\\Users\\71104861\\Documents\\Personal\\DataGeneralist\\Data Lessons\\Rblog2\\PersonalIncomeSummaryTable.csv", head = TRUE, sep = ",", skip = 4, nrows = 156)

Then, we will filter out irrelevant rows and columns and merge the useful information into the same data frame to prepare it for plotting, modeling, and prediction.

Keep Geofips until after merge…

library(dplyr)

#Filter out non-population data and the United States code
PersonalIncomeTable = PersonalIncomeTable[PersonalIncomeTable$Description == "Population (persons) 1/", ]
PersonalIncomeTable = PersonalIncomeTable[PersonalIncomeTable$GeoFips > 0, ]

#Remove two columns
PersonalIncomeTable$LineCode <- NULL
PersonalIncomeTable$Description <- NULL

#Rename two columns
PersonalIncomeTable <- rename(PersonalIncomeTable, Population = X2019)
PersonalIncomeTable <- rename(PersonalIncomeTable, State2 = GeoName) 

#Filter out United States code
GDPperState = GDPperState[GDPperState$GeoFips > 0, ]

#Rename two columns
GDPperState <- rename(GDPperState, Real_GDP = X2019) 
GDPperState <- rename(GDPperState, State = GeoName) 

#Merge the two dataframes on the GeoFips code
GDP_Pop_Merge <- inner_join(PersonalIncomeTable, GDPperState, by = 'GeoFips')

#Delete extra state column and GeoFips column
GDP_Pop_Merge$GeoFips <- NULL
GDP_Pop_Merge$State2 <- NULL

#Convert population to millions
GDP_Pop_Merge$Population = GDP_Pop_Merge$Population / 1000000

#Write the two variables as vectors by slicing the data frame
Population = as.numeric(GDP_Pop_Merge[,1])
Real_GDP = as.numeric(GDP_Pop_Merge[,3])

Question B1: Exploratory Data Analysis

a. Using a scatter plot describe the relationship between the state’s population and its real GDP. Describe the general trend (direction and form). (Use the plot() function in R with the two input variables. Include plots and R-code used.

plot(Population, Real_GDP, main="State GDP vs. Population in 2019 Scatterplot", xlab="Population (in millions)", ylab="Real GDP (millions of dollars)", pch=19)

abline(lm(Real_GDP ~ Population), col="red")

plot of chunk unnamed-chunk-8

In general, as the state population increases, the real GDP increases. The linear relationship between these two variables appears to be strong.

b. What is the value of the correlation coefficient? (Use the cor() function in R with the two input variables. Please interpret. Discuss the difference in the strength in correlation.

cor(Real_GDP, Population)
## [1] 0.9807532

Correlation coefficient is 0.98.

There is a very strong positive linear relationship between the state’s population and its real GDP. As the number of people living in a state increases, we would expect the real GDP to increase.

c. Based on this exploratory analysis, is it reasonable to assume a simple linear regression model for the relationship between population and real GDP? Did you note anything unusual?

Yes, it is reasonale to assume a simple linear regression model for the relationship between these two variables because the correlation coefficient is .98. However, the variance appears to increase as population increases to the upper quartile of values. Also, the majority of the data is clustered under 8 million in population. In conclusion, the strong correlation suggests that the model is good for explaining the relationship, but the variance and clustering lead me to question the model’s predictive power.

d. Based on the analysis above, would you pursue a transformation of the data?

If this model’s purpose is explanatory power, I would not transform it. However, if the model is for prediction, I would consider a transformation.

Question B2: Fitting the Simple Linear Regression Model

Fit a linear regression to evaluate the relationship between the rate of accidents and the number of signs. Do not transform the data. The function to use in R is:

model = lm(Real_GDP ~ Population)

a. What are the model parameters and what are their estimates?

model = lm(Real_GDP ~ Population)
summary(model)
## 
## Call:
## lm(formula = Real_GDP ~ Population)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -396218  -49786    6977   37437  273730 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -47980      17960  -2.671   0.0102 *  
## Population     64949       1847  35.161   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96140 on 49 degrees of freedom
## Multiple R-squared:  0.9619, Adjusted R-squared:  0.9611 
## F-statistic:  1236 on 1 and 49 DF,  p-value: < 2.2e-16

b. Write down the equation for the least squares line.

Real_GDP = 64949*(Population) – 47980

c. Interpret the estimated value of the slope parameter in the context of the problem. Include its standard error in your interpretation.

If you add one million of population to a state, the real GDP will increase by an average of 64,949 million. Most of the deviation from this average rate of increase can be explained by the standard error of 1847 million added or subtracted from this slope parameter.

d. Find a 95% confidence interval for the slope parameter. Is the slope statistically significant at this level?

confint(model, level=0.95)
##                 2.5 %    97.5 %
## (Intercept) -84072.70 -11887.02
## Population   61236.93  68661.02

The 95% confidence interval for the slope parameter is (61,237, 68,661). Yes, it is statistically significant at this level.

Question B3: Checking the Assumptions of the Model

To check whether the assumptions are met, we use three visual displays:

  1. Scatterplot of the data
plot(Population, Real_GDP, main= "State GDP vs. Population in 2019 Scatterplot", xlab="Population (in millions)", ylab="Real GDP (millions of dollars)", pch=19)
abline(lm(Real_GDP ~ Population), col="red")

plot of chunk unnamed-chunk-12

  1. Residual plot – a plot of the residuals, \(\epsilon_i\), versus, \(\hat{y}_i\) (also called the predicted or fitted values)
model.res = resid(model)
plot(Population, model.res, 
  ylab="Real GDP", xlab="Population", 
  main="Population vs. Real GDP") 
abline(0, 0) 

plot of chunk unnamed-chunk-13

  1. Normal probability plot of the residuals, or q-q
model.res = resid(model)
qqnorm(model.res, pch = 1, frame = FALSE)
qqline(model.res, col = "steelblue", lwd = 2)

plot of chunk unnamed-chunk-14

plot(model)

plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15

#Calculate Cook's distance for each point. If the value is greater than 4/n, then it is an outlier.

GDP_Pop_Merge$cooksd <- cooks.distance(model)
threshold <- 4/nrow(GDP_Pop_Merge)
#Threshold is .078

Provide the plots and R commands used to evaluate the assumptions.

Interpret the 3 displays with respect to the assumptions of the linear regression model. In other words, comment on whether there are any apparent departures from the assumptions of the linear regression model. Make sure that you state the model assumptions and assess each one. Describe what graphical tool you used to evaluate each assumption. Finally, are there any extreme outliers in the data/residuals?

The scatterplot shows that there is a very strong positive linear relatiomship between the two variables. The residual deviance seems to increase as population increases suggesting a non-constant variance. The issues continue in the q-q plot which reveals that the residuals are not normally distributed. The upper quantile does not contain the expected number of observations. It looks like most of the data is distributed on the left side with a “long tail” extending out to the right. This cluster on the left side is confirmed in the residual plot.
California and Florida appear to be outliers. Both of them have values greater than the Cook’s distance threshold defined by 4/n = .078.

Question B4: Prediction

Suppose we are interested in what the rate of accidents is when population = 9.13. Please make a prediction and provide the 95% prediction interval. What observations can you make about the result?

predict(model, data.frame(Population = c(9.13)), se.fit=T, interval="prediction", level=0.95)
## $fit
##        fit      lwr      upr
## 1 545004.3 349659.7 740348.9
## 
## $se.fit
## [1] 14352.74
## 
## $df
## [1] 49
## 
## $residual.scale
## [1] 96141.49
predict(model, data.frame(Population = c(9.13)), interval="predict", level=0.99)
##        fit      lwr      upr
## 1 545004.3 284494.4 805514.2

Having 9.13 millions of people is predicted to yield $545,004 million GDP. The 95% prediction interval estimates this value to be between (284494.4, 805514.2), which is a large range, considering the scatter plot shows the majority of GDP data points ti be between these values. This seems like the model does not provide a prediction with too much value.

Additional Related Resources

One-way ANOVA tutorial: https://datascienceplus.com/one-way-anova-in-r/

Simple Linear Regression tutorial: http://www.r-tutor.com/elementary-statistics/simple-linear-regression

Sources, Methodology and Notes

Problem sets: The problem sets are modeled after Georgia Tech’s ISYE 6414 Regression Analysis Course; however, the data sets have changed considerably.

Data:

Part A: The data used is fake.

Part B: The data is sources from the Bureau of Economic Analysis (BEA)

SAINC1 Personal Income Summary: Personal Income, Population, Per Capita Personal Income
SAGDP1 Gross Domestic Product (GDP) summary, annual by state
https://apps.bea.gov/itable/iTable.cfm?ReqID=70&step=1#reqid=70&step=1&isuri=1

Rwordpress Help:
https://www.tabvizexplorer.com/how-to-upload-r-markdown-directly-to-wordpress/
https://rpubs.com/pbaumgartner/r2wp


4 Comments

Leave a Reply