Hypothesis Testing and Causal Inference (MT11)

At this stage, students should now have sufficient knowledge to begin to understand the basic form of linear models and the purpose of explanatory variables in reducing the error variance around our “best guess”. Rather than the amount of error within our models, this week we’ll be asking a different question; how much error is there in our estimates? In other words, what is the spread or variance of our linear model estimates across many possible samples? We will introduce confidence intervals as a way of describing this error variance in the estimate and discuss the various ways in which they can be constructed.

Learning outcomes

At the end of today’s workshop you should be able to do the following:

  1. Calcualte and interpret a 95% confidence interval using normal theory and bootstrpping in a group comparison linear model

  2. Calcualte and interpret a 95% confidence interval using normal theory and bootstrpping in a linear regression model

Install required packages

install.packages("supernova")
install.packages("readr")
install.packages("sjlabelled")
install.packages("supernova")
install.packages("readr")
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("lm.beta")
install.packages("ggpubr")
install.packages("mosiac")
install.packages("Hmisc")
install.packages("car")
install.packages("plyr")

Load required packages

The required packages are the same as the installed packages. Write the code needed to load the required packages in the below R chunk.

library(sjlabelled)
library(supernova)
library(readr)
library(tidyverse)
library(ggplot2)
library(lm.beta)
library(ggpubr)
library(mosaic)
library(Hmisc)
library(car)
library(plyr)

The WVS data

For this weeks workshop, we are going to return to the World Values Survey data. Just like in prvious weeks, we are going to run a group comparison linear model and a regression linear model. However, this week we are going to calcuate the confidence intervals for the estimates, rather than relying on the t ratio and p value for statistical inference.

If you would like to have a quick recap on this data, go to the World Value Survey site. Don’t download anything: http://www.worldvaluessurvey.org/WVSContents.jsp.

You can access the .rdata (R data file) file of the data here: https://github.com/thomcurran/PB130/raw/master/MT3/Workshop/WV6_Data_R_v20180912.rds. The file is called WV6_Data_R_v20180912.rds. The .rds extension is an R speicifc extension, most commonly we'll work with .csv files.

Now we are going to open the file in R using RStudio or RStudio Server. Use the command below, but be sure to change the filepath to the one where you have put the WV6_Data_R_v20180912.rds file. Alternatively, find the WV6_Data_R_v20180912.rds file in the MT11 workshop folder, click on it, and save it as WVS.

WVS <- readRDS("C:/Users/tc560/Dropbox/Work/LSE/PB130/MT11/Workshop/WV6_Data_R_v20180912.rds")

WVS$V2A <- as_character(WVS$V2A) # this saves the country varaible as a "character" variable (i.e., the names o th countries) rather than its default "numeric" value.

Two-parameter model or independent t-test with confidence inttervals

Okay lets now move to a new research topic using the WVS to examine differences between independent groups. We will start with a two-parameter model (two groups) becuase it is the easiest to comapre to the two-parameter regression model (one continuous explanatory variable) that we will run afterwards.

As we should now know, the two-parameter linear model is often called an independent t-test. Like all linear models, this test can be done using the lm() in R.

To start, lets ask an interesting question of the WVS. In the data, there are some items that ask respondants about their attitudes toward marginalised groups. One of those questions asks whether respondants:

"Would not like to have immigrants/foreign workers as neighbors"

Coders recored a record of 1 if the respondant mentioned this and 2 if the respondant did not emntion this in their interview.

1 - Mentioned 2 - Not mentioned

Presently, the issue of attitudes toward immigrants is quite a 'live' one in the USA. So lets see whether Americans who reported that they would not like to have immigrants as neighbors differ in levels of life satisfaction in the WVS than those who did not report such attitudes. Our theory is that life satisfaction may differentiate these attitudes, but we don't know yet.

The life satisfaction queestion asks: “What is the satisfaction with your life" and is measured on a 1-10 scale:

  1. Completely dissatisfied

. . .

  1. Completely satisfied

Before we delve into this topic, though, lets just clarify the research question for our two-parameter linear model and null hypothesis we are testing:

Research Question - Do levels of life satisfaction differ between Americans who would and would not like to have immigrants as neighbors?

Null Hypothesis - The difference in life satisfaction between Americans who would and would not like to have immigrants as neighbors will be zero.

I looked in the codebook and the immigrant variable is listed as V39. We also want the life satisfaction variable, which is listed as V23. So lets look for the variables and select them, then head the new dataframe entitled WVSimmigrant. Lets also select the country code varable, V2A, so we can use this to extract only responses for the USA. I have left some of the below code blank for you to fill the gap.

WVSimmigrant <- 
  WVS %>% 
  select(V39, V23, V2A) # SELECT THE VARIABLES OF INTEREST HERE
head(WVSimmigrant)
## # A tibble: 6 x 3
##   V39        V23        V2A    
##   <labelled> <labelled> <chr>  
## 1 2          8          Algeria
## 2 2          5          Algeria
## 3 2          4          Algeria
## 4 2          8          Algeria
## 5 1          8          Algeria
## 6 1          7          Algeria

Just like other variables in this dataset, we need to edit the dataframe since the variables have values we can’t interpret like -5. While we are at it, lets also filter out USA in the first instance as the country of interest. We can do that by using the filter() function.

WVSimmigrant <- 
WVSimmigrant %>% 
filter(V39 >= "1" & V23 >= "1" & V2A == c("United States")) 
WVSimmigrant
## # A tibble: 2,216 x 3
##    V39        V23        V2A          
##    <labelled> <labelled> <chr>        
##  1 2           7         United States
##  2 2           8         United States
##  3 2           8         United States
##  4 2           8         United States
##  5 2          10         United States
##  6 2           6         United States
##  7 2           7         United States
##  8 2           7         United States
##  9 1           7         United States
## 10 2           8         United States
## # ... with 2,206 more rows

Now given we are building a linear model with a categorical variable, coding is really important for our explanatory variable. We want to find the increment in life satisfaciton needed from the group who reported they would not like to have immigrants as neighbors to those who don't mind. Currently the V39 variable is coded 1=not mentioned and 2=mentioned. We need to replace the code for mentioned (i.e., 2) so that it becomes the intercept in our model. To do this, we need to replace 2 with 0 - so that the intercept in our linear model reflects the mean value of life satisfaction for those who would not like to have immigrants as neighbors.

WVSimmigrant$V39 <- revalue(as.factor(WVSimmigrant$V39), c("2"="0")) # revalue is part of the plyr package and takes the V39 variable from the WVSimmigrant data frame and replaces values of 2 with values of 0. This function always takes the input of a factor or categorical variable so that as.factor code just force V23 to a factor (i.e., not a numeric) variable.

Now lets run the histogram and look at the look at the life satisfaction distribution.

WVSimmigrant %>%
  ggplot(aes(x= V23)) +
  geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") + 
  geom_density(adjust = 4, alpha = .2, fill = "antiquewhite3") +
  geom_vline(aes(xintercept=mean(V23)), col="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(V23)), col="black", size=1) +
  ylab("Density") +
  xlab("Life Satisfaction")+
  ggtitle("Distribution of Life Satisfaction in the USA") +
  theme_minimal(base_size = 8)  

As we have seen with this data in the past, the life satisfaction variable has a negative skew with values to the left of the mean pulling it downwards. Keep this in mind as it may be relevant later.

Okay great, we are ready to go. Lets first inspect this data and take a look at the means and standard devations for each group. By now, you should know how to do this. As a reminder, we can use favstats() to inpect the descriptives.

favstats(~V23, V39, data=WVSimmigrant) # the second line of favstats is our factor and it splits the satisfaction scores into a mean for the two groups.
##   fav_stats min Q1 median Q3 max     mean       sd    n missing
## 1         1   1  6      8  9  10 7.238854 2.024677  314       0
## 2         0   1  7      8  9  10 7.474763 1.832015 1902       0

Remember that those who mentioned they would not like to have immigrants as neighbors are coded as 0, all else are coded 1.

We can see that the means are pretty close but there is a slight difference in favour of the all else group. Lets quickly work that mean difference in R.

7.474763-7.238854
## [1] 0.235909

The difference is 0.24 points on a 1-10 Likert scale, which doesn't seem a great deal but in the context of the SD, could be significant. The question therefore is whether this mean difference is statistically meaningful or large enough for us to consider the possibility of a zero difference in the population unlikely.

Setting up the linear model

Now lets set our up our linear model to examine the statitical significance of that 0.24 mean difference. As always, we are using the lm() function to test a two-parameter linear model becuase we have two means to estimate. Lets build the model and save it in a new R object called immigrant.model. Hopefully, you know how to do this.

immigrant.model <- lm(V23 ~ 1 + V39, data=WVSimmigrant) 

summary(immigrant.model) 
## 
## Call:
## lm(formula = V23 ~ 1 + V39, data = WVSimmigrant)
## 
## Residuals:
## Satisfaction with your life 
##     Min      1Q  Median      3Q     Max 
## -6.4748 -0.4748  0.5252  1.5252  2.7611 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.2389     0.1050  68.947   <2e-16 ***
## V390          0.2359     0.1133   2.082   0.0375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.86 on 2214 degrees of freedom
## Multiple R-squared:  0.001953,   Adjusted R-squared:  0.001503 
## F-statistic: 4.333 on 1 and 2214 DF,  p-value: 0.03749

Notice that the intercept part of our model - bo - is the same as the mean score for those who would not like to have immigrants as neighbors - 7.24? Notice also that b1 is the mean difference or the increment needed to get to those who did not mention that would not like to have immigrants as neighbors (i.e., .24 + 7.24 = 7.48)? This shows that we now have two parameters, and a mean difference that is exactly as we calcuated by hand.

Notice also we have the same other statistics as in our previous model. The standard error and the t-ratio. From what you now know about these statistics, you should be able to work out whether the difference of .24 is statistically meanignful. We have a t of 2.08 and a p value of .04 - we can reject the null hypothesis.

We might write this up in a research paper as follows:

There was a statistically significant difference between Amercians who reported that they would not like to have immigrants as neighbors and those who did not: t(137) = 2.08, p < .05.

The implication from this analysis appears to be that Americans who reported they would not like to have immigrants as neighbors are less satisfied with their life than those who didn't.

Calculating the 95% confidence interval for the estiamte using normal theory

To this point, I've been showing you how we use the estiamted standard deviaiton of the sampling distribution - or the standard error - to gauge how far away our estiamte is from a zero effect.

When we take the standard error and divide it by the estimate we have a metric - a t ratio - that tells us how many standard errors our estimate is away from zero. If the t ratio is greater than 1.96, then we reject the null hypothesis and conclude that, assuming a normal sampling distribution, a zero mean difference is unlikely in the population.

We can make this assumption of a normal sampling distribution because of something called the central limit theorem (see leture notes for more information)

However, as we saw in the lecture, there is a more intuative way of expressing the same idea using confidence intervals that provide a range of values for which a population mean difference is likely to reside around the sample mean difference.

From the central limit theorem, we know that the sampling distribution follows a normal distribution with a mean difference that is equal to the population mean difference. But what we don’t know is the mean difference in the population.

The sample mean difference of .24, then, is the best estimate we have of the population mean difference. The true population mean difference is probably different from our estimate. But the key question is: how different?

If we treat standard error like we treat standard deviation, then we can classify 1.96 standard errors either side of the estimate as the likely estimates in the population. Those that we would expect to sample 95 times in 100.

This is the essence of the normal theory 95% confidence interval and it is calculated as follows:

95% confidence interval = estimate +/- (SE*1.96)

If the 95% confidence interval does not include zero, then a zero difference in the population is unlikely. It is just another, more intuitive, way of expressing the p value!

Lets calculate the upper and lower limits of the confidence interval using the estimate, standard error, and critical t value of 1.96 (the value that cuts off 5% of a normal distribution).

.24 - (.11*1.96) # .24 is the mean differencem, .11 is the standard error, and 1.96 is our critical value based on the normal distribution. I've taken 1.96 standard errors from the estimate here.
## [1] 0.0244
.24 + (.11*1.96) # I've added 1.96 standard errors to the estimate here.
## [1] 0.4556

We have a normal theory 95% confidence interval that runs from .02 to to .46. A more accurate way to do this is to request the confidence interval of the mean difference directly in R. To do this, we can use the confint() function in R.

confint(immigrant.model) # this just takes the linear model R object and returns the confidence interval for the two estiamtes.
##                  2.5 %    97.5 %
## (Intercept) 7.03296032 7.4447467
## V390        0.01367023 0.4581496

Okay so we are a little off in our rounding but the interpretation stays the same. If we randomly sampled the same data 100 times, the true popualtion mean difference would fall somewhere between .01 and .46 in 95 of those samples. Thats all we can say!

However, if zero is not inluded in that range then we can conclude that a zero mean difference is pretty unlikely.

Does this include zero? No! Therefore, the interpretation is that we can reject the null hypothesis. People who would not like to have immigrants as neighbors are still less satisfied with their life under this interpretation of statistical inference.

Partitioning variance

We've established that there is a mean difference using both the t ratio and 95% confidence interval, but how much variance in life satisfaction is explained by our explanatory variable? As you know, we can answer this question using the supernova() function in R, which breaks down the variance in life satisfaction due to the model and error (i.e., variance left over once we've subtracted out the model)

supernova(immigrant.model)
##  Analysis of Variance Table (Type III SS)
##  Model: V23 ~ 1 + V39
##  
##                                SS   df     MS     F    PRE     p
##  ----- --------------- | -------- ---- ------ ----- ------ -----
##  Model (error reduced) |   14.999    1 14.999 4.333 0.0020 .0375
##  Error (from model)    | 7663.375 2214  3.461                   
##  ----- --------------- | -------- ---- ------ ----- ------ -----
##  Total (empty model)   | 7678.374 2215  3.467

Okay, this table puts the sample mean difference into more perspective. Yes, it is statistically significant but the variance explaned is tiny - PRE or R2 = 0.2%! This needs to be bourne in mind when drawing conclusions. We have statistically significant difference, but is it practically meaningful? Perhaps not.

Calculating the 95% confidence interval for the estiamte using bootstrapping

Now, lets just take a bit of stock. What we have just done is to create the 95% confidence interval for the mean difference using normal theory. That is, we used the normal curve to calculate a probability of finding the population estimate, with probabilities of .05 and less considered “unlikely” (i.e., 1.96 SE away from the mean).

We then used that probability to yield a range of values through which the popualtion mean difference would be observed in 95 samples out of 100. This is the nomral theory 95% confidence interval.

Because of the Central Limit Theorem, the normal curve turns out to be an excellent model for a sampling distribution of means but only when the distribution of variables and the estimate is normal.

As we saw in the lecture, when the distribution of estimates are not normal, we run the risk of encountering type I and type II errors.

So, rather than assume a normal sampling distribution of estimated values, a better approach is to generate a sampling distribution from the avaliable data to arrive at a 95% confidence interval that requires no restrictive assumptions regarding the distribution of estimates.

Bootstrapping is one method that uses the characteristics of an original sample to produce many thousands of resamples with replacement – estimating coefficients of interest each time. Typically, it is recommended that little benefit is achieved over about 5000 bootstrap resamples, so I recommend you always request 5000 resamples.

In this way, bootsrapping permits us to generate a distribution of estimates that we can rank from low to high and use to cut off the 2.5 and 97.5 percentiles of the distribution (see lecture notes for more details on the mechanics of bootstrapping). Genius!

So lets go ahead and generate a bootstrap sampling distribution of estimates using the Boot() function and save it as a new R object called immigrant.boot. Note that we are going to request 5000 resamples so that might take a little time to run!

immigrant.boot <- Boot(immigrant.model, f=coef, R = 5000) # this function takes the linear model we built earlier and asks for the coeffiencts to be calcuated in 5000 resamples with replacement from the original sample (R=5000)
## Loading required namespace: boot
summary(immigrant.boot) # just like the linear model, the summary function outputs the summary of this analysis
## 
## Number of bootstrap replications R = 5000 
##             original   bootBias  bootSE bootMed
## (Intercept)  7.23885 -0.0016880 0.11383 7.24013
## V390         0.23591  0.0016333 0.12195 0.23752

What R just did there is take a resample of the original sample and calcuate the model coeffciencts (i.e., b0 and b1) and repeat that process 5000 times! Before the advent of fast computers that would have taken weeks, but R does it in 20 seconds or so.

You can see that the output gives the sample estimates under the original column and then mean estiamtes from the 5000 resamples in the bootMed column. The difference between the sample and mean bootstrap estiamte is given in the bootBias column. Finally, the mean standard error from the 5000 resamples is given in the bootSE comlumn.

We can see that the sample mean difference and bootstrap mean difference are virtually the same. The bias is quite small.

Now, though, to the important part. With the bootsrap distribution of mean differences that we have just called, we can construct a confidence interval for our estimate. We do this by simply finding the values of the mean difference in the bootstrap distribution that reflect the 2.5 and 97.5 percentiles for the upper and lower bounds of the confidence interval.

And, just like normal theory, if zero is not included in the bootstrap 95% confidence interval then we can reject the null hypothesis and conclude that a zero mean difference in the population is unlikely.

Lets go ahead and call the bootsrap 95% confidence interval using the confint() function.

confint(immigrant.boot, level = .95, type = "norm") # this function request the confidence intervals from the immigrant.boot object. The level is .95 becuase we want the 95% confidence intervals. The type is norm becuase we dont want to correct them in any way.
## Bootstrap normal confidence intervals
## 
##                    2.5 %    97.5 %
## (Intercept)  7.017435511 7.4636474
## V390        -0.004738366 0.4732916

And well, what do you know! The bootstrap 95% confidence interval tells us something different to the normal theory 95% confidence interval. Here, we see that the range of values runs from -.0001 to .47 (your values might slightly differ to this, bootstrapping will produce slightly different estimates each time it resamples). Cruically, though, this range now includes zero! We cannot reject the null hypothesis.

To get a clue about why the normal theory 95% confidence interval is too liberal (i.e., rejecting the null when it is possibly true), take a look at the distribution of life satisfaction above. We can see a negative skew meaning that a normal curve underestimates values below the mean, and overestimates them above the mean. These differences are marginal, but when the estimate is ONLY JUST significant, as is the case here, it can cause erroroneous conclusions. This is why bootstrapping should always be preferred. It makes no assumptions about the distribution of variables and estimates.

As I mentioned in the leture, bootstrapping the estimates is now the conventional way to determine statistical significnacne. So, therefore, the difference in life satisfaction between those who reported they would not like to have immigrants as neighbors and those who didn't mind immigrants as neighbors is not statistically significant.

We might write this is a research paper like so:

An independent t-test was performed to test the difference in life satisfaction between Americans who reported they would not like to have immigrants as neighbors versus those who didn't mind immigrants as neighbors. The linear model indicated that not liking immigrants as neighbors explained 0.2% of the variance in life satisfaction (R = .04, R2 = .002, F(1, 2214) = 4.333, p < .05). Inspection of the model estimates indicated that that there was a mean difference in life satisfaction between the groups of .24 in favor of those who did not mind immigrants as neighbors. The 95% confidence interval associated with this mean difference was obtained with 5,000 bootstrap resamples and included zero (b1 = .23, 95% CI = -.001, .47). As such, the difference in life satisfaction between Americans who reported they would not like to have immigrants as neighbors and those who didn't mind immigrants as neighbors is not statistically significant.

Exercise

Okay, lets now give you a chance to set up and test a linear regression model with 95% normal theory and bootsrap confidence intervals. In the WHR dataset there is another interesting potential predictor of life satisfaction - satisfaction with financial situation of household. This variable asks the question: “How satisfied are you with the current financial situation of your household?" and is measured on a 1-10 scale:

  1. Completely dissatisfied

. . .

  1. Completely satisfied

Some theories propose that economic satisfaction is an important prerequisite for life satisfaction. So lets find out if thats the case in the American data.

Before we delve into this topic, lets just clarify the research question and null hypothesis we are testing:

Research Question - Is there a linear relationship between Americans' satisfaction with financial situation of household and their life satisfaction?

Null Hypothesis - The relationship between satisfaction with financial situation of household and life satisfaction will be zero.

Task 1 - Select the variables

I looked in the codebook and the satisfaction with financial situation of household is listed as V59. We know that the life satisfaction variable is V23. And we also need the country variable V2A to filter for United States. So lets first select those variables out of the WVS dataframe and save it as a new dataframe called WVSfinace.

WVSfinance <- 
  WVS %>% 
  select(??) # SELECT THE VARIABLES OF INTEREST HERE
WVSfinance

Task 2 - Filter the variables

Just like other variables in this dataset, we need to edit the dataframe since the variables have values we can’t interpret like -5. Do that by using the filter() function. Also filter data from the United States in this code.

WVSfinance <- 
WVSfinance %>% 
filter(??) 
WVSfinance

Task 3 - Plot the distribution of the financial satisfaction

Write the code to draw a plot of the distribution of the finance satisfaction.

WVSfinance %>% 
  ggplot(aes(x= ??)) + ## fill in the variable gap here for GINI
  geom_histogram(aes(y=..density..), binwidth=1, colour="black", fill="white") + 
  geom_density(adjust = 4, alpha = .2, fill = "antiquewhite3") +
  geom_vline(aes(xintercept=mean(??)), col="black", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(??)), col="black", size=1) +
  ylab("??") + # add appropriate y lab 
  xlab("??") + # add appropriate x lab
  ggtitle("Distribution of financial staisfaction in the WVS") +
  theme_minimal(base_size = 8) 

Is the distribution of financial satisfaction approximately normal?

Task 4 - Calcualte the correlation coefficient for the relationship between financial satisfaction and life satisfaction

Now use the rcorr() function to call the correlation coefficient - or Pearson's r - for the relationship between financial satisfaction and life satisfaction.

rcorr(as.matrix(WVSfinance[,c("??","V23")], type="pearson")) # fill in the ?? gap here for the explanatory variable

What is the correlation coefficient?

Is this a small, medium, or large correlation?

Is this correlation statistically significant?

Task 5 - Set up the linear model

Now lets set our up our linear regression model to examine the intercpet and slope of the best fitting regression line using the lm() function. Don't forget to save the linear model in a new R object. I suggest finance.model

finance.model <- lm(?? ~ 1 + ??, data=WVSfinance) # add ?? variables to the linear model code
summary(finance.model) # add the new R object to the summary function

What is the incercept?

What is the slope?

What is the interpretation of the slope?

What is the t-ratio?

What is the p value (remember that e-values mean zero)?

Do we accept or reject the null hypothesis?

Why?

Task 6 - The standardized estimate

In the below chunk, use the lm.beta() function to calcuate the standardized estimate for the slope of financial satisfaction and life satisfaction.

lm.beta(??) # add finance model to the lm.beta function

What is the standardised slope?

How does this compare to the standardized slope?

Task 7 - Paritioning variance

We've established whether there is a statistically significant relationship, but how much variance in life satisfaction is explained by the financial satisfaction mode? We can answer this question using the supernova() function in R, which breaks down the variance in life satisfaction due to the model and error (i.e., variance left over once we've subtracted out the model).

supernova(??) # add finance model to the supernova function

What is the total variance or sum of squares for happiness?

How is that sum of squares partitioned between the model?

And the Error?

What is the percentage of variance explained in by the model?

What is the F ratio?

What is the p value?

What do we conclude in relation to our research question?

Task 8 - Normal theory 95% confidence interval for the model estimates

Okay, so we have a significnant positve relationship and we also know that the finance satisfaction model exlains a significnant portion of life satisfaction variance. Lets now calcualte the normal theory 95% confidence interval for our model estimates using the confint() function.

confint(??) # add the finalnce model to the confint() function

What is the normal theory 95% confidence interval for the slope of finance satisfaction on life satisfaction?

What does this confidence interval tell us?

Can we accept the null hypothesis?

Why?

Task 9 - Generate the bootsrap sampling distribution

We can see that the normal theory confidence interval supports our conclusions from the linear model. This is not a surprise, p values and confidence intervals are equalivalent. However, we also know that they may be biased if the varaiables or estiamtes are normally distributed. The solution this is, od course, is to bootstrap the estimates!

So lets go ahead and generate a bootstrap sampling distribution of estimates using the Boot() function and save it as a new R object called finance.boot. Note that we are going to request 5000 resamples so that might take a little time to run!

finance.boot <- Boot(??, f=coef, R = ??) 
summary(finance.boot) ## add the finance model and stipulate how many resamples you want to run (typically 5,000)

What is the mean bootsrap estiamte for the slope of finance satisfaction on life satisfaction?

Whats is the bias between the bootsrap estimate and the origional estimate?

Are these similar?

Task 8 - Bootstrap 95% confidence interval for the model estimates

Lets go ahead and call the bootsrap 95% confidence interval using the confint() function.

confint(??, level = .95, type = "norm") # add the bootstrap model you just created to this code

What is the bootstrap 95% confidence interval for the slope of finance satisfaction on life satisfaction?

What does this confidence interval tell us?

Can we accept the null hypothesis?

Why?

How might you write this analysis up in a paper?

Previous
Next