Day 1 Practical: Post-graduate Statistics Course

The average grumpiness score for each of the 44 photos can be found in photo_long.csv.

Using lm, estimate the difference between the scores of happy and grumpy photos.

Solution

Fit a model with type as a factor.

model<-lm(y~type, data=photo_long)

The typegrump coefficient is the difference between the grumpy and happy photos.

Simulate data according to this model assuming the parameter estimates to be the true values.

Hint

There’s a few ‘clever’ ways to do this, but the most obvious way to do it is using a for loop and calling the function rnorm. In R it’s usually much more efficient to create an object of the appropriate size and fill it up, rather than to increase the size of the object each iteration. For example, if we wanted to iterate from 1 to 10 and store the squares of these values, then:

res<-matrix(NA,10,1)

for(i in 1:10){
 res[i]<-i^2
}

is more efficient than:

res<-c()

for(i in 1:10){
 res<-c(res,i^2)
}

It’s also easier to read I think.

If you need to extract the coefficient from a model you can use the function coef. If you want to extract the residual standard deviation you can use the function sigma.

Solution

Simulate Function

Many model objects in R come with a method simulate which simulates data from the model using the estimated parameter values as the true values. However, it can be instructive to roll your own, especially for such a simple model as this. Let’s call the function my_simulate. The first argument nsim specifies how many replicate data-sets I would like to simulate. The second argument, group, is a vector (either numeric or a factor) specifying which group the observations belong. mu and sd specify the means and standard deviations of the groups and must be equal in length to the number of levels in group:

my_simulate<-function(nsim, group, mu, sd){
  
  n<-length(group) 
  # number of observations
  
  ngroup<-length(unique(group))
  # number of groups
  
  if(!is.numeric(group) & !is.factor(group)){stop('group must be a factor or numeric')}
  
  group<-as.numeric(group)
  
  if(length(mu)!=ngroup){stop('mu is the wrong length')}
  if(length(sd)!=ngroup){stop('sd is the wrong length')}
  
  y<-matrix(NA, n, nsim) 
  # matrix for holding nsim simulate data sets
  
  for(i in 1:nsim){
     y[,i]<-rnorm(n, mu[group], sd[group])
  }  

  return(y)
}

From the model fitted above get the estimated means for the two groups. Note that for happy photos the two coefficients have to be added because they are the intercept and the difference.

mu<-c(coef(model)[1], sum(coef(model)))

We have assumed that the residual variance is the same for all observations, and so we can extract the estimate of the residual standard deviation and give both sets of photos this value.

sd<-rep(sigma(model),2)

Simulate data

nsim<-100
# number of data sets to simulate

sim_y<-my_simulate(nsim, as.factor(photo_long$type), mu, sd)
# simulate data

Refit the model to the simulated data and compare the distribution of the estimated difference between the two types of photo with the sampling distribution you expect.

Hint

I would compare the two distributions graphically. You could overlay the theoretical probability density function on a histogram of the simulated values, but this is always a bit fiddly. I would simply simulate values from the theoretical distribution (i.e. use rnorm) and then compare using qqplot.

Solution

Refit models to simulated data and store the estimate of the difference

sim_est<-matrix(NA, nsim, 1)
# matrix for holding estimate of the difference

for(i in 1:nsim){

  model_sim<-lm(sim_y[,i]~type, data=photo_long)
  # fit model to the ith simulated data set

  sim_est[i]<-coef(model_sim)[2]
  # save the estimate of the difference
}

Compare actual and theoretical distributions graphically

est.d<-coef(model)[2]
# estimate of the difference 
se.d<-coef(summary(model))[2,"Std. Error"]
# standard error of the difference

expected_samp<-rnorm(nsim, est.d, se.d)
# sampling distribution of the estimate is expected to be normal since the residual standard deviation is assumed known (and set to sigma(model)). If the residual standard deviation is not known, but estimated, then the sampling distribution is a scaled mean shifted t-distribution. 

qqplot(expected_samp, sim_est)
abline(0,1)

Simulate data under the null model (the photos have the same mean) and generate the null distribution of the t-value and likelihood ratio for testing the difference.

Hint

When you pass a model object to the function summary a coefficient table is printed that not only includes the estimates, but also their standard errors, t-values and p-values. To access this information you can use the function coef on the model summary object. The likelihood of a model can be extracted by passing the model object to the function logLik.

Solution

Fit the null model

model_null<-lm(y~1, data=photo_long)


mu_null<-rep(coef(model_null)[1], 2)
sd_null<-rep(sigma(model_null),2) 
# estimate of mean and standard deviation for the two types under the null 

sim_null<-my_simulate(nsim, as.factor(photo_long$type), mu_null, sd_null)
# simulate data under the null

Fit the null model to data simulated under the null to obtain the distribution of the t-value under the null. Fit the null and alternate model to data simulated under the null in order to calculate the likelihood ratio statistic under the null.

sim_val<-matrix(NA, nsim, 2, dimnames=list(1:nsim, c("t_value", "LR2")))
# matrix for holding t-value and likelihood-ratio

for(i in 1:nsim){

  model_sim_null<-lm(sim_null[,i]~1, data=photo_long)
  # null model on data generated under the null

  model_sim_alt<-lm(sim_null[,i]~type, data=photo_long)
  # alternate model on data generated under the null

  sim_val[i,"t_value"]<-coef(summary(model_sim_alt))[2,"t value"]
  # store t-value of the difference on data generated under the null model

  sim_val[i,"LR2"]<-2*(logLik(model_sim_alt)-logLik(model_sim_null))
  # store twice the difference in log-likelihood between null and alternate model fitted to null data
}

Comparing these distributions to what you expect.

Solution

Simulate t-value and likelihood-ratio statistic from their theoretical distributions

expected_t<-rt(nsim, df=44-2)
# distribution of the t-value under the null model is expected to be a t-distribution

expected_LR2<-rchisq(nsim,1)
# distribution of twice the difference in log-likelihood between null and alternate model is expected to be chi-squared

Compare actual and theoretical distributions graphically

par(mfrow=c(1,2))
qqplot(sim_val[,"t_value"], expected_t)
abline(0,1)
qqplot(sim_val[,"LR2"], expected_LR2)
abline(0,1)

Pretty good!

Think of a way of testing whether the the observations associated with the different categories also have different residual variances.

Hint

This is a tricky question. A likelihood ratio test is the most obvious way to do it, but a little thought needs to go into how the likelihood could be obtained when the means and the variances differ between the groups. It can be done using lm only.

Solution

The scores for the two types of photo have no shared parameters if separate means and variances are estimated for each photo type. Consequently two models can be fitted and their likelihoods can be multiplied (or their log-likelihoods summed) to get the total likelihood. We can compare this likelihood to the model where only the means are allowed to differ.

model_grumpy<-lm(y~1, data=subset(photo_long, type=="grumpy"))
model_happy<-lm(y~1, data=subset(photo_long, type=="happy"))
# separate models for the two photo types

loglik_dv<-logLik(model_grumpy)+logLik(model_happy)
# combine the likelihoods to get the total likelihood

loglik_sv<-logLik(model)
# likelihood when only the mean differs

LR2<-2*(loglik_dv-loglik_sv)
# twice the difference in log-likelihood between the two models

1-pchisq(as.numeric(LR2), df=1)

## [1] 0.01155496

# likelihood ratio test

They may well have different variances.

The implementation above was long-winded but hopefully made the logic of the ideas more concrete. A shorter way of obtaining the sampling distributions is to use the simulation function to simulate replicate data-sets

sim_null<-simulate(model_null, nsim=100)  # simulate from the null
sim_alt<-simulate(model, nsim=100)  # simulate from the alternate model

fit_alt<-function(x){
  model_alt<-lm(x~type, data=photo_long)
  c(coef(summary(model_alt))["typehappy",], logLik=logLik(model_alt))
}  # a function for fitting the alternate model to x and extracting coefficients, t-values and the log-likelihood

null_llik<-apply(sim_null, 2, function(x){logLik(lm(x~1))})
# apply null model to data generated under the null to get log-lik

null_dist<-t(apply(sim_null, 2, fit_alt))
# apply alternate model to data generated under the null

sampling_dist<-t(apply(sim_alt, 2, fit_alt))
# apply alternate model to data generated under the alternate model

par(mfrow=c(2,2))
qqplot(sampling_dist[,"Estimate"], expected_samp)
abline(0,1)
qqplot(null_dist[,"t value"], expected_t)
abline(0,1)
qqplot(2*(null_dist[,"logLik"]-null_llik), expected_LR2)
abline(0,1)

Day 1 Practical: Post-graduate Statistics Course

Jarrod Hadfield