The average grumpiness score for each of the 44 photos can be found in photo_long.csv.

  1. Using lm, estimate the difference between the scores of happy and grumpy photos.

  2. Simulate data according to this model assuming the parameter estimates to be the true values.

  3. Hint

    There’s a few ‘clever’ ways to do this, but the most obvious way to do it is using a for loop and calling the function rnorm. In R it’s usually much more efficient to create an object of the appropriate size and fill it up, rather than to increase the size of the object each iteration. For example, if we wanted to iterate from 1 to 10 and store the squares of these values, then:

    res<-matrix(NA,10,1)
    
    for(i in 1:10){
     res[i]<-i^2
    }

    is more efficient than:

    res<-c()
    
    for(i in 1:10){
     res<-c(res,i^2)
    }

    It’s also easier to read I think.

    If you need to extract the coefficient from a model you can use the function coef. Getting the residual standard deviation is a bit more painful: it can be found in the model summary object in the list element sigma.


    Solution

    Simulate Function

    Many model objects in R come with a method simulate which simulates data from the model using the estimated parameter values as the true values. However, it can be instructive to roll your own, especially for such a simple model as this. Let’s call the function my_simulate. The first argument nsim specifies how many replicate data-sets I would like to simulate. The second argument, group, is a vector (either numeric or a factor) specifying which group the observations belong. mu and sd specify the means and standard deviations of the groups and must be equal in length to the number of levels in group:

    my_simulate<-function(nsim, group, mu, sd){
      
      n<-length(group) 
      # number of observations
      
      ngroup<-length(unique(group))
      # number of groups
      
      if(!is.numeric(group) & !is.factor(group)){stop('group must be a factor or numeric')}
      
      group<-as.numeric(group)
      
      if(length(mu)!=ngroup){stop('mu is the wrong length')}
      if(length(sd)!=ngroup){stop('sd is the wrong length')}
      
      y<-matrix(NA, n, nsim) 
      # matrix for holding nsim simulate data sets
      
      for(i in 1:nsim){
         y[,i]<-rnorm(n, mu[group], sd[group])
      }  
    
      return(y)
    }

    Fit the model

    model<-lm(y~type, data=photo_long)

    Get the estimated means for the two groups. Note that for happy photos the two coefficients have to be added because they are the intercept and the difference.

    mu<-c(coef(model)[1], sum(coef(model)))

    We have assumed that the residual variance is the same for all observations, and so we can extract the estimate of the residual standard deviation and give both sets of photos this value.

    sd<-rep(summary(model)$sigma,2) 

    Simulate data

    nsim<-100
    # number of data sets to simulate
    
    sim_y<-my_simulate(nsim, as.factor(photo_long$type), mu, sd)
    # simulate data


  4. Refit the model to the simulated data and compare the distribution of the estimated difference between the two types of photo with the sampling distribution you expect.

  5. Hint

    I would compare the two distributions graphically. You could overlay the theoretical probability density function on a histogram of the simulated values, but this is always a bit fiddly. I would simply simulate values from the theoretical distribution (i.e. use rt, or rt.scaled from the metRologly library) and then compare using qqplot.


    Solution

    Refit models to simulated data and store the estimate of the difference

    sim_est<-matrix(NA, nsim, 1)
    # matrix for holding estimate of the difference
    
    for(i in 1:nsim){
    
      model_sim<-lm(sim_y[,i]~type, data=photo_long)
      # fit model to the ith simulated data set
    
      sim_est[i]<-coef(model_sim)[2]
      # save the estimate of the difference
    }  

    Compare actual and theoretical distributions graphically

    est.d<-coef(model)[2]
    # estimate of the difference 
    se.d<-coef(summary(model))[2,"Std. Error"]
    # standard error of the difference
    
    expected_samp<-rt(nsim, df=44-2)*se.d+est.d
    # sampling distribution of the estimate is expected to be a scaled mean shifted t-distribtion. You could also use rt.scaled function from metRology.
    
    qqplot(expected_samp, sim_est)
    abline(0,1)


  6. Simulate data under the null model (the photos have the same mean) and generate the null distribution of the t-value and likelihood ratio for testing the difference.

  7. Hint

    When you pass a model object to the function summary a coefficient table is printed that not only includes the estimates, but also their standard errors, t-values and p-values. To access this information you can use the function coef on the model summary object. The likelihood of a model can be extracted by passing the model object to the function logLik.


    Solution

    Fit the null model

    model_null<-lm(y~1, data=photo_long)
    
    
    mu_null<-rep(coef(model_null)[1], 2)
    sd_null<-rep(summary(model_null)$sigma,2) 
    # estimate of mean and standard deviation for the two types under the null 
    
    sim_null<-my_simulate(nsim, as.factor(photo_long$type), mu_null, sd_null)
    # simulate data under the null

    Fit the null model to data simulated under the null to obtain the distribution of the t-value under the null. Fit the null and alternate model to data simulated under the null in order to calculate the likelihood ratio statistic under the null.

    sim_val<-matrix(NA, nsim, 2, dimnames=list(1:nsim, c("t_value", "LR2")))
    # matrix for holding t-value and likelihood-ratio
    
    for(i in 1:nsim){
    
      model_sim_null<-lm(sim_null[,i]~1, data=photo_long)
      # null model on data generated under the null
    
      model_sim_alt<-lm(sim_null[,i]~type, data=photo_long)
      # alternate model on data generated under the null
    
      sim_val[i,"t_value"]<-coef(summary(model_sim_alt))[2,"t value"]
      # store t-value of the difference on data generated under the null model
    
      sim_val[i,"LR2"]<-2*(logLik(model_sim_alt)-logLik(model_sim_null))
      # store twice the difference in log-likelihood between null and alternate model fitted to null data
    }  


  8. Comparing these distributions to what you expect.

  9. Solution

    Simulate t-value and likelihood-ratio statistic from their theoretical distributions

    expected_t<-rt(nsim, df=44-2)
    # distribution of the t-value under the null model is expected to be a t-distribution
    
    expected_LR2<-rchisq(nsim,1)
    # distribution of twice the difference in log-likelihood between null and alternate model is expected to be chi-squared

    Compare actual and theoretical distributions graphically

    par(mfrow=c(1,2))
    qqplot(sim_val[,"t_value"], expected_t)
    abline(0,1)
    qqplot(sim_val[,"LR2"], expected_LR2)
    abline(0,1)

    Pretty good!


  10. Think of a way of testing whether the the observations associated with the different categories also have different residual variances.

  11. Hint

    This is a tricky question. A likelihood ratio test is the most obvious way to do it, but a little thought needs to go into how the likelihood could be obtained when the means and the variances differ between the groups. It can be done using lm only.


    Solution

    The scores for the two types of photo have no shared parameters if separate means and variances are estimated for each photo type. Consequently two models can be fitted and their likelihoods can be multiplied (or their log-likelihoods summed) to get the total likelihood. We can compare this likelihood to the model where only the means are allowed to differ.

    model_grumpy<-lm(y~1, data=subset(photo_long, type=="grumpy"))
    model_happy<-lm(y~1, data=subset(photo_long, type=="happy"))
    # separate models for the two photo types
    
    loglik_dv<-logLik(model_grumpy)+logLik(model_happy)
    # combine the likelihoods to get the total likelihood
    
    loglik_sv<-logLik(model)
    # likelihood when only the mean differs
    
    LR2<-2*(loglik_dv-loglik_sv)
    # twice the difference in log-likelihood between the two models
    
    1-pchisq(as.numeric(LR2), df=1)
    ## [1] 0.01155496
    # likelihood ratio test

    They may well have different variances.


The implementation above was long-winded but hopefully made the logic of the ideas more concrete. A shorter way of obtaining the sampling distributions is to use the simulation function to simulate replicate data-sets

sim_null<-simulate(model_null, nsim=100)  # simulate from the null
sim_alt<-simulate(model, nsim=100)  # simulate from the alternate model

fit_alt<-function(x){
  model_alt<-lm(x~type, data=photo_long)
  c(coef(summary(model_alt))["typehappy",], logLik=logLik(model_alt))
}  # a function for fitting the alternate model to x and extracting coefficients, t-values and the log-likelihood

null_llik<-apply(sim_null, 2, function(x){logLik(lm(x~1))})
# apply null model to data generated under the null to get log-lik

null_dist<-t(apply(sim_null, 2, fit_m2))
# apply alternate model to data generated under the null

sampling_dist<-t(apply(sim_alt, 2, fit_m2))
# apply alternate model to data generated under the alternate model

par(mfrow=c(2,2))
qqplot(sampling_dist[,"Estimate"], expected_samp)
abline(0,1)
qqplot(null_dist[,"t value"], expected_t)
abline(0,1)
qqplot(2*(null_dist[,"logLik"]-null_llik), expected_LR2)
abline(0,1)

Accessibility statement