The average grumpiness score for each of the 44 photos can be found in photo_long.csv.
lm
, estimate the difference between the scores of happy and grumpy photos.
Hint
There’s a few ‘clever’ ways to do this, but the most obvious way to do it is using a for
loop and calling the function rnorm
. In R it’s usually much more efficient to create an object of the appropriate size and fill it up, rather than to increase the size of the object each iteration. For example, if we wanted to iterate from 1 to 10 and store the squares of these values, then:
res<-matrix(NA,10,1)
for(i in 1:10){
res[i]<-i^2
}
is more efficient than:
res<-c()
for(i in 1:10){
res<-c(res,i^2)
}
It’s also easier to read I think.
If you need to extract the coefficient from a model you can use the function coef
. Getting the residual standard deviation is a bit more painful: it can be found in the model summary
object in the list element sigma
.
Solution
Simulate Function
Many model objects in R come with a method simulate
which simulates data from the model using the estimated parameter values as the true values. However, it can be instructive to roll your own, especially for such a simple model as this. Let’s call the function my_simulate
. The first argument nsim
specifies how many replicate data-sets I would like to simulate. The second argument, group
, is a vector (either numeric or a factor) specifying which group the observations belong. mu
and sd
specify the means and standard deviations of the groups and must be equal in length to the number of levels in group
:
my_simulate<-function(nsim, group, mu, sd){
n<-length(group)
# number of observations
ngroup<-length(unique(group))
# number of groups
if(!is.numeric(group) & !is.factor(group)){stop('group must be a factor or numeric')}
group<-as.numeric(group)
if(length(mu)!=ngroup){stop('mu is the wrong length')}
if(length(sd)!=ngroup){stop('sd is the wrong length')}
y<-matrix(NA, n, nsim)
# matrix for holding nsim simulate data sets
for(i in 1:nsim){
y[,i]<-rnorm(n, mu[group], sd[group])
}
return(y)
}
Fit the model
model<-lm(y~type, data=photo_long)
Get the estimated means for the two groups. Note that for happy photos the two coefficients have to be added because they are the intercept and the difference.
mu<-c(coef(model)[1], sum(coef(model)))
We have assumed that the residual variance is the same for all observations, and so we can extract the estimate of the residual standard deviation and give both sets of photos this value.
sd<-rep(summary(model)$sigma,2)
Simulate data
nsim<-100
# number of data sets to simulate
sim_y<-my_simulate(nsim, as.factor(photo_long$type), mu, sd)
# simulate data
Hint
I would compare the two distributions graphically. You could overlay the theoretical probability density function on a histogram of the simulated values, but this is always a bit fiddly. I would simply simulate values from the theoretical distribution (i.e. use rt
, or rt.scaled
from the metRologly
library) and then compare using qqplot
.
Solution
Refit models to simulated data and store the estimate of the difference
sim_est<-matrix(NA, nsim, 1)
# matrix for holding estimate of the difference
for(i in 1:nsim){
model_sim<-lm(sim_y[,i]~type, data=photo_long)
# fit model to the ith simulated data set
sim_est[i]<-coef(model_sim)[2]
# save the estimate of the difference
}
Compare actual and theoretical distributions graphically
est.d<-coef(model)[2]
# estimate of the difference
se.d<-coef(summary(model))[2,"Std. Error"]
# standard error of the difference
expected_samp<-rt(nsim, df=44-2)*se.d+est.d
# sampling distribution of the estimate is expected to be a scaled mean shifted t-distribtion. You could also use rt.scaled function from metRology.
qqplot(expected_samp, sim_est)
abline(0,1)
Hint
When you pass a model object to the function summary
a coefficient table is printed that not only includes the estimates, but also their standard errors, t-values and p-values. To access this information you can use the function coef
on the model summary object. The likelihood of a model can be extracted by passing the model object to the function logLik
.
Solution
Fit the null model
model_null<-lm(y~1, data=photo_long)
mu_null<-rep(coef(model_null)[1], 2)
sd_null<-rep(summary(model_null)$sigma,2)
# estimate of mean and standard deviation for the two types under the null
sim_null<-my_simulate(nsim, as.factor(photo_long$type), mu_null, sd_null)
# simulate data under the null
Fit the null model to data simulated under the null to obtain the distribution of the t-value under the null. Fit the null and alternate model to data simulated under the null in order to calculate the likelihood ratio statistic under the null.
sim_val<-matrix(NA, nsim, 2, dimnames=list(1:nsim, c("t_value", "LR2")))
# matrix for holding t-value and likelihood-ratio
for(i in 1:nsim){
model_sim_null<-lm(sim_null[,i]~1, data=photo_long)
# null model on data generated under the null
model_sim_alt<-lm(sim_null[,i]~type, data=photo_long)
# alternate model on data generated under the null
sim_val[i,"t_value"]<-coef(summary(model_sim_alt))[2,"t value"]
# store t-value of the difference on data generated under the null model
sim_val[i,"LR2"]<-2*(logLik(model_sim_alt)-logLik(model_sim_null))
# store twice the difference in log-likelihood between null and alternate model fitted to null data
}
Solution
Simulate t-value and likelihood-ratio statistic from their theoretical distributions
expected_t<-rt(nsim, df=44-2)
# distribution of the t-value under the null model is expected to be a t-distribution
expected_LR2<-rchisq(nsim,1)
# distribution of twice the difference in log-likelihood between null and alternate model is expected to be chi-squared
Compare actual and theoretical distributions graphically
par(mfrow=c(1,2))
qqplot(sim_val[,"t_value"], expected_t)
abline(0,1)
qqplot(sim_val[,"LR2"], expected_LR2)
abline(0,1)
Pretty good!
Hint
This is a tricky question. A likelihood ratio test is the most obvious way to do it, but a little thought needs to go into how the likelihood could be obtained when the means and the variances differ between the groups. It can be done using lm
only.
Solution
The scores for the two types of photo have no shared parameters if separate means and variances are estimated for each photo type. Consequently two models can be fitted and their likelihoods can be multiplied (or their log-likelihoods summed) to get the total likelihood. We can compare this likelihood to the model where only the means are allowed to differ.
model_grumpy<-lm(y~1, data=subset(photo_long, type=="grumpy"))
model_happy<-lm(y~1, data=subset(photo_long, type=="happy"))
# separate models for the two photo types
loglik_dv<-logLik(model_grumpy)+logLik(model_happy)
# combine the likelihoods to get the total likelihood
loglik_sv<-logLik(model)
# likelihood when only the mean differs
LR2<-2*(loglik_dv-loglik_sv)
# twice the difference in log-likelihood between the two models
1-pchisq(as.numeric(LR2), df=1)
## [1] 0.01155496
# likelihood ratio test
They may well have different variances.
The implementation above was long-winded but hopefully made the logic of the ideas more concrete. A shorter way of obtaining the sampling distributions is to use the simulation function to simulate replicate data-sets
sim_null<-simulate(model_null, nsim=100) # simulate from the null
sim_alt<-simulate(model, nsim=100) # simulate from the alternate model
fit_alt<-function(x){
model_alt<-lm(x~type, data=photo_long)
c(coef(summary(model_alt))["typehappy",], logLik=logLik(model_alt))
} # a function for fitting the alternate model to x and extracting coefficients, t-values and the log-likelihood
null_llik<-apply(sim_null, 2, function(x){logLik(lm(x~1))})
# apply null model to data generated under the null to get log-lik
null_dist<-t(apply(sim_null, 2, fit_m2))
# apply alternate model to data generated under the null
sampling_dist<-t(apply(sim_alt, 2, fit_m2))
# apply alternate model to data generated under the alternate model
par(mfrow=c(2,2))
qqplot(sampling_dist[,"Estimate"], expected_samp)
abline(0,1)
qqplot(null_dist[,"t value"], expected_t)
abline(0,1)
qqplot(2*(null_dist[,"logLik"]-null_llik), expected_LR2)
abline(0,1)