The average grumpiness score for each of the 44 photos can be found in photo_long.csv.
lm
, estimate the difference between the scores of
happy and grumpy photos.
Solution
Fit a model with type
as a factor.
model<-lm(y~type, data=photo_long)
The typegrump
coefficient is the difference
between the grumpy and happy photos.
Hint
There’s a few ‘clever’ ways to do this, but the most obvious way to
do it is using a for
loop and calling the function
rnorm
. In R it’s usually much more efficient to create an
object of the appropriate size and fill it up, rather than to increase
the size of the object each iteration. For example, if we wanted to
iterate from 1 to 10 and store the squares of these values, then:
res<-matrix(NA,10,1)
for(i in 1:10){
res[i]<-i^2
}
is more efficient than:
res<-c()
for(i in 1:10){
res<-c(res,i^2)
}
It’s also easier to read I think.
If you need to extract the coefficient from a model you can use the
function coef
. If you want to extract the residual standard
deviation you can use the function sigma
.
Solution
Simulate Function
Many model objects in R come with a method simulate
which simulates data from the model using the estimated parameter values
as the true values. However, it can be instructive to roll your own,
especially for such a simple model as this. Let’s call the function
my_simulate
. The first argument nsim
specifies
how many replicate data-sets I would like to simulate. The second
argument, group
, is a vector (either numeric or a factor)
specifying which group the observations belong. mu
and
sd
specify the means and standard deviations of the groups
and must be equal in length to the number of levels in
group
:
my_simulate<-function(nsim, group, mu, sd){
n<-length(group)
# number of observations
ngroup<-length(unique(group))
# number of groups
if(!is.numeric(group) & !is.factor(group)){stop('group must be a factor or numeric')}
group<-as.numeric(group)
if(length(mu)!=ngroup){stop('mu is the wrong length')}
if(length(sd)!=ngroup){stop('sd is the wrong length')}
y<-matrix(NA, n, nsim)
# matrix for holding nsim simulate data sets
for(i in 1:nsim){
y[,i]<-rnorm(n, mu[group], sd[group])
}
return(y)
}
From the model fitted above get the estimated means for the two groups. Note that for happy photos the two coefficients have to be added because they are the intercept and the difference.
mu<-c(coef(model)[1], sum(coef(model)))
We have assumed that the residual variance is the same for all observations, and so we can extract the estimate of the residual standard deviation and give both sets of photos this value.
sd<-rep(sigma(model),2)
Simulate data
nsim<-100
# number of data sets to simulate
sim_y<-my_simulate(nsim, as.factor(photo_long$type), mu, sd)
# simulate data
Hint
I would compare the two distributions graphically. You could overlay
the theoretical probability density function on a histogram of the
simulated values, but this is always a bit fiddly. I would simply
simulate values from the theoretical distribution (i.e. use
rnorm
) and then compare using qqplot
.
Solution
Refit models to simulated data and store the estimate of the difference
sim_est<-matrix(NA, nsim, 1)
# matrix for holding estimate of the difference
for(i in 1:nsim){
model_sim<-lm(sim_y[,i]~type, data=photo_long)
# fit model to the ith simulated data set
sim_est[i]<-coef(model_sim)[2]
# save the estimate of the difference
}
Compare actual and theoretical distributions graphically
est.d<-coef(model)[2]
# estimate of the difference
se.d<-coef(summary(model))[2,"Std. Error"]
# standard error of the difference
expected_samp<-rnorm(nsim, est.d, se.d)
# sampling distribution of the estimate is expected to be normal since the residual standard deviation is assumed known (and set to sigma(model)). If the residual standard deviation is not known, but estimated, then the sampling distribution is a scaled mean shifted t-distribution.
qqplot(expected_samp, sim_est)
abline(0,1)
Hint
When you pass a model object to the function summary
a
coefficient table is printed that not only includes the estimates, but
also their standard errors, t-values and p-values. To access this
information you can use the function coef
on the model
summary object. The likelihood of a model can be extracted by passing
the model object to the function logLik
.
Solution
Fit the null model
model_null<-lm(y~1, data=photo_long)
mu_null<-rep(coef(model_null)[1], 2)
sd_null<-rep(sigma(model_null),2)
# estimate of mean and standard deviation for the two types under the null
sim_null<-my_simulate(nsim, as.factor(photo_long$type), mu_null, sd_null)
# simulate data under the null
Fit the null model to data simulated under the null to obtain the distribution of the t-value under the null. Fit the null and alternate model to data simulated under the null in order to calculate the likelihood ratio statistic under the null.
sim_val<-matrix(NA, nsim, 2, dimnames=list(1:nsim, c("t_value", "LR2")))
# matrix for holding t-value and likelihood-ratio
for(i in 1:nsim){
model_sim_null<-lm(sim_null[,i]~1, data=photo_long)
# null model on data generated under the null
model_sim_alt<-lm(sim_null[,i]~type, data=photo_long)
# alternate model on data generated under the null
sim_val[i,"t_value"]<-coef(summary(model_sim_alt))[2,"t value"]
# store t-value of the difference on data generated under the null model
sim_val[i,"LR2"]<-2*(logLik(model_sim_alt)-logLik(model_sim_null))
# store twice the difference in log-likelihood between null and alternate model fitted to null data
}
Solution
Simulate t-value and likelihood-ratio statistic from their theoretical distributions
expected_t<-rt(nsim, df=44-2)
# distribution of the t-value under the null model is expected to be a t-distribution
expected_LR2<-rchisq(nsim,1)
# distribution of twice the difference in log-likelihood between null and alternate model is expected to be chi-squared
Compare actual and theoretical distributions graphically
par(mfrow=c(1,2))
qqplot(sim_val[,"t_value"], expected_t)
abline(0,1)
qqplot(sim_val[,"LR2"], expected_LR2)
abline(0,1)
Pretty good!
Hint
This is a tricky question. A likelihood ratio test is the most
obvious way to do it, but a little thought needs to go into how the
likelihood could be obtained when the means and the variances
differ between the groups. It can be done using lm
only.
Solution
The scores for the two types of photo have no shared parameters if separate means and variances are estimated for each photo type. Consequently two models can be fitted and their likelihoods can be multiplied (or their log-likelihoods summed) to get the total likelihood. We can compare this likelihood to the model where only the means are allowed to differ.
model_grumpy<-lm(y~1, data=subset(photo_long, type=="grumpy"))
model_happy<-lm(y~1, data=subset(photo_long, type=="happy"))
# separate models for the two photo types
loglik_dv<-logLik(model_grumpy)+logLik(model_happy)
# combine the likelihoods to get the total likelihood
loglik_sv<-logLik(model)
# likelihood when only the mean differs
LR2<-2*(loglik_dv-loglik_sv)
# twice the difference in log-likelihood between the two models
1-pchisq(as.numeric(LR2), df=1)
## [1] 0.01155496
# likelihood ratio test
They may well have different variances.
The implementation above was long-winded but hopefully made the logic of the ideas more concrete. A shorter way of obtaining the sampling distributions is to use the simulation function to simulate replicate data-sets
sim_null<-simulate(model_null, nsim=100) # simulate from the null
sim_alt<-simulate(model, nsim=100) # simulate from the alternate model
fit_alt<-function(x){
model_alt<-lm(x~type, data=photo_long)
c(coef(summary(model_alt))["typehappy",], logLik=logLik(model_alt))
} # a function for fitting the alternate model to x and extracting coefficients, t-values and the log-likelihood
null_llik<-apply(sim_null, 2, function(x){logLik(lm(x~1))})
# apply null model to data generated under the null to get log-lik
null_dist<-t(apply(sim_null, 2, fit_alt))
# apply alternate model to data generated under the null
sampling_dist<-t(apply(sim_alt, 2, fit_alt))
# apply alternate model to data generated under the alternate model
par(mfrow=c(2,2))
qqplot(sampling_dist[,"Estimate"], expected_samp)
abline(0,1)
qqplot(null_dist[,"t value"], expected_t)
abline(0,1)
qqplot(2*(null_dist[,"logLik"]-null_llik), expected_LR2)
abline(0,1)