Eleven people assessed six 10x10 arrays of photographs and scored the
number of people they recognised. In the first three trials the
participant was given 5 seconds, in the final 3 trials the participant
was given 3 seconds. The data can be found in collage_photo.csv and are recorded in
chronological order. Using lm
Solution
Read in the data
collage_photo<-read.csv("collage_photo.csv")
head(collage_photo)
## y speed person
## 1 2 5 1
## 2 1 5 1
## 3 2 5 1
## 4 2 3 1
## 5 2 3 1
## 6 0 3 1
and generate a covariate which is trial
collage_photo$trial<-rep(1:6,11)
Fit a simple model with speed (as a factor)
m1<-lm(y~as.factor(speed), data=collage_photo)
summary(m1)
##
## Call:
## lm(formula = y ~ as.factor(speed), data = collage_photo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27273 -0.93939 0.06061 0.72727 1.72727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9394 0.1514 6.204 4.53e-08 ***
## as.factor(speed)5 0.3333 0.2141 1.557 0.124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8698 on 64 degrees of freedom
## Multiple R-squared: 0.03648, Adjusted R-squared: 0.02142
## F-statistic: 2.423 on 1 and 64 DF, p-value: 0.1245
e3<-coef(m1)[1]
# expected number in three seconds
e5<-sum(coef(m1)[1:2])
# expected number in five seconds
e5/e3
## (Intercept)
## 1.354839
# The proportional change
Solution
m2<-lm(y~as.factor(speed)+trial, data=collage_photo)
## add trial as a covariate
summary(m2)
##
## Call:
## lm(formula = y ~ as.factor(speed) + trial, data = collage_photo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.31818 -0.89394 0.01515 0.72727 1.77273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.16667 0.67761 1.722 0.090 .
## as.factor(speed)5 0.19697 0.45102 0.437 0.664
## trial -0.04545 0.13205 -0.344 0.732
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8759 on 63 degrees of freedom
## Multiple R-squared: 0.03829, Adjusted R-squared: 0.007757
## F-statistic: 1.254 on 2 and 63 DF, p-value: 0.2924
The best estimate says they get worse as they assess more photos - perhaps they get tired rather than experienced. For each trial they recognise 0.045 less people. But standard error large!
Solution
plot(m2,2)
The residuals are clustered within certain quantiles (because the data are counts) and there seem to be too many small numbers and too few large numbers than what we expect from the normal.
Solution
There’s a few ways this can be done. The speed5
effect
in the current model is the difference between y
when given
5 seconds versus 3. We could start by removing the intercept (we could
also reorder the factor) so that speed5
is an estimate of
the expected effect. However, this is still the expected effect when
trial
=0. To obtain an estimate when trial
=1 we
could fit a new covariate trial-1
which would be zero when
trial
=0.
m3<-lm(y~as.factor(speed)-1+I(trial-1), data=collage_photo)
We can get the estimate and it’s standard error out:
est<-coef(summary(m3))["as.factor(speed)5", "Estimate"]
se<-coef(summary(m3))["as.factor(speed)5", "Std. Error"]
If the true value was two, we would expect the distribution of our estimates to have a mean of two (and approximating the sampling distribution as normal, rather than \(t\)) and a standard deviation equal to the standard error. The chance that we get an estimate smaller than the one we did is
pnorm(est, 2, se)
## [1] 0.0003620524
So yes, on the first trial and given 5 seconds, people identify
significantly less than 2 people. We could also perform a more exact
t-test but more work needs to be done. The t-value (the estimate divided
by its standard error) is only t-distributed when the true value of the
parameter being estimated is zero. In this example, we are assuming
(under the null) that the true value of the parameter is 2. However,
this implies that we expect the parameter - 2 to be zero. Consequently,
we can treat the t-value as (est
-2)/se
and
perform a t-test
pt((est-2)/se, 63)
## [1] 0.0006237342
Other options would have been to take 2 from the response variable and fit the model
m3b<-lm(I(y-2)~as.factor(speed)-1+I(trial-1), data=collage_photo)
and the null-hypothesis would have been 0 for the speed5
effect rather than 2 (the speed5
effect here is exactly
equal to est-2
). Another option is to force the intercept
to be 2 using an offset.
collage_photo$two<-rep(2, nrow(collage_photo))
m3c<-lm(y~offset(two)+as.factor(speed)-1+I(trial-1), data=collage_photo)
and again the speed5
effect here is exactly equal to
est-2
.
Solution
Start by fitting speed as a continuous covariate:
m4<-lm(y~speed+I(trial-1), data=collage_photo)
The coefficient tells us how many additional people people recognise for every additional second. Consequently,
\[ \texttt{Number recognised} = \beta_\texttt{(Intercept)} + \beta_\texttt{speed}\texttt{seconds}\] and so after a bit of algebra we get
\[ \texttt{seconds} = \frac{\texttt{Number recognised}-\beta_\texttt{(Intercept)}}{\beta_\texttt{speed}}\] and so
(2-coef(m4)["(Intercept)"])/coef(m4)["speed"]
## (Intercept)
## 11.92308
seconds.
Solution
The quantity is a ratio of parameters so getting a confidence
interval is not straightforward (See Fieller’s theorem). However, what
we can do is simulate replicate data according to our model, refit the
model to these simulated data sets and re-estimate the parameters. We
would then have a collection of parameter estimates for the relevant
parameters ((Intercept)
and speed
) that are
drawn from their joint sampling distribution. For each simulated data
set we can then take the pair of parameter estimates and solve for the
time it takes. Over simulations, we would then have a collection of
estimates for how long it would take, and we can treat this collection
of estimates as being drawn from the sampling distribution.
nsim<-1000
yrep<-simulate(m4, nsim=nsim) # simulate data according to our model
time<-rep(NA, nsim)
for(i in 1:nsim){ # go through each replicate data set
collage_photo$yrep<-yrep[,i]
m4.rep<-lm(yrep~speed+I(trial-1), data=collage_photo) # refit the model to the replicate data set
time[i]<-(2-coef(m4.rep)["(Intercept)"])/coef(m4.rep)["speed"] # solve for the time it takes
}
quantile(time, c(0.025, 0.975)) # get the 2.5% and 97.5% quantiles of the sampling distribution.
## 2.5% 97.5%
## -31.45232 52.15061
Time to collect more data!