Practical 2: Post-graduate Statistics Course

Eleven people assessed six 10x10 arrays of photographs and scored the number of people they recognised. In the first three trials the participant was given 5 seconds, in the final 3 trials the participant was given 3 seconds. The data can be found in collage_photo.csv and are recorded in chronological order. Using lm

What is the best estimate of how many more people (as a %) do people recognise when given more time?

Solution

Read in the data

collage_photo<-read.csv("collage_photo.csv")

head(collage_photo)

##   y speed person
## 1 2     5      1
## 2 1     5      1
## 3 2     5      1
## 4 2     3      1
## 5 2     3      1
## 6 0     3      1

and generate a covariate which is trial

collage_photo$trial<-rep(1:6,11)

Fit a simple model with speed (as a factor)

m1<-lm(y~as.factor(speed), data=collage_photo)
summary(m1)

## 
## Call:
## lm(formula = y ~ as.factor(speed), data = collage_photo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27273 -0.93939  0.06061  0.72727  1.72727 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.9394     0.1514   6.204 4.53e-08 ***
## as.factor(speed)5   0.3333     0.2141   1.557    0.124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8698 on 64 degrees of freedom
## Multiple R-squared:  0.03648,    Adjusted R-squared:  0.02142 
## F-statistic: 2.423 on 1 and 64 DF,  p-value: 0.1245

e3<-coef(m1)[1]
# expected number in three seconds

e5<-sum(coef(m1)[1:2])
# expected number in five seconds

e5/e3

## (Intercept) 
##    1.354839

# The proportional change

Do people improve as they assess more arrays?

Solution

m2<-lm(y~as.factor(speed)+trial, data=collage_photo)
## add trial as a covariate

summary(m2)

## 
## Call:
## lm(formula = y ~ as.factor(speed) + trial, data = collage_photo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.31818 -0.89394  0.01515  0.72727  1.77273 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        1.16667    0.67761   1.722    0.090 .
## as.factor(speed)5  0.19697    0.45102   0.437    0.664  
## trial             -0.04545    0.13205  -0.344    0.732  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8759 on 63 degrees of freedom
## Multiple R-squared:  0.03829,    Adjusted R-squared:  0.007757 
## F-statistic: 1.254 on 2 and 63 DF,  p-value: 0.2924

The best estimate says they get worse as they assess more photos - perhaps they get tired rather than experienced. For each trial they recognise 0.045 less people. But standard error large!

Using a qqplot judge what is wrong with the distribution of residuals.

Solution

plot(m2,2)

The residuals are clustered within certain quantiles (because the data are counts) and there seem to be too many small numbers and too few large numbers than what we expect from the normal.

Assume that the improvement is linear with respect to how many times they had tried. The expected number of IEB photos on an array was 2. Does a participant on their first try and given 5 seconds recognise significantly less than this number.

Solution

There’s a few ways this can be done. The speed5 effect in the current model is the difference between y when given 5 seconds versus 3. We could start by removing the intercept (we could also reorder the factor) so that speed5 is an estimate of the expected effect. However, this is still the expected effect when trial=0. To obtain an estimate when trial=1 we could fit a new covariate trial-1 which would be zero when trial=0.

m3<-lm(y~as.factor(speed)-1+I(trial-1), data=collage_photo)

We can get the estimate and it’s standard error out:

est<-coef(summary(m3))["as.factor(speed)5", "Estimate"]
se<-coef(summary(m3))["as.factor(speed)5", "Std. Error"]

If the true value was two, we would expect the distribution of our estimates to have a mean of two (and approximating the sampling distribution as normal, rather than \(t\)) and a standard deviation equal to the standard error. The chance that we get an estimate smaller than the one we did is

pnorm(est, 2, se)

## [1] 0.0003620524

So yes, on the first trial and given 5 seconds, people identify significantly less than 2 people. We could also perform a more exact t-test but more work needs to be done. The t-value (the estimate divided by its standard error) is only t-distributed when the true value of the parameter being estimated is zero. In this example, we are assuming (under the null) that the true value of the parameter is 2. However, this implies that we expect the parameter - 2 to be zero. Consequently, we can treat the t-value as (est-2)/se and perform a t-test

pt((est-2)/se, 63)

## [1] 0.0006237342

Other options would have been to take 2 from the response variable and fit the model

m3b<-lm(I(y-2)~as.factor(speed)-1+I(trial-1), data=collage_photo)

and the null-hypothesis would have been 0 for the speed5 effect rather than 2 (the speed5 effect here is exactly equal to est-2). Another option is to force the intercept to be 2 using an offset.

collage_photo$two<-rep(2, nrow(collage_photo))
m3c<-lm(y~offset(two)+as.factor(speed)-1+I(trial-1), data=collage_photo)

and again the speed5 effect here is exactly equal to est-2.

Assume that the improvement is linear with respect to the time the people were given. If the point estimates were the true values, how long should I have given people if I wanted them to recognise two people on their first try (on average)?

Solution

Start by fitting speed as a continuous covariate:

m4<-lm(y~speed+I(trial-1), data=collage_photo)

The coefficient tells us how many additional people people recognise for every additional second. Consequently,

\[ \texttt{Number recognised} = \beta_\texttt{(Intercept)} + \beta_\texttt{speed}\texttt{seconds}\] and so after a bit of algebra we get

\[ \texttt{seconds} = \frac{\texttt{Number recognised}-\beta_\texttt{(Intercept)}}{\beta_\texttt{speed}}\] and so

(2-coef(m4)["(Intercept)"])/coef(m4)["speed"]

## (Intercept) 
##    11.92308

seconds.

Put approximate confidence intervals on this using simulation.

Solution

The quantity is a ratio of parameters so getting a confidence interval is not straightforward (See Fieller’s theorem). However, what we can do is simulate replicate data according to our model, refit the model to these simulated data sets and re-estimate the parameters. We would then have a collection of parameter estimates for the relevant parameters ((Intercept) and speed) that are drawn from their joint sampling distribution. For each simulated data set we can then take the pair of parameter estimates and solve for the time it takes. Over simulations, we would then have a collection of estimates for how long it would take, and we can treat this collection of estimates as being drawn from the sampling distribution.

nsim<-1000
yrep<-simulate(m4, nsim=nsim)  # simulate data according to our model
time<-rep(NA, nsim)
for(i in 1:nsim){              # go through each replicate data set
  collage_photo$yrep<-yrep[,i]
  m4.rep<-lm(yrep~speed+I(trial-1), data=collage_photo)             # refit the model to the replicate data set 
  time[i]<-(2-coef(m4.rep)["(Intercept)"])/coef(m4.rep)["speed"]    # solve for the time it takes
}
quantile(time, c(0.025, 0.975))                                     # get the 2.5% and 97.5% quantiles of the sampling distribution.

##      2.5%     97.5% 
## -31.45232  52.15061

Time to collect more data!

Practical 2: Post-graduate Statistics Course

Jarrod Hadfield