Download this R markdown as: R, Rmd.

Outline of this lab session:

The idea is to consider a general version of the linear regression model where \[ \boldsymbol{Y} \sim \mathsf{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbb{W}^{-1}), \] where \(\boldsymbol{Y} = (Y_1, \dots, Y_n)^\top\) is the response vector for \(n \in \mathbb{N}\) independent subjects, \(\mathbb{X}\) is the corresponding model matrix (of a full rank \(p \in \mathbb{N}\)) and \(\boldsymbol{\beta} \in \mathbb{R}^p\) is the vector of the unknown parameters.

Note, that the normality assumption above is only used as a simplification but it is not needed to fit the model (without normality the results and the inference will hold asymptotically).

Considering the variance structure, reflected in the matrix \(\mathbb{W} \in \mathbb{R}^{n \times n}\) (a diagonal matrix because the subjects are mutually independent) can can distinguish two cases:

These cases lead to the so-called

Both cases will be briefly addressed below.

1. General linear model

library("mffSM")        # Hlavy, plotLM()
library("colorspace")   # color palettes
library("MASS")         # sdres()
library("sandwich")     # vcovHC()
library("lmtest")       # coeftest()
data(Hlavy, package = "mffSM")

We start with the dataset Hlavy (heads) which represents head sizes of fetuses (from an ultrasound monitoring) in different periods of pregnancy. Each data value (row in the dataframe) provides an average over \(n_i \in \mathbb{N}\) measurements of \(n_i\) independent subjects (mothers).

The information provided in the data stand for the:

head(Hlavy)
##   i  t  n     y
## 1 1 16  2 5.100
## 2 2 18  3 5.233
## 3 3 19  9 4.744
## 4 4 20 10 5.110
## 5 5 21 11 5.236
## 6 6 22 20 5.740
dim(Hlavy)
## [1] 26  4
summary(Hlavy)
##        i               t              n              y       
##  Min.   : 1.00   Min.   :16.0   Min.   : 2.0   Min.   :4.74  
##  1st Qu.: 7.25   1st Qu.:23.2   1st Qu.:20.2   1st Qu.:5.87  
##  Median :13.50   Median :29.5   Median :47.0   Median :7.78  
##  Mean   :13.50   Mean   :29.5   Mean   :39.8   Mean   :7.46  
##  3rd Qu.:19.75   3rd Qu.:35.8   3rd Qu.:60.0   3rd Qu.:8.98  
##  Max.   :26.00   Max.   :42.0   Max.   :85.0   Max.   :9.63
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(y ~ t, data = Hlavy, pch = 23, col = "red3", bg = "orange")

The simple scatterplot above can be, however, substantially improved by accounting also for reliability of each observation (higher number of measurements used to obtain the average also means higher reliability). This can be obtain, for instance, by the following code:

with(Hlavy, quantile(n, prob = seq(0, 1, by = 0.2)))
##   0%  20%  40%  60%  80% 100% 
##    2   11   31   53   60   85
Hlavy <- transform(Hlavy, 
                   fn = cut(n, breaks = c(0, 10, 30, 50, 60, 90),
                            labels = c("<=10", "11-30", "31-50", "51-60", ">60")),
                   cexn = n/25+1)
with(Hlavy, summary(fn))
##  <=10 11-30 31-50 51-60   >60 
##     5     5     5     6     5
plotHlavy <- function(){
  PAL <- rev(heat_hcl(nlevels(Hlavy[, "fn"]), c.=c(80, 30), l=c(30, 90), power=c(1/5, 1.3)))
  names(PAL) <- levels(Hlavy[, "fn"])
  plot(y ~ t, data = Hlavy, 
       xlab = "Week", ylab = "Average head size",
       pch = 23, col = "black", bg = PAL[fn], cex = cexn)
  legend(17, 9, legend = levels(Hlavy[, "fn"]), title = "n", fill = PAL)
  abline(v = 27, col = "lightblue", lty = 2)
}    

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plotHlavy()

It is reasonable to assume, that for each independent measurement \(\widetilde{Y}_{ij}\) we have some model of the form \[ \widetilde{Y}_{ij} | \boldsymbol{X}_{i} \sim \mathsf{N}(\boldsymbol{X}_{i}^\top\boldsymbol{\beta}, \sigma^2) \] where for each \(j \in \{1, \ldots, n_i\}\) these observations are measured under the same set of covariates \(\mathbf{X_i}\). This is a homoscedastic model.

Unfortunately, we do not observe \(\widetilde{Y}_{ij}\) directly, but, instead, we only observe the overall average \(Y_i = \frac{1}{n_i}\sum_{j = 1}^{n_i} \widetilde{Y}_{ij}\).

It also holds, that \[ Y_i \sim \mathsf{N}(\boldsymbol{X}_i^\top\boldsymbol{\beta}, \sigma^2/{n_i}), \] or, considering all data together, we have \[ \boldsymbol{Y} \sim \mathsf{N}_n(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbb{W}^{-1}), \] where \(\mathbb{W}^{-1}\) is a diagonal matrix \(\mathbb{W}^{-1} = \mathsf{diag}(1/n_1, \dots, 1/n_n)\) (thus, the variance structure is known from the design of the experiment). The matrix \(\mathbb{W} = \mathsf{diag}(n_1, \dots, n_n)\) is the weight matrix which gives each \(Y_i\) the weight corresponding to number of observations it has been obtained from.

In addition, practical motivation behind the model is the following: Practitioners say that the relationship y ~ t could be described by a piecewise quadratic function with t = 27 being a point where the quadratic relationship changes its shape.

From the practical point of view, the fitted curve should be continuous at t = 27 and perhaps also smooth (i.e., with continuous at least the first derivative) as it is biologically not defensible to claim that at t = 27 the growth has a jump in the speed (rate) of the growth.

As far as different observations have different credibility, we will use the general linear model with the known matrix \(\mathbb{W}\). The model can be fitted as follows:

Hlavy <- transform(Hlavy, t27 = t - 27)

summary(mCont <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), 
                    weights = n, data = Hlavy))
## 
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), 
##     data = Hlavy, weights = n)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8821 -0.3522  0.0123  0.3792  0.8532 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.04975    0.04242  166.17  < 2e-16 ***
## t27                   0.38118    0.03029   12.59  3.0e-11 ***
## I(t27^2)              0.01621    0.00394    4.11   0.0005 ***
## t27:t27 > 0TRUE      -0.06353    0.03943   -1.61   0.1220    
## I(t27^2):t27 > 0TRUE -0.02679    0.00378   -7.09  5.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.477 on 21 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.65e+03 on 4 and 21 DF,  p-value: <2e-16

Compare to the naive (and incorrect) model where the weights are ignored:

mCont_noweights <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), data = Hlavy)
summary(mCont_noweights)
## 
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), 
##     data = Hlavy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3439 -0.0575 -0.0173  0.0535  0.2285 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.08435    0.06901  102.65  < 2e-16 ***
## t27                   0.42294    0.03263   12.96  1.7e-11 ***
## I(t27^2)              0.02167    0.00311    6.97  7.0e-07 ***
## t27:t27 > 0TRUE      -0.12116    0.04926   -2.46    0.023 *  
## I(t27^2):t27 > 0TRUE -0.03098    0.00292  -10.59  7.0e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.121 on 21 degrees of freedom
## Multiple R-squared:  0.996,  Adjusted R-squared:  0.995 
## F-statistic: 1.17e+03 on 4 and 21 DF,  p-value: <2e-16

Fitted curves:

tgrid <- seq(16, 42, length = 100)
pdata <- data.frame(t27 = tgrid - 27)
fitCont <- predict(mCont, newdata = pdata)

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plotHlavy()
lines(tgrid, fitCont, col = "red2", lwd = 3)
lines(tgrid, predict(mCont_noweights, newdata = pdata), col = "navyblue", lwd = 2, lty = 2)
legend("bottomright", bty = "n", 
       legend = c(bquote(Weights: hat(sigma) == .(summary(mCont)$sigma)), 
                  bquote(Noweights: hat(sigma) == .(summary(mCont_noweights)$sigma))),
       col = c("red2", "navyblue"), lty = c(1,2), lwd = c(2,3))

Individual work

### Key matrices 
W <- diag(Hlavy$n)
Y <- Hlavy$y
X <- model.matrix(mCont)

### Estimated regression coefficients 
c(betahat <- solve(t(X)%*%W%*%X)%*%(t(X)%*%W%*%Y))
## [1]  7.04975  0.38118  0.01621 -0.06353 -0.02679
coef(mCont)
##          (Intercept)                  t27             I(t27^2)      t27:t27 > 0TRUE I(t27^2):t27 > 0TRUE 
##              7.04975              0.38118              0.01621             -0.06353             -0.02679
### Generalized fitted values 
c(Yg <- X%*%betahat)
##  [1] 4.819 4.933 5.038 5.176 5.346 5.549 5.784 6.052 6.352 6.685 7.050 7.357 7.643 7.908 8.151 8.374 8.575
## [18] 8.755 8.914 9.052 9.169 9.265 9.339 9.393 9.425 9.436
fitted(mCont)
##     1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
## 4.819 4.933 5.038 5.176 5.346 5.549 5.784 6.052 6.352 6.685 7.050 7.357 7.643 7.908 8.151 8.374 8.575 8.755 
##    19    20    21    22    23    24    25    26 
## 8.914 9.052 9.169 9.265 9.339 9.393 9.425 9.436
### Generalized residual sum of squares 
(RSS <- t(Y - Yg)%*%W%*%(Y - Yg))
##       [,1]
## [1,] 4.778
### Generalized mean square 
RSS/(length(Y)-ncol(X))
##        [,1]
## [1,] 0.2275
(summary(mCont)$sigma)^2
## [1] 0.2275
### Be careful that Multiple R-squared:  0.9968 reported in the output "summary(mCont)" is not equal to 
sum((Yg-mean(Yg))^2)/sum((Y-mean(Y))^2)
## [1] 1.006
### Instead, it equals the following quantity that is more difficult to interpret 
barYw <- sum(Y*Hlavy$n)/sum(Hlavy$n)   
### weighted mean of the original observations 

### something as "generalized" regression sum of squares 
SSR <- sum((Yg - barYw)^2 * Hlavy$n)

### something as "generalized" total variance 
SST <- sum((Y - barYw)^2 * Hlavy$n) 

### "generalized" R squared
SSR/SST
## [1] 0.9968
summary(mCont)$r.squared
## [1] 0.9968
iii <- rep(1:nrow(Hlavy), Hlavy$n)
Hlavy2 <- Hlavy[iii,] # new dataset

summary(mCont2 <- lm(y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), data = Hlavy2))
## 
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), 
##     data = Hlavy2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.2940 -0.0450 -0.0147  0.0413  0.3005 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.049749   0.006055  1164.3   <2e-16 ***
## t27                   0.381177   0.004323    88.2   <2e-16 ***
## I(t27^2)              0.016214   0.000563    28.8   <2e-16 ***
## t27:t27 > 0TRUE      -0.063531   0.005627   -11.3   <2e-16 ***
## I(t27^2):t27 > 0TRUE -0.026786   0.000539   -49.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0681 on 1031 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.997 
## F-statistic: 8.11e+04 on 4 and 1031 DF,  p-value: <2e-16
summary(mCont)
## 
## Call:
## lm(formula = y ~ t27 + I(t27^2) + (t27 > 0):t27 + (t27 > 0):I(t27^2), 
##     data = Hlavy, weights = n)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8821 -0.3522  0.0123  0.3792  0.8532 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.04975    0.04242  166.17  < 2e-16 ***
## t27                   0.38118    0.03029   12.59  3.0e-11 ***
## I(t27^2)              0.01621    0.00394    4.11   0.0005 ***
## t27:t27 > 0TRUE      -0.06353    0.03943   -1.61   0.1220    
## I(t27^2):t27 > 0TRUE -0.02679    0.00378   -7.09  5.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.477 on 21 degrees of freedom
## Multiple R-squared:  0.997,  Adjusted R-squared:  0.996 
## F-statistic: 1.65e+03 on 4 and 21 DF,  p-value: <2e-16

2. Model with heteroscedastic errors (sandwich estimate)

For a general heteroscedastic model it is usually assumed that
\[ Y_i | \boldsymbol{X}_i \sim \mathsf{N}(\boldsymbol{X}_i^\top\boldsymbol{\beta}, \sigma^2(\boldsymbol{X}_i)) \] where \(\sigma^2(\cdot)\) is some positive (unknown) function of the response variables \(\boldsymbol{X}_i \in \mathbb{R}^p\). Note, that the normality assumption is here only given for simplification but it is not required.

In the following, we will consider the dataset that comes from the project MANA (Maturita na necisto) launched in the Czech Republic in 2005. This project was a part of a bigger project whose goal was to prepare a new form of graduation exam (“statni maturita”).
This dataset contains the results in English language at grammar schools.

load(url("http://msekce.karlin.mff.cuni.cz/~vavraj/nmfm334/data/mana.RData"))

The variables in the data are:

For better intuition, we can transform the data as follows:

mana <- transform(mana, 
                  fregion = factor(region, levels = 1:14,
                                   labels = c("A", "S", "C", "P", "K", "U", "L", 
                                              "H", "E", "J", "B", "M", "Z", "T")),
                  fspec = factor(specialization, levels = 1:11, 
                                 labels = c("Educ", "Social", "Lang", "Law", 
                                            "Mat-Phys", "Engin", "Science", 
                                            "Medic", "Econom", "Agric", "Art")),
                  fgender = factor(gender, levels = 0:1, labels = c("Female", "Male")),
                  fgrad = factor(graduation, levels = 0:1, labels = c("No", "Yes")),
                  fentry  = factor(entry.exam, levels = 0:1,  labels = c("No", "Yes")))

summary(mana)
##      region      specialization     gender        graduation      entry.exam        score       avg.mark   
##  Min.   : 1.00   Min.   : 1.0   Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   : 5   Min.   :1.00  
##  1st Qu.: 3.00   1st Qu.: 2.0   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:40   1st Qu.:1.48  
##  Median : 7.00   Median : 6.0   Median :0.000   Median :1.000   Median :0.000   Median :44   Median :1.64  
##  Mean   : 7.46   Mean   : 5.5   Mean   :0.415   Mean   :0.958   Mean   :0.454   Mean   :43   Mean   :1.65  
##  3rd Qu.:11.00   3rd Qu.: 8.0   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:48   3rd Qu.:1.81  
##  Max.   :14.00   Max.   :11.0   Max.   :1.000   Max.   :1.000   Max.   :1.000   Max.   :52   Max.   :2.47  
##                                                                                                            
##     mark.eng       fregion         fspec        fgender     fgrad      fentry    
##  Min.   :1.00   T      : 923   Social : 855   Female:3044   No : 216   No :2843  
##  1st Qu.:1.40   S      : 665   Econom : 846   Male  :2159   Yes:4987   Yes:2360  
##  Median :2.00   B      : 634   Engin  : 638                                      
##  Mean   :2.18   A      : 599   Medic  : 564                                      
##  3rd Qu.:2.80   L      : 534   Science: 557                                      
##  Max.   :4.20   C      : 474   Educ   : 513                                      
##                 (Other):1374   (Other):1230

Individual work

For illustration purposes we will fit a simple (additive) model to explain the average mark given the region and the specialization.

mAddit <- lm(avg.mark ~ fregion + fspec, data = mana)
summary(mAddit)
## 
## Call:
## lm(formula = avg.mark ~ fregion + fspec, data = mana)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6884 -0.1644 -0.0052  0.1562  0.8402 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.77616    0.01432  124.06  < 2e-16 ***
## fregionS       0.00887    0.01334    0.66   0.5063    
## fregionC      -0.01336    0.01455   -0.92   0.3585    
## fregionP       0.02008    0.02197    0.91   0.3607    
## fregionK      -0.02198    0.02098   -1.05   0.2948    
## fregionU       0.05678    0.02049    2.77   0.0056 ** 
## fregionL       0.02388    0.01411    1.69   0.0907 .  
## fregionH       0.04861    0.02104    2.31   0.0209 *  
## fregionE       0.07980    0.02049    3.89  1.0e-04 ***
## fregionJ      -0.03138    0.01942   -1.62   0.1062    
## fregionB       0.03236    0.01348    2.40   0.0164 *  
## fregionM       0.02588    0.02159    1.20   0.2306    
## fregionZ       0.01653    0.01880    0.88   0.3794    
## fregionT       0.00846    0.01250    0.68   0.4984    
## fspecSocial   -0.15080    0.01325  -11.38  < 2e-16 ***
## fspecLang     -0.26387    0.01700  -15.52  < 2e-16 ***
## fspecLaw      -0.15020    0.01534   -9.79  < 2e-16 ***
## fspecMat-Phys -0.22192    0.01926  -11.52  < 2e-16 ***
## fspecEngin    -0.17751    0.01403  -12.65  < 2e-16 ***
## fspecScience  -0.10411    0.01448   -7.19  7.4e-13 ***
## fspecMedic    -0.14368    0.01446   -9.94  < 2e-16 ***
## fspecEconom   -0.16751    0.01330  -12.59  < 2e-16 ***
## fspecAgric    -0.07115    0.03120   -2.28   0.0226 *  
## fspecArt      -0.15527    0.02030   -7.65  2.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.236 on 5179 degrees of freedom
## Multiple R-squared:  0.0712, Adjusted R-squared:  0.0671 
## F-statistic: 17.3 on 23 and 5179 DF,  p-value: <2e-16
par(mar = c(4,4,1.5,0.5))
plotLM(mAddit)

Could the variability vary with the region or school specialization?

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(sqrt(abs(stdres(mAddit))) ~ mana$fregion)

plot(sqrt(abs(stdres(mAddit))) ~ mana$fspec)

In the diagnostic plots, there are some obvious issues with heteroscedasticity. We will use the sandwich estimate of the variance matrix to account for this heteroscedasticity.

Generally, for the variance matrix of \(\widehat{\boldsymbol{\beta}}\) we have \[ \mathsf{var} \left[ \left. \widehat{\boldsymbol{\beta}} \right| \mathbb{X} \right] = \mathsf{var} \left[ \left. \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top \mathbf{Y} \right| \mathbb{X} \right] = \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top \mathsf{var} \left[ \left. \mathbf{Y} \right| \mathbb{X} \right] \mathbb{X} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}. \] Under homoscedasticity, \(\mathsf{var} \left[ \left. \mathbf{Y} \right| \mathbb{X} \right] = \sigma^2 \mathbb{I}\) and the variance estimator for \(\mathsf{var} \left[ \left. \widehat{\boldsymbol{\beta}} \right| \mathbb{X} \right]\) is \[ \widehat{\mathsf{var}} \left[ \left. \widehat{\boldsymbol{\beta}} \right| \mathbb{X} \right] = \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top \left[\widehat{\sigma}^2 \mathbb{I} \right] \mathbb{X} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} = \widehat{\sigma}^2 \left(\mathbb{X}^\top \mathbb{X}\right)^{-1}. \] Under heteroscedasticity, we consider Heteroscedasticity-Consistent covariance matrix estimator, which differs in the estimation of the meat \(\mathsf{var} \left[ \left. \mathbf{Y} \right| \mathbb{X} \right] = \mathsf{var} \left[ \left. \boldsymbol{\varepsilon} \right| \mathbb{X} \right]\). Default choice of type = "HC3" uses \[ \widehat{\mathsf{var}} \left[ \left. \mathbf{Y} \right| \mathbb{X} \right] = \mathsf{diag} \left(U_i^2 / m_{ii}^2 \right), \] where \(\mathbf{U} = \mathbf{Y} - \widehat{\mathbf{Y}} = \mathbf{Y} - \mathbb{X} \left(\mathbb{X}^\top \mathbb{X}\right)^{-1} \mathbb{X}^\top \mathbf{Y} = (\mathbb{I} - \mathbb{H}) \mathbf{Y} = \mathbb{M} \mathbf{Y}\) are the residuals.

library(sandwich)

### Estimate of the variance matrix of the regression coefficients 
VHC <- vcovHC(mAddit, type = "HC3") # type changes the "meat"
VHC[1:5,1:5]
##             (Intercept)   fregionS   fregionC   fregionP   fregionK
## (Intercept)   0.0002036 -0.0001045 -0.0001066 -0.0001052 -0.0001035
## fregionS     -0.0001045  0.0001943  0.0001021  0.0001023  0.0001012
## fregionC     -0.0001066  0.0001021  0.0002078  0.0001022  0.0001006
## fregionP     -0.0001052  0.0001023  0.0001022  0.0004862  0.0001011
## fregionK     -0.0001035  0.0001012  0.0001006  0.0001011  0.0004262

And we can compare the result with manually reconstructed estimate

X     <- model.matrix(mAddit)                                               
Bread <- solve(t(X) %*% X) %*% t(X)
Meat  <- diag(residuals(mAddit)^2 / (1 - hatvalues(mAddit))^2)
(Bread %*% Meat %*% t(Bread))[1:5, 1:5]
##             (Intercept)   fregionS   fregionC   fregionP   fregionK
## (Intercept)   0.0002036 -0.0001045 -0.0001066 -0.0001052 -0.0001035
## fregionS     -0.0001045  0.0001943  0.0001021  0.0001023  0.0001012
## fregionC     -0.0001066  0.0001021  0.0002078  0.0001022  0.0001006
## fregionP     -0.0001052  0.0001023  0.0001022  0.0004862  0.0001011
## fregionK     -0.0001035  0.0001012  0.0001006  0.0001011  0.0004262
all.equal(Bread %*% Meat %*% t(Bread), VHC)
## [1] TRUE

This can be compared with the classical estimate of the variance covariance matrix - in the following, we only focus on some portion of the matrix multiplied by 10000.

10000*VHC[1:5, 1:5] 
##             (Intercept) fregionS fregionC fregionP fregionK
## (Intercept)       2.036   -1.045   -1.066   -1.052   -1.035
## fregionS         -1.045    1.943    1.021    1.023    1.012
## fregionC         -1.066    1.021    2.078    1.022    1.006
## fregionP         -1.052    1.023    1.022    4.862    1.011
## fregionK         -1.035    1.012    1.006    1.011    4.262
10000*vcov(mAddit)[1:5, 1:5]
##             (Intercept) fregionS fregionC fregionP fregionK
## (Intercept)      2.0496  -1.0091  -0.9705  -0.9934  -0.9450
## fregionS        -1.0091   1.7808   0.9401   0.9440   0.9342
## fregionC        -0.9705   0.9401   2.1167   0.9437   0.9367
## fregionP        -0.9934   0.9440   0.9437   4.8252   0.9394
## fregionK        -0.9450   0.9342   0.9367   0.9394   4.4006

Note, that some variance components are overestimated by the standard estimate (function vcov()) and some others or underestimated.

The inference then proceeds with the same point estimate \(\widehat{\boldsymbol{\beta}}\) but a different vcovHC matrix instead of standard vcov(). Classical summary table can be recalculated in the following way:

coeftest(mAddit, vcov = VHC)    # from library("lmtest")
## 
## t test of coefficients:
## 
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.77616    0.01427  124.48  < 2e-16 ***
## fregionS       0.00887    0.01394    0.64  0.52456    
## fregionC      -0.01336    0.01442   -0.93  0.35401    
## fregionP       0.02008    0.02205    0.91  0.36255    
## fregionK      -0.02198    0.02064   -1.06  0.28711    
## fregionU       0.05678    0.02139    2.65  0.00797 ** 
## fregionL       0.02388    0.01410    1.69  0.09046 .  
## fregionH       0.04861    0.02023    2.40  0.01631 *  
## fregionE       0.07980    0.02170    3.68  0.00024 ***
## fregionJ      -0.03138    0.01963   -1.60  0.10999    
## fregionB       0.03236    0.01395    2.32  0.02042 *  
## fregionM       0.02588    0.02425    1.07  0.28581    
## fregionZ       0.01653    0.01835    0.90  0.36766    
## fregionT       0.00846    0.01261    0.67  0.50237    
## fspecSocial   -0.15080    0.01287  -11.71  < 2e-16 ***
## fspecLang     -0.26387    0.01688  -15.63  < 2e-16 ***
## fspecLaw      -0.15020    0.01528   -9.83  < 2e-16 ***
## fspecMat-Phys -0.22192    0.01888  -11.75  < 2e-16 ***
## fspecEngin    -0.17751    0.01420  -12.50  < 2e-16 ***
## fspecScience  -0.10411    0.01437   -7.24  5.0e-13 ***
## fspecMedic    -0.14368    0.01413  -10.17  < 2e-16 ***
## fspecEconom   -0.16751    0.01282  -13.06  < 2e-16 ***
## fspecAgric    -0.07115    0.02715   -2.62  0.00879 ** 
## fspecArt      -0.15527    0.02229   -6.97  3.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Unfortunately, function predict.lm is not as flexible and confidence intervals need to be computed manually

newdata <- data.frame(fregion = "A",
                      fspec = levels(mana$fspec),
                      avg.mark = 0)
# predict no sandwich
pred_vcov <- predict(mAddit, newdata = newdata, interval = "confidence")

# predict with sandwich (manually)
L <- model.matrix(mAddit, data = newdata)
se.fit <- sqrt(diag(L %*% VHC %*% t(L)))
pred_VHC <- matrix(NA, nrow = nrow(L), ncol = 3)
pred_VHC[,1] <- L %*% coef(mAddit)
pred_VHC[,2] <- pred_VHC[,1] - se.fit * qt(0.975, df = mAddit$df.residual)
pred_VHC[,3] <- pred_VHC[,1] + se.fit * qt(0.975, df = mAddit$df.residual)
colnames(pred_VHC) <- c("fit", "lwr", "upr")
rownames(pred_VHC) <- rownames(pred_vcov)

all.equal(pred_vcov[,"fit"], pred_VHC[,"fit"])
## [1] TRUE
all.equal(pred_vcov[,"lwr"], pred_VHC[,"lwr"])
## [1] "Mean relative difference: 0.001029"
all.equal(pred_vcov[,"upr"], pred_VHC[,"upr"])
## [1] "Mean relative difference: 0.0009899"
par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
plot(x = 1:nlevels(mana$fspec), y = pred_vcov[,"fit"], 
     xlab = "Specialization", ylab = "Expected average mark, Prague region",
     xaxt = "n", ylim = range(pred_vcov, pred_VHC), 
     col = "red4", pch = 16, cex = 2)
axis(1, at = 1:nlevels(mana$fspec), labels = levels(mana$fspec))
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_vcov[,"lwr"],
       angle = 80, lwd = 2, col = "red")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_vcov[,"upr"],
       angle = 80, lwd = 2, col = "red")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_VHC[,"lwr"],
       angle = 80, lwd = 2, col = "navyblue")
arrows(x0 = 1:nlevels(mana$fspec), y0 = pred_vcov[,"fit"], y1 = pred_VHC[,"upr"],
       angle = 80, lwd = 2, col = "navyblue")
legend("top", legend = c("vcov()", "vcovHC(type = 'HC3')"),
       ncol = 2, bty = "n", col = c("red", "navyblue"), lwd = 2)

Outline of this lab session:

Loading the data and libraries

We will use a small dataset (47 observations and 8 different covariates) containing some pieces of information about fires that occurred in Chicago in 1970 while distinguishing for some information provided in the data. Particularly, we will focus on the locality in Chicago (north vs. south) and the proportion of minority citizens. The dataset is loaded from the website and some brief insight is provided below.

library("mffSM")   # plotLM()
library("MASS")    # stdres()
library("fBasics") # dagoTest()
library("car")     # ncvTest()
library("lmtest")  # bptest()

chicago <- read.csv("http://www.karlin.mff.cuni.cz/~vavraj/nmfm334/data/chicago.csv", 
                    header=T, stringsAsFactors = TRUE)
        
chicago <- transform(chicago, 
                     fside = factor(side, levels = 0:1, labels=c("North", "South")))          
                    
head(chicago)
##   minor fire theft  old insur income side fside
## 1  10.0  6.2    29 60.4   0.0 11.744    0 North
## 2  22.2  9.5    44 76.5   0.1  9.323    0 North
## 3  19.6 10.5    36 73.5   1.2  9.948    0 North
## 4  17.3  7.7    37 66.9   0.5 10.656    0 North
## 5  24.5  8.6    53 81.4   0.7  9.730    0 North
## 6  54.0 34.1    68 52.6   0.3  8.231    0 North
summary(chicago)
##      minor            fire           theft            old           insur           income     
##  Min.   : 1.00   Min.   : 2.00   Min.   :  3.0   Min.   : 2.0   Min.   :0.000   Min.   : 5.58  
##  1st Qu.: 3.75   1st Qu.: 5.65   1st Qu.: 22.0   1st Qu.:48.6   1st Qu.:0.000   1st Qu.: 8.45  
##  Median :24.50   Median :10.40   Median : 29.0   Median :65.0   Median :0.400   Median :10.69  
##  Mean   :34.99   Mean   :12.28   Mean   : 32.4   Mean   :60.3   Mean   :0.615   Mean   :10.70  
##  3rd Qu.:57.65   3rd Qu.:16.05   3rd Qu.: 38.0   3rd Qu.:77.3   3rd Qu.:0.900   3rd Qu.:11.99  
##  Max.   :99.70   Max.   :39.70   Max.   :147.0   Max.   :90.1   Max.   :2.200   Max.   :21.48  
##       side         fside   
##  Min.   :0.000   North:25  
##  1st Qu.:0.000   South:22  
##  Median :0.000             
##  Mean   :0.468             
##  3rd Qu.:1.000             
##  Max.   :1.000

The data contain information from Chicago insurance redlining in 47 districts in 1970 where

Individual work

Use the data on fires in Chicago (the dataset loaded above) and perform some basic explanatory analysis while focusing on the number of fires (the dependent variable fire and two explanatory variables the proportion of the minority (minor) and the location in Chicago (side).

What are you expectations about the form of the regression function when being interested in modeling of the conditional number of fires \(\mathsf{E} [Y |\, \mathtt{minor}, \mathtt{side}]\), while conditioning on the minority proportion and the locality information?

Are there some issues that you should be concern with? What about the assumptions on normal regression model or an ordinary regression model (without the assumption of normality)?

What particular graphical tools can be used to visually verify the assumptions imposed on the model?

1. Expected number of fires given minority and locality

We start by looking for a model of dependence of the number of fires (fire) on the percentage of minority (minor). We are also aware of the fact that the form of this dependence may differ in north and south parts of Chicago.

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
XLAB <- "Minority (%)"
YLAB <- "Fires (#/100 households)"
COLS <- c(North = "darkgreen", South = "red")
BGS <- c(North = "aquamarine", South = "orange")
PCHS <- c(North = 21, South = 23)
plot(fire ~ minor, data = chicago, 
     xlab = XLAB, ylab = YLAB, 
     col = COLS[fside], bg = BGS[fside], pch = PCHS[fside])
legend("topleft", legend = c("North", "South"), bty = "n", 
       pch = PCHS, col = COLS, pt.bg = BGS)
lines(with(subset(chicago, fside == "North"), lowess(minor, fire)),
      col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, fire)), 
      col = COLS["South"], lwd = 2)

Lets compare the following three models. Based on the output describe the effect of the percentage of minority on the number of fires. Describe also the effect of the side.

m <- list()
m[["Line"]]  <- lm(fire ~ minor * fside, data = chicago)
m[["Log2"]]   <- lm(fire ~ log2(minor) * fside, data = chicago)
m[["Parab"]] <- lm(fire ~ (minor + I(minor^2)) * fside, data = chicago)

summary(m[["Line"]])
## 
## Call:
## lm(formula = fire ~ minor * fside, data = chicago)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17.26  -3.34  -1.47   3.08  18.30 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.1952     1.6921    3.07   0.0037 ** 
## minor              0.3227     0.0490    6.59    5e-08 ***
## fsideSouth         1.3745     3.1025    0.44   0.6600    
## minor:fsideSouth  -0.2081     0.0659   -3.16   0.0029 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.53 on 43 degrees of freedom
## Multiple R-squared:  0.539,  Adjusted R-squared:  0.506 
## F-statistic: 16.7 on 3 and 43 DF,  p-value: 2.38e-07
summary(m[["Log2"]])
## 
## Call:
## lm(formula = fire ~ log2(minor) * fside, data = chicago)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.597 -4.660 -0.875  2.285 17.383 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -0.173      2.272   -0.08    0.940    
## log2(minor)               3.981      0.598    6.65  4.1e-08 ***
## fsideSouth                2.918      4.188    0.70    0.490    
## log2(minor):fsideSouth   -2.018      0.896   -2.25    0.029 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.44 on 43 degrees of freedom
## Multiple R-squared:  0.552,  Adjusted R-squared:  0.521 
## F-statistic: 17.6 on 3 and 43 DF,  p-value: 1.29e-07
summary(m[["Parab"]])
## 
## Call:
## lm(formula = fire ~ (minor + I(minor^2)) * fside, data = chicago)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.763  -3.918  -0.572   3.096  14.322 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.02190    1.82254    1.11   0.2737    
## minor                  0.75499    0.14458    5.22  5.5e-06 ***
## I(minor^2)            -0.00529    0.00169   -3.14   0.0032 ** 
## fsideSouth             1.72111    3.41937    0.50   0.6174    
## minor:fsideSouth      -0.43592    0.19456   -2.24   0.0305 *  
## I(minor^2):fsideSouth  0.00317    0.00212    1.50   0.1418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.86 on 41 degrees of freedom
## Multiple R-squared:  0.647,  Adjusted R-squared:  0.604 
## F-statistic:   15 on 5 and 41 DF,  p-value: 2.23e-08

And the corresponding diagnostics:

par(mar = c(4,4,2,0.5))
plotLM(m[["Line"]])

plotLM(m[["Log2"]])

plotLM(m[["Parab"]])

For all three models there can be some issues spotted in the diagnostic plots above (heteroscedasticity, conditional mean specification, maybe normality issues).

Some statistical tests for normality applied to residuals that are heteroscedastic (even in homoscedastic model):

shapiro.test(residuals(m[["Line"]]))          
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m[["Line"]])
## W = 0.9, p-value = 9e-04
dagoTest(residuals(m[["Line"]]))
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 9.4764
##     Z3  | Skewness: 2.1817
##     Z4  | Kurtosis: 2.1717
##   P VALUE:
##     Omnibus  Test: 0.008754 
##     Skewness Test: 0.02913 
##     Kurtosis Test: 0.02987

Some statistical tests for normality applied to standardized residuals that are not normal (even in normal linear model):

shapiro.test(stdres(m[["Line"]]))          
## 
##  Shapiro-Wilk normality test
## 
## data:  stdres(m[["Line"]])
## W = 0.9, p-value = 7e-04
dagoTest(stdres(m[["Line"]]))
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 7.9183
##     Z3  | Skewness: 1.3467
##     Z4  | Kurtosis: 2.4707
##   P VALUE:
##     Omnibus  Test: 0.01908 
##     Skewness Test: 0.1781 
##     Kurtosis Test: 0.01348

Statistical test for homoscedasticity (Breusch-Pagan test and Koenker’s studentized version) with alternatives \(\mathsf{var} \left[ Y_i | \mathbf{X}_i\right] = \sigma^2 \exp\left\{\tau \mathsf{E} \left[Y_i | \mathbf{X}_i\right]\right\}\) or in general \(\mathsf{var} \left[ Y_i | \mathbf{X}_i, \mathbf{Z_i}\right] = \sigma^2 \exp\left\{\mathbf{Z}_i^\top \boldsymbol{\tau}\right\}\):

ncvTest(m[["Line"]])                                     # default is fitted
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 25.24, Df = 1, p = 5.1e-07
ncvTest(m[["Line"]], var.formula = ~fitted(m[["Line"]])) # the same as before 
## Non-constant Variance Score Test 
## Variance formula: ~ fitted(m[["Line"]]) 
## Chisquare = 25.24, Df = 1, p = 5.1e-07
bptest(m[["Line"]], varformula = ~fitted(m[["Line"]]))   # Koenker's studentized version (robust to non-normality)
## 
##  studentized Breusch-Pagan test
## 
## data:  m[["Line"]]
## BP = 14, df = 1, p-value = 2e-04
bptest(m[["Line"]])                                      # default is the same as model formula
## 
##  studentized Breusch-Pagan test
## 
## data:  m[["Line"]]
## BP = 15, df = 3, p-value = 0.002

In addition, there is also a bunch of points scattered around the origin and relatively only a few observations are given for larger proportion of the minority or the larger amount of fires. This suggests that log transformation (of the response and the predictor as well) could help.

2. Response (logarithmic) transformation

Typically, when the residual variance increases with the response expectation, log-transformation of the response often ensures a homoscedastic model (Assumption A3a from the lecture).

Firstly, compare the plot below (on the log scale on \(y\) axes) with the previous plot on the original scale on the \(y\) axes:

par(mfrow = c(1,1), mar = c(4,4,0.5,0.5))
LYLAB <- "Log(Fires)"
plot(log(fire) ~ minor, data = chicago, 
     xlab = XLAB, ylab = LYLAB, 
     col = COLS[fside], bg = BGS[fside], pch = PCHS[fside])
legend("topleft", legend = c("North", "South"), bty = "n", 
       pch = PCHS,  col = COLS, pt.bg = BGS)
lines(with(subset(chicago, fside == "North"), lowess(minor, log(fire))), 
      col = COLS["North"], lwd = 2)
lines(with(subset(chicago, fside == "South"), lowess(minor, log(fire))), 
      col = COLS["South"], lwd = 2)

Using the proposed transformation of the response we also change the underlying data. This means, that instead of the model \[ \mathsf{E} [Y_i | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \boldsymbol{\beta} \] fitted based on the random sample \(\{(Y_i, \boldsymbol{X}_i);\; i = 1, \dots, n\}\) we fit another model of the form \[ \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] = \boldsymbol{X}_i^\top \widetilde{\boldsymbol{\beta}} \] which is, however, based on the different random sample \(\{(\log(Y_i), \boldsymbol{X}_i);\;i = 1, \dots, n\}\). This also has some crucial consequences. The logarithmic transformation will help to deal with some issues regarding heteroscedastic data but, on the other hand, it will introduce new problems regarding the model interpretability.

The idea is that the final model should be (always) interpreted with respect to the original data - the information about the number of fires (fire) - not the logarithm of the number of fires (log(fire)).

For some better interpretation we will also use the log transformation of the independent variable – the information about the proportion of minority – minor. What is the interpretation of the following model:

summary(m <- lm(log(fire) ~ log2(minor) * fside, data = chicago))
## 
## Call:
## lm(formula = log(fire) ~ log2(minor) * fside, data = chicago)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.752 -0.340 -0.079  0.346  0.849 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.0162     0.1463    6.94  1.6e-08 ***
## log2(minor)              0.3596     0.0385    9.33  6.7e-12 ***
## fsideSouth               0.2796     0.2698    1.04    0.306    
## log2(minor):fsideSouth  -0.1441     0.0577   -2.50    0.016 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.415 on 43 degrees of freedom
## Multiple R-squared:  0.728,  Adjusted R-squared:  0.709 
## F-statistic: 38.3 on 3 and 43 DF,  p-value: 3.23e-12

The diagnostic plots have improved in terms of mean and variance:

par(mar = c(4,4,2,0.5))
plotLM(m)

Recall the Jensen inequality and the fact that \[ \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] \neq \log(\mathsf{E} [Y_i | \boldsymbol{X}_i]) \qquad \text{but rather} \qquad \mathsf{E} [\log(Y_i) | \boldsymbol{X}_i] < \log(\mathsf{E} [Y_i | \boldsymbol{X}_i]). \] And now compare the two models: \[ \begin{aligned} Y_i &= \mathbf{X}_i^\top \boldsymbol{\beta} + \varepsilon_i, \\ \log Y_i = \mathbf{X}_i^\top \widetilde{\boldsymbol{\beta}} + \widetilde{\varepsilon}_i \quad\Longleftrightarrow\quad Y_i &= \exp\left\{\mathbf{X}_i^\top \widetilde{\boldsymbol{\beta}}\right\} \cdot \exp\left\{ \widetilde{\varepsilon}_i \right\}. \end{aligned} \]