Chi squared distribution of the residual variance estimator
Supplement: Residual variance as a chi squared variable
We will begin by writing \(\hat{\varepsilon}\) in matrix form. Writing the vector of fitted residuals as \(\hat{\boldsymbol{\varepsilon}}:= \boldsymbol{Y}- \boldsymbol{X}\hat{\beta}\), we get \[ \begin{aligned} \hat{\boldsymbol{\varepsilon}}={}& \boldsymbol{Y}- \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y} \\={}& \left(\boldsymbol{I}- \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\right) \boldsymbol{Y}, \end{aligned} \]
where we can recognize the projection matrix
\[ \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} = \boldsymbol{I}- \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal \]
that projects vectors to the orthogonal complement of the column space of \(\boldsymbol{X}\). Plugging in our model for \(\boldsymbol{Y}\),
\[ \hat{\boldsymbol{\varepsilon}}= \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} (\boldsymbol{X}\beta + \boldsymbol{\varepsilon}) = \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}, \]
since \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{X}= \boldsymbol{0}\). We thus have
\[ \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}= \boldsymbol{\varepsilon}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon} = \boldsymbol{\varepsilon}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}, \]
since projecting onto a subspace twice is the same as projecting once.
The problem with the preceding expression is that \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\) is complicated, and ties up the IID entries of \(\boldsymbol{\varepsilon}\) in a messy way. We can disentangle them by writing the eigenvalue decomposition of the square, symmetric matrix \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\):
\[ \underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} = \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal, \]
where \(\boldsymbol{U}\) is orthonormal and \(\Lambda\) has on the diagonal either \(1\) (for columns of \(\boldsymbol{U}\) perpendicular to the columns of \(\boldsymbol{X}\)) or \(0\) (for columns of \(\boldsymbol{U}\) in the span of the columns of \(\boldsymbol{X}\)).
Plugging in gives
\[ \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}^\intercal\boldsymbol{U}\Lambda \boldsymbol{U}^\intercal\boldsymbol{\varepsilon}= (\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}) \Lambda \boldsymbol{U}^\intercal\boldsymbol{\varepsilon}. \]
What is the distribution of \(\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\)? Recall that \(\boldsymbol{U}\) is fixed — it is simply a (very complicated) function of the regressors, \(\boldsymbol{X}\). So \(\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\) is a linear combination of normal random variables, and is itself normal. Its mean is \(\mathbb{E}\left[\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\right] = \boldsymbol{U}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\right] = \boldsymbol{0}\), and its covariance is
\[ \mathrm{Cov}\left(\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\right) = \mathbb{E}\left[\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\boldsymbol{U}\right] = \boldsymbol{U}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right] \boldsymbol{U} = \boldsymbol{U}^\intercal\sigma^2 \boldsymbol{I}\boldsymbol{U} = \sigma^2 \boldsymbol{U}^\intercal\boldsymbol{U} = \sigma^2 \boldsymbol{I}, \]
where the final line is by the orthonormality of \(\boldsymbol{U}\) (which followed from the symmetry of \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\)). It follows that
\[ \boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\sim \mathcal{N}\left(\boldsymbol{0}, \sigma^2 \boldsymbol{I}\right). \]
So \(\boldsymbol{U}^\intercal\boldsymbol{\varepsilon}\) has the same distribution as \(\boldsymbol{\varepsilon}\)! They are not the same random variable — the two are correlated in a complicated way by the matrix \(\boldsymbol{U}\) — but they have the same distribution.
Let’s write \(\boldsymbol{\varepsilon}' := \boldsymbol{U}^\intercal\) for this random variable, so that
\[ \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}} = \boldsymbol{\varepsilon}'^\intercal\Lambda \boldsymbol{\varepsilon}' = \sum_{n=1}^N{\varepsilon'_n}^2 \Lambda_{nn} = \sum_{n=1}^{N-P} {\varepsilon'_n}^2, = \sigma^2 \sum_{n=1}^{N-P} \left(\frac{\varepsilon'_n}{\sigma}\right)^2, \]
where without loss of generality we take the non-zero eigenvalues to be listed first, and recalling that there are \(N - P\) such eigenvalues, since the dimension of the space perpindicular to the columns of \(\boldsymbol{X}\) has dimension \(N-P\).
By recognizing the definition of a chi-squared random variable, we can define an estimator of the residual variance,
\[ \hat{\sigma}^2 = \frac{1}{N - P} \hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}\sim \frac{\sigma^2}{N - P} s, \quad\textrm{where}\quad s\sim \mathcal{\chi}^2_{N-P}. \]
This distribution is true in finite sample — it is not asymptotic. But this beautiful result is only possible because of the normal assumption, which may not be very reasonable in practice.
From standard properties of the chi-squared distribution, some immediate results follow.
Variance
As expected, the variance of our estimator around the truth goes to zero as \(N \rightarrow \infty\), since
\[ \mathrm{Var}\left(\hat{\sigma}^2 - \sigma^2\right) = \sigma^4 \mathrm{Var}\left(\frac{s}{N - P} - 1\right) = \frac{\sigma^4}{(N - P)^2} \mathrm{Var}\left(s\right) = \frac{2(N - P)}{(N - P)^2}\sigma^4 \rightarrow 0. \]
This is essentially the chi-squared version of the proof we already gave using the LLN.
Bias
First, we can see that
\[ \mathbb{E}\left[\hat{\sigma}^2\right] = \sigma^2 \frac{N-P}{N - P} = \sigma^2. \]
In other words, \(\hat{\sigma}^2\) is unbiased. Normalizing by \(N - P\) instead of \(N\), as we might naively have done, results in an unbiased estimator.
Alternative (simpler) derivation of the bias of the variance estimator
If all we want is \(\mathbb{E}\left[\hat{\sigma}^2\right]\), we can do a trick to avoid the chi squared distribution and eigendecomposition of the projection matrix.
In particualr, \(\hat{\varepsilon}^\intercal\hat{\varepsilon}\) is a scalar, and the matrix trace of a scalar is also a scalar, so
\[ \boldsymbol{\varepsilon}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}= \mathrm{trace}\left(\boldsymbol{\varepsilon}^\intercal\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}\right) = \mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right), \]
where we have used the general property of the trace that \(\mathrm{trace}\left(AB\right) = \mathrm{trace}\left(BA\right)\). Combining the above results, we see that
\[ \begin{aligned} \mathbb{E}\left[\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}\right] ={}& \mathbb{E}\left[\mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right)\right] \\={}& \mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \mathbb{E}\left[\boldsymbol{\varepsilon}\boldsymbol{\varepsilon}^\intercal\right]\right) \\={}& \mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \sigma^2 \boldsymbol{I}\right) & \textrm{(by A2)} \\={}& \sigma^2 \mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}} \right) & \textrm{(by A2)}. \end{aligned} \]
Now, since \(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\) is a projection matrix onto an \(N - P\) dimensional space, \(\mathrm{trace}\left(\underset{\boldsymbol{X}^\perp}{\boldsymbol{P}}\right) = N - P\), so we get
\[ \mathbb{E}\left[\hat{\boldsymbol{\varepsilon}}^\intercal\hat{\boldsymbol{\varepsilon}}\right] = (N - P) \sigma^2. \]