STAT151A Homework 1
1 Ordinary least squares in matrix form
Consider simple least squares regression \(y_n = \beta_1 + \beta_2 x_n + \varepsilon_n\), where \(x_n\) is a scalar. Assume that we have \(N\) datapoints. We showed directly that the least–squares solution is given by
\[ \begin{aligned} \hat{\beta}_1 ={}& \overline{y} - \hat{\beta}_2 \overline{x} &and\quad\quad \hat{\beta}_2 ={}& \frac{\overline{xy} - \overline{x} \, \overline{y}} {\overline{xx} - \overline{x}^2}. \end{aligned} \]
Let us re–derive this using matrix notation.
(a) Write simple linear regression in the form \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\). Be precise about what goes into each entry of \(\boldsymbol{Y}\), \(\boldsymbol{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\). What are the dimensions of each?
(b) We proved that the optimal \(\hat{\boldsymbol{\beta}}\) satisfies \(\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{Y}\). Define the “barred quantities” \[ \begin{aligned} \overline{y} ={}& \frac{1}{N} \sum_{n=1}^Ny_n \\ \overline{x} ={}& \frac{1}{N} \sum_{n=1}^Nx_n \\ \overline{xy} ={}& \frac{1}{N} \sum_{n=1}^Nx_n y_n \\ \overline{xx} ={}& \frac{1}{N} \sum_{n=1}^Nx_n ^2, \end{aligned} \]
In terms of the barred quantities and the number of datpoints \(N\), write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).
(c) When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Write a formal expression in terms of the barred quantities. Interpret this condition intuitively in terms of the distribution of the regressors \(x_n\).
(d) Using the formula for the inverse of a \(2\times 2\) matrix, find an expression for \(\hat{\boldsymbol{\beta}}\), and confirm that we get the same answer that we got by solving directly.
(e) In the case where \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, find three distinct values of \(\boldsymbol{\beta}\) that all achieve the same sum of squared residuals \(\boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}\).
(a)
\[ \boldsymbol{X}= \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{pmatrix} \quad\quad \boldsymbol{Y}= \begin{pmatrix} y_1 \\ \vdots \\ y_N \end{pmatrix} \quad\quad \boldsymbol{\varepsilon}= \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_N \end{pmatrix} \quad\quad \boldsymbol{\beta}= \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix} \]
These are \(N \times 2\), \(N \times 1\), \(N \times 1\), and \(2 \times 1\) respectively.
(b)
\[ \boldsymbol{X}^\intercal\boldsymbol{X}= N \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} \quad\quad \boldsymbol{X}^\intercal\boldsymbol{Y}= N \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \]
(c)
\(\boldsymbol{X}\intercal\boldsymbol{X}\) is invertible if the determinant, \(N (\overline{xx} - \bar{x}\, \bar{x}) \ne 0\). This occurs when the sample variance of \(x_n\) is greater than zero.
(d)
\[ \begin{aligned} (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}={}& \frac{1}{N (\overline{xx} - \bar{x}\, \bar{x})} \begin{pmatrix} \overline{xx} & -\bar{x}\\ -\bar{x}& 1 \end{pmatrix} N \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \\={}& \frac{1}{\overline{xx} - \bar{x}\, \bar{x}} \begin{pmatrix} \bar{y}\, \overline{xx} - \bar{x}\, \overline{xy} \\ -\bar{y}\, \bar{x}+ \overline{xy} \end{pmatrix} \end{aligned}. \]
We already have \(\hat{\beta}_2 = (\overline{xy} - \bar{y}\, \bar{x}) / (\overline{xx} - \bar{x}\, \bar{x})\) as expected. To see that \(\hat{\beta}_1\) is correct, write
\[ \begin{aligned} \bar{y}\, \overline{xx} - \bar{x}\, \overline{xy} ={}& \bar{y}\, \overline{xx} - \bar{y}\, \bar{x}\, \bar{x}+ \bar{y}\, \bar{x}\, \bar{x}- \bar{x}\, \overline{xy} \\={}& \bar{y}(\overline{xx} - \bar{x}\, \bar{x}) - \bar{x}(\overline{xy} - \bar{y}\, \bar{x}). \end{aligned} \]
Plugging this in gives
\[ \begin{aligned} \frac{\bar{y}\, \overline{xx} - \bar{x}\, \overline{xy}}{\overline{xx} - \bar{x}\, \bar{x}} ={}& \frac{\bar{y}(\overline{xx} - \bar{x}\, \bar{x}) - \bar{x}(\overline{xy} - \bar{y}\, \bar{x})} {\overline{xx} - \bar{x}\, \bar{x}} \\={}& \bar{y}- \bar{x}\hat{\beta}_2, \end{aligned} \]
as expected.
(e)
When \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, the sample variance of \(x_n\) is zero, meaning all \(x_n\) are equal to some constant \(c = \bar{x}\). In this case, the fit is \(\hat{y}_n = \beta_1 + \beta_2 \bar{x}\), which depends only on the sum \(\beta_1 + \beta_2 \bar{x}\). Any \(\boldsymbol{\beta}\) with the same value of \(\beta_1 + \beta_2 \bar{x}\) gives the same residuals. Three such values that set \(\beta_1 + \beta_2 \bar{x}= \bar{y}\):
\[ \begin{aligned} \boldsymbol{\beta}&= (\bar{y}, 0)^\intercal\\ \boldsymbol{\beta}&= (\bar{y}- \bar{x}, 1)^\intercal\\ \boldsymbol{\beta}&= (\bar{y}- 2\bar{x}, 2)^\intercal \end{aligned} \] All three give \(\hat{y}_n = \bar{y}\) for all \(n\), and thus the same sum of squared residuals \(\sum_n (y_n - \bar{y})^2\).
Since \(\hat{y}_n = \bar{y}\), all these values happen to also minimize the sum of squared residuals. However, a perfectly good answer to the question as asked could also set \(\beta_1 + \beta_2 \bar{x}= d\) for any value \(d\). Any such solution gives the same residuals, but non-optimal ones if \(d \ne \bar{y}\).
Finally, note that if \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, then the vector \((\bar{x}, -1)\) is zero eigenvector of \(\boldsymbol{X}^\intercal\boldsymbol{X}\): \[ \boldsymbol{X}^\intercal\boldsymbol{X}= N \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \bar{x}^2 \end{pmatrix} \begin{pmatrix} \bar{x}\\ -1 \end{pmatrix} = \begin{pmatrix} \bar{x}- \bar{x}\\ \bar{x}^2 - \bar{x} \end{pmatrix} = \boldsymbol{0}. \]
This means that any coefficient vector of the form \[ \boldsymbol{\beta}= \boldsymbol{\beta}_0 + c \begin{pmatrix} \bar{x}\\ -1 \end{pmatrix}, \] for any \(\boldsymbol{\beta}_0\) and \(c\), gives the same \(\hat{y}_n\). For example, the above solutions can be generated by taking \(\boldsymbol{\beta}_0 = (\bar{y}, 0)^\intercal\), and the values \(c = 0\), \(-1\), and \(-2\). Since any such \(c\) gives an equivalent linear fit, there is an infinite, one-dimensional set of equivalent regression lines, corresponding to the zero eigenvector of \(\boldsymbol{X}^\intercal\boldsymbol{X}\).
2 One-hot encoding
Consider a one–hot encoding of a variable \(z_n\) that takes three distinct values, “a”, “b”, and “c”. That is, let
\[ \boldsymbol{x}_n = \begin{cases} \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} & \textrm{ when }z_n = a \\ \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} & \textrm{ when }z_n = b \\ \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} & \textrm{ when }z_n = c \\ \end{cases} \]
Let \(\boldsymbol{X}\) be the regressor matrix with \(\boldsymbol{x}_n^\intercal\) in row \(n\).
(a)
Let \(N_a\) be the number of observations with \(z_n\) = a, and let \(\sum_{n:z_n = a}\) denote a sum over rows with \(z_n\) = a, with analogous definitions for b and c. In terms of these quantities, write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).
(b)
When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Explain intuitively why the regression problem cannot be solved when \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Write an explicit expression for \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) when it is invertible.
(c)
Using your previous answer, show that the least squares vector \(\hat{\boldsymbol{\beta}}\) is the mean of \(y_n\) within distinct values of \(z_n\).
(d)
Suppose now you include a constant in the regression, so that
\[ y_n = \alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n, \]
and let \(\boldsymbol{X}'\) denote the regressor matrix for this regression with coefficient vector \((\alpha, \boldsymbol{\beta}^\intercal)^\intercal\). Write an expression for \(\boldsymbol{X}'^\intercal\boldsymbol{X}'\) and show that it is not invertible.
(e)
Find three distinct values of \((\alpha, \boldsymbol{\beta}^\intercal)\) that all give the exact same fit \(\alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).
(a)
\[ \boldsymbol{X}^\intercal\boldsymbol{X}= \begin{pmatrix} N_a & 0 & 0 \\ 0 & N_b & 0 \\ 0 & 0 & N_c \\ \end{pmatrix} \quad\quad \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \sum_{n:z_n = a} y_n \\ \sum_{n:z_n = b} y_n \\ \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]
(b)
It is invertbile as long as each of \(N_a\), \(N_b\), and \(N_c\) are nonzero. If there are no observations for a particular level, you of course cannot estimate its relationship with \(y_n\). When \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible, then
\[ \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} = \begin{pmatrix} 1/N_a & 0 & 0 \\ 0 & 1/N_b & 0 \\ 0 & 0 & 1/N_c \\ \end{pmatrix} \]
(c)
By direct multiplication,
\[ \hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \frac{1}{N_a} \sum_{n:z_n = a} y_n \\ \frac{1}{N_b} \sum_{n:z_n = b} y_n \\ \frac{1}{N_c} \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]
(d)
\[ (\boldsymbol{X}')^\intercal\boldsymbol{X}' = \begin{pmatrix} N & N_a & N_b & N_c \\ N_a & N_a & 0 & 0 \\ N_b & 0 & N_b & 0 \\ N_c & 0 & 0 & N_c \\ \end{pmatrix}. \]
This is not invertible because the first column is the sum of the other three. Equivalently, \((\boldsymbol{X}')^\intercal\boldsymbol{X}' \boldsymbol{v}= \boldsymbol{0}\) where \(\boldsymbol{v}= (1, -1, -1, -1)^\intercal\).
(e)
Any line of the form
\[ (\alpha + \beta_1) z_{na} + (\alpha + \beta_2) z_{nb} + (\alpha + \beta_3) z_{nc} \]
will give the same fit. Three equivalent sets that also happen to solve the least squares problem are
\[ \begin{aligned} (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (0, 0, 0, 0) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (1, -1, -1, -1) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (2, -2, -2, -2). \end{aligned} \]
These are all of the form \(\hat{\boldsymbol{\beta}}+ C \boldsymbol{v}\) for \(C = 0\), \(C = 1\), and \(C = 2\), as they must be, where \(\boldsymbol{v}\) is the null vector from (d).
3 Linear regression as projections
In this problem, we show that the OLS regression acts like a “projection,” breaking \(\boldsymbol{Y}\) up into two orthogonal components: a component \(\hat{\boldsymbol{Y}}\) that is in the span of the columns of \(\boldsymbol{X}\), and a component \(\hat{\boldsymbol{\varepsilon}}\) that is orthogonal to the columns of \(\boldsymbol{X}\).
Consider a regression setup where \(\boldsymbol{X}\) is full-rank with \(P < N\). Let \(\hat{\boldsymbol{\beta}}\), \(\hat{\boldsymbol{Y}}\), and \(\hat{\varepsilon}\) denote the usual quantities. Write the \(p\)-th column of \(\boldsymbol{X}\) as \(\boldsymbol{X}_p\), so that \(\boldsymbol{X}= (\boldsymbol{X}_1, \ldots, \boldsymbol{X}_P)\), and define the \(N \times N\) “projection matrix” \(\underset{\boldsymbol{X}}{\boldsymbol{P}} := \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\).
(a) An \(N\)-vector \(\boldsymbol{v}\) is said to be “in the linear span of the columns of \(\boldsymbol{X}\)” if it can be written as \(\boldsymbol{v}= \sum_{p=1}^P a_p \boldsymbol{X}_p\) for some \(a_1,\ldots,a_P\). Show that such a vector can be written in the form \(\boldsymbol{X}\boldsymbol{a}\) for some \(P\)-vector \(\boldsymbol{a}\).
(b) Show that if \(\boldsymbol{v}\) is in the in the linear span of the columns of \(\boldsymbol{X}\), then \(\underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{v}= \boldsymbol{v}\). That is, \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\) leaves \(\boldsymbol{v}\) unchanged.
(c) Show that if an \(N\)-vector \(\boldsymbol{q}\) is orthogonal to each column of \(\boldsymbol{X}\), then \(\underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{q}= \boldsymbol{0}\).
(d) Show that \(\boldsymbol{v}^\intercal\boldsymbol{q}= \boldsymbol{0}\), where \(\boldsymbol{v}\) and \(\boldsymbol{q}\) are given in the preceding problems.
For the remaining problems, we introduce some additional notation. Note that, in general, the response vector \(\boldsymbol{Y}\) is just an \(N\)-vector. Write \(\boldsymbol{Y}= \boldsymbol{Y}_{\parallel} + \boldsymbol{Y}_{\perp}\), where \(\boldsymbol{Y}_{\parallel}\) is in the linear span of the columns of \(\boldsymbol{X}\), and \(\boldsymbol{Y}_{\perp}\) is orthogonal to the columns of \(\boldsymbol{X}\). We call \(\boldsymbol{Y}_{\parallel}\) the “projection of \(\boldsymbol{Y}\) onto the span of the columns of \(\boldsymbol{X}\).”
(e) Show that \(\hat{\boldsymbol{Y}}= \boldsymbol{Y}_{\parallel}\) and \(\hat{\varepsilon}= \boldsymbol{Y}_{\perp}\).
(f) Show that \(\boldsymbol{X}^\intercal\hat{\varepsilon}= \boldsymbol{0}\).
(g) Show that \(\left\Vert\boldsymbol{Y}\right\Vert_2^2 = \left\Vert\hat{\boldsymbol{Y}}\right\Vert_2^2 + \left\Vert\hat{\varepsilon}\right\Vert_2^2\).
(h) Let \(\boldsymbol{\gamma}\) denote a \(P \times 1\) vector. Using (b), show directly that \(\left\Vert\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\gamma}\right\Vert_2^2\) is smallest when \(\boldsymbol{\gamma}= \hat{\boldsymbol{\beta}}\). Hint: Plug in \(\boldsymbol{Y}= \hat{\varepsilon}+ \boldsymbol{X}\hat{\boldsymbol{\beta}}\), use \(\boldsymbol{X}\hat{\boldsymbol{\beta}}- \boldsymbol{X}\boldsymbol{\gamma}= \boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma})\), and expand the square.
(i) Suppose that \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{a}\) for some \(\boldsymbol{a}\). Show that in this case \(\hat{\boldsymbol{Y}}= \boldsymbol{Y}\) and \(\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\).
(j) As a special case of (a), show that \(\frac{1}{N} \sum_{n=1}^N\hat{\varepsilon}_n = 0\) when the regression includes an intercept term. Hint: When the regression includes an intercept, then \(\boldsymbol{1}\) is a column of \(\boldsymbol{X}\).
(a)
By definition, \(\boldsymbol{v}= \sum_{p=1}^P a_p \boldsymbol{X}_p\). Let \(\boldsymbol{a}= (a_1, \ldots, a_P)^\intercal\). Then \(\boldsymbol{X}\boldsymbol{a}= \sum_{p=1}^P a_p \boldsymbol{X}_p = \boldsymbol{v}\).
(b)
If \(\boldsymbol{v}= \boldsymbol{X}\boldsymbol{a}\) for some \(\boldsymbol{a}\), then \[ \underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{v}= \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{a}= \boldsymbol{X}\boldsymbol{a}= \boldsymbol{v}. \]
(c)
If \(\boldsymbol{q}\) is orthogonal to each column of \(\boldsymbol{X}\), then \(\boldsymbol{X}^\intercal\boldsymbol{q}= \boldsymbol{0}\). Therefore, \[ \underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{q}= \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{q}= \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{0}= \boldsymbol{0}. \]
(d)
Since \(\boldsymbol{v}= \boldsymbol{X}\boldsymbol{a}\) for some \(\boldsymbol{a}\), and \(\boldsymbol{X}^\intercal\boldsymbol{q}= \boldsymbol{0}\), \[ \boldsymbol{v}^\intercal\boldsymbol{q}= (\boldsymbol{X}\boldsymbol{a})^\intercal\boldsymbol{q}= \boldsymbol{a}^\intercal\boldsymbol{X}^\intercal\boldsymbol{q}= \boldsymbol{a}^\intercal\boldsymbol{0}= 0. \]
(e)
From the definitions, \(\hat{\boldsymbol{Y}}= \boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{Y}= \boldsymbol{Y}_{\parallel}\).
Note that \(\boldsymbol{Y}_{\perp} = \boldsymbol{Y}- \boldsymbol{Y}_{\parallel} = \boldsymbol{Y}- \hat{\boldsymbol{Y}}= \hat{\varepsilon}\) by definition of \(\hat{\varepsilon}\).
(f)
From part (e), \(\hat{\varepsilon}= \boldsymbol{Y}_{\perp}\) is orthogonal to each column of \(\boldsymbol{X}\), so \(\boldsymbol{X}^\intercal\hat{\varepsilon}= \boldsymbol{X}^\intercal\boldsymbol{Y}_{\perp} = \boldsymbol{0}\).
(g)
Since \(\boldsymbol{Y}= \hat{\boldsymbol{Y}}+ \hat{\varepsilon}\) and \(\hat{\boldsymbol{Y}}^\intercal\hat{\varepsilon}= 0\) (from parts (d) and (f)), \[ \left\Vert\boldsymbol{Y}\right\Vert_2^2 = \boldsymbol{Y}^\intercal\boldsymbol{Y}= (\hat{\boldsymbol{Y}}+ \hat{\varepsilon})^\intercal(\hat{\boldsymbol{Y}}+ \hat{\varepsilon}) = \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}+ 2 \hat{\boldsymbol{Y}}^\intercal\hat{\varepsilon}+ \hat{\varepsilon}^\intercal\hat{\varepsilon}= \hat{\boldsymbol{Y}}^\intercal\hat{\boldsymbol{Y}}+ \hat{\varepsilon}^\intercal\hat{\varepsilon}= \left\Vert\hat{\boldsymbol{Y}}\right\Vert_2^2 + \left\Vert\hat{\varepsilon}\right\Vert_2^2. \]
(h)
Using \(\boldsymbol{Y}= \hat{\varepsilon}+ \boldsymbol{X}\hat{\boldsymbol{\beta}}\), \[ \begin{aligned} \left\Vert\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\gamma}\right\Vert_2^2 &= \left\Vert\hat{\varepsilon}+ \boldsymbol{X}\hat{\boldsymbol{\beta}}- \boldsymbol{X}\boldsymbol{\gamma}\right\Vert_2^2 \\ &= \left\Vert\hat{\varepsilon}+ \boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma})\right\Vert_2^2 \\ &= \left\Vert\hat{\varepsilon}\right\Vert_2^2 + 2 \hat{\varepsilon}^\intercal\boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma}) + \left\Vert\boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma})\right\Vert_2^2 \\ &= \left\Vert\hat{\varepsilon}\right\Vert_2^2 + 0 + \left\Vert\boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma})\right\Vert_2^2 \\ &\ge \left\Vert\hat{\varepsilon}\right\Vert_2^2, \end{aligned} \] with equality when \(\boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma}) = \boldsymbol{0}\). Since \(\boldsymbol{X}\) is full rank, this occurs only when \(\boldsymbol{\gamma}= \hat{\boldsymbol{\beta}}\).
(i)
If \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{a}\), then \(\boldsymbol{Y}\) is already in the span of the columns of \(\boldsymbol{X}\). By part (b), \(\underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{Y}= \boldsymbol{Y}\), so \(\hat{\boldsymbol{Y}}= \boldsymbol{Y}\). Then \(\hat{\boldsymbol{\varepsilon}}= \boldsymbol{Y}- \hat{\boldsymbol{Y}}= \boldsymbol{0}\).
(j)
When the regression includes an intercept, \(\boldsymbol{1}\) is a column of \(\boldsymbol{X}\). By part (g), \(\hat{\varepsilon}^\intercal\boldsymbol{X}= \boldsymbol{0}\), which means \(\hat{\varepsilon}^\intercal\boldsymbol{1}= 0\). But \(\hat{\varepsilon}^\intercal\boldsymbol{1}= \sum_n \hat{\varepsilon}_n = N \frac{1}{N} \sum_{n=1}^N\hat{\varepsilon}_n\). Therefore \(\frac{1}{N} \sum_{n=1}^N\hat{\varepsilon}_n = 0\).
4 Changes of basis
In this problem, we show how rewriting a regression problem using an invertible linear transformation of the regressors changes the estimated coefficients, but does not change the fit.
Consider a regression setup where \(\boldsymbol{X}\) is full-rank with \(P < N\). Let \(\boldsymbol{z}_n = \boldsymbol{A}^\intercal\boldsymbol{x}_n\) for each \(n\), so that \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{A}\). Consider the regressions \[ \boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\quad\textrm{and}\quad \boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\gamma}, \] with corresponding OLS estimates \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\).
(a) Find an explicit expression for \(\hat{\boldsymbol{\gamma}}\) in terms of \(\boldsymbol{A}\) and \(\hat{\boldsymbol{\beta}}\). Hint: Plug in \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{A}\) and simplify using the linear algebra fact that \((\boldsymbol{A}\boldsymbol{B}\boldsymbol{Q})^{-1} = \boldsymbol{Q}^{-1} \boldsymbol{B}^{-1} \boldsymbol{A}^{-1}\) when all inverses exist.
(b) Using your result from (a), show that \(\boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{z}_n^\intercal\hat{\boldsymbol{\gamma}}\). In other words, show that \(\hat{y}_n\) is the same whether you regress on \(\boldsymbol{X}\) or on \(\boldsymbol{Z}\).
(c) Let \(R^2_x\) and \(R^2_z\) denote the \(R^2\) fits from the regression on \(\boldsymbol{X}\) and \(\boldsymbol{Z}\) respectively. Show that \(R^2_x= R^2_z\). Hint: Using (b), what do you know about the residuals for the two regressions?
(a)
\[ \begin{aligned} \hat{\boldsymbol{\gamma}}&= (\boldsymbol{Z}^\intercal\boldsymbol{Z})^{-1} \boldsymbol{Z}^\intercal\boldsymbol{Y}\\ &= ((\boldsymbol{X}\boldsymbol{A})^\intercal(\boldsymbol{X}\boldsymbol{A}))^{-1} (\boldsymbol{X}\boldsymbol{A})^\intercal\boldsymbol{Y}\\ &= (\boldsymbol{A}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{A})^{-1} \boldsymbol{A}^\intercal\boldsymbol{X}^\intercal\boldsymbol{Y}\\ &= \boldsymbol{A}^{-1} (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} (\boldsymbol{A}^\intercal)^{-1} \boldsymbol{A}^\intercal\boldsymbol{X}^\intercal\boldsymbol{Y}\\ &= \boldsymbol{A}^{-1} (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\\ &= \boldsymbol{A}^{-1} \hat{\boldsymbol{\beta}}. \end{aligned} \]
(b)
\[ \boldsymbol{z}_n^\intercal\hat{\boldsymbol{\gamma}}= (\boldsymbol{A}^\intercal\boldsymbol{x}_n)^\intercal\boldsymbol{A}^{-1} \hat{\boldsymbol{\beta}}= \boldsymbol{x}_n^\intercal\boldsymbol{A}\boldsymbol{A}^{-1} \hat{\boldsymbol{\beta}}= \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}. \]
Thus \(\hat{y}_n\) is the same for both regressions.
(c)
From part (b), the fitted values \(\hat{y}_n\) are identical for both regressions. Therefore, the residuals \(\hat{\varepsilon}_n = y_n - \hat{y}_n\) are also identical. Since \(R^2 = 1 - \frac{\sum_n \hat{\varepsilon}_n^2}{\sum_n y_n^2}\) and both the numerator and denominator are the same for both regressions, we have \(R^2_x= R^2_z\).
5 Changes of basis continued
This is an application of the change of basis problem to one-hot indicators with a constant. Let \(\boldsymbol{z}_n\) be a one-hot encoding of a binary regressor, so that \(\boldsymbol{z}_n = (1, 0)^\intercal\) or \(\boldsymbol{z}_n = (0, 1)^\intercal\), which encodes a binary variable \(a_n \in \{A, B\}\). For example, \(\boldsymbol{z}_n\) might encode \(\textrm{Kitchen Quality} \in \{\textrm{Good}, \textrm{Excellent}\}\), where \(a_n = \textrm{Kitchen Quality}\), \(A = \textrm{Good}\), and \(B = \textrm{Excellent}\).
Let \(\boldsymbol{x}_n\) be the regressor for the regression \(y\sim 1 + z_2\), the regression on a constant and the second component of \(\boldsymbol{z}_n\), so \(\boldsymbol{x}_n = (1, \mathbb{I}\left(a_n = B\right))\).
Consider the regressions \[ \boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\quad\textrm{and}\quad \boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\gamma}, \] with corresponding OLS estimates \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\).
(a) Find an \(\boldsymbol{A}\) such that \(\boldsymbol{z}_n = \boldsymbol{A}^\intercal\boldsymbol{x}_n\) in this case, and prove that it’s invertible.
(b) Write an expression for \(\hat{\boldsymbol{\gamma}}\) in terms of \(\bar{y}_A\) and \(\bar{y}_B\), which are respectively the average responses when \(a_n = A\) and \(a_n = B\).
(c) Write an expression for \(\hat{\beta}_2\) in terms of \(\bar{y}_A\) and \(\bar{y}_B\) using the fact that \(\hat{y}_n\) must be the same for the two regressions.
(d) In ordinary language, what is the difference in the interpretation of the coefficient \(\hat{\beta}_2\) and \(\hat{\gamma}_2\)?
(a)
We need \(\boldsymbol{z}_n = \boldsymbol{A}^\intercal\boldsymbol{x}_n\) where \(\boldsymbol{x}_n = (1, \mathbb{I}\left(a_n = B\right))^\intercal\).
When \(a_n = A\): \(\boldsymbol{x}_n = (1, 0)^\intercal\) and \(\boldsymbol{z}_n = (1, 0)^\intercal\). When \(a_n = B\): \(\boldsymbol{x}_n = (1, 1)^\intercal\) and \(\boldsymbol{z}_n = (0, 1)^\intercal\).
So we need: \[ \boldsymbol{A}^\intercal\begin{pmatrix} 1 \\ 0 \end{pmatrix} = \begin{pmatrix} 1 \\ 0 \end{pmatrix} \quad\textrm{and}\quad \boldsymbol{A}^\intercal\begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 0 \\ 1 \end{pmatrix}. \]
This gives \(\boldsymbol{A}^\intercal= \begin{pmatrix} 1 & -1 \\ 0 & 1 \end{pmatrix}\), so \(\boldsymbol{A}= \begin{pmatrix} 1 & 0 \\ -1 & 1 \end{pmatrix}\).
The determinant is \(\det(\boldsymbol{A}) = 1 \ne 0\), so \(\boldsymbol{A}\) is invertible.
(b)
From the one-hot encoding problem (Problem 2), when regressing \(\boldsymbol{Y}\) on \(\boldsymbol{Z}\) (the one-hot encoding without intercept), the OLS estimate is simply the group means: \[ \hat{\boldsymbol{\gamma}}= \begin{pmatrix} \bar{y}_A \\ \bar{y}_B \end{pmatrix}. \]
(c)
From the change of basis result, the fitted values must be identical. For an observation with \(a_n = A\): \[ \hat{y}_n = \boldsymbol{z}_n^\intercal\hat{\boldsymbol{\gamma}}= (1, 0) \begin{pmatrix} \bar{y}_A \\ \bar{y}_B \end{pmatrix} = \bar{y}_A. \]
For the regression on \(\boldsymbol{x}_n = (1, 0)^\intercal\): \[ \hat{y}_n = \boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 + 0 \cdot \hat{\beta}_2 = \hat{\beta}_1. \]
So \(\hat{\beta}_1 = \bar{y}_A\).
For an observation with \(a_n = B\): \[ \hat{y}_n = \boldsymbol{z}_n^\intercal\hat{\boldsymbol{\gamma}}= (0, 1) \begin{pmatrix} \bar{y}_A \\ \bar{y}_B \end{pmatrix} = \bar{y}_B. \]
For the regression on \(\boldsymbol{x}_n = (1, 1)^\intercal\): \[ \hat{y}_n = \hat{\beta}_1 + \hat{\beta}_2 = \bar{y}_B. \]
Therefore \(\hat{\beta}_2 = \bar{y}_B - \hat{\beta}_1 = \bar{y}_B - \bar{y}_A\).
Note that an equivalent way to solve this would be to use the formula \(\hat{\boldsymbol{\gamma}}= \boldsymbol{A}^{-1} \hat{\boldsymbol{\beta}}\), which gives \(\hat{\boldsymbol{\beta}}= \boldsymbol{A}\hat{\boldsymbol{\gamma}}\), so
\[ \hat{\boldsymbol{\beta}} = \begin{pmatrix} 1 & 0 \\ -1 & 1 \end{pmatrix} \begin{pmatrix} \bar{y}_A \\ \bar{y}_B \end{pmatrix} = \begin{pmatrix} \bar{y}_A \\ \bar{y}_B - \bar{y}_A \end{pmatrix}. \]
(d)
\(\hat{\gamma}_2 = \bar{y}_B\) is the average response for observations in group \(B\).
\(\hat{\beta}_2 = \bar{y}_B - \bar{y}_A\) is the difference in average response between group \(B\) and group \(A\). It represents the effect of being in group \(B\) relative to the baseline group \(A\).
Linear combinations of regressors (based on Freedman (2009) Chapter 4 set B exercise 14)
Let \(\hat{\boldsymbol{\beta}}\) be the OLS estimator \(\hat{\boldsymbol{\beta}}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\), where the design matrix \(\boldsymbol{X}\) has full rank \(P < N\). Assume that, for each \(n\) and some \(\boldsymbol{\beta}\), that \(y_n = \boldsymbol{x}_n \boldsymbol{\beta}+ \varepsilon_n\), where \(\mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = 0\), \(\varepsilon_n\) are IID, and \(\mathrm{Var}\left(\varepsilon_n \vert \boldsymbol{x}_n\right) = \sigma^2\).
(a) Write an expression for \(\boldsymbol{Y}\) in terms of \(\boldsymbol{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\).
(b) Plug your result from (a) in to get an expression for \(\hat{\boldsymbol{\beta}}\).
(c) Using your result from (b), what is \(\mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\)?
(d) Using your result from (b), what is \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\) in terms of \(\sigma^2\) and \(\boldsymbol{X}\)?
(e) Find a \(\boldsymbol{a}\) such that \(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 - \hat{\beta}_2\).
(f) For a given \(\boldsymbol{a}\), find an expression for \(\mathbb{E}\left[\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\)
(g) For a given \(\boldsymbol{a}\), find an expression for \(\mathrm{Var}\left(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\).
(a)
\[ \boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon} \]
where \(\boldsymbol{\varepsilon}= (\varepsilon_1, \ldots, \varepsilon_N)^\intercal\).
(b)
\[ \begin{aligned} \hat{\boldsymbol{\beta}}&= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\\ &= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal(\boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}) \\ &= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\\ &= \boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}. \end{aligned} \]
(c)
\[ \begin{aligned} \mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] &= \mathbb{E}\left[\boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right] \\ &= \boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathbb{E}\left[\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right] \\ &= \boldsymbol{\beta}+ (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{0}\\ &= \boldsymbol{\beta}. \end{aligned} \]
So \(\hat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\).
(d)
\[ \begin{aligned} \mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) &= \mathrm{Cov}\left((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right) \\ &= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\mathrm{Cov}\left(\boldsymbol{\varepsilon}\vert \boldsymbol{X}\right) \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\ &= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal(\sigma^2 \boldsymbol{I}) \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\ &= \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \\ &= \sigma^2 (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}. \end{aligned} \]
(e)
\[ \boldsymbol{a}= (1, -1, 0, \ldots, 0)^\intercal \]
Then \(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 - \hat{\beta}_2\).
(f)
\[ \mathbb{E}\left[\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = \boldsymbol{a}^\intercal\mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right] = \boldsymbol{a}^\intercal\boldsymbol{\beta}= \beta_1 - \beta_2. \]
(g)
\[ \mathrm{Var}\left(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) = \boldsymbol{a}^\intercal\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right) \boldsymbol{a}= \sigma^2 \boldsymbol{a}^\intercal(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{a}. \]