STAT151A Homework 1

Author

Your name here

1 Ordinary least squares in matrix form

Consider simple least squares regression \(y_n = \beta_1 + \beta_2 x_n + \varepsilon_n\), where \(x_n\) is a scalar. Assume that we have \(N\) datapoints. We showed directly that the least–squares solution is given by

\[ \begin{aligned} \hat{\beta}_1 ={}& \overline{y} - \hat{\beta}_2 \overline{x} &and\quad\quad \hat{\beta}_2 ={}& \frac{\overline{xy} - \overline{x} \, \overline{y}} {\overline{xx} - \overline{x}^2}. \end{aligned} \]

Let us re–derive this using matrix notation.

(a) Write simple linear regression in the form \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\). Be precise about what goes into each entry of \(\boldsymbol{Y}\), \(\boldsymbol{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\). What are the dimensions of each?

(b) We proved that the optimal \(\hat{\boldsymbol{\beta}}\) satisfies \(\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{Y}\). Define the “barred quantities” \[ \begin{aligned} \overline{y} ={}& \frac{1}{N} \sum_{n=1}^Ny_n \\ \overline{x} ={}& \frac{1}{N} \sum_{n=1}^Nx_n \\ \overline{xy} ={}& \frac{1}{N} \sum_{n=1}^Nx_n y_n \\ \overline{xx} ={}& \frac{1}{N} \sum_{n=1}^Nx_n ^2, \end{aligned} \]

In terms of the barred quantities and the number of datpoints \(N\), write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).

(c) When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Write a formal expression in terms of the barred quantities. Interpret this condition intuitively in terms of the distribution of the regressors \(x_n\).

(d) Using the formula for the inverse of a \(2\times 2\) matrix, find an expression for \(\hat{\boldsymbol{\beta}}\), and confirm that we get the same answer that we got by solving directly.

(e) In the case where \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, find three distinct values of \(\boldsymbol{\beta}\) that all achieve the same sum of squared residuals \(\boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}\).

2 One-hot encoding

Consider a one–hot encoding of a variable \(z_n\) that takes three distinct values, “a”, “b”, and “c”. That is, let

\[ \boldsymbol{x}_n = \begin{cases} \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} & \textrm{ when }z_n = a \\ \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} & \textrm{ when }z_n = b \\ \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} & \textrm{ when }z_n = c \\ \end{cases} \]

Let \(\boldsymbol{X}\) be the regressor matrix with \(\boldsymbol{x}_n^\intercal\) in row \(n\).

(a)

Let \(N_a\) be the number of observations with \(z_n\) = a, and let \(\sum_{n:z_n = a}\) denote a sum over rows with \(z_n\) = a, with analogous definitions for b and c. In terms of these quantities, write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).

(b)

When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Explain intuitively why the regression problem cannot be solved when \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Write an explicit expression for \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) when it is invertible.

(c)

Using your previous answer, show that the least squares vector \(\hat{\boldsymbol{\beta}}\) is the mean of \(y_n\) within distinct values of \(z_n\).

(d)

Suppose now you include a constant in the regression, so that

\[ y_n = \alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n, \]

and let \(\boldsymbol{X}'\) denote the regressor matrix for this regression with coefficient vector \((\alpha, \boldsymbol{\beta}^\intercal)^\intercal\). Write an expression for \(\boldsymbol{X}'^\intercal\boldsymbol{X}'\) and show that it is not invertible.

(e)

Find three distinct values of \((\alpha, \boldsymbol{\beta}^\intercal)\) that all give the exact same fit \(\alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

3 Linear regression as projections

In this problem, we show that the OLS regression acts like a “projection,” breaking \(\boldsymbol{Y}\) up into two orthogonal components: a component \(\hat{\boldsymbol{Y}}\) that is in the span of the columns of \(\boldsymbol{X}\), and a component \(\hat{\boldsymbol{\varepsilon}}\) that is orthogonal to the columns of \(\boldsymbol{X}\).

Consider a regression setup where \(\boldsymbol{X}\) is full-rank with \(P < N\). Let \(\hat{\boldsymbol{\beta}}\), \(\hat{\boldsymbol{Y}}\), and \(\hat{\varepsilon}\) denote the usual quantities. Write the \(p\)-th column of \(\boldsymbol{X}\) as \(\boldsymbol{X}_p\), so that \(\boldsymbol{X}= (\boldsymbol{X}_1, \ldots, \boldsymbol{X}_P)\), and define the \(N \times N\) “projection matrix” \(\underset{\boldsymbol{X}}{\boldsymbol{P}} := \boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\).

(a) An \(N\)-vector \(\boldsymbol{v}\) is said to be “in the linear span of the columns of \(\boldsymbol{X}\)” if it can be written as \(\boldsymbol{v}= \sum_{p=1}^P a_p \boldsymbol{X}_p\) for some \(a_1,\ldots,a_P\). Show that such a vector can be written in the form \(\boldsymbol{X}\boldsymbol{a}\) for some \(P\)-vector \(\boldsymbol{a}\).

(b) Show that if \(\boldsymbol{v}\) is in the in the linear span of the columns of \(\boldsymbol{X}\), then \(\underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{v}= \boldsymbol{v}\). That is, \(\underset{\boldsymbol{X}}{\boldsymbol{P}}\) leaves \(\boldsymbol{v}\) unchanged.

(c) Show that if an \(N\)-vector \(\boldsymbol{q}\) is orthogonal to each column of \(\boldsymbol{X}\), then \(\underset{\boldsymbol{X}}{\boldsymbol{P}} \boldsymbol{q}= \boldsymbol{0}\).

(d) Show that \(\boldsymbol{v}^\intercal\boldsymbol{q}= \boldsymbol{0}\), where \(\boldsymbol{v}\) and \(\boldsymbol{q}\) are given in the preceding problems.

For the remaining problems, we introduce some additional notation. Note that, in general, the response vector \(\boldsymbol{Y}\) is just an \(N\)-vector. Write \(\boldsymbol{Y}= \boldsymbol{Y}_{\parallel} + \boldsymbol{Y}_{\perp}\), where \(\boldsymbol{Y}_{\parallel}\) is in the linear span of the columns of \(\boldsymbol{X}\), and \(\boldsymbol{Y}_{\perp}\) is orthogonal to the columns of \(\boldsymbol{X}\). We call \(\boldsymbol{Y}_{\parallel}\) the “projection of \(\boldsymbol{Y}\) onto the span of the columns of \(\boldsymbol{X}\).”

(e) Show that \(\hat{\boldsymbol{Y}}= \boldsymbol{Y}_{\parallel}\) and \(\hat{\varepsilon}= \boldsymbol{Y}_{\perp}\).

(f) Show that \(\boldsymbol{X}^\intercal\hat{\varepsilon}= \boldsymbol{0}\).

(g) Show that \(\left\Vert\boldsymbol{Y}\right\Vert_2^2 = \left\Vert\hat{\boldsymbol{Y}}\right\Vert_2^2 + \left\Vert\hat{\varepsilon}\right\Vert_2^2\).

(h) Let \(\boldsymbol{\gamma}\) denote a \(P \times 1\) vector. Using (b), show directly that \(\left\Vert\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\gamma}\right\Vert_2^2\) is smallest when \(\boldsymbol{\gamma}= \hat{\boldsymbol{\beta}}\). Hint: Plug in \(\boldsymbol{Y}= \hat{\varepsilon}+ \boldsymbol{X}\hat{\boldsymbol{\beta}}\), use \(\boldsymbol{X}\hat{\boldsymbol{\beta}}- \boldsymbol{X}\boldsymbol{\gamma}= \boldsymbol{X}(\hat{\boldsymbol{\beta}}- \boldsymbol{\gamma})\), and expand the square.

(i) Suppose that \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{a}\) for some \(\boldsymbol{a}\). Show that in this case \(\hat{\boldsymbol{Y}}= \boldsymbol{Y}\) and \(\hat{\boldsymbol{\varepsilon}}= \boldsymbol{0}\).

(j) As a special case of (a), show that \(\frac{1}{N} \sum_{n=1}^N\hat{\varepsilon}_n = 0\) when the regression includes an intercept term. Hint: When the regression includes an intercept, then \(\boldsymbol{1}\) is a column of \(\boldsymbol{X}\).

4 Changes of basis

In this problem, we show how rewriting a regression problem using an invertible linear transformation of the regressors changes the estimated coefficients, but does not change the fit.

Consider a regression setup where \(\boldsymbol{X}\) is full-rank with \(P < N\). Let \(\boldsymbol{z}_n = \boldsymbol{A}^\intercal\boldsymbol{x}_n\) for each \(n\), so that \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{A}\). Consider the regressions \[ \boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\quad\textrm{and}\quad \boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\gamma}, \] with corresponding OLS estimates \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\).

(a) Find an explicit expression for \(\hat{\boldsymbol{\gamma}}\) in terms of \(\boldsymbol{A}\) and \(\hat{\boldsymbol{\beta}}\). Hint: Plug in \(\boldsymbol{Z}= \boldsymbol{X}\boldsymbol{A}\) and simplify using the linear algebra fact that \((\boldsymbol{A}\boldsymbol{B}\boldsymbol{Q})^{-1} = \boldsymbol{Q}^{-1} \boldsymbol{B}^{-1} \boldsymbol{A}^{-1}\) when all inverses exist.

(b) Using your result from (a), show that \(\boldsymbol{x}_n^\intercal\hat{\boldsymbol{\beta}}= \boldsymbol{z}_n^\intercal\hat{\boldsymbol{\gamma}}\). In other words, show that \(\hat{y}_n\) is the same whether you regress on \(\boldsymbol{X}\) or on \(\boldsymbol{Z}\).

(c) Let \(R^2_x\) and \(R^2_z\) denote the \(R^2\) fits from the regression on \(\boldsymbol{X}\) and \(\boldsymbol{Z}\) respectively. Show that \(R^2_x= R^2_z\). Hint: Using (b), what do you know about the residuals for the two regressions?

5 Changes of basis continued

This is an application of the change of basis problem to one-hot indicators with a constant. Let \(\boldsymbol{z}_n\) be a one-hot encoding of a binary regressor, so that \(\boldsymbol{z}_n = (1, 0)^\intercal\) or \(\boldsymbol{z}_n = (0, 1)^\intercal\), which encodes a binary variable \(a_n \in \{A, B\}\). For example, \(\boldsymbol{z}_n\) might encode \(\textrm{Kitchen Quality} \in \{\textrm{Good}, \textrm{Excellent}\}\), where \(a_n = \textrm{Kitchen Quality}\), \(A = \textrm{Good}\), and \(B = \textrm{Excellent}\).

Let \(\boldsymbol{x}_n\) be the regressor for the regression \(y\sim 1 + z_2\), the regression on a constant and the second component of \(\boldsymbol{z}_n\), so \(\boldsymbol{x}_n = (1, \mathbb{I}\left(a_n = B\right))\).

Consider the regressions \[ \boldsymbol{Y}\sim \boldsymbol{X}\boldsymbol{\beta}\quad\textrm{and}\quad \boldsymbol{Y}\sim \boldsymbol{Z}\boldsymbol{\gamma}, \] with corresponding OLS estimates \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\gamma}}\).

(a) Find an \(\boldsymbol{A}\) such that \(\boldsymbol{z}_n = \boldsymbol{A}^\intercal\boldsymbol{x}_n\) in this case, and prove that it’s invertible.

(b) Write an expression for \(\hat{\boldsymbol{\gamma}}\) in terms of \(\bar{y}_A\) and \(\bar{y}_B\), which are respectively the average responses when \(a_n = A\) and \(a_n = B\).

(c) Write an expression for \(\hat{\beta}_2\) in terms of \(\bar{y}_A\) and \(\bar{y}_B\) using the fact that \(\hat{y}_n\) must be the same for the two regressions.

(d) In ordinary language, what is the difference in the interpretation of the coefficient \(\hat{\beta}_2\) and \(\hat{\gamma}_2\)?

Linear combinations of regressors (based on Freedman (2009) Chapter 4 set B exercise 14)

Let \(\hat{\boldsymbol{\beta}}\) be the OLS estimator \(\hat{\boldsymbol{\beta}}= (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}\), where the design matrix \(\boldsymbol{X}\) has full rank \(P < N\). Assume that, for each \(n\) and some \(\boldsymbol{\beta}\), that \(y_n = \boldsymbol{x}_n \boldsymbol{\beta}+ \varepsilon_n\), where \(\mathbb{E}\left[\varepsilon_n \vert \boldsymbol{x}_n\right] = 0\), \(\varepsilon_n\) are IID, and \(\mathrm{Var}\left(\varepsilon_n \vert \boldsymbol{x}_n\right) = \sigma^2\).

(a) Write an expression for \(\boldsymbol{Y}\) in terms of \(\boldsymbol{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\).

(b) Plug your result from (a) in to get an expression for \(\hat{\boldsymbol{\beta}}\).

(c) Using your result from (b), what is \(\mathbb{E}\left[\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\)?

(d) Using your result from (b), what is \(\mathrm{Cov}\left(\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\) in terms of \(\sigma^2\) and \(\boldsymbol{X}\)?

(e) Find a \(\boldsymbol{a}\) such that \(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 - \hat{\beta}_2\).

(f) For a given \(\boldsymbol{a}\), find an expression for \(\mathbb{E}\left[\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right]\)

(g) For a given \(\boldsymbol{a}\), find an expression for \(\mathrm{Var}\left(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}\vert \boldsymbol{X}\right)\).

6 Correlated regressors

Suppose that \(y_n = \boldsymbol{x}_n^\intercal\boldsymbol{\beta}+ \varepsilon_n\) for some \(\boldsymbol{\beta}\). Suppose that \(\mathbb{E}\left[\varepsilon_n\right] = 0\) and \(\mathrm{Var}\left(\varepsilon_n\right) = \sigma^2\), and \(\varepsilon_n\) are independent of each other and the \(\boldsymbol{x}_n\).

Let \(\boldsymbol{x}_n \in \mathbb{R}^{2}\), where

\(\boldsymbol{x}_n\) is independent of \(\boldsymbol{x}_m\) for \(n \ne m\),
\(\mathbb{E}\left[\boldsymbol{x}_{n1}\right] = \mathbb{E}\left[\boldsymbol{x}_{n2}\right] = 0\),
\(\mathrm{Var}\left(\boldsymbol{x}_{n1}\right) = \mathrm{Var}\left(\boldsymbol{x}_{n2}\right) = 1\), and
\(\mathbb{E}\left[\boldsymbol{x}_{n1} \boldsymbol{x}_{n2}\right] = \rho\).

(a) If \(\left|\rho\right| < 1\), is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) always, sometimes, or never invertible?

(b) If \(\left|\rho\right| = 1\), is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) always, sometimes, or never invertible?

(c) What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\)? When is the limit invertible?

(d) State intuitively why there is no unique \(\hat{\boldsymbol{\beta}}\) when \(\rho = 1\). When \(\rho = 1\), give two distinct values of \(\boldsymbol{\beta}\) that result in the same fit \(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

The remaining questions use results from the previous homework question.

(e) Find a \(\boldsymbol{a}\) such that \(\boldsymbol{a}^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 - \hat{\beta}_2\) and \(\boldsymbol{b}\) such that \(\boldsymbol{b}^\intercal\hat{\boldsymbol{\beta}}= \hat{\beta}_1 + \hat{\beta}_2\).

(f) For fixed large \(N\), so that \(\left( \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\right)\approx \mathbb{E}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right]^{-1}\), what happens to \(N \mathrm{Var}\left(\hat{\beta}_1 - \hat{\beta}_2 \vert \boldsymbol{X}\right)\) and \(N \mathrm{Var}\left(\hat{\beta}_1 + \hat{\beta}_2 \vert \boldsymbol{X}\right)\) as \(\rho\) varies from zero to one?

References

Freedman, David. 2009. Statistical Models: Theory and Practice. cambridge university press.