STAT151A Homework 2 (with solutions)

Author

Your name here

This homework is due on Gradescope on Monday September 30th at 9pm.

1 Ordinary least squares in matrix form

Consider simple least squares regression \(y_n = \beta_1 + \beta_2 x_n + \varepsilon_n\), where \(x_n\) is a scalar. Assume that we have \(N\) datapoints. We showed directly that the least–squares solution is given by

\[ \begin{aligned} \hat{\beta}_1 ={}& \overline{y} - \hat{\beta}_2 \overline{x} &and\quad\quad \hat{\beta}_2 ={}& \frac{\overline{xy} - \overline{x} \, \overline{y}} {\overline{xx} - \overline{x}^2}. \end{aligned} \]

Let us re–derive this using matrix notation.

(a)

Write simple linear regression in the form \(\boldsymbol{Y}= \boldsymbol{X}\boldsymbol{\beta}+ \boldsymbol{\varepsilon}\). Be precise about what goes into each entry of \(\boldsymbol{Y}\), \(\boldsymbol{X}\), \(\boldsymbol{\beta}\), and \(\boldsymbol{\varepsilon}\). What are the dimensions of each?

(b)

We proved that the optimal \(\hat{\boldsymbol{\beta}}\) satisfies \(\boldsymbol{X}^\intercal\boldsymbol{X}\hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{Y}\). Define the “barred quantities” \[ \begin{aligned} \overline{y} ={}& \frac{1}{N} \sum_{n=1}^Ny_n \\ \overline{x} ={}& \frac{1}{N} \sum_{n=1}^Nx_n \\ \overline{xy} ={}& \frac{1}{N} \sum_{n=1}^Nx_n y_n \\ \overline{xx} ={}& \frac{1}{N} \sum_{n=1}^Nx_n ^2, \end{aligned} \]

In terms of the barred quantities and the number of datpoints \(N\), write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).

(c)

When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Write a formal expression in terms of the barred quantities. Interpret this condition intuitively in terms of the distribution of the regressors \(x_n\).

(d)

Using the formula for the inverse of a \(2\times 2\) matrix, find an expression for \(\hat{\boldsymbol{\beta}}\), and confirm that we get the same answer that we got by solving directly.

(e)

In the case where \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible, find three distinct values of \(\boldsymbol{\beta}\) that all achieve the same sum of squared residuals \(\boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}\).

Solutions

(a)

\[ \boldsymbol{X}= \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{pmatrix} \quad\quad \boldsymbol{Y}= \begin{pmatrix} y_1 \\ \vdots \\ y_N \end{pmatrix} \quad\quad \boldsymbol{\varepsilon}= \begin{pmatrix} \varepsilon_1 \\ \vdots \\ \varepsilon_N \end{pmatrix} \quad\quad \boldsymbol{\beta}= \begin{pmatrix} \beta_1 \\ \beta_2 \end{pmatrix} \]

These are \(N \times 2\), \(N \times 1\), \(N \times 1\), and \(2 \times 1\) respectively.

(b)

\[ \boldsymbol{X}^\intercal\boldsymbol{X}= N \begin{pmatrix} 1 & \bar{x}\\ \bar{x}& \overline{xx} \end{pmatrix} \quad\quad \boldsymbol{X}^\intercal\boldsymbol{Y}= N \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \]

(c)

\(\boldsymbol{X}\intercal\boldsymbol{X}\) is invertible if the determinant, \(N (\overline{xx} - \bar{x}\, \bar{x}) \ne 0\). This occurs when the sample variance of \(x_n\) is greater than zero.

(d)

\[ \begin{aligned} (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}={}& \frac{1}{N (\overline{xx} - \bar{x}\, \bar{x})} \begin{pmatrix} \overline{xx} & -\bar{x}\\ -\bar{x}& 1 \end{pmatrix} N \begin{pmatrix} \bar{y}\\ \overline{xy} \end{pmatrix} \\={}& \frac{1}{\overline{xx} - \bar{x}\, \bar{x}} \begin{pmatrix} \bar{y}\, \overline{xx} - \bar{x}\, \overline{xy} \\ -\bar{y}\, \bar{x}+ \overline{xy} \end{pmatrix} \end{aligned}. \]

We already have \(\hat{\beta}_2 = (\overline{xy} - \bar{y}\, \bar{x}) / (\overline{xx} - \bar{x}\, \bar{x})\) as expected. To see that \(\hat{\beta}_1\) is correct, write

\[ \begin{aligned} \bar{y}\, \overline{xx} - \bar{x}\, \overline{xy} ={}& \bar{y}\, \overline{xx} - \bar{y}\, \bar{x}\, \bar{x}+ \bar{y}\, \bar{x}\, \bar{x}- \bar{x}\, \overline{xy} \\={}& \bar{y}(\overline{xx} - \bar{x}\, \bar{x}) - \bar{x}(\overline{xy} - \bar{y}\, \bar{x}). \end{aligned} \]

Plugging this in gives

\[ \begin{aligned} \frac{\bar{y}\, \overline{xx} - \bar{x}\, \overline{xy}}{\overline{xx} - \bar{x}\, \bar{x}} ={}& \frac{\bar{y}(\overline{xx} - \bar{x}\, \bar{x}) - \bar{x}(\overline{xy} - \bar{y}\, \bar{x})} {\overline{xx} - \bar{x}\, \bar{x}} \\={}& \bar{y}- \bar{x}\hat{\beta}_2, \end{aligned} \]

as expected.

2 Probability and matrices

For this problem, assume that \(y_n = \beta_0 + x_n \beta_1 + \varepsilon_n\) for scalar \(x_n\) and some fixed \(\beta_0\) and \(\beta_1\). Assume that

  • The residuals \(\varepsilon_n\) are IID with \(\mathbb{E}\left[\varepsilon_n\right] = 0\) and \(\mathrm{Var}\left(\varepsilon_n\right) = \sigma^2\).
  • The regressors \(x_n\) are IID with \(\mathbb{E}\left[x_n\right] = \mu\) and \(\mathrm{Var}\left(x_n\right) = \nu^2\).
  • The residuals are independent of the regressors.

(a)

Evaluate the following expressions. (You may need to remind yourself of the definition of conditional expectation and variance.)

  • \(\mathbb{E}\left[y_n\right]\)
  • \(\mathrm{Var}\left(y_n\right)\)
  • \(\mathbb{E}\left[y_n \vert x_n\right]\)
  • \(\mathrm{Var}\left(y_n \vert x_n\right)\)

(b)

Compute the following limits using the LLN, or say that the limit does not exist or is infinite.

  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^N\varepsilon_n\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n^2\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n \varepsilon_n\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n y_n\)

(c)

Compute the following limits using the CLT, or say that the limit does not exist or is infinite.

  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^N\varepsilon_n\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^Ny_n\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^N(y_n - (\beta_0 + x_n \beta_1))\)

(d)

Noting that this is simple linear regression, let \(\boldsymbol{X}\), \(\boldsymbol{Y}\), and \(\boldsymbol{\varepsilon}\) be as in the solution to question one above. Evaluate the following limits, or say that the limit does not exist or is infinite.

Here, \((\boldsymbol{A})_{ij}\) denotes the \(i,j\)–th entry of the matrix \(\boldsymbol{A}\). Let the regressor \(x_n\) be in the second column of \(\boldsymbol{X}\).

  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{\varepsilon}\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \boldsymbol{1}^\intercal\boldsymbol{\varepsilon}\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{11}\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{12}\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{22}\)
  • \(\lim_{N \rightarrow \infty } (\boldsymbol{X}^\intercal\boldsymbol{X})_{11}\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})\)
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})\)

Hint: Write the matrix expressions as sums over \(n=1\) to \(N\).

Solutions

(a)

  • \(\mathbb{E}\left[y_n\right] = \beta_0 + \mu \beta_1\)
  • \(\mathrm{Var}\left(y_n\right) = \beta_1^2 \nu^2 + \sigma^2\)
  • \(\mathbb{E}\left[y_n \vert x_n\right] = \beta_0 + x_n \beta_1\)
  • \(\mathrm{Var}\left(y_n \vert x_n\right) = \sigma^2\)

(b)

  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^N\varepsilon_n = 0\) by the LLN
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2 = \sigma^2\) by the LLN
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n = \mu\) by the LLN
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n^2 = \mu^2 + \nu^2\) by the LLN
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n \varepsilon_n = 0\) by the LLN
  • \(\lim_{N \rightarrow \infty } \frac{1}{N} \sum_{n=1}^Nx_n y_n = \beta_0 \mu + \beta_1 (\nu^2 + \mu^2)\) by the LLN

(c)

  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^N\varepsilon_n = \mathcal{N}\left(0,\sigma^2\right)\) by the CLT
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^Ny_n\) diverges if \(\beta_0 + \beta_1 \mu \ne 0\), and otherwise converges to \(\mathcal{N}\left(0, \beta_1^2 \nu^2 + \sigma^2\right)\).
  • \(\lim_{N \rightarrow \infty } \frac{1}{\sqrt{N}} \sum_{n=1}^N(y_n - (\beta_0 + x_n \beta_1)) = \mathcal{N}\left(0, \sigma^2\right)\) by the CLT because \(y_n - (\beta_0 + x_n \beta_1) = \varepsilon_n\).

(d)

The key is to write these expressions as limits of sums, and then use the techniques given above.

  • \(\frac{1}{N} \boldsymbol{1}^\intercal\boldsymbol{\varepsilon}= \frac{1}{N} \sum_{n=1}^N\varepsilon_n \rightarrow 0\)
  • \(\frac{1}{\sqrt{N}} \boldsymbol{1}^\intercal\boldsymbol{\varepsilon}= \frac{1}{\sqrt{N}} \sum_{n=1}^N\varepsilon_n \rightarrow \mathcal{N}\left(0, \sigma^2\right)\)
  • \(\frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{11} = \frac{1}{N} \sum_{n=1}^N1 = 1\)
  • \(\frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{12} = \frac{1}{N} \sum_{n=1}^Nx_n \rightarrow \mu\)
  • \(\frac{1}{N} (\boldsymbol{X}^\intercal\boldsymbol{X})_{22} = \frac{1}{N} \sum_{n=1}^Nx_n^2 \rightarrow \mu^2 + \nu^2\)
  • (accidental repeat)
  • \(\frac{1}{N} (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta}) = \frac{1}{N} \boldsymbol{\varepsilon}^\intercal\boldsymbol{\varepsilon}= \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2 \rightarrow \sigma^2\)
  • \(\frac{1}{\sqrt{N}} (\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta})^\intercal(\boldsymbol{Y}- \boldsymbol{X}\boldsymbol{\beta}) = \frac{1}{\sqrt{N}} \sum_{n=1}^N\varepsilon_n^2 = \sqrt{N} \frac{1}{N} \sum_{n=1}^N\varepsilon_n^2 \rightarrow \infty\)

3 One-hot encoding

Consider a one–hot encoding of a variable \(z_n\) that takes three distinct values, “a”, “b”, and “c”. That is, let

\[ \boldsymbol{x}_n = \begin{cases} \begin{pmatrix} 1 \\ 0 \\ 0 \end{pmatrix} & \textrm{ when }z_n = a \\ \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix} & \textrm{ when }z_n = b \\ \begin{pmatrix} 0 \\ 0 \\ 1 \end{pmatrix} & \textrm{ when }z_n = c \\ \end{cases} \]

Let \(\boldsymbol{X}\) be the regressor matrix with \(\boldsymbol{x}_n^\intercal\) in row \(n\).

(a)

Let \(N_a\) be the number of observations with \(z_n\) = a, and let \(\sum_{n:z_n = a}\) denote a sum over rows with \(z_n\) = a, with analogous definitions for b and c. In terms of these quantities, write expressions for \(\boldsymbol{X}^\intercal\boldsymbol{X}\) and \(\boldsymbol{X}^\intercal\boldsymbol{Y}\).

(b)

When is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) invertible? Explain intuitively why the regression problem cannot be solved when \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is not invertible. Write an explicit expression for \((\boldsymbol{X}^\intercal\boldsymbol{X})^{-1}\) when it is invertible.

(c)

Using your previous answer, show that the least squares vector \(\hat{\boldsymbol{\beta}}\) is the mean of \(y_n\) within distinct values of \(z_n\).

(d)

Suppose now you include a constant in the regression, so that

\[ y_n = \alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n + \varepsilon_n, \]

and let \(\boldsymbol{X}'\) denote the regressor matrix for this regression with coefficient vector \((\alpha, \boldsymbol{\beta}^\intercal)^\intercal\). Write an expression for \(\boldsymbol{X}'^\intercal\boldsymbol{X}'\) and show that it is not invertible.

(e)

Find three distinct values of \((\alpha, \boldsymbol{\beta}^\intercal)\) that all give the exact same fit \(\alpha + \boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

Solutions

(a)

\[ \boldsymbol{X}^\intercal\boldsymbol{X}= \begin{pmatrix} N_a & 0 & 0 \\ 0 & N_b & 0 \\ 0 & 0 & N_c \\ \end{pmatrix} \quad\quad \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \sum_{n:z_n = a} y_n \\ \sum_{n:z_n = b} y_n \\ \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]

(b)

It is invertbile as long as each of \(N_a\), \(N_b\), and \(N_c\) are nonzero. If there are no observations for a particular level, you of course cannot estimate its relationship with \(y_n\). When \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible, then

\[ \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} = \begin{pmatrix} 1/N_a & 0 & 0 \\ 0 & 1/N_b & 0 \\ 0 & 0 & 1/N_c \\ \end{pmatrix} \]

(c)

By direct multiplication,

\[ \hat{\boldsymbol{\beta}}= \boldsymbol{X}^\intercal\boldsymbol{X}^{-1} \boldsymbol{X}^\intercal\boldsymbol{Y}= \begin{pmatrix} \frac{1}{N_a} \sum_{n:z_n = a} y_n \\ \frac{1}{N_b} \sum_{n:z_n = b} y_n \\ \frac{1}{N_c} \sum_{n:z_n = c} y_n \\ \end{pmatrix}. \]

(d)

\[ (\boldsymbol{X}')^\intercal\boldsymbol{X}' = \begin{pmatrix} N & N_a & N_b & N_c \\ N_a & N_a & 0 & 0 \\ N_b & 0 & N_b & 0 \\ N_c & 0 & 0 & N_c \\ \end{pmatrix}. \]

This is not invertible because the first column is the sum of the other three. Equivalently, \((\boldsymbol{X}')^\intercal\boldsymbol{X}' \boldsymbol{v}= \boldsymbol{0}\) where \(\boldsymbol{v}= (1, -1, -1, -1)^\intercal\).

(e)

Any line of the form

\[ (\alpha + \beta_1) z_{na} + (\alpha + \beta_2) z_{nb} + (\alpha + \beta_3) z_{nc} \]

will give the same fit. Three equivalent sets that also happen to solve the least squares problem are

\[ \begin{aligned} (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (0, 0, 0, 0) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (1, -1, -1, -1) \\ (\alpha, \beta_1, \beta_2, \beta_3) ={} \hat{\boldsymbol{\beta}}+ (2, -2, -2, -2). \end{aligned} \]

These are all of the form \(\hat{\boldsymbol{\beta}}+ C \boldsymbol{v}\) for \(C = 0\), \(C = 1\), and \(C = 2\), as they must be, where \(\boldsymbol{v}\) is the null vector from (d).

4 Correlated regressors

Suppose that \(y_n = \boldsymbol{x}_n^\intercal\boldsymbol{\beta}+ \varepsilon_n\) for some \(\boldsymbol{\beta}\). Suppose that \(\mathbb{E}\left[\varepsilon_n\right] = 0\) and \(\mathrm{Var}\left(\varepsilon_n\right) = \sigma^2\), and \(\varepsilon_n\) are independent of each other and the \(\boldsymbol{x}_n\).

Let \(\boldsymbol{x}_n \in \mathbb{R}^{2}\), where

  • \(\boldsymbol{x}_n\) is independent of \(\boldsymbol{x}_m\) for \(n \ne m\),
  • \(\mathbb{E}\left[\boldsymbol{x}_{n1}\right] = \mathbb{E}\left[\boldsymbol{x}_{n2}\right] = 0\),
  • \(\mathrm{Var}\left(\boldsymbol{x}_{n1}\right) = \mathrm{Var}\left(\boldsymbol{x}_{n2}\right) = 1\), and
  • \(\mathbb{E}\left[\boldsymbol{x}_{n1} \boldsymbol{x}_{n2}\right] = \rho\).

(a)

If \(\left|\rho\right| < 1\), is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) always, sometimes, or never invertible?

(b)

If \(\left|\rho\right| = 1\), is \(\boldsymbol{X}^\intercal\boldsymbol{X}\) always, sometimes, or never invertible?

(c)

What is \(\lim_{N \rightarrow \infty} \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}\)? When is the limit invertible?

(d)

State intuitively why there is no unique \(\hat{\boldsymbol{\beta}}\) when \(\rho = 1\). When \(\rho = 1\), give two distinct values of \(\boldsymbol{\beta}\) that result in the same fit \(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n\).

Solutions

(a)

It may be that \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is non–invertible by chance, depending on the distribution of \(\boldsymbol{x}_n\). There are distributions for which \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is invertible with probability one, such as independent normals. But without extra information like this then the right answer is “sometimes.”

(b)

If \(\rho = 1\), then \(\boldsymbol{x}_{n1} = \boldsymbol{x}_{n2}\) almost surely because they are perfectly correlated. (The difference between them has zero variance.) In this case, \(\boldsymbol{X}^\intercal\boldsymbol{X}\) is never invertible (non–invertible with probability one).

(c)

By the LLN,

\[ \lim_{N \rightarrow \infty} \frac{1}{N} \boldsymbol{X}^\intercal\boldsymbol{X}= \begin{pmatrix} 1 & \rho \\ \rho & 1 \end{pmatrix}. \]

The limit is invertible if and only if \(\left|\rho\right| < 1\).

(d)

If \(\rho = 1\), then \(\boldsymbol{x}_{n1} = \boldsymbol{x}_{n2}\) with probability one, and we can write \(\boldsymbol{\beta}^\intercal\boldsymbol{x}_n = (\beta_1 + \beta_2) \boldsymbol{x}_{n1}\). Intuitively, we cannot distinguish a relationship between \(y_n\) and the two components of \(\boldsymbol{x}_n\) because the two components are numerically the same. Three values that give the same fit are \(\boldsymbol{\beta}= (0, 0)^\intercal\), \(\boldsymbol{\beta}= (1, -1)^\intercal\), and \(\boldsymbol{\beta}= (-1, 1)^\intercal\).

5 Matrix square roots

In the last homework, we proved that if \(\boldsymbol{A}\) is a square symmetric matrix with eigenvalues \(\boldsymbol{u}_p\) and eigenvectors \(\lambda_p\), then we can write \(\boldsymbol{A}= \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal\), where \(\boldsymbol{U}= (\boldsymbol{u}_1 \ldots \boldsymbol{u}_p)\) has \(\boldsymbol{u}_p\) in its \(p\)–th column, and \(\Lambda\) is diagonal with \(\lambda_p\) in the \(p\)–th diagonal entry. We also have that the eigenvectors can be taken to be orthonormal without loss of generality.

We will additionally assume that \(\boldsymbol{A}= \boldsymbol{X}^\intercal\boldsymbol{X}\) for some (possibly non–square) matrix \(\boldsymbol{X}\).

Define \(\Lambda^{1/2}\) to be the diagonal matrix with \(\sqrt{\lambda_p}\) on the \(p\)–th diagonal.

  • Prove that, since \(\boldsymbol{A}= \boldsymbol{X}^\intercal\boldsymbol{X}\), \(\lambda_p \ge 0\), and so \(\Lambda^{1/2}\) is always real–valued.
    When the eignevalues are non–negative, we say that \(\boldsymbol{A}\) is “positive semi–definite.” (Hint: using the fact that \(\lambda_p = \boldsymbol{u}_p^\intercal\boldsymbol{A}\boldsymbol{u}_p\), show that \(\lambda_p\) is the square of something.)
  • Show that if we take \(\boldsymbol{Q}= \boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal\) then \(\boldsymbol{A}= \boldsymbol{Q}\boldsymbol{Q}^\intercal\). We say that \(\boldsymbol{Q}\) is a “matrix square root” of \(\boldsymbol{A}\).
  • Show that we also have \(\boldsymbol{A}= \boldsymbol{Q}\boldsymbol{Q}\) (without the second transpose).
  • Show that, if \(\boldsymbol{V}\) is any orthonormal matrix (a matrix with orthonormal columns), then \(\boldsymbol{Q}' = \boldsymbol{Q}\boldsymbol{V}\ne \boldsymbol{Q}\) also satisfies \(\boldsymbol{A}= \boldsymbol{Q}' \boldsymbol{Q}'^\intercal\). This shows that the matrix square root is not unique. (This fact can be thought of as the matrix analogue of the fact that \(4 = 2 \cdot 2\) but also \(4 = (-2) \cdot (-2)\)).
  • Show that if \(\lambda_p > 0\) then \(\boldsymbol{Q}\) is invertible.
  • Show that, if \(\boldsymbol{Q}\) is invertible, then the columns of \(\boldsymbol{X}\boldsymbol{Q}^{-1}\) are orthonormal. (Hint: show that \((\boldsymbol{X}\boldsymbol{Q}^{-1})^\intercal(\boldsymbol{X}\boldsymbol{Q}^{-1})\) is the identity matrix.)

Solutions

(1)

Suppose that \(\boldsymbol{v}\) is an eigenvector with eigenvalue \(\lambda\). Then

\[ \lambda \left\Vert\boldsymbol{v}\right\Vert^2 = \boldsymbol{v}^\intercal\boldsymbol{A}\boldsymbol{v}= \boldsymbol{v}^\intercal\boldsymbol{X}^\intercal\boldsymbol{X}\boldsymbol{v}= \left\Vert\boldsymbol{X}\boldsymbol{v}\right\Vert^2 \ge 0. \]

Since \(\left\Vert\boldsymbol{v}\right\Vert^2 \ge 0\), we must have \(\lambda \ge 0\) as well.

(2)

We already know that \(\Lambda^{1/2} \Lambda^{1/2} = \Lambda\) because the matrices are diagonal. Because the eigenvector matrices are orthonormal,

\[ \boldsymbol{Q}\boldsymbol{Q}^\intercal= \boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal\boldsymbol{U}\Lambda^{1/2} \boldsymbol{U}^\intercal= \boldsymbol{U}\Lambda^{1/2} \Lambda^{1/2} \boldsymbol{U}^\intercal= \boldsymbol{U}\Lambda \boldsymbol{U}^\intercal= \boldsymbol{A}. \]

(3)

Show directly that \(\boldsymbol{Q}\) is symmetric.

(4)

\[ \boldsymbol{Q}' \boldsymbol{Q}'^\intercal= \boldsymbol{Q}\boldsymbol{V}\boldsymbol{V}^\intercal\boldsymbol{Q}^\intercal= \boldsymbol{Q}\boldsymbol{Q}^\intercal= \boldsymbol{A}, \]

since \(\boldsymbol{V}^\intercal= \boldsymbol{V}^{-1}\) and left and right inverses are the same.

(5)

The inverse is given by the matrix with \(1/\sqrt{\lambda_p}\) on the diagonal, which can be verified by direct multiplication.

(5)

\[ \begin{aligned} (\boldsymbol{X}\boldsymbol{Q}^{-1})^\intercal(\boldsymbol{X}\boldsymbol{Q}^{-1}) ={}& (\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2})^\intercal\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2} \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2} \boldsymbol{X}^\intercal\boldsymbol{X}(\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2} \\={}& (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2} (\boldsymbol{X}^\intercal\boldsymbol{X})^{1/2} (\boldsymbol{X}^\intercal\boldsymbol{X})^{1/2} (\boldsymbol{X}^\intercal\boldsymbol{X})^{-1/2} \\={}& \boldsymbol{I}. \end{aligned} \]