$$ \newcommand{\mybold}[1]{\boldsymbol{#1}} \newcommand{\trans}{\intercal} \newcommand{\norm}[1]{\left\Vert#1\right\Vert} \newcommand{\abs}[1]{\left|#1\right|} \newcommand{\bbr}{\mathbb{R}} \newcommand{\bbz}{\mathbb{Z}} \newcommand{\bbc}{\mathbb{C}} \newcommand{\gauss}[1]{\mathcal{N}\left(#1\right)} \newcommand{\chisq}[1]{\mathcal{\chi}^2_{#1}} \newcommand{\studentt}[1]{\mathrm{StudentT}_{#1}} \newcommand{\fdist}[2]{\mathrm{FDist}_{#1,#2}} \newcommand{\iid}{\overset{\mathrm{IID}}{\sim}} \newcommand{\argmin}[1]{\underset{#1}{\mathrm{argmin}}\,} \newcommand{\projop}[1]{\underset{#1}{\mathrm{Proj}}\,} \newcommand{\proj}[1]{\underset{#1}{\mybold{P}}} \newcommand{\expect}[1]{\mathbb{E}\left[#1\right]} \newcommand{\prob}[1]{\mathbb{P}\left(#1\right)} \newcommand{\dens}[1]{\mathit{p}\left(#1\right)} \newcommand{\var}[1]{\mathrm{Var}\left(#1\right)} \newcommand{\cov}[1]{\mathrm{Cov}\left(#1\right)} \newcommand{\sumn}{\sum_{n=1}^N} \newcommand{\meann}{\frac{1}{N} \sumn} \newcommand{\cltn}{\frac{1}{\sqrt{N}} \sumn} \newcommand{\trace}[1]{\mathrm{trace}\left(#1\right)} \newcommand{\diag}[1]{\mathrm{Diag}\left(#1\right)} \newcommand{\grad}[2]{\nabla_{#1} \left. #2 \right.} \newcommand{\gradat}[3]{\nabla_{#1} \left. #2 \right|_{#3}} \newcommand{\fracat}[3]{\left. \frac{#1}{#2} \right|_{#3}} \newcommand{\W}{\mybold{W}} \newcommand{\w}{w} \newcommand{\wbar}{\bar{w}} \newcommand{\wv}{\mybold{w}} \newcommand{\X}{\mybold{X}} \newcommand{\x}{x} \newcommand{\xbar}{\bar{x}} \newcommand{\xv}{\mybold{x}} \newcommand{\Xcov}{\mybold{M}_{\X}} \newcommand{\Xcovhat}{\hat{\mybold{M}}_{\X}} \newcommand{\Covsand}{\Sigmam_{\mathrm{sand}}} \newcommand{\Covsandhat}{\hat{\Sigmam}_{\mathrm{sand}}} \newcommand{\Z}{\mybold{Z}} \newcommand{\z}{z} \newcommand{\zv}{\mybold{z}} \newcommand{\zbar}{\bar{z}} \newcommand{\Y}{\mybold{Y}} \newcommand{\Yhat}{\hat{\Y}} \newcommand{\y}{y} \newcommand{\yv}{\mybold{y}} \newcommand{\yhat}{\hat{\y}} \newcommand{\ybar}{\bar{y}} \newcommand{\res}{\varepsilon} \newcommand{\resv}{\mybold{\res}} \newcommand{\resvhat}{\hat{\mybold{\res}}} \newcommand{\reshat}{\hat{\res}} \newcommand{\betav}{\mybold{\beta}} \newcommand{\betavhat}{\hat{\betav}} \newcommand{\betahat}{\hat{\beta}} \newcommand{\betastar}{{\beta^{*}}} \newcommand{\betavstar}{{\betav^{*}}} \newcommand{\loss}{\mathscr{L}} \newcommand{\losshat}{\hat{\loss}} \newcommand{\f}{f} \newcommand{\fhat}{\hat{f}} \newcommand{\bv}{\mybold{\b}} \newcommand{\bvhat}{\hat{\bv}} \newcommand{\alphav}{\mybold{\alpha}} \newcommand{\alphavhat}{\hat{\av}} \newcommand{\alphahat}{\hat{\alpha}} \newcommand{\omegav}{\mybold{\omega}} \newcommand{\gv}{\mybold{\gamma}} \newcommand{\gvhat}{\hat{\gv}} \newcommand{\ghat}{\hat{\gamma}} \newcommand{\hv}{\mybold{\h}} \newcommand{\hvhat}{\hat{\hv}} \newcommand{\hhat}{\hat{\h}} \newcommand{\gammav}{\mybold{\gamma}} \newcommand{\gammavhat}{\hat{\gammav}} \newcommand{\gammahat}{\hat{\gamma}} \newcommand{\new}{\mathrm{new}} \newcommand{\zerov}{\mybold{0}} \newcommand{\onev}{\mybold{1}} \newcommand{\id}{\mybold{I}} \newcommand{\sigmahat}{\hat{\sigma}} \newcommand{\etav}{\mybold{\eta}} \newcommand{\muv}{\mybold{\mu}} \newcommand{\Sigmam}{\mybold{\Sigma}} \newcommand{\rdom}[1]{\mathbb{R}^{#1}} \newcommand{\RV}[1]{{#1}} \def\A{\mybold{A}} \def\A{\mybold{A}} \def\av{\mybold{a}} \def\a{a} \def\B{\mybold{B}} \def\b{b} \def\S{\mybold{S}} \def\sv{\mybold{s}} \def\s{s} \def\R{\mybold{R}} \def\rv{\mybold{r}} \def\r{r} \def\V{\mybold{V}} \def\vv{\mybold{v}} \def\v{v} \def\vhat{\hat{v}} \def\U{\mybold{U}} \def\uv{\mybold{u}} \def\u{u} \def\W{\mybold{W}} \def\wv{\mybold{w}} \def\w{w} \def\tv{\mybold{t}} \def\t{t} \def\Sc{\mathcal{S}} \def\ev{\mybold{e}} \def\Lammat{\mybold{\Lambda}} \def\Q{\mybold{Q}} \def\eps{\varepsilon} $$

Vector and matrix-valued statistics and limit theorems.

$\LaTeX$

Goals

Introduce / review multivariate probability
- The law of large numbers for vectors and matrices
- The multivariate normal distribution
- The central limit theorem for vectors

Normal random variables

There is a special distribution called a “Normal” or “Gaussian” distribution that plays a special role in statistics largely due to the Central Limit Theorem (which we will discuss shortly). Specifically, ${\z}$ is a “standard normal random variable” if

$\expect{{\z}}$ = 0
$\var{{\z}}$ = 1
$\dens{z} = \frac{1}{\sqrt{2\pi}} \exp\left( - \frac{z^2}{2} \right)$.

For this class, the particular form of the density will not be too important, except possibly when doing Bayesian statistics. But importantly, if ${\z}$ is standard normal, then we can accurately calculate the probability $\prob{{\z} \le z}$ for any $z \in \rdom{}$.

Exercise

From $\prob{{\z} \le z}$, show how to compute $\prob{{\z} \ge z'}$ and $\prob{\abs{{\z}} \le z''}$ for any $z'$ and $z''$.

For any $\mu$ and $\sigma$, we say that the random variable ${\x} = \mu + \sigma \z$ is also standard normal, but with

$\expect{{\x}} = \mu$
$\var{{\x}} = \sigma^2$
$\dens{x} = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( - \frac{(x - \mu)^2}{2\sigma^2} \right)$.

Further, if we have two independent normal random variables, ${u}$ and ${v}$, the sum ${u} + {v}$ is itself normal, with mean and variance given by the ordinary rules of expectation. Recall that independence means things like:

$\expect{{u} {v}} = \expect{{u}} \expect{{v}}$
$\expect{{u} | {v}} = \expect{{u}}$
$\prob{{u} \in A \textrm{ and } {v} \in B} = \prob{{u} \in A} \prob{{v} \in B}$,

and so on.

Notation

I will write ${\x} \sim \gauss{\mu, \sigma^2}$ to mean that ${\x}$ is a normal random variable with mean $\mu$ and variance $\sigma^2$.

Vector-valued random variables

The bivariate normal distribution

Suppose that ${\u}_1$ and ${\u}_2$ are independent standard normal random variables. Suppose we define the new random variables

\[ \begin{aligned} {\x_1} :={}& a_{11} {\u}_1 + a_{12} {\u}_2 \\ {\x_2} :={}& a_{21} {\u}_1 + a_{22} {\u}_2. \end{aligned} \]

We can see that ${\x_1}$ and ${x_2}$ are each normal random variables, and we can compute their means and variances:

\[ \begin{aligned} \expect{{\x_1}} :={}& a_{11} \expect{{\u}_1} + a_{12} \expect{{\u}_2} = 0 \\ \expect{{\x_2}} :={}& a_{21} \expect{{\u}_1} + a_{22} \expect{{\u}_2} = 0 \\ \var{{\x_1}} :={}& a_{11}^2 \var{{\u}_1} + a_{12}^2 \var{{\u}_2} = a_{11}^2 + a_{12}^2 \\ \var{{\x_2}} :={}& a_{21}^2 \var{{\u}_1} + a_{22}^2 \var{{\u}_2} = a_{21}^2 + a_{22}^2. \end{aligned} \]

But in general ${\x_1}$ and ${x_2}$ are not independent. If they were, we would have $\expect{{\x_1} {\x_2}} = \expect{{\x_1}} \expect{{\x_2}} = 0$, but in fact we have

\[ \expect{{\x_1} {\x_2}} = \expect{(a_{11} {\u}_1 + a_{12} {\u}_2)(a_{21} {\u}_1 + a_{22} {\u}_2)} = a_{11} a_{21} \expect{{\u}_1^2} + a_{12} a_{22} \expect{{\u}_2^2} = a_{11} a_{21} + a_{12} a_{22}. \]

The variables ${\x_1}$ and ${x_2}$ are instances of the bivariate normal distribution, which is the two-dimensional analogue of the normal distribution. One can define the bivariate normal distribution in many ways. Here, we have defined it by taking linear combinations of independent univariate normal distributions. It turns out this is always possible, and for the purposes of this class, we will see that it is enough to use such a definition.

The properties we derived above can in fact be represented more succinctly in vector notation. We can write

\[ {\xv} = \begin{pmatrix} {\x_1} \\ {\x_2} \\ \end{pmatrix} = \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \\ \end{pmatrix} \begin{pmatrix} {\u}_1 \\ {\u}_2 \\ \end{pmatrix} =: \A {\uv}. \]

Note that the matrix $\A$ is not random. Defining the expectation of a vector to be the vector of expectations of its entries, we get

\[ \expect{{\xv}} = \A \expect{{\uv}} = \A \zerov = \zerov, \]

just as above.

Defining the “variance” of ${\xv}$ is more subtle. Recall that a normal distribution is fully characterized by its variance. We would like this to be the case for the bivariate normal as well. But for this it is not enough to know the marginal varianecs $\var{{\x_1}}$ and $\var{{\x_2}}$, since the “covariance” $\cov{{\x_1}, {\x_2}}$ is also very important for the behavior of the random variable.

Notation

When dealing with two scalar– or vector–valued random variables, I will write the covariance as a function of two arguments, e.g., $\cov{{\x_1}, {\x_2}}$. However, when speaking of only a single random variable I might write $\cov{{\xv}}$ as a shorthand for the covariance of a vector with itself, e.g., $\cov{{\xv}} = \cov{{\xv}, {\xv}}$. I will try to reserve the variance $\var{{\x}}$ for only the variance of a single scalar random variable.

A convenient way to write all the covariances in a single expression is to define

\[ \begin{aligned} \cov{{\xv}} ={}& \expect{\left({\xv} - \expect{{\xv}} \right) \left({\xv} - \expect{{\xv}} \right)^\trans} \\={}& \expect{ \begin{pmatrix} ({\x_1} - \expect{{\x_1}})^2 & ({\x_1} - \expect{{\x_1}})({\x_2} - \expect{{\x_2}}) \\ ({\x_1} - \expect{{\x_1}})({\x_2} - \expect{{\x_2}}) & ({\x_2} - \expect{{\x_2}})^2 \end{pmatrix} } \\={}& \begin{pmatrix} \var{{\x_1}} & \cov{{\x_1}, {\x_2}} \\ \cov{{\x_1}, {\x_2}} & \var{{\x_2}} \end{pmatrix}. \end{aligned} \]

Note that the dimension of the matrix inside the expectation is $2 \times 2$, since we take the transpose on the left. By expanding, we see that each entry of this “covariance matrix” has the covariance between two entries of the vector, and the diagonal contains the “marginal” variances.

This expression in fact allows quite convenient calculation of all the covariances above, since

\[ \begin{aligned} \cov{{\xv}} ={}& \expect{\A {\uv} {\uv}^\trans \A^\trans} \\={}& \A \expect{{\uv} {\uv}^\trans } \A^\trans \\={}& \A \id \A^\trans \\={}& \A \A^\trans. \end{aligned} \]

You can readily verify that the expression matches the ones derived manually above.

The multivariate normal distribution

We can readily generalize the previous section to $P$–dimensional vectors. Let ${\uv}$ denote a vector of $P$ standard normal random variables. Then define

\[ {\xv} := \A {\uv} + \muv. \]

Then we say that ${\xv}$ is a multivariate normal random variable with mean

\[ \expect{{\xv}} = \A \expect{{\uv}} + \muv = \muv \]

and

\[ \cov{{\xv}} = \expect{({\xv} - \muv)({\xv} - \muv)^\trans} = \expect{(\A {\uv})(\A {\uv})^\trans} = \A \A^\trans, \]

writing

\[ {\xv} \sim \gauss{\muv, \A \A^\trans}. \]

Suppose we want to design a multivariate normal with a given covariance matrix $\Sigmam$. If we require that $\Sigmam$ be positive semi-definite (see exercise), the we can take $\A = \Sigmam^{1/2}$, and use that to construct a multivariate normal with covariance $\Sigmam$.

Notation

I will write ${\xv} \sim \gauss{\muv, \Sigmam}$ to mean that ${\xv}$ is a multivariate normal random variable with mean $\muv$ and covariance matrix $\Sigmam$.

Note that we will not typically go through the construction ${\xv} = \Sigmam^{1/2} {\uv}$ — we’ll take for granted that we can do so. The construction in terms of univariate random variables is simply an easy way to define multivariate random variables without having to deal with multivariate densities.

Exercise

Show that if you have a vector-valued random variable with a non positive semi-definite covariance matrix, you can construct a univariate random variable with negative variance, which is impossible. It follows that every vector covariance must be postive semi-definite.

A few useful properties come out immediately from properties of the univariate normal distribution:

The entries of a multivariate normal random vector are independent if an only if the covariance matrix is diagonal.
Any linear combination of a multivariate normal random variable is itself multivariate normal (possibly with different dimension).
$\prob{{\xv} \in S} = \prob{{\uv} \in \{\uv: \A \uv \in S\}}$

Properties of the multivariate normal

Since we have defined a $\xv \sim \gauss{\muv, \Sigmam}$ random variable to be $\xv = \Sigmam^{1/2} \uv + \muv$ where $\uv$ is a standard normal random variable, we can derive the distribution of linear transformations of $\xv$ directly. In particular, suppose that $\xv$ is a $P$–vector, and let $\vv$ be another $P$–vector and $\A$ a $P \times P$ matrix, then

\[ \begin{aligned} \vv^\trans \xv \sim \gauss{\vv^\trans \mu, \vv^\trans \Sigmam \vv} \quad\textrm{and}\quad \A \xv \sim \gauss{\A \mu, \A \Sigmam \A^\trans}. \end{aligned} \]

Exercise

Prove the preceding result.

In particular,

\[ \begin{aligned} \expect{\vv^\trans \xv} ={}& \vv^\trans \mu \\ \expect{\A \xv} ={}& \A \mu \\ \cov{\vv^\trans \xv} ={}& \vv^\trans \Sigmam \vv \\ \cov{\A \xv} ={}& \A \Sigmam \A^\trans. \end{aligned} \]

As a special case, we can take $\vv = \ev_n$, the vector with $1$ in entry $n$ and $0$ elsewhere, to get \[ \ev_n^\trans \xv = \xv_n \sim \gauss{\mu_n, \Sigmam_{nn}}, \] as expected.

Generic random vectors

In general, we can consider a random vector — or a random matrix — as a collection of potentially non-independent random values.
For such a vector ${\xv} \in \rdom{P}$, we can speak about its expectation and covariance just as we would for a multivariate normal distribution. Specifically,

\[ \expect{{\xv}} = \begin{pmatrix} \expect{{\x}_1} \\ \vdots \\ \expect{{\x}_P} \\ \end{pmatrix} \quad\textrm{and}\quad \cov{{\xv}} = \expect{\left({\xv} - \expect{{\xv}} \right) \left({\xv} - \expect{{\xv}} \right)^\trans} = \begin{pmatrix} \var{{\x_1}} & \cov{{\x_1}, {\x_2}} & \ldots & \cov{{\x_1}, {\x_P}} \\ \cov{{\x_2}, {\x_1}} & \ldots & \ldots & \cov{{\x_2}, {\x_P}} \\ \vdots & & & \vdots \\ \cov{{\x_P}, {\x_1}} & \ldots & \ldots & \var{{\x_P}} \\ \end{pmatrix}. \]

For these expressions to exist, it suffices for $\expect{{\x}_p^2} < \infty$ for all $p \in \{1,\ldots,P\}$.

We will at times talk about the expectation of a random matrix, ${\X}$, which is simply the matrix of expectations. The notation for covariances of course doesn’t make sense for matrices, but it won’t be needed. (In fact, in cases where the covariance of a random matrix is needed in statistics, the matrix is typically stacked into a vector first.)

The law of large numbers for vectors

The law of large numbers is particularly simple for vectors — as long as the dimension stays fixed as $N \rightarrow \infty$, you can simply apply the LLN to each component separately. Suppose you’re given a sequence of vector-valued random variables, ${\xv}_n \in \rdom{P}$. Write $\expect{{\xv}_n} = \muv_n$. Then, as long as we can apply the LLN to each component, we get

\[ \overline{{\xv}} := \meann {\xv}_n \rightarrow \overline{\muv} \quad\textrm{as }N\rightarrow \infty. \]

Extensions

Since we see that the shape of the vector doesn’t matter, we can also apply the LLN to matrices. For example, if ${\X}_n$ is a sequence of random matrices, with $\expect{{\X}_n} = \A$, then

\[ \meann {\X}_n \rightarrow \A. \]

We will also use (without proof) a theorem called the continuous mapping theorem, which says that, for a continous function $f(\cdot)$, then

\[ f\left(\meann {\xv}_n\right) \rightarrow f(\overline{\mu}). \]

Note that the preceding statment is different than saying

\[ \meann f\left( {\xv}_n \right) \rightarrow \expect{f\left( {\xv}_n \right)}, \]

which may also be true, but which applies the LLN to the random variables $f\left( {\xv}_n \right)$.

The central limit theorem for vectors

We might hope that we can apply the CLT componentwise just as we did for the LLN, but we are not so lucky. It is true that, assuming (as before) that $\meann \v_{np} \rightarrow \overline{v}_p$ for each $p$, that

\[ \sqrt{N} (\overline{{\xv}}_p - \overline{\mu}_p) \rightarrow \gauss{0, \overline{v}_p}. \]

However, this does not tell us about the joint behavior of the random variables. For the CLT, we have to consider the behavior of the whole vector. In particular, write

\[ \Sigmam_n := \cov{{\xv}_n} \]

and assume that

\[ \meann \Sigmam_n \rightarrow \overline{\Sigmam} \]

for each entry of the matrix, where $\overline{\Sigmam}$ is element-wise finite. (Note that this requires the average of each entry of the diagonal to to converge!) Then

\[ \frac{1}{\sqrt{N}} \left( \sumn {\xv}_n - \expect{{\xv}_n} \right) \rightarrow {\zv} \quad\textrm{ where } {\zv} \sim \gauss{\zerov, \overline{\Sigmam}}. \]

Finally, we note (again without proof) that the continuous mapping theorem applies to the CLT as well. That is, if $f(\cdot)$ is continuous, then

\[ f\left( \frac{1}{\sqrt{N}} \left( \sumn {\xv}_n - \expect{{\xv}_n} \right) \right) \rightarrow f\left( {\zv} \right). \]

Examples of different joint behavior with the same marginal behavior

Here’s a simple example that illustrates the problem. Consider two settings, based on IID standard normal scalar variables ${u}_n$ and ${v}_n$.

Setting one

Take

\[{\xv}_n = \begin{pmatrix} {u}_n\\ {v}_n \end{pmatrix}. \]

We know that $\frac{1}{\sqrt{N}} {\xv}_n$ is exactly multivariate normal with mean $\zerov$ and covariance matrix $\id{}$ for all $N$. So, a fortiori,

\[ \frac{1}{\sqrt{N}} \sumn {\xv}_n \rightarrow \gauss{\zerov, \id{}}. \]

Setting two

Take

\[{\yv}_n = \begin{pmatrix} {u}_n\\ {u}_n \end{pmatrix}. \]

We know that $\frac{1}{\sqrt{N}} {\yv}_n$ is exactly multivariate normal with mean $\zerov$ and covariance matrix $\onev \onev^\trans$ for all $N$.
So, a fortiori,

\[ \frac{1}{\sqrt{N}} \sumn {\yv}_n \rightarrow \gauss{\zerov, \onev \onev^\trans}. \]

For both ${\xv}$ and ${\yv}$, each component converges to a standard normal random variable. But for ${\xv}$ the two components were independent, and for ${\yv}$ they were perfectly dependent. The behavior of the marginals cannot tell you about the joint behavior.