Vector and matrix-valued statistics and limit theorems.
Goals
Introduce / review multivariate probability
The law of large numbers for vectors and matrices
The multivariate normal distribution
The central limit theorem for vectors
Normal random variables
There is a special distribution called a “Normal” or “Gaussian” distribution that plays a special role in statistics largely due to the Central Limit Theorem (which we will discuss shortly). Specifically, is a “standard normal random variable” if
= 0
= 1
.
For this class, the particular form of the density will not be too important, except possibly when doing Bayesian statistics. But importantly, if is standard normal, then we can accurately calculate the probability for any .
Exercise
From , show how to compute and for any and .
For any and , we say that the random variable is also standard normal, but with
.
Further, if we have two independent normal random variables, and , the sum is itself normal, with mean and variance given by the ordinary rules of expectation. Recall that independence means things like:
,
and so on.
Notation
I will write to mean that is a normal random variable with mean and variance .
Vector-valued random variables
The bivariate normal distribution
Suppose that and are independent standard normal random variables. Suppose we define the new random variables
We can see that and are each normal random variables, and we can compute their means and variances:
But in general and are not independent. If they were, we would have , but in fact we have
The variables and are instances of the bivariate normal distribution, which is the two-dimensional analogue of the normal distribution. One can define the bivariate normal distribution in many ways. Here, we have defined it by taking linear combinations of independent univariate normal distributions. It turns out this is always possible, and for the purposes of this class, we will see that it is enough to use such a definition.
The properties we derived above can in fact be represented more succinctly in vector notation. We can write
Note that the matrix is not random. Defining the expectation of a vector to be the vector of expectations of its entries, we get
just as above.
Defining the “variance” of is more subtle. Recall that a normal distribution is fully characterized by its variance. We would like this to be the case for the bivariate normal as well. But for this it is not enough to know the marginal varianecs and , since the “covariance” is also very important for the behavior of the random variable.
Notation
When dealing with two scalar– or vector–valued random variables, I will write the covariance as a function of two arguments, e.g., . However, when speaking of only a single random variable I might write as a shorthand for the covariance of a vector with itself, e.g., . I will try to reserve the variance for only the variance of a single scalar random variable.
A convenient way to write all the covariances in a single expression is to define
Note that the dimension of the matrix inside the expectation is , since we take the transpose on the left. By expanding, we see that each entry of this “covariance matrix” has the covariance between two entries of the vector, and the diagonal contains the “marginal” variances.
This expression in fact allows quite convenient calculation of all the covariances above, since
You can readily verify that the expression matches the ones derived manually above.
The multivariate normal distribution
We can readily generalize the previous section to –dimensional vectors. Let denote a vector of standard normal random variables. Then define
Then we say that is a multivariate normal random variable with mean
and
writing
Suppose we want to design a multivariate normal with a given covariance matrix . If we require that be positive semi-definite (see exercise), the we can take , and use that to construct a multivariate normal with covariance .
Notation
I will write to mean that is a multivariate normal random variable with mean and covariance matrix .
Note that we will not typically go through the construction — we’ll take for granted that we can do so. The construction in terms of univariate random variables is simply an easy way to define multivariate random variables without having to deal with multivariate densities.
Exercise
Show that if you have a vector-valued random variable with a non positive semi-definite covariance matrix, you can construct a univariate random variable with negative variance, which is impossible. It follows that every vector covariance must be postive semi-definite.
A few useful properties come out immediately from properties of the univariate normal distribution:
The entries of a multivariate normal random vector are independent if an only if the covariance matrix is diagonal.
Any linear combination of a multivariate normal random variable is itself multivariate normal (possibly with different dimension).
Properties of the multivariate normal
Since we have defined a random variable to be where is a standard normal random variable, we can derive the distribution of linear transformations of directly. In particular, suppose that is a –vector, and let be another –vector and a matrix, then
Exercise
Prove the preceding result.
In particular,
As a special case, we can take , the vector with in entry and elsewhere, to get as expected.
Generic random vectors
In general, we can consider a random vector — or a random matrix — as a collection of potentially non-independent random values.
For such a vector , we can speak about its expectation and covariance just as we would for a multivariate normal distribution. Specifically,
For these expressions to exist, it suffices for for all .
We will at times talk about the expectation of a random matrix, , which is simply the matrix of expectations. The notation for covariances of course doesn’t make sense for matrices, but it won’t be needed. (In fact, in cases where the covariance of a random matrix is needed in statistics, the matrix is typically stacked into a vector first.)
The law of large numbers for vectors
The law of large numbers is particularly simple for vectors — as long as the dimension stays fixed as , you can simply apply the LLN to each component separately. Suppose you’re given a sequence of vector-valued random variables, . Write . Then, as long as we can apply the LLN to each component, we get
Extensions
Since we see that the shape of the vector doesn’t matter, we can also apply the LLN to matrices. For example, if is a sequence of random matrices, with , then
We will also use (without proof) a theorem called the continuous mapping theorem, which says that, for a continous function , then
Note that the preceding statment is different than saying
which may also be true, but which applies the LLN to the random variables .
The central limit theorem for vectors
We might hope that we can apply the CLT componentwise just as we did for the LLN, but we are not so lucky. It is true that, assuming (as before) that for each , that
However, this does not tell us about the joint behavior of the random variables. For the CLT, we have to consider the behavior of the whole vector. In particular, write
and assume that
for each entry of the matrix, where is element-wise finite. (Note that this requires the average of each entry of the diagonal to to converge!) Then
Finally, we note (again without proof) that the continuous mapping theorem applies to the CLT as well. That is, if is continuous, then
Examples of different joint behavior with the same marginal behavior
Here’s a simple example that illustrates the problem. Consider two settings, based on IID standard normal scalar variables and .
Setting one
Take
We know that is exactly multivariate normal with mean and covariance matrix for all . So, a fortiori,
Setting two
Take
We know that is exactly multivariate normal with mean and covariance matrix for all .
So, a fortiori,
For both and , each component converges to a standard normal random variable. But for the two components were independent, and for they were perfectly dependent. The behavior of the marginals cannot tell you about the joint behavior.
Source Code
---title: "Vector and matrix-valued statistics and limit theorems."format: html: code-fold: false code-tools: true include-before-body: - file: ../macros.md---$\LaTeX$# Goals- Introduce / review multivariate probability - The law of large numbers for vectors and matrices - The multivariate normal distribution - The central limit theorem for vectors# Normal random variablesThere is a special distribution called a "Normal" or "Gaussian" distribution thatplays a special role in statistics largely due to the Central Limit Theorem(which we will discuss shortly).Specifically, ${\z}$ is a "standard normal random variable" if - $\expect{{\z}}$ = 0 - $\var{{\z}}$ = 1 - $\dens{z} = \frac{1}{\sqrt{2\pi}} \exp\left( - \frac{z^2}{2} \right)$.For this class, the particular form of the density will not be too important,except possibly when doing Bayesian statistics. But importantly, if ${\z}$ is standard normal, then we can accurately calculate the probability$\prob{{\z} \le z}$ for any $z \in \rdom{}$.::: {.callout-warning title='Exercise'} From $\prob{{\z} \le z}$, show how to compute$\prob{{\z} \ge z'}$ and $\prob{\abs{{\z}} \le z''}$for any $z'$ and $z''$.:::For any $\mu$ and $\sigma$, we say that the random variable${\x} = \mu + \sigma \z$ is also standard normal, but with - $\expect{{\x}} = \mu$ - $\var{{\x}} = \sigma^2$ - $\dens{x} = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( - \frac{(x - \mu)^2}{2\sigma^2} \right)$.Further, if we have two independent normal random variables, ${u}$ and ${v}$,the sum ${u} + {v}$ is itself normal, with mean and variance given bythe ordinary rules of expectation. Recall that independence means things like:- $\expect{{u} {v}} = \expect{{u}} \expect{{v}}$- $\expect{{u} | {v}} = \expect{{u}}$- $\prob{{u} \in A \textrm{ and } {v} \in B} = \prob{{u} \in A} \prob{{v} \in B}$,and so on. ::: {.callout-tip title='Notation'} I will write ${\x} \sim \gauss{\mu, \sigma^2}$ to mean that ${\x}$ is a normal randomvariable with mean $\mu$ and variance $\sigma^2$.:::# Vector-valued random variables## The bivariate normal distributionSuppose that ${\u}_1$ and ${\u}_2$ are independentstandard normal random variables. Suppose we define the new random variables$$\begin{aligned}{\x_1} :={}& a_{11} {\u}_1 + a_{12} {\u}_2 \\{\x_2} :={}& a_{21} {\u}_1 + a_{22} {\u}_2.\end{aligned}$$We can see that ${\x_1}$ and ${x_2}$ are each normal random variables,and we can compute their means and variances:$$\begin{aligned}\expect{{\x_1}} :={}& a_{11} \expect{{\u}_1} + a_{12} \expect{{\u}_2} = 0 \\\expect{{\x_2}} :={}& a_{21} \expect{{\u}_1} + a_{22} \expect{{\u}_2} = 0 \\\var{{\x_1}} :={}& a_{11}^2 \var{{\u}_1} + a_{12}^2 \var{{\u}_2} = a_{11}^2 + a_{12}^2 \\\var{{\x_2}} :={}& a_{21}^2 \var{{\u}_1} + a_{22}^2 \var{{\u}_2} = a_{21}^2 + a_{22}^2.\end{aligned}$$But in general ${\x_1}$ and ${x_2}$ are not independent. If they were, we would have $\expect{{\x_1} {\x_2}} = \expect{{\x_1}} \expect{{\x_2}} = 0$, butin fact we have$$\expect{{\x_1} {\x_2}} =\expect{(a_{11} {\u}_1 + a_{12} {\u}_2)(a_{21} {\u}_1 + a_{22} {\u}_2)} = a_{11} a_{21} \expect{{\u}_1^2} + a_{12} a_{22} \expect{{\u}_2^2} = a_{11} a_{21} + a_{12} a_{22}.$$The variables ${\x_1}$ and ${x_2}$ are instances of the *bivariate normal distribution*,which is the two-dimensional analogue of the normal distribution. One can definethe bivariate normal distribution in many ways. Here, we have defined it by takinglinear combinations of independent univariate normal distributions. It turns outthis is always possible, and for the purposes of this class, we will see that it isenough to use such a definition.The properties we derived above can in fact be represented more succinctly in vectornotation. We can write$${\xv} = \begin{pmatrix}{\x_1} \\{\x_2} \\\end{pmatrix} =\begin{pmatrix}a_{11} & a_{12} \\a_{21} & a_{22} \\\end{pmatrix} \begin{pmatrix}{\u}_1 \\{\u}_2 \\\end{pmatrix} =:\A {\uv}.$$Note that the matrix $\A$ is not random. Defining the expectation of a vectorto be the vector of expectations of its entries, we get$$\expect{{\xv}} = \A \expect{{\uv}} = \A \zerov = \zerov,$$just as above.Defining the "variance" of ${\xv}$ is more subtle. Recall that a normaldistribution is fully characterized by its variance. We would like thisto be the case for the bivariate normal as well. But for this it is not enoughto know the marginal varianecs $\var{{\x_1}}$ and $\var{{\x_2}}$, since the"covariance" $\cov{{\x_1}, {\x_2}}$ is also very important for the behaviorof the random variable. ::: {.callout-tip title="Notation"}When dealing with two scalar-- or vector--valued random variables, I willwrite the covariance as a function of two arguments, e.g., $\cov{{\x_1}, {\x_2}}$. However, when speaking of only a singlerandom variable I might write $\cov{{\xv}}$ as a shorthand forthe covariance of a vector with itself, e.g.,$\cov{{\xv}} = \cov{{\xv}, {\xv}}$. I will try to reservethe variance $\var{{\x}}$ for only the variance of a single scalarrandom variable.:::A convenient way to write all the covariances in a single expression is to define$$\begin{aligned}\cov{{\xv}} ={}& \expect{\left({\xv} - \expect{{\xv}} \right) \left({\xv} - \expect{{\xv}} \right)^\trans}\\={}& \expect{ \begin{pmatrix} ({\x_1} - \expect{{\x_1}})^2 & ({\x_1} - \expect{{\x_1}})({\x_2} - \expect{{\x_2}}) \\ ({\x_1} - \expect{{\x_1}})({\x_2} - \expect{{\x_2}}) & ({\x_2} - \expect{{\x_2}})^2 \end{pmatrix} }\\={}& \begin{pmatrix} \var{{\x_1}} & \cov{{\x_1}, {\x_2}} \\ \cov{{\x_1}, {\x_2}} & \var{{\x_2}} \end{pmatrix}.\end{aligned}$$Note that the dimension of the matrix inside the expectation is $2 \times 2$, sincewe take the transpose on the left. By expanding, we see that each entry of this"covariance matrix" has the covariance between two entries of the vector, and thediagonal contains the "marginal" variances.This expression in fact allows quite convenient calculation of all the covariancesabove, since$$\begin{aligned}\cov{{\xv}} ={}&\expect{\A {\uv} {\uv}^\trans \A^\trans} \\={}&\A \expect{{\uv} {\uv}^\trans } \A^\trans\\={}&\A \id \A^\trans\\={}&\A \A^\trans.\end{aligned}$$You can readily verify that the expression matches the ones derivedmanually above.## The multivariate normal distributionWe can readily generalize the previous section to $P$--dimensional vectors.Let ${\uv}$ denote a vector of $P$ standard normal random variables.Then define$${\xv} := \A {\uv} + \muv.$$Then we say that ${\xv}$ is a multivariate normal randomvariable with mean$$\expect{{\xv}} = \A \expect{{\uv}} + \muv = \muv$$and $$\cov{{\xv}} =\expect{({\xv} - \muv)({\xv} - \muv)^\trans} =\expect{(\A {\uv})(\A {\uv})^\trans} =\A \A^\trans,$$writing$${\xv} \sim \gauss{\muv, \A \A^\trans}.$$Suppose we want to design a multivariate normal witha given covariance matrix $\Sigmam$. If we require that$\Sigmam$ be positive semi-definite (see exercise),the we can take $\A = \Sigmam^{1/2}$, and usethat to construct a multivariate normal with covariance $\Sigmam$.::: {.callout-tip title='Notation'} I will write ${\xv} \sim \gauss{\muv, \Sigmam}$ to mean that ${\xv}$ is amultivariate normal random variable with mean $\muv$ and covariance matrix$\Sigmam$. :::Note that we will not typically go through the construction${\xv} = \Sigmam^{1/2} {\uv}$ --- we'll take for granted thatwe can do so. The construction in terms of univariate random variablesis simply an easy way to define multivariate random variables withouthaving to deal with multivariate densities.::: {.callout-warning title='Exercise'} Show that if you have a vector-valued random variablewith a non positive semi-definite covariance matrix,you can construct a univariate random variable withnegative variance, which is impossible. It followsthat every vector covariance must be postivesemi-definite.:::A few useful properties come out immediately from propertiesof the univariate normal distribution:- The entries of a multivariate normal random vector areindependent if an only if the covariance matrix is diagonal.- Any linear combination of a multivariate normal random variableis itself multivariate normal (possibly with different dimension).- $\prob{{\xv} \in S} = \prob{{\uv} \in \{\uv: \A \uv \in S\}}$### Properties of the multivariate normalSince we have defined a $\xv \sim \gauss{\muv, \Sigmam}$ random variable tobe $\xv = \Sigmam^{1/2} \uv + \muv$ where $\uv$ is a standard normalrandom variable, we can derive the distribution of linear transformations of$\xv$ directly. In particular, suppose that $\xv$ is a $P$--vector,and let $\vv$ be another $P$--vector and $\A$ a $P \times P$ matrix, then$$\begin{aligned}\vv^\trans \xv \sim \gauss{\vv^\trans \mu, \vv^\trans \Sigmam \vv}\quad\textrm{and}\quad\A \xv \sim \gauss{\A \mu, \A \Sigmam \A^\trans}.\end{aligned}$$::: {.callout-warning title='Exercise'} Prove the preceding result.:::In particular,$$\begin{aligned}\expect{\vv^\trans \xv} ={}& \vv^\trans \mu \\\expect{\A \xv} ={}& \A \mu \\\cov{\vv^\trans \xv} ={}& \vv^\trans \Sigmam \vv \\\cov{\A \xv} ={}& \A \Sigmam \A^\trans.\end{aligned}$$As a special case, we can take $\vv = \ev_n$, the vector with $1$ in entry $n$and $0$ elsewhere, to get$$\ev_n^\trans \xv = \xv_n \sim \gauss{\mu_n, \Sigmam_{nn}},$$as expected.## Generic random vectorsIn general, we can consider a random vector --- or a random matrix ---as a collection of potentially non-independent random values. For such a vector ${\xv} \in \rdom{P}$, we can speak about its expectationand covariance just as we would for a multivariate normal distribution.Specifically,$$\expect{{\xv}} =\begin{pmatrix}\expect{{\x}_1} \\\vdots \\\expect{{\x}_P} \\\end{pmatrix}\quad\textrm{and}\quad\cov{{\xv}} =\expect{\left({\xv} - \expect{{\xv}} \right) \left({\xv} - \expect{{\xv}} \right)^\trans} =\begin{pmatrix}\var{{\x_1}} & \cov{{\x_1}, {\x_2}} & \ldots & \cov{{\x_1}, {\x_P}} \\\cov{{\x_2}, {\x_1}} & \ldots & \ldots & \cov{{\x_2}, {\x_P}} \\\vdots & & & \vdots \\\cov{{\x_P}, {\x_1}} & \ldots & \ldots & \var{{\x_P}} \\\end{pmatrix}.$$For these expressions to exist, it suffices for $\expect{{\x}_p^2} < \infty$ for all $p \in \{1,\ldots,P\}$.We will at times talk about the expectation of a random matrix,${\X}$, which is simply the matrix of expectations. The notationfor covariances of course doesn't make sense for matrices, but it won't be needed. (In fact, in cases where the covariance of a random matrixis needed in statistics, the matrix is typically stacked into a vector first.)## The law of large numbers for vectorsThe law of large numbers is particularly simple for vectors --- as long as the dimension stays fixed as $N \rightarrow \infty$, you cansimply apply the LLN to each component separately. Suppose you'regiven a sequence of vector-valued random variables, ${\xv}_n \in \rdom{P}$.Write $\expect{{\xv}_n} = \muv_n$. Then, as long as we can apply the LLN to each component, we get$$\overline{{\xv}} :=\meann {\xv}_n \rightarrow \overline{\muv}\quad\textrm{as }N\rightarrow \infty.$$## ExtensionsSince we see that the shape of the vector doesn't matter, wecan also apply the LLN to matrices. For example, if ${\X}_n$is a sequence of random matrices, with $\expect{{\X}_n} = \A$,then$$\meann {\X}_n \rightarrow \A.$$We will also use (without proof) a theorem called the **continuousmapping theorem**, which says that, for a continous function $f(\cdot)$,then$$f\left(\meann {\xv}_n\right) \rightarrow f(\overline{\mu}).$$Note that the preceding statment is different than saying$$\meann f\left( {\xv}_n \right) \rightarrow \expect{f\left( {\xv}_n \right)},$$which may also be true, but which applies the LLN to the random variables$f\left( {\xv}_n \right)$.## The central limit theorem for vectorsWe might hope that we can apply the CLT componentwise just as we didfor the LLN, but we are not so lucky. It is true that, assuming (asbefore) that $\meann \v_{np} \rightarrow \overline{v}_p$ for each$p$, that$$\sqrt{N} (\overline{{\xv}}_p - \overline{\mu}_p) \rightarrow \gauss{0, \overline{v}_p}.$$However, this does not tell us about the *joint* behavior of therandom variables. For the CLT, we have to consider the behavior of the whole vector.In particular, write$$\Sigmam_n := \cov{{\xv}_n}$$and assume that$$\meann \Sigmam_n \rightarrow \overline{\Sigmam}$$for each entry of the matrix,where $\overline{\Sigmam}$ is element-wise finite. (Note that thisrequires the average of each entry of the diagonal to to converge!)Then$$\frac{1}{\sqrt{N}} \left( \sumn {\xv}_n - \expect{{\xv}_n} \right)\rightarrow {\zv} \quad\textrm{ where }{\zv} \sim \gauss{\zerov, \overline{\Sigmam}}.$$Finally, we note (again without proof) that the continuous mapping theorem appliesto the CLT as well. That is, if $f(\cdot)$ is continuous, then$$f\left( \frac{1}{\sqrt{N}} \left( \sumn {\xv}_n - \expect{{\xv}_n} \right) \right)\rightarrow f\left( {\zv} \right).$$## Examples of different joint behavior with the same marginal behaviorHere's a simple example that illustrates theproblem. Consider two settings, based on IID standard normalscalar variables ${u}_n$ and ${v}_n$.#### Setting oneTake$${\xv}_n =\begin{pmatrix}{u}_n\\{v}_n\end{pmatrix}.$$We know that $\frac{1}{\sqrt{N}} {\xv}_n$ is exactly multivariate normalwith mean $\zerov$ and covariance matrix $\id{}$ for all $N$. So, *a fortiori*,$$\frac{1}{\sqrt{N}} \sumn {\xv}_n \rightarrow \gauss{\zerov, \id{}}.$$#### Setting twoTake$${\yv}_n =\begin{pmatrix}{u}_n\\{u}_n\end{pmatrix}.$$We know that $\frac{1}{\sqrt{N}} {\yv}_n$ is exactly multivariate normalwith mean $\zerov$ and covariance matrix $\onev \onev^\trans$ for all $N$. So, *a fortiori*,$$\frac{1}{\sqrt{N}} \sumn {\yv}_n \rightarrow \gauss{\zerov, \onev \onev^\trans}.$$For both ${\xv}$ and ${\yv}$, each component converges to a standardnormal random variable. But for ${\xv}$ the two components wereindependent, and for ${\yv}$ they were perfectly dependent. The behaviorof the marginals cannot tell you about the joint behavior.