Vector and matrix-valued statistics and limit theorems.

LATEX

Goals

  • Introduce / review multivariate probability
    • The law of large numbers for vectors and matrices
    • The multivariate normal distribution
    • The central limit theorem for vectors

Normal random variables

There is a special distribution called a “Normal” or “Gaussian” distribution that plays a special role in statistics largely due to the Central Limit Theorem (which we will discuss shortly). Specifically, z is a “standard normal random variable” if

  • E[z] = 0
  • Var(z) = 1
  • p(z)=12πexp(z22).

For this class, the particular form of the density will not be too important, except possibly when doing Bayesian statistics. But importantly, if z is standard normal, then we can accurately calculate the probability P(zz) for any zR.

Exercise

From P(zz), show how to compute P(zz) and P(|z|z) for any z and z.

For any μ and σ, we say that the random variable x=μ+σz is also standard normal, but with

  • E[x]=μ
  • Var(x)=σ2
  • p(x)=1σ2πexp((xμ)22σ2).

Further, if we have two independent normal random variables, u and v, the sum u+v is itself normal, with mean and variance given by the ordinary rules of expectation. Recall that independence means things like:

  • E[uv]=E[u]E[v]
  • E[u|v]=E[u]
  • P(uA and vB)=P(uA)P(vB),

and so on.

Notation

I will write xN(μ,σ2) to mean that x is a normal random variable with mean μ and variance σ2.

Vector-valued random variables

The bivariate normal distribution

Suppose that u1 and u2 are independent standard normal random variables. Suppose we define the new random variables

x1:=a11u1+a12u2x2:=a21u1+a22u2.

We can see that x1 and x2 are each normal random variables, and we can compute their means and variances:

E[x1]:=a11E[u1]+a12E[u2]=0E[x2]:=a21E[u1]+a22E[u2]=0Var(x1):=a112Var(u1)+a122Var(u2)=a112+a122Var(x2):=a212Var(u1)+a222Var(u2)=a212+a222.

But in general x1 and x2 are not independent. If they were, we would have E[x1x2]=E[x1]E[x2]=0, but in fact we have

E[x1x2]=E[(a11u1+a12u2)(a21u1+a22u2)]=a11a21E[u12]+a12a22E[u22]=a11a21+a12a22.

The variables x1 and x2 are instances of the bivariate normal distribution, which is the two-dimensional analogue of the normal distribution. One can define the bivariate normal distribution in many ways. Here, we have defined it by taking linear combinations of independent univariate normal distributions. It turns out this is always possible, and for the purposes of this class, we will see that it is enough to use such a definition.

The properties we derived above can in fact be represented more succinctly in vector notation. We can write

x=(x1x2)=(a11a12a21a22)(u1u2)=:Au.

Note that the matrix A is not random. Defining the expectation of a vector to be the vector of expectations of its entries, we get

E[x]=AE[u]=A0=0,

just as above.

Defining the “variance” of x is more subtle. Recall that a normal distribution is fully characterized by its variance. We would like this to be the case for the bivariate normal as well. But for this it is not enough to know the marginal varianecs Var(x1) and Var(x2), since the “covariance” Cov(x1,x2) is also very important for the behavior of the random variable.

Notation

When dealing with two scalar– or vector–valued random variables, I will write the covariance as a function of two arguments, e.g., Cov(x1,x2). However, when speaking of only a single random variable I might write Cov(x) as a shorthand for the covariance of a vector with itself, e.g., Cov(x)=Cov(x,x). I will try to reserve the variance Var(x) for only the variance of a single scalar random variable.

A convenient way to write all the covariances in a single expression is to define

Cov(x)=E[(xE[x])(xE[x])]=E[((x1E[x1])2(x1E[x1])(x2E[x2])(x1E[x1])(x2E[x2])(x2E[x2])2)]=(Var(x1)Cov(x1,x2)Cov(x1,x2)Var(x2)).

Note that the dimension of the matrix inside the expectation is 2×2, since we take the transpose on the left. By expanding, we see that each entry of this “covariance matrix” has the covariance between two entries of the vector, and the diagonal contains the “marginal” variances.

This expression in fact allows quite convenient calculation of all the covariances above, since

Cov(x)=E[AuuA]=AE[uu]A=AIA=AA.

You can readily verify that the expression matches the ones derived manually above.

The multivariate normal distribution

We can readily generalize the previous section to P–dimensional vectors. Let u denote a vector of P standard normal random variables. Then define

x:=Au+μ.

Then we say that x is a multivariate normal random variable with mean

E[x]=AE[u]+μ=μ

and

Cov(x)=E[(xμ)(xμ)]=E[(Au)(Au)]=AA,

writing

xN(μ,AA).

Suppose we want to design a multivariate normal with a given covariance matrix Σ. If we require that Σ be positive semi-definite (see exercise), the we can take A=Σ1/2, and use that to construct a multivariate normal with covariance Σ.

Notation

I will write xN(μ,Σ) to mean that x is a multivariate normal random variable with mean μ and covariance matrix Σ.

Note that we will not typically go through the construction x=Σ1/2u — we’ll take for granted that we can do so. The construction in terms of univariate random variables is simply an easy way to define multivariate random variables without having to deal with multivariate densities.

Exercise

Show that if you have a vector-valued random variable with a non positive semi-definite covariance matrix, you can construct a univariate random variable with negative variance, which is impossible. It follows that every vector covariance must be postive semi-definite.

A few useful properties come out immediately from properties of the univariate normal distribution:

  • The entries of a multivariate normal random vector are independent if an only if the covariance matrix is diagonal.
  • Any linear combination of a multivariate normal random variable is itself multivariate normal (possibly with different dimension).
  • P(xS)=P(u{u:AuS})

Properties of the multivariate normal

Since we have defined a xN(μ,Σ) random variable to be x=Σ1/2u+μ where u is a standard normal random variable, we can derive the distribution of linear transformations of x directly. In particular, suppose that x is a P–vector, and let v be another P–vector and A a P×P matrix, then

vxN(vμ,vΣv)andAxN(Aμ,AΣA).

Exercise

Prove the preceding result.

In particular,

E[vx]=vμE[Ax]=AμCov(vx)=vΣvCov(Ax)=AΣA.

As a special case, we can take v=en, the vector with 1 in entry n and 0 elsewhere, to get enx=xnN(μn,Σnn), as expected.

Generic random vectors

In general, we can consider a random vector — or a random matrix — as a collection of potentially non-independent random values.
For such a vector xRP, we can speak about its expectation and covariance just as we would for a multivariate normal distribution. Specifically,

E[x]=(E[x1]E[xP])andCov(x)=E[(xE[x])(xE[x])]=(Var(x1)Cov(x1,x2)Cov(x1,xP)Cov(x2,x1)Cov(x2,xP)Cov(xP,x1)Var(xP)).

For these expressions to exist, it suffices for E[xp2]< for all p{1,,P}.

We will at times talk about the expectation of a random matrix, X, which is simply the matrix of expectations. The notation for covariances of course doesn’t make sense for matrices, but it won’t be needed. (In fact, in cases where the covariance of a random matrix is needed in statistics, the matrix is typically stacked into a vector first.)

The law of large numbers for vectors

The law of large numbers is particularly simple for vectors — as long as the dimension stays fixed as N, you can simply apply the LLN to each component separately. Suppose you’re given a sequence of vector-valued random variables, xnRP. Write E[xn]=μn. Then, as long as we can apply the LLN to each component, we get

x:=1Nn=1Nxnμas N.

Extensions

Since we see that the shape of the vector doesn’t matter, we can also apply the LLN to matrices. For example, if Xn is a sequence of random matrices, with E[Xn]=A, then

1Nn=1NXnA.

We will also use (without proof) a theorem called the continuous mapping theorem, which says that, for a continous function f(), then

f(1Nn=1Nxn)f(μ).

Note that the preceding statment is different than saying

1Nn=1Nf(xn)E[f(xn)],

which may also be true, but which applies the LLN to the random variables f(xn).

The central limit theorem for vectors

We might hope that we can apply the CLT componentwise just as we did for the LLN, but we are not so lucky. It is true that, assuming (as before) that 1Nn=1Nvnpvp for each p, that

N(xpμp)N(0,vp).

However, this does not tell us about the joint behavior of the random variables. For the CLT, we have to consider the behavior of the whole vector. In particular, write

Σn:=Cov(xn)

and assume that

1Nn=1NΣnΣ

for each entry of the matrix, where Σ is element-wise finite. (Note that this requires the average of each entry of the diagonal to to converge!) Then

1N(n=1NxnE[xn])z where zN(0,Σ).

Finally, we note (again without proof) that the continuous mapping theorem applies to the CLT as well. That is, if f() is continuous, then

f(1N(n=1NxnE[xn]))f(z).

Examples of different joint behavior with the same marginal behavior

Here’s a simple example that illustrates the problem. Consider two settings, based on IID standard normal scalar variables un and vn.

Setting one

Take

xn=(unvn).

We know that 1Nxn is exactly multivariate normal with mean 0 and covariance matrix I for all N. So, a fortiori,

1Nn=1NxnN(0,I).

Setting two

Take

yn=(unun).

We know that 1Nyn is exactly multivariate normal with mean 0 and covariance matrix 11 for all N.
So, a fortiori,

1Nn=1NynN(0,11).

For both x and y, each component converges to a standard normal random variable. But for x the two components were independent, and for y they were perfectly dependent. The behavior of the marginals cannot tell you about the joint behavior.