B.1 Matrix Calculus
In the settings of machine learning and computational neuroscience, derivatives often appear in equations with matrices and vectors. Although it is always possible to re-express these equations in terms of sums of simpler derivatives, evaluating such expressions can be extremely tedious. It is therefore quite useful to have at hand rules for applying the derivatives directly to the matrices and vectors. The results are both easier to execute and more economically expressed. In defining these “matrix derivatives,” however, some care is required to ensure that the usual formulations of the rules of scalar calculus—the chain rule, product rule, etc.—are preserved. We do that here.
Throughout, we treat vectors with a transpose (denoted ) as rows, and vectors without as columns.
B.1.1 Derivatives with respect to vectors
We conceptualize this fundamental operation as applying to vectors and yielding matrices. Application to scalars—or rather, scalar-valued functions—is then defined as a special case. Application to matrices (and higher-order tensors) is undefined.
The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow derivatives with respect to both column and row vectors; however:
-
1.
In derivatives of a vector with respect to a vector, the two vectors must have opposite orientations; that is, we can take and , but not or . They are defined according to
the Jacobian and its transpose.
Thus, the transformation of “shapes” behaves like an outer product: if has length and has length , then is and is .
Several special cases warrant attention. Consider the linear vector-valued function . Since is a column, the derivative must be with respect to a row. In particular:
Or again, consider the case where is just a scalar function of . Rule 1 then says that is a column-vector version of the gradient, and a row-vector version. When is a linear, scalar function of , , the rule says that:
The chain rule.
Getting the chain rule right means making sure that the dimensions of the vectors and matrices generated by taking derivatives line up properly, which motivates the rule:
-
2.
In the elements generated by the chain rule, all the numerators on the RHS must have the same orientation as the numerator on the LHS, and likewise for the denominators.
Rules 1 and 2, along with the requirement that inner matrix dimensions agree, ensure that the chain rule for a row-vector derivative is:
This chain rule works just as well if or are scalars:
(a matrix) | ||||
(a row vector) | ||||
We could write down the column-vector version by applying rule 2 while ensuring agreement between the inner matrix dimensions. Alternatively, we can apply rule 1 to the chain rule just derived for the row-vector derivative:
This is perhaps the less intuitive of the two chain rules, since it reverses the order in which the factors are usually written in scalar calculus.
The product rule.
This motivates no additional matrix-calculus rules, but maintaining agreement among inner matrix dimensions does enforce a particular order. For example, let , the dot product of two vector-valued functions. Then the product rule must read:
The column-vector equivalent is easily derived by transposing the RHS. Neither, unfortunately, can be read as “the derivative of the first times the second, plus the first times the derivative of the second,” as it is often taught in scalar calculus. It is easily remembered, nevertheless, by applying our rules 1 and 2, and checking inner matrix dimensions for agreement.
In the special case of a quadratic form, , this reduces to:
In the even more special case where is symmetric, , this yields . Evidently, the column-vector equivalent is .
B.1.2 Derivatives with respect to matrices
Scalar-valued functions.
Given a matrix,
and a scalar-valued function , we define:
This definition can be more easily applied if we translate it into the derivatives with respect to vectors introduced in the previous section. Giving names to the rows () and columns () of :
we can write:
This lets us more easily derive some common special cases. Consider the bilinear form . The derivative with respect to the first row of is:
Stacking all of these rows vertically as in Eq. B.1, we see that:
Alternatively, we might have used the column-gradient formulation:
and then stacked these columns horizontally as in Eq. B.1:
Or again, consider a case where shows up in the other part of the bilinear form (in this case, a quadratic form):
Then defining , and considering again just the first row of , we find:
where , and is its first element. Stacking these rows vertically, as in Eq. B.1, yields:
A common application of this derivative occurs when working with Gaussian functions, which can be written for the defined in Eq. B.2. In this case, the matrix is symmetric, and the result simplifies further. More generally, Eq. B.2 occurs in quadratic penalties on the state in control problems, in which case would be the state-transition matrix.
Matrix-valued functions.
The derivative of a matrix-valued function with respect to a matrix is a tensor. These are cumbersome, so in a way our discussion of them is merely preliminary to what follows. Let be the entry of . We consider a few simple matrix functions of :
where is the column of . Transposes and derivatives commute, as usual, so the derivative of (e.g.) is just the transpose of the above. That means that
with the row of . From the first we can also compute the slightly more complicated, but elegant:
with the column of . And for a square matrix , we consider the even more complicated:
Finally, consider the vector-valued function , where has Jacobian but is otherwise unspecified. Its derivative with respect to an element of is
For all of the above, the derivative with respect to the entire matrix is just the collection of these matrices for all and .
Some applications of the chain rule to matrix derivatives.
Now suppose we want to take a derivative with respect to of the scalar-valued function , for various matrix-valued functions . We shall consider in particular those lately worked out. The chain rule here says that the element of this matrix is:
with a vector of ones and the entry-wise (Hadamard) product. We now apply this equation to the results above:
(B.3) | ||||
(B.4) | ||||
(B.5) | ||||
(B.6) |
For the vector-valued function considered above,
A few special cases are interesting. Since the trace and the derivative are linear operators, they commute, and in particular
Therefore, letting in the above equations, we have
(B.8) | ||||
(B.9) | ||||
(B.10) |
B.1.3 More useful identities
Now that we have defined derivatives (of scalars and vectors) with respective to vectors and derivatives (of scalars) with respect to matrices, we can derive the following useful identities.
The derivative of the log-determinant.
The trace and determinant of a matrix are related by a useful formula, derived through their respective relationships with the matrix’s spectrum. Recall:
For each eigenvalue and associated eigenvector , , so:
Therefore, the eigenvectors of are the eigenvectors , and the eigenvalues of are the exponentiated eigenvalues of . Hence:
Therefore:
So far we have made use only of results from scalar calculus. (The derivative of the log of can be derived easily in terms of the Maclaurin series for the natural logarithm.)
ANOTHER EXAMPLE (fix me).
Cf. the “trace trick,” in e.g. IPGM. When (i.e. the log of the determinant of ),
i.e. the inverse transpose. (This can be derived from the “interesting scalar case” below.) From this it follows easily that