B.1 Matrix Calculus
In the settings of machine learning and computational neuroscience, derivatives often appear in equations with matrices and vectors. Although it is always possible to re-express these equations in terms of sums of simpler derivatives, evaluating such expressions can be extremely tedious. It is therefore quite useful to have at hand rules for applying the derivatives directly to the matrices and vectors. The results are both easier to execute and more economically expressed. In defining these “matrix derivatives,” however, some care is required to ensure that the usual formulations of the rules of scalar calculus—the chain rule, product rule, etc.—are preserved. We do that here.
Throughout, we treat vectors with a transpose (denoted
B.1.1 Derivatives with respect to vectors
We conceptualize this fundamental operation as applying to vectors and yielding matrices. Application to scalars—or rather, scalar-valued functions—is then defined as a special case. Application to matrices (and higher-order tensors) is undefined.
The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow derivatives with respect to both column and row vectors; however:
-
1.
In derivatives of a vector with respect to a vector, the two vectors must have opposite orientations; that is, we can take
and , but not or . They are defined according tothe Jacobian and its transpose.
Thus, the transformation of “shapes” behaves like an outer product: if
Several special cases warrant attention.
Consider the linear vector-valued function
Or again, consider the case where
The chain rule.
Getting the chain rule right means making sure that the dimensions of the vectors and matrices generated by taking derivatives line up properly, which motivates the rule:
-
2.
In the elements generated by the chain rule, all the numerators on the RHS must have the same orientation as the numerator on the LHS, and likewise for the denominators.
Rules 1 and 2, along with the requirement that inner matrix dimensions agree, ensure that the chain rule for a row-vector derivative is:
This chain rule works just as well if
(a matrix) | ||||
(a row vector) | ||||
We could write down the column-vector version by applying rule 2 while ensuring agreement between the inner matrix dimensions. Alternatively, we can apply rule 1 to the chain rule just derived for the row-vector derivative:
This is perhaps the less intuitive of the two chain rules, since it reverses the order in which the factors are usually written in scalar calculus.
The product rule.
This motivates no additional matrix-calculus rules, but maintaining agreement among inner matrix dimensions does enforce a particular order.
For example, let
The column-vector equivalent is easily derived by transposing the RHS. Neither, unfortunately, can be read as “the derivative of the first times the second, plus the first times the derivative of the second,” as it is often taught in scalar calculus. It is easily remembered, nevertheless, by applying our rules 1 and 2, and checking inner matrix dimensions for agreement.
In the special case of a quadratic form,
In the even more special case where
B.1.2 Derivatives with respect to matrices
Scalar-valued functions.
Given a matrix,
and a scalar-valued function
This definition can be more easily applied if we translate it into the derivatives with respect to vectors introduced in the previous section.
Giving names to the rows (
we can write:
This lets us more easily derive some common special cases.
Consider the bilinear form
Stacking all
Alternatively, we might have used the column-gradient formulation:
and then stacked these columns horizontally as in Eq. B.1:
Or again, consider a case where
Then defining
where
A common application of this derivative occurs when working with Gaussian functions, which can be written
Matrix-valued functions.
The derivative of a matrix-valued function with respect to a matrix is a tensor.
These are cumbersome, so in a way our discussion of them is merely preliminary to what follows.
Let
where
with
with
Finally, consider the vector-valued function
For all of the above, the derivative with respect to the entire matrix
Some applications of the chain rule to matrix derivatives.
Now suppose we want to take a derivative with respect to
with
(B.3) | ||||
(B.4) | ||||
(B.5) | ||||
(B.6) |
For the vector-valued function
A few special cases are interesting. Since the trace and the derivative are linear operators, they commute, and in particular
Therefore, letting
(B.8) | ||||
(B.9) | ||||
(B.10) |
B.1.3 More useful identities
Now that we have defined derivatives (of scalars and vectors) with respective to vectors and derivatives (of scalars) with respect to matrices, we can derive the following useful identities.
The derivative of the log-determinant.
The trace and determinant of a matrix are related by a useful formula, derived through their respective relationships with the matrix’s spectrum. Recall:
For each eigenvalue
Therefore, the eigenvectors of
Therefore:
So far we have made use only of results from scalar calculus.
(The derivative of the log of
ANOTHER EXAMPLE (fix me).
Cf. the “trace trick,” in e.g. IPGM.
When
i.e. the inverse transpose. (This can be derived from the “interesting scalar case” below.) From this it follows easily that