B.1 Matrix Calculus

In the settings of machine learning and computational neuroscience, derivatives often appear in equations with matrices and vectors. Although it is always possible to re-express these equations in terms of sums of simpler derivatives, evaluating such expressions can be extremely tedious. It is therefore quite useful to have at hand rules for applying the derivatives directly to the matrices and vectors. The results are both easier to execute and more economically expressed. In defining these “matrix derivatives,” however, some care is required to ensure that the usual formulations of the rules of scalar calculus—the chain rule, product rule, etc.—are preserved. We do that here.

Throughout, we treat vectors with a transpose (denoted $\bm{x}^{\text{T}}$ ) as rows, and vectors without as columns.

B.1.1 Derivatives with respect to vectors

We conceptualize this fundamental operation as applying to vectors and yielding matrices. Application to scalars—or rather, scalar-valued functions—is then defined as a special case. Application to matrices (and higher-order tensors) is undefined.

The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow derivatives with respect to both column and row vectors; however:

1.

In derivatives of a vector with respect to a vector, the two vectors must have opposite orientations; that is, we can take $\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}}$ and $\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}}$ , but not $\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}}$ or $\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T%}}}$ . They are defined according to

$\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}%=\mathbf{J}(\bm{y})\hskip 72.27pt\frac{\mathop{}\!\mathrm{d}{\bm{y}}^{\text{T}%}}{\mathop{}\!\mathrm{d}{\bm{x}}}=\mathbf{J}^{\text{T}}(\bm{y}),$

the Jacobian and its transpose.

Thus, the transformation of “shapes” behaves like an outer product: if $\bm{y}$ has length $m$ and $\bm{x}$ has length $n$ , then $\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}}$ is $m\times n$ and $\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}}$ is $n\times m$ .

Several special cases warrant attention. Consider the linear vector-valued function $\bm{y}=\mathbf{A}\bm{x}$ . Since $\bm{y}$ is a column, the derivative must be with respect to a row. In particular:

\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}}%^{\text{T}}}=\mathbf{A}\frac{\mathop{}\!\mathrm{d}{\bm{x}}}{\mathop{}\!\mathrm%{d}{\bm{x}}^{\text{T}}}=\mathbf{A}\mathbf{I}=\mathbf{A}.

Or again, consider the case where $y$ is just a scalar function of $\bm{x}$ . Rule 1 then says that $\mathop{}\!\mathrm{d}{y}/\mathop{}\!\mathrm{d}{\bm{x}}$ is a column-vector version of the gradient, and $\mathop{}\!\mathrm{d}{y}/\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}$ a row-vector version. When $y$ is a linear, scalar function of $\bm{x}$ , $\bm{c}\cdot\bm{x}$ , the rule says that:

\begin{split}\frac{\mathop{}\!\mathrm{d}{(\bm{c}\cdot\bm{x})}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}&=\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}%\bm{c})}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\frac{\mathop{}\!\mathrm{d%}{(\bm{c}^{\text{T}}\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{c}%^{\text{T}}\frac{\mathop{}\!\mathrm{d}{\bm{x}}}{\mathop{}\!\mathrm{d}{\bm{x}}^%{\text{T}}}=\bm{c}^{\text{T}}\mathbf{I}=\bm{c}^{\text{T}}\\\frac{\mathop{}\!\mathrm{d}{(\bm{c}\cdot\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}%}}&=\frac{\mathop{}\!\mathrm{d}{(\bm{c}^{\text{T}}\bm{x})}}{\mathop{}\!\mathrm%{d}{\bm{x}}}=\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}\bm{c})}}{\mathop{}%\!\mathrm{d}{\bm{x}}}=\frac{\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}}}{\mathop{%}\!\mathrm{d}{\bm{x}}}\bm{c}=\mathbf{I}\bm{c}=\bm{c}.\end{split}

The chain rule.

Getting the chain rule right means making sure that the dimensions of the vectors and matrices generated by taking derivatives line up properly, which motivates the rule:

2.

In the elements generated by the chain rule, all the numerators on the RHS must have the same orientation as the numerator on the LHS, and likewise for the denominators.

Rules 1 and 2, along with the requirement that inner matrix dimensions agree, ensure that the chain rule for a row-vector derivative is:

\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}\bm{z}%(\bm{y}(\bm{x}))=\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{%\bm{y}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{%\bm{x}}^{\text{T}}}.

This chain rule works just as well if $z$ or $y$ are scalars:

$\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}\bm{z}(y(\bm{x}))$	$\displaystyle=\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{y}}%\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}$	(a matrix)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}z(\bm{y}(\bm{x}))$	$\displaystyle=\frac{\mathop{}\!\mathrm{d}{z}}{\mathop{}\!\mathrm{d}{\bm{y}}^{%\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}$	(a row vector)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}z(y(\bm{x}))$	$\displaystyle=\frac{\mathop{}\!\mathrm{d}{z}}{\mathop{}\!\mathrm{d}{y}}\frac{%\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}$	$\displaystyle\text{(a row vector)}.$

We could write down the column-vector version by applying rule 2 while ensuring agreement between the inner matrix dimensions. Alternatively, we can apply rule 1 to the chain rule just derived for the row-vector derivative:

\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}}\bm{z}^{\text{T}}%(\bm{y}(\bm{x}))=\bigg{(}\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!%\mathrm{d}{\bm{y}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}\bigg{)}^{\text{T}}=\frac{\mathop{}\!\mathrm{d}{%\bm{y}^{\text{T}}}}{\mathop{}\!\mathrm{d}{\bm{x}}}\frac{\mathop{}\!\mathrm{d}{%\bm{z}^{\text{T}}}}{\mathop{}\!\mathrm{d}{\bm{y}}}.

This is perhaps the less intuitive of the two chain rules, since it reverses the order in which the factors are usually written in scalar calculus.

The product rule.

This motivates no additional matrix-calculus rules, but maintaining agreement among inner matrix dimensions does enforce a particular order. For example, let $y=\bm{u}(\bm{x})\cdot\bm{v}(\bm{x})$ , the dot product of two vector-valued functions. Then the product rule must read:

\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=%\frac{\mathop{}\!\mathrm{d}{(\bm{v}^{\text{T}}\bm{u})}}{\mathop{}\!\mathrm{d}{%\bm{u}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{u}}}{\mathop{}\!\mathrm{d}{%\bm{x}}^{\text{T}}}+\frac{\mathop{}\!\mathrm{d}{(\bm{u}^{\text{T}}\bm{v})}}{%\mathop{}\!\mathrm{d}{\bm{v}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{v}}}{%\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{v}^{\text{T}}\mathbf{J}(\bm{u})+%\bm{u}^{\text{T}}\mathbf{J}(\bm{v}).

The column-vector equivalent is easily derived by transposing the RHS. Neither, unfortunately, can be read as “the derivative of the first times the second, plus the first times the derivative of the second,” as it is often taught in scalar calculus. It is easily remembered, nevertheless, by applying our rules 1 and 2, and checking inner matrix dimensions for agreement.

In the special case of a quadratic form, $y=\bm{x}^{\text{T}}\mathbf{A}\bm{x}$ , this reduces to:

\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}\mathbf{A}\bm{x})}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{x}^{\text{T}}\mathbf{A}^{\text{T}}+\bm{x}^{%\text{T}}\mathbf{A}=\bm{x}^{\text{T}}(\mathbf{A}^{\text{T}}+\mathbf{A}).

In the even more special case where $A$ is symmetric, $\mathbf{A}=\mathbf{A}^{\text{T}}$ , this yields $2\bm{x}^{\text{T}}\mathbf{A}$ . Evidently, the column-vector equivalent is $2\mathbf{A}\bm{x}$ .

B.1.2 Derivatives with respect to matrices

Scalar-valued functions.

Given a matrix,

\mathbf{X}=\begin{pmatrix}x_{1,1}&x_{1,2}&\cdots&x_{1,n}\\x_{2,1}&x_{2,2}&\cdots&x_{2,n}\\\vdots&\vdots&\ddots&\vdots\\x_{m,1}&x_{m,2}&\cdots&x_{m,n}\end{pmatrix},

and a scalar-valued function $y$ , we define:

\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\begin{%pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,1}}}&\frac{%\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,2}}}&\cdots&\frac{\mathop%{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,n}}}\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,1}}}&\frac{\mathop{%}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,2}}}&\cdots&\frac{\mathop{}\!%\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,n}}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,1}}}&\frac{\mathop{%}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,2}}}&\cdots&\frac{\mathop{}\!%\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,n}}}\end{pmatrix}.

This definition can be more easily applied if we translate it into the derivatives with respect to vectors introduced in the previous section. Giving names to the rows ( $\bm{\bar{x}}_{i}^{\text{T}}$ ) and columns ( $\bm{x}_{i}$ ) of $\mathbf{X}$ :

\mathbf{X}=\begin{pmatrix}\bm{\bar{x}}_{1}^{\text{T}}\\\bm{\bar{x}}_{2}^{\text{T}}\\\vdots\\\bm{\bar{x}}_{m}^{\text{T}}\\\end{pmatrix}=\begin{pmatrix}\bm{x}_{1}&\bm{x}_{2}&\cdots&\bm{x}_{n}\end{%pmatrix},

we can write:

\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\begin{%pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}%}^{\text{T}}}\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{2}}^{\text%{T}}}\\\vdots\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{m}}^{\text%{T}}}\\\end{pmatrix}=\begin{pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}&\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{%\bm{x}_{2}}}&\cdots&\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{%x}_{n}}}\end{pmatrix}.

(B.1)

This lets us more easily derive some common special cases. Consider the bilinear form $y=\bm{a}^{\text{T}}\mathbf{X}\bm{b}$ . The derivative with respect to the first row of $\mathbf{X}$ is:

\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{a}^{\text{T}}\begin{pmatrix}\frac%{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{1})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\\\frac{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{2})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\\\vdots\\\frac{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{m})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\end{pmatrix}=\bm{a}^{\text{T}}\begin{%pmatrix}\bm{b}^{\text{T}}\\\bm{0}^{\text{T}}\\\vdots\\\bm{0}^{\text{T}}\end{pmatrix}=a_{1}\bm{b}^{\text{T}}

Stacking all $m$ of these rows vertically as in Eq. B.1, we see that:

\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\mathbf{X}}}=\begin{pmatrix}a_{1}\bm{b}^{\text{T}}\\a_{2}\bm{b}^{\text{T}}\\\vdots\\a_{m}\bm{b}^{\text{T}}\end{pmatrix}=\bm{a}\bm{b}^{\text{T}}.

Alternatively, we might have used the column-gradient formulation:

\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}=\begin{pmatrix}\frac{\mathop{}\!\mathrm{d}{(\bm{x}_{1}%^{\text{T}}\bm{a})}}{\mathop{}\!\mathrm{d}{\bm{x}_{1}}}&\frac{\mathop{}\!%\mathrm{d}{(\bm{x}_{2}^{\text{T}}\bm{a})}}{\mathop{}\!\mathrm{d}{\bm{x}_{1}}}&%\cdots&\frac{\mathop{}\!\mathrm{d}{(\bm{x}_{n}^{\text{T}}\bm{a})}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}\end{pmatrix}\bm{b}=b_{1}\bm{a},

and then stacked these columns horizontally as in Eq. B.1:

\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\mathbf{X}}}=\begin{pmatrix}b_{1}\bm{a}&b_{2}\bm{a}&\cdots&b_{n}\bm%{a}\end{pmatrix}=\bm{a}\bm{b}^{\text{T}}.

Or again, consider a case where $\mathbf{X}$ shows up in the other part of the bilinear form (in this case, a quadratic form):

y=(\mathbf{X}\bm{a}+\bm{b})^{\text{T}}\mathbf{W}(\mathbf{X}\bm{a}+\bm{b}).

(B.2)

Then defining $\bm{z}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathbf{X}\bm{a}+\bm{b}$ , and considering again just the first row of $\mathbf{X}$ , we find:

\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text%{T}}}=\frac{\mathop{}\!\mathrm{d}{(\bm{z}^{\text{T}}W\bm{z})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=(\mathbf{W}\bm{z})^{\text{T}}\frac{%\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{%T}}}+\bm{z}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{(\mathbf{W}\bm{z})}}{\mathop%{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{z}^{\text{T}}(\mathbf{W}^{%\text{T}}+\mathbf{W})\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d%}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{z}^{\text{T}}(\mathbf{W}^{\text{T}}+%\mathbf{W})\begin{pmatrix}\bm{a}^{\text{T}}\\\bm{0}^{\text{T}}\\\vdots\\\bm{0}^{\text{T}}\end{pmatrix}=v_{1}\bm{a}^{\text{T}},

where $\bm{v}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=(\mathbf{W}+\mathbf{W}^{\text{T}})\bm{z}$ , and $v_{1}$ is its first element. Stacking these rows vertically, as in Eq. B.1, yields:

\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\bm{v}\bm{a%}^{\text{T}}=(\mathbf{W}+\mathbf{W}^{\text{T}})(\mathbf{X}\bm{a}+\bm{b})\bm{a}%^{\text{T}}.

A common application of this derivative occurs when working with Gaussian functions, which can be written $e^{-y/2}$ for the $y$ defined in Eq. B.2. In this case, the matrix $\mathbf{W}$ is symmetric, and the result simplifies further. More generally, Eq. B.2 occurs in quadratic penalties on the state in control problems, in which case $X$ would be the state-transition matrix.

Matrix-valued functions.

The derivative of a matrix-valued function with respect to a matrix is a tensor. These are cumbersome, so in a way our discussion of them is merely preliminary to what follows. Let $x_{ij}$ be the $(i,j)^{\text{th}}$ entry of $X$ . We consider a few simple matrix functions of $X$ :

\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{x_{%ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&\underbrace{\bm{a}_{i}}_{j^{%\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0}\end{bmatrix}

where $\bm{a}_{i}$ is the $i^{\text{th}}$ column of $\mathbf{A}$ . Transposes and derivatives commute, as usual, so the derivative of $\mathbf{X}^{\text{T}}\mathbf{A}^{\text{T}}$ (e.g.) is just the transpose of the above. That means that

\frac{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{x_{%ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&\underbrace{\bm{\tilde{a}}_{j%}}_{i^{\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0}\end{bmatrix}^{\text{T}},

with $\bm{\tilde{a}}_{j}^{\text{T}}$ the $j^{\text{th}}$ row of $A$ . From the first we can also compute the slightly more complicated, but elegant:

\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{%\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{a}_{i}\bm{b}_{j}^{\text{T}},

with $\bm{b}_{j}$ the $j^{\text{th}}$ column of $\mathbf{B}$ . And for a square matrix $\mathbf{A}$ , we consider the even more complicated:

\frac{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{%\mathop{}\!\mathrm{d}{x_{ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&%\underbrace{\mathbf{X}\bm{\tilde{a}}_{j}}_{i^{\text{th}}\text{ column}}&\bm{0}%&\cdots&\bm{0}\end{bmatrix}^{\text{T}}+\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm%{0}&\underbrace{X\bm{a}_{j}}_{i^{\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0%}\end{bmatrix}.

Finally, consider the vector-valued function $\bm{f}(\mathbf{X}\bm{a}+\bm{b})$ , where $\bm{f}$ has Jacobian $\mathbf{J}$ but is otherwise unspecified. Its derivative with respect to an element of $\mathbf{X}$ is

\frac{\mathop{}\!\mathrm{d}{\bm{f}}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\mathbf{J}%\begin{bmatrix}0&0&\cdots&0&\underbrace{a_{j}}_{i^{\text{th}}\text{ column}}&0%&\cdots&0\end{bmatrix}^{\text{T}}=\mathopen{}\mathclose{{}\left\{\mathbf{J}}%\right\}_{i}a_{j}.

For all of the above, the derivative with respect to the entire matrix $\mathbf{X}$ is just the collection of these matrices for all $i$ and $j$ .

Some applications of the chain rule to matrix derivatives.

Now suppose we want to take a derivative with respect to $\mathbf{X}$ of the scalar-valued function $y(\mathbf{F}(\mathbf{X}))$ , for various matrix-valued functions $\mathbf{F}(\cdot)$ . We shall consider in particular those lately worked out. The chain rule here says that the $(i,j)^{\text{th}}$ element of this matrix is:

\frac{\mathop{}\!\mathrm{d}{y(\mathbf{F}(\mathbf{X}))}}{\mathop{}\!\mathrm{d}{%x_{ij}}}=\sum_{kl}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{F_{kl}%}}\frac{\mathop{}\!\mathrm{d}{F_{kl}}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{1}^{%\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf%{F}}}\circ\frac{\mathop{}\!\mathrm{d}{\mathbf{F}}}{\mathop{}\!\mathrm{d}{x_{ij%}}}\bigg{)}\bm{1},

with $\bm{1}$ a vector of ones and $\circ$ the entry-wise (Hadamard) product. We now apply this equation to the results above:

$\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{A}\mathbf{X})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}$	$\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}\circ\frac{\mathop{}\!\mathrm{d}{(%\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}=\bigg{\{}%\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}%\bigg{\}}_{j}^{\text{T}}\bm{a}_{i}\implies\frac{\mathop{}\!\mathrm{d}{y(%\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\mathbf{A}^{\text{T%}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})%}},$	(B.3)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}$	$\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A})}}\circ\frac{\mathop{}\!\mathrm{d}{(%\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies%\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{%\mathbf{X}}}=\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{X}%\mathbf{A})}}\mathbf{A}^{\text{T}},$	(B.4)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{A}\mathbf{X}\mathbf{B}^{%\text{T}})}}{\mathop{}\!\mathrm{d}{x_{ij}}}$	$\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}\circ\frac{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies\frac{\mathop{}\!\mathrm{d}{y(\mathbf%{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=%\mathbf{A}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(%\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}\mathbf{B},$	(B.5)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A}\mathbf{X}^{%\text{T}})}}{\mathop{}\!\mathrm{d}{x_{ij}}}$	$\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}\circ\frac{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies\frac{\mathop{}\!\mathrm{d}{y(\mathbf%{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\frac%{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{%X}^{\text{T}})}}\mathbf{X}\mathbf{A}^{\text{T}}+\frac{\mathop{}\!\mathrm{d}{y}%}{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}^{\text{T%}}\mathbf{X}\mathbf{A}$	(B.6)

For the vector-valued function $\bm{f}(\mathbf{X}\bm{a}+\bm{b})=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}\bm{z}$ considered above,

\frac{\mathop{}\!\mathrm{d}{y(\bm{f})}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{1}^%{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{z}%}}\circ\frac{\mathop{}\!\mathrm{d}{\bm{f}}}{\mathop{}\!\mathrm{d}{x_{ij}}}%\bigg{)}\bm{1}=\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{z}}^{%\text{T}}}\mathopen{}\mathclose{{}\left\{\mathbf{J}}\right\}_{i}a_{j}\implies%\frac{\mathop{}\!\mathrm{d}{y(\bm{f})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=%\mathbf{J}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm%{z}}}\bm{a}^{\text{T}}.

(B.7)

A few special cases are interesting. Since the trace and the derivative are linear operators, they commute, and in particular

\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}\text{tr}%\mathopen{}\mathclose{{}\left[\mathbf{X}}\right]=\mathbf{I}.

Therefore, letting $y(\mathbf{X})\stackrel{{\scriptstyle\text{set}}}{{=}}\text{tr}\mathopen{}%\mathclose{{}\left[\mathbf{X}}\right]$ in the above equations, we have

$\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{A}\mathbf{X}}\right]}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\frac{%\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{X}\mathbf%{A}}\right]}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}$	$\displaystyle=\mathbf{A}^{\text{T}},$	(B.8)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}}}\right]}}{\mathop{}\!\mathrm{d%}{\mathbf{X}}}$	$\displaystyle=\mathbf{A}^{\text{T}}\mathbf{B},$	(B.9)
$\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}}}\right]}}{\mathop{}\!\mathrm{d%}{\mathbf{X}}}$	$\displaystyle=\mathbf{X}\mathbf{A}^{\text{T}}+\mathbf{X}\mathbf{A}.$	(B.10)

B.1.3 More useful identities

Now that we have defined derivatives (of scalars and vectors) with respective to vectors and derivatives (of scalars) with respect to matrices, we can derive the following useful identities.

The derivative of the log-determinant.

The trace and determinant of a matrix are related by a useful formula, derived through their respective relationships with the matrix’s spectrum. Recall:

	$\displaystyle\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}}\right]=\sum_{i%}\lambda_{i},$
	$\displaystyle\|\mathbf{M}\|=\prod_{i}\lambda_{i}.$

For each eigenvalue $\lambda$ and associated eigenvector $\bm{v}$ , $\mathbf{M}\bm{v}=\lambda\bm{v}$ , so:

\begin{split}\exp\{\mathbf{M}\}\bm{v}&=\bigg{(}\mathbf{I}+\mathbf{M}+\frac{%\mathbf{M}^{2}}{2!}+\frac{\mathbf{M}^{3}}{3!}+\cdots\bigg{)}\bm{v}\\&=\bm{v}+\lambda\bm{v}+\frac{\lambda^{2}}{2!}\bm{v}+\frac{\lambda^{3}}{3!}\bm{%v}+\cdots\\&=e^{\lambda}\bm{v}.\end{split}

Therefore, the eigenvectors of $\exp\{\mathbf{M}\}$ are the eigenvectors $\mathbf{M}$ , and the eigenvalues of $\exp\{\mathbf{M}\}$ are the exponentiated eigenvalues of $\mathbf{M}$ . Hence:

\exp\{\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}}\right]\}=\exp\bigg{\{%}\sum_{i}\lambda_{i}\bigg{\}}=\prod_{i}\exp\{\lambda_{i}\}=|\exp\{\mathbf{M}\}|.

(B.11)

Therefore:

\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\log|%\mathbf{A}(x)|&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}\!\mathrm{d}{}}{%\mathop{}\!\mathrm{d}{x}}|\mathbf{A}(x)|\\\mathbf{M}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\log\mathbf{A}\implies&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}\!\mathrm{d%}{}}{\mathop{}\!\mathrm{d}{x}}|\exp\mathopen{}\mathclose{{}\left\{\mathbf{M}(x%)}\right\}|\\\text{Eq.\ \ref{eqn:exptr}}\implies&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}%\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\exp\mathopen{}\mathclose{{}\left\{%\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}(x)}\right]}\right\}\\&=\frac{1}{|\mathbf{A}(x)|}\exp\mathopen{}\mathclose{{}\left\{\text{tr}%\mathopen{}\mathclose{{}\left[\mathbf{M}(x)}\right]}\right\}\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\text{tr}\mathopen{}\mathclose{{}\left[%\mathbf{M}(x)}\right]\\&=\frac{1}{|\mathbf{A}(x)|}|\exp\mathopen{}\mathclose{{}\left\{\mathbf{M}(x)}%\right\}|\text{tr}\mathopen{}\mathclose{{}\left[\frac{\mathop{}\!\mathrm{d}{}}%{\mathop{}\!\mathrm{d}{x}}\mathbf{M}(x)}\right]\\&=\frac{1}{|\mathbf{A}(x)|}|\mathbf{A}(x)|\text{tr}\mathopen{}\mathclose{{}%\left[\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\log\mathbf{A}(x%)}\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[{\mathbf{A}}^{-1}\frac{\mathop{}\!%\mathrm{d}{\mathbf{A}}}{\mathop{}\!\mathrm{d}{x}}}\right].\\\end{split}

So far we have made use only of results from scalar calculus. (The derivative of the log of $\mathbf{A}(x)$ can be derived easily in terms of the Maclaurin series for the natural logarithm.)

ANOTHER EXAMPLE (fix me).

Cf. the “trace trick,” in e.g. IPGM. When $f=\log|\mathbf{A}|$ (i.e. the log of the determinant of $\mathbf{A}$ ),

\frac{\mathop{}\!\mathrm{d}{\log|\mathbf{A}|)}}{\mathop{}\!\mathrm{d}{\mathbf{%A}}}=\mathbf{A}^{-\text{T}},

i.e. the inverse transpose. (This can be derived from the “interesting scalar case” below.) From this it follows easily that

\frac{\mathop{}\!\mathrm{d}{|\mathbf{A}|}}{\mathop{}\!\mathrm{d}{\mathbf{A}}}=%|\mathbf{A}|\mathbf{A}^{-\text{T}}.