B.1 Matrix Calculus

In the settings of machine learning and computational neuroscience, derivatives often appear in equations with matrices and vectors. Although it is always possible to re-express these equations in terms of sums of simpler derivatives, evaluating such expressions can be extremely tedious. It is therefore quite useful to have at hand rules for applying the derivatives directly to the matrices and vectors. The results are both easier to execute and more economically expressed. In defining these “matrix derivatives,” however, some care is required to ensure that the usual formulations of the rules of scalar calculus—the chain rule, product rule, etc.—are preserved. We do that here.

Throughout, we treat vectors with a transpose (denoted 𝒙T\bm{x}^{\text{T}}) as rows, and vectors without as columns.

B.1.1 Derivatives with respect to vectors

We conceptualize this fundamental operation as applying to vectors and yielding matrices. Application to scalars—or rather, scalar-valued functions—is then defined as a special case. Application to matrices (and higher-order tensors) is undefined.

The central idea in our definition is that the dimensions of the derivative must match the dimensions of the resulting matrix. In particular, we allow derivatives with respect to both column and row vectors; however:

  1. 1.

    In derivatives of a vector with respect to a vector, the two vectors must have opposite orientations; that is, we can take d𝒚/d𝒙T\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}} and d𝒚T/d𝒙\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}}, but not d𝒚/d𝒙\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}} or d𝒚T/d𝒙T\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T%}}}. They are defined according to

    d𝒚d𝒙T=𝐉(𝒚)        d𝒚Td𝒙=𝐉T(𝒚),\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}%=\mathbf{J}(\bm{y})\hskip 72.27pt\frac{\mathop{}\!\mathrm{d}{\bm{y}}^{\text{T}%}}{\mathop{}\!\mathrm{d}{\bm{x}}}=\mathbf{J}^{\text{T}}(\bm{y}),

    the Jacobian and its transpose.

Thus, the transformation of “shapes” behaves like an outer product: if 𝒚\bm{y} has length mm and 𝒙\bm{x} has length nn, then d𝒚/d𝒙T\mathop{}\!\mathrm{d}{\bm{y}}/\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}} is m×nm\times n and d𝒚T/d𝒙\mathop{}\!\mathrm{d}{\bm{y}^{\text{T}}}/\mathop{}\!\mathrm{d}{\bm{x}} is n×mn\times m.

Several special cases warrant attention. Consider the linear vector-valued function 𝒚=𝐀𝒙\bm{y}=\mathbf{A}\bm{x}. Since 𝒚\bm{y} is a column, the derivative must be with respect to a row. In particular:

d(𝐀𝒙)d𝒙T=𝐀d𝒙d𝒙T=𝐀𝐈=𝐀.\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}}%^{\text{T}}}=\mathbf{A}\frac{\mathop{}\!\mathrm{d}{\bm{x}}}{\mathop{}\!\mathrm%{d}{\bm{x}}^{\text{T}}}=\mathbf{A}\mathbf{I}=\mathbf{A}.

Or again, consider the case where yy is just a scalar function of 𝒙\bm{x}. Rule 1 then says that dy/d𝒙\mathop{}\!\mathrm{d}{y}/\mathop{}\!\mathrm{d}{\bm{x}} is a column-vector version of the gradient, and dy/d𝒙T\mathop{}\!\mathrm{d}{y}/\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}} a row-vector version. When yy is a linear, scalar function of 𝒙\bm{x}, 𝒄𝒙\bm{c}\cdot\bm{x}, the rule says that:

d(𝒄𝒙)d𝒙T=d(𝒙T𝒄)d𝒙T=d(𝒄T𝒙)d𝒙T=𝒄Td𝒙d𝒙T=𝒄T𝐈=𝒄Td(𝒄𝒙)d𝒙=d(𝒄T𝒙)d𝒙=d(𝒙T𝒄)d𝒙=d𝒙Td𝒙𝒄=𝐈𝒄=𝒄.\begin{split}\frac{\mathop{}\!\mathrm{d}{(\bm{c}\cdot\bm{x})}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}&=\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}%\bm{c})}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\frac{\mathop{}\!\mathrm{d%}{(\bm{c}^{\text{T}}\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{c}%^{\text{T}}\frac{\mathop{}\!\mathrm{d}{\bm{x}}}{\mathop{}\!\mathrm{d}{\bm{x}}^%{\text{T}}}=\bm{c}^{\text{T}}\mathbf{I}=\bm{c}^{\text{T}}\\\frac{\mathop{}\!\mathrm{d}{(\bm{c}\cdot\bm{x})}}{\mathop{}\!\mathrm{d}{\bm{x}%}}&=\frac{\mathop{}\!\mathrm{d}{(\bm{c}^{\text{T}}\bm{x})}}{\mathop{}\!\mathrm%{d}{\bm{x}}}=\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}\bm{c})}}{\mathop{}%\!\mathrm{d}{\bm{x}}}=\frac{\mathop{}\!\mathrm{d}{\bm{x}^{\text{T}}}}{\mathop{%}\!\mathrm{d}{\bm{x}}}\bm{c}=\mathbf{I}\bm{c}=\bm{c}.\end{split}

The chain rule.

Getting the chain rule right means making sure that the dimensions of the vectors and matrices generated by taking derivatives line up properly, which motivates the rule:

  1. 2.

    In the elements generated by the chain rule, all the numerators on the RHS must have the same orientation as the numerator on the LHS, and likewise for the denominators.

Rules 1 and 2, along with the requirement that inner matrix dimensions agree, ensure that the chain rule for a row-vector derivative is:

dd𝒙T𝒛(𝒚(𝒙))=d𝒛d𝒚Td𝒚d𝒙T.\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}\bm{z}%(\bm{y}(\bm{x}))=\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{%\bm{y}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{%\bm{x}}^{\text{T}}}.

This chain rule works just as well if zz or yy are scalars:

dd𝒙T𝒛(y(𝒙))\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}\bm{z}(y(\bm{x})) =d𝒛dydyd𝒙T\displaystyle=\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{y}}%\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}} (a matrix)
dd𝒙Tz(𝒚(𝒙))\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}z(\bm{y}(\bm{x})) =dzd𝒚Td𝒚d𝒙T\displaystyle=\frac{\mathop{}\!\mathrm{d}{z}}{\mathop{}\!\mathrm{d}{\bm{y}}^{%\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}} (a row vector)
dd𝒙Tz(y(𝒙))\displaystyle\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}^{%\text{T}}}z(y(\bm{x})) =dzdydyd𝒙T\displaystyle=\frac{\mathop{}\!\mathrm{d}{z}}{\mathop{}\!\mathrm{d}{y}}\frac{%\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}} (a row vector).\displaystyle\text{(a row vector)}.

We could write down the column-vector version by applying rule 2 while ensuring agreement between the inner matrix dimensions. Alternatively, we can apply rule 1 to the chain rule just derived for the row-vector derivative:

dd𝒙𝒛T(𝒚(𝒙))=(d𝒛d𝒚Td𝒚d𝒙T)T=d𝒚Td𝒙d𝒛Td𝒚.\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\bm{x}}}\bm{z}^{\text{T}}%(\bm{y}(\bm{x}))=\bigg{(}\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!%\mathrm{d}{\bm{y}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{y}}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}\bigg{)}^{\text{T}}=\frac{\mathop{}\!\mathrm{d}{%\bm{y}^{\text{T}}}}{\mathop{}\!\mathrm{d}{\bm{x}}}\frac{\mathop{}\!\mathrm{d}{%\bm{z}^{\text{T}}}}{\mathop{}\!\mathrm{d}{\bm{y}}}.

This is perhaps the less intuitive of the two chain rules, since it reverses the order in which the factors are usually written in scalar calculus.

The product rule.

This motivates no additional matrix-calculus rules, but maintaining agreement among inner matrix dimensions does enforce a particular order. For example, let y=𝒖(𝒙)𝒗(𝒙)y=\bm{u}(\bm{x})\cdot\bm{v}(\bm{x}), the dot product of two vector-valued functions. Then the product rule must read:

dyd𝒙T=d(𝒗T𝒖)d𝒖Td𝒖d𝒙T+d(𝒖T𝒗)d𝒗Td𝒗d𝒙T=𝒗T𝐉(𝒖)+𝒖T𝐉(𝒗).\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=%\frac{\mathop{}\!\mathrm{d}{(\bm{v}^{\text{T}}\bm{u})}}{\mathop{}\!\mathrm{d}{%\bm{u}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{u}}}{\mathop{}\!\mathrm{d}{%\bm{x}}^{\text{T}}}+\frac{\mathop{}\!\mathrm{d}{(\bm{u}^{\text{T}}\bm{v})}}{%\mathop{}\!\mathrm{d}{\bm{v}}^{\text{T}}}\frac{\mathop{}\!\mathrm{d}{\bm{v}}}{%\mathop{}\!\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{v}^{\text{T}}\mathbf{J}(\bm{u})+%\bm{u}^{\text{T}}\mathbf{J}(\bm{v}).

The column-vector equivalent is easily derived by transposing the RHS. Neither, unfortunately, can be read as “the derivative of the first times the second, plus the first times the derivative of the second,” as it is often taught in scalar calculus. It is easily remembered, nevertheless, by applying our rules 1 and 2, and checking inner matrix dimensions for agreement.

In the special case of a quadratic form, y=𝒙T𝐀𝒙y=\bm{x}^{\text{T}}\mathbf{A}\bm{x}, this reduces to:

d(𝒙T𝐀𝒙)d𝒙T=𝒙T𝐀T+𝒙T𝐀=𝒙T(𝐀T+𝐀).\frac{\mathop{}\!\mathrm{d}{(\bm{x}^{\text{T}}\mathbf{A}\bm{x})}}{\mathop{}\!%\mathrm{d}{\bm{x}}^{\text{T}}}=\bm{x}^{\text{T}}\mathbf{A}^{\text{T}}+\bm{x}^{%\text{T}}\mathbf{A}=\bm{x}^{\text{T}}(\mathbf{A}^{\text{T}}+\mathbf{A}).

In the even more special case where AA is symmetric, 𝐀=𝐀T\mathbf{A}=\mathbf{A}^{\text{T}}, this yields 2𝒙T𝐀2\bm{x}^{\text{T}}\mathbf{A}. Evidently, the column-vector equivalent is 2𝐀𝒙2\mathbf{A}\bm{x}.

B.1.2 Derivatives with respect to matrices

Scalar-valued functions.

Given a matrix,

𝐗=(x1,1x1,2x1,nx2,1x2,2x2,nxm,1xm,2xm,n),\mathbf{X}=\begin{pmatrix}x_{1,1}&x_{1,2}&\cdots&x_{1,n}\\x_{2,1}&x_{2,2}&\cdots&x_{2,n}\\\vdots&\vdots&\ddots&\vdots\\x_{m,1}&x_{m,2}&\cdots&x_{m,n}\end{pmatrix},

and a scalar-valued function yy, we define:

dyd𝐗=(dydx1,1dydx1,2dydx1,ndydx2,1dydx2,2dydx2,ndydxn,1dydxn,2dydxn,n).\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\begin{%pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,1}}}&\frac{%\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,2}}}&\cdots&\frac{\mathop%{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{1,n}}}\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,1}}}&\frac{\mathop{%}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,2}}}&\cdots&\frac{\mathop{}\!%\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{2,n}}}\\\vdots&\vdots&\ddots&\vdots\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,1}}}&\frac{\mathop{%}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,2}}}&\cdots&\frac{\mathop{}\!%\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{x_{n,n}}}\end{pmatrix}.

This definition can be more easily applied if we translate it into the derivatives with respect to vectors introduced in the previous section. Giving names to the rows (𝒙¯iT\bm{\bar{x}}_{i}^{\text{T}}) and columns (𝒙i\bm{x}_{i}) of 𝐗\mathbf{X}:

𝐗=(𝒙¯1T𝒙¯2T𝒙¯mT)=(𝒙1𝒙2𝒙n),\mathbf{X}=\begin{pmatrix}\bm{\bar{x}}_{1}^{\text{T}}\\\bm{\bar{x}}_{2}^{\text{T}}\\\vdots\\\bm{\bar{x}}_{m}^{\text{T}}\\\end{pmatrix}=\begin{pmatrix}\bm{x}_{1}&\bm{x}_{2}&\cdots&\bm{x}_{n}\end{%pmatrix},

we can write:

dyd𝐗=(dyd𝒙¯1Tdyd𝒙¯2Tdyd𝒙¯mT)=(dyd𝒙1dyd𝒙2dyd𝒙n).\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\begin{%pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}%}^{\text{T}}}\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{2}}^{\text%{T}}}\\\vdots\\\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{m}}^{\text%{T}}}\\\end{pmatrix}=\begin{pmatrix}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}&\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{%\bm{x}_{2}}}&\cdots&\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{%x}_{n}}}\end{pmatrix}. (B.1)

This lets us more easily derive some common special cases. Consider the bilinear form y=𝒂T𝐗𝒃y=\bm{a}^{\text{T}}\mathbf{X}\bm{b}. The derivative with respect to the first row of 𝐗\mathbf{X} is:

d(𝒂T𝐗𝒃)d𝒙¯1T=𝒂T(d(𝒃T𝒙¯1)d𝒙¯1Td(𝒃T𝒙¯2)d𝒙¯1Td(𝒃T𝒙¯m)d𝒙¯1T)=𝒂T(𝒃T𝟎T𝟎T)=a1𝒃T\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{a}^{\text{T}}\begin{pmatrix}\frac%{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{1})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\\\frac{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{2})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\\\vdots\\\frac{\mathop{}\!\mathrm{d}{(\bm{b}^{\text{T}}\bm{\bar{x}}_{m})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}\end{pmatrix}=\bm{a}^{\text{T}}\begin{%pmatrix}\bm{b}^{\text{T}}\\\bm{0}^{\text{T}}\\\vdots\\\bm{0}^{\text{T}}\end{pmatrix}=a_{1}\bm{b}^{\text{T}}

Stacking all mm of these rows vertically as in Eq. B.1, we see that:

d(𝒂T𝐗𝒃)d𝐗=(a1𝒃Ta2𝒃Tam𝒃T)=𝒂𝒃T.\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\mathbf{X}}}=\begin{pmatrix}a_{1}\bm{b}^{\text{T}}\\a_{2}\bm{b}^{\text{T}}\\\vdots\\a_{m}\bm{b}^{\text{T}}\end{pmatrix}=\bm{a}\bm{b}^{\text{T}}.

Alternatively, we might have used the column-gradient formulation:

d(𝒂T𝐗𝒃)d𝒙1=(d(𝒙1T𝒂)d𝒙1d(𝒙2T𝒂)d𝒙1d(𝒙nT𝒂)d𝒙1)𝒃=b1𝒂,\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}=\begin{pmatrix}\frac{\mathop{}\!\mathrm{d}{(\bm{x}_{1}%^{\text{T}}\bm{a})}}{\mathop{}\!\mathrm{d}{\bm{x}_{1}}}&\frac{\mathop{}\!%\mathrm{d}{(\bm{x}_{2}^{\text{T}}\bm{a})}}{\mathop{}\!\mathrm{d}{\bm{x}_{1}}}&%\cdots&\frac{\mathop{}\!\mathrm{d}{(\bm{x}_{n}^{\text{T}}\bm{a})}}{\mathop{}\!%\mathrm{d}{\bm{x}_{1}}}\end{pmatrix}\bm{b}=b_{1}\bm{a},

and then stacked these columns horizontally as in Eq. B.1:

d(𝒂T𝐗𝒃)d𝐗=(b1𝒂b2𝒂bn𝒂)=𝒂𝒃T.\frac{\mathop{}\!\mathrm{d}{(\bm{a}^{\text{T}}\mathbf{X}\bm{b})}}{\mathop{}\!%\mathrm{d}{\mathbf{X}}}=\begin{pmatrix}b_{1}\bm{a}&b_{2}\bm{a}&\cdots&b_{n}\bm%{a}\end{pmatrix}=\bm{a}\bm{b}^{\text{T}}.

Or again, consider a case where 𝐗\mathbf{X} shows up in the other part of the bilinear form (in this case, a quadratic form):

y=(𝐗𝒂+𝒃)T𝐖(𝐗𝒂+𝒃).y=(\mathbf{X}\bm{a}+\bm{b})^{\text{T}}\mathbf{W}(\mathbf{X}\bm{a}+\bm{b}). (B.2)

Then defining 𝒛..=𝐗𝒂+𝒃\bm{z}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\mathbf{X}\bm{a}+\bm{b}, and considering again just the first row of 𝐗\mathbf{X}, we find:

dyd𝒙¯1T=d(𝒛TW𝒛)d𝒙¯1T=(𝐖𝒛)Td𝒛d𝒙¯1T+𝒛Td(𝐖𝒛)d𝒙¯1T=𝒛T(𝐖T+𝐖)d𝒛d𝒙¯1T=𝒛T(𝐖T+𝐖)(𝒂T𝟎T𝟎T)=v1𝒂T,\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text%{T}}}=\frac{\mathop{}\!\mathrm{d}{(\bm{z}^{\text{T}}W\bm{z})}}{\mathop{}\!%\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=(\mathbf{W}\bm{z})^{\text{T}}\frac{%\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{%T}}}+\bm{z}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{(\mathbf{W}\bm{z})}}{\mathop%{}\!\mathrm{d}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{z}^{\text{T}}(\mathbf{W}^{%\text{T}}+\mathbf{W})\frac{\mathop{}\!\mathrm{d}{\bm{z}}}{\mathop{}\!\mathrm{d%}{\bm{\bar{x}}_{1}}^{\text{T}}}=\bm{z}^{\text{T}}(\mathbf{W}^{\text{T}}+%\mathbf{W})\begin{pmatrix}\bm{a}^{\text{T}}\\\bm{0}^{\text{T}}\\\vdots\\\bm{0}^{\text{T}}\end{pmatrix}=v_{1}\bm{a}^{\text{T}},

where 𝒗..=(𝐖+𝐖T)𝒛\bm{v}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=(\mathbf{W}+\mathbf{W}^{\text{T}})\bm{z}, and v1v_{1} is its first element. Stacking these rows vertically, as in Eq. B.1, yields:

dyd𝐗=𝒗𝒂T=(𝐖+𝐖T)(𝐗𝒂+𝒃)𝒂T.\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\bm{v}\bm{a%}^{\text{T}}=(\mathbf{W}+\mathbf{W}^{\text{T}})(\mathbf{X}\bm{a}+\bm{b})\bm{a}%^{\text{T}}.

A common application of this derivative occurs when working with Gaussian functions, which can be written e-y/2e^{-y/2} for the yy defined in Eq. B.2. In this case, the matrix 𝐖\mathbf{W} is symmetric, and the result simplifies further. More generally, Eq. B.2 occurs in quadratic penalties on the state in control problems, in which case XX would be the state-transition matrix.

Matrix-valued functions.

The derivative of a matrix-valued function with respect to a matrix is a tensor. These are cumbersome, so in a way our discussion of them is merely preliminary to what follows. Let xijx_{ij} be the (i,j)th(i,j)^{\text{th}} entry of XX. We consider a few simple matrix functions of XX:

d(𝐀𝐗)dxij=[𝟎𝟎𝟎𝒂ijth column𝟎𝟎]\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{x_{%ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&\underbrace{\bm{a}_{i}}_{j^{%\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0}\end{bmatrix}

where 𝒂i\bm{a}_{i} is the ithi^{\text{th}} column of 𝐀\mathbf{A}. Transposes and derivatives commute, as usual, so the derivative of 𝐗T𝐀T\mathbf{X}^{\text{T}}\mathbf{A}^{\text{T}} (e.g.) is just the transpose of the above. That means that

d(𝐗𝐀)dxij=[𝟎𝟎𝟎𝒂~jith column𝟎𝟎]T,\frac{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{x_{%ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&\underbrace{\bm{\tilde{a}}_{j%}}_{i^{\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0}\end{bmatrix}^{\text{T}},

with 𝒂~jT\bm{\tilde{a}}_{j}^{\text{T}} the jthj^{\text{th}} row of AA. From the first we can also compute the slightly more complicated, but elegant:

d(𝐀𝐗𝐁T)dxij=𝒂i𝒃jT,\frac{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{%\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{a}_{i}\bm{b}_{j}^{\text{T}},

with 𝒃j\bm{b}_{j} the jthj^{\text{th}} column of 𝐁\mathbf{B}. And for a square matrix 𝐀\mathbf{A}, we consider the even more complicated:

d(𝐗𝐀𝐗T)dxij=[𝟎𝟎𝟎𝐗𝒂~jith column𝟎𝟎]T+[𝟎𝟎𝟎X𝒂jith column𝟎𝟎].\frac{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{%\mathop{}\!\mathrm{d}{x_{ij}}}=\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm{0}&%\underbrace{\mathbf{X}\bm{\tilde{a}}_{j}}_{i^{\text{th}}\text{ column}}&\bm{0}%&\cdots&\bm{0}\end{bmatrix}^{\text{T}}+\begin{bmatrix}\bm{0}&\bm{0}&\cdots&\bm%{0}&\underbrace{X\bm{a}_{j}}_{i^{\text{th}}\text{ column}}&\bm{0}&\cdots&\bm{0%}\end{bmatrix}.

Finally, consider the vector-valued function 𝒇(𝐗𝒂+𝒃)\bm{f}(\mathbf{X}\bm{a}+\bm{b}), where 𝒇\bm{f} has Jacobian 𝐉\mathbf{J} but is otherwise unspecified. Its derivative with respect to an element of 𝐗\mathbf{X} is

d𝒇dxij=𝐉[000ajith column00]T={𝐉}iaj.\frac{\mathop{}\!\mathrm{d}{\bm{f}}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\mathbf{J}%\begin{bmatrix}0&0&\cdots&0&\underbrace{a_{j}}_{i^{\text{th}}\text{ column}}&0%&\cdots&0\end{bmatrix}^{\text{T}}=\mathopen{}\mathclose{{}\left\{\mathbf{J}}%\right\}_{i}a_{j}.

For all of the above, the derivative with respect to the entire matrix 𝐗\mathbf{X} is just the collection of these matrices for all ii and jj.

Some applications of the chain rule to matrix derivatives.

Now suppose we want to take a derivative with respect to 𝐗\mathbf{X} of the scalar-valued function y(𝐅(𝐗))y(\mathbf{F}(\mathbf{X})), for various matrix-valued functions 𝐅()\mathbf{F}(\cdot). We shall consider in particular those lately worked out. The chain rule here says that the (i,j)th(i,j)^{\text{th}} element of this matrix is:

dy(𝐅(𝐗))dxij=kldydFkldFkldxij=𝟏T(dyd𝐅d𝐅dxij)𝟏,\frac{\mathop{}\!\mathrm{d}{y(\mathbf{F}(\mathbf{X}))}}{\mathop{}\!\mathrm{d}{%x_{ij}}}=\sum_{kl}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{F_{kl}%}}\frac{\mathop{}\!\mathrm{d}{F_{kl}}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{1}^{%\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\mathbf%{F}}}\circ\frac{\mathop{}\!\mathrm{d}{\mathbf{F}}}{\mathop{}\!\mathrm{d}{x_{ij%}}}\bigg{)}\bm{1},

with 𝟏\bm{1} a vector of ones and \circ the entry-wise (Hadamard) product. We now apply this equation to the results above:

dy(𝐀𝐗)dxij\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{A}\mathbf{X})}}{\mathop{}\!%\mathrm{d}{x_{ij}}} =𝟏T(dyd(𝐀𝐗)d(𝐀𝐗)dxij)𝟏={dyd(𝐀𝐗)}jT𝒂idy(𝐀𝐗)d𝐗=𝐀Tdyd(𝐀𝐗),\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}\circ\frac{\mathop{}\!\mathrm{d}{(%\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}=\bigg{\{}%\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})}}%\bigg{\}}_{j}^{\text{T}}\bm{a}_{i}\implies\frac{\mathop{}\!\mathrm{d}{y(%\mathbf{A}\mathbf{X})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\mathbf{A}^{\text{T%}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{A}\mathbf{X})%}}, (B.3)
dy(𝐗𝐀)dxij\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A})}}{\mathop{}\!%\mathrm{d}{x_{ij}}} =𝟏T(dyd(𝐗𝐀)d(𝐗𝐀)dxij)𝟏dy(𝐗𝐀)d𝐗=dyd(𝐗𝐀)𝐀T,\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A})}}\circ\frac{\mathop{}\!\mathrm{d}{(%\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies%\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A})}}{\mathop{}\!\mathrm{d}{%\mathbf{X}}}=\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{X}%\mathbf{A})}}\mathbf{A}^{\text{T}}, (B.4)
dy(𝐀𝐗𝐁T)dxij\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{A}\mathbf{X}\mathbf{B}^{%\text{T}})}}{\mathop{}\!\mathrm{d}{x_{ij}}} =𝟏T(dyd(𝐀𝐗𝐁T)d(𝐀𝐗𝐁T)dxij)𝟏dy(𝐀𝐗𝐁T)d𝐗=𝐀Tdyd(𝐀𝐗𝐁T)𝐁,\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}\circ\frac{\mathop%{}\!\mathrm{d}{(\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies\frac{\mathop{}\!\mathrm{d}{y(\mathbf%{A}\mathbf{X}\mathbf{B}^{\text{T}})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=%\mathbf{A}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(%\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}})}}\mathbf{B}, (B.5)
dy(𝐗𝐀𝐗T)dxij\displaystyle\frac{\mathop{}\!\mathrm{d}{y(\mathbf{X}\mathbf{A}\mathbf{X}^{%\text{T}})}}{\mathop{}\!\mathrm{d}{x_{ij}}} =𝟏T(dyd(𝐗𝐀𝐗T)d(𝐗𝐀𝐗T)dxij)𝟏dy(𝐗𝐀𝐗T)d𝐗=dyd(𝐗𝐀𝐗T)𝐗𝐀T+dyd(𝐗𝐀𝐗T)T𝐗𝐀\displaystyle=\bm{1}^{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}\circ\frac{\mathop%{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{\mathop{}\!%\mathrm{d}{x_{ij}}}\bigg{)}\bm{1}\implies\frac{\mathop{}\!\mathrm{d}{y(\mathbf%{X}\mathbf{A}\mathbf{X}^{\text{T}})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\frac%{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{%X}^{\text{T}})}}\mathbf{X}\mathbf{A}^{\text{T}}+\frac{\mathop{}\!\mathrm{d}{y}%}{\mathop{}\!\mathrm{d}{(\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}})}}^{\text{T%}}\mathbf{X}\mathbf{A} (B.6)

For the vector-valued function 𝒇(𝐗𝒂+𝒃)=..𝒛\bm{f}(\mathbf{X}\bm{a}+\bm{b})=\mathrel{\vbox{\hbox{\scriptsize.}\hbox{%\scriptsize.}}}\bm{z} considered above,

dy(𝒇)dxij=𝟏T(dyd𝒛d𝒇dxij)𝟏=dyd𝒛T{𝐉}iajdy(𝒇)d𝐗=𝐉Tdyd𝒛𝒂T.\frac{\mathop{}\!\mathrm{d}{y(\bm{f})}}{\mathop{}\!\mathrm{d}{x_{ij}}}=\bm{1}^%{\text{T}}\bigg{(}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{z}%}}\circ\frac{\mathop{}\!\mathrm{d}{\bm{f}}}{\mathop{}\!\mathrm{d}{x_{ij}}}%\bigg{)}\bm{1}=\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm{z}}^{%\text{T}}}\mathopen{}\mathclose{{}\left\{\mathbf{J}}\right\}_{i}a_{j}\implies%\frac{\mathop{}\!\mathrm{d}{y(\bm{f})}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=%\mathbf{J}^{\text{T}}\frac{\mathop{}\!\mathrm{d}{y}}{\mathop{}\!\mathrm{d}{\bm%{z}}}\bm{a}^{\text{T}}. (B.7)

A few special cases are interesting. Since the trace and the derivative are linear operators, they commute, and in particular

dd𝐗tr[𝐗]=𝐈.\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}\text{tr}%\mathopen{}\mathclose{{}\left[\mathbf{X}}\right]=\mathbf{I}.

Therefore, letting y(𝐗)=settr[𝐗]y(\mathbf{X})\stackrel{{\scriptstyle\text{set}}}{{=}}\text{tr}\mathopen{}%\mathclose{{}\left[\mathbf{X}}\right] in the above equations, we have

dtr[𝐀𝐗]d𝐗=dtr[𝐗𝐀]d𝐗\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{A}\mathbf{X}}\right]}}{\mathop{}\!\mathrm{d}{\mathbf{X}}}=\frac{%\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{X}\mathbf%{A}}\right]}}{\mathop{}\!\mathrm{d}{\mathbf{X}}} =𝐀T,\displaystyle=\mathbf{A}^{\text{T}}, (B.8)
dtr[𝐀𝐗𝐁T]d𝐗\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{A}\mathbf{X}\mathbf{B}^{\text{T}}}\right]}}{\mathop{}\!\mathrm{d%}{\mathbf{X}}} =𝐀T𝐁,\displaystyle=\mathbf{A}^{\text{T}}\mathbf{B}, (B.9)
dtr[𝐗𝐀𝐗T]d𝐗\displaystyle\frac{\mathop{}\!\mathrm{d}{\text{tr}\mathopen{}\mathclose{{}%\left[\mathbf{X}\mathbf{A}\mathbf{X}^{\text{T}}}\right]}}{\mathop{}\!\mathrm{d%}{\mathbf{X}}} =𝐗𝐀T+𝐗𝐀.\displaystyle=\mathbf{X}\mathbf{A}^{\text{T}}+\mathbf{X}\mathbf{A}. (B.10)

B.1.3 More useful identities

Now that we have defined derivatives (of scalars and vectors) with respective to vectors and derivatives (of scalars) with respect to matrices, we can derive the following useful identities.

The derivative of the log-determinant.

The trace and determinant of a matrix are related by a useful formula, derived through their respective relationships with the matrix’s spectrum. Recall:

tr[𝐌]=iλi,\displaystyle\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}}\right]=\sum_{i%}\lambda_{i},
|𝐌|=iλi.\displaystyle|\mathbf{M}|=\prod_{i}\lambda_{i}.

For each eigenvalue λ\lambda and associated eigenvector 𝒗\bm{v}, 𝐌𝒗=λ𝒗\mathbf{M}\bm{v}=\lambda\bm{v}, so:

exp{𝐌}𝒗=(𝐈+𝐌+𝐌22!+𝐌33!+)𝒗=𝒗+λ𝒗+λ22!𝒗+λ33!𝒗+=eλ𝒗.\begin{split}\exp\{\mathbf{M}\}\bm{v}&=\bigg{(}\mathbf{I}+\mathbf{M}+\frac{%\mathbf{M}^{2}}{2!}+\frac{\mathbf{M}^{3}}{3!}+\cdots\bigg{)}\bm{v}\\&=\bm{v}+\lambda\bm{v}+\frac{\lambda^{2}}{2!}\bm{v}+\frac{\lambda^{3}}{3!}\bm{%v}+\cdots\\&=e^{\lambda}\bm{v}.\end{split}

Therefore, the eigenvectors of exp{𝐌}\exp\{\mathbf{M}\} are the eigenvectors 𝐌\mathbf{M}, and the eigenvalues of exp{𝐌}\exp\{\mathbf{M}\} are the exponentiated eigenvalues of 𝐌\mathbf{M}. Hence:

exp{tr[𝐌]}=exp{iλi}=iexp{λi}=|exp{𝐌}|.\exp\{\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}}\right]\}=\exp\bigg{\{%}\sum_{i}\lambda_{i}\bigg{\}}=\prod_{i}\exp\{\lambda_{i}\}=|\exp\{\mathbf{M}\}|. (B.11)

Therefore:

ddxlog|𝐀(x)|=1|𝐀(x)|ddx|𝐀(x)|𝐌..=log𝐀=1|𝐀(x)|ddx|exp{𝐌(x)}|Eq. B.11=1|𝐀(x)|ddxexp{tr[𝐌(x)]}=1|𝐀(x)|exp{tr[𝐌(x)]}ddxtr[𝐌(x)]=1|𝐀(x)||exp{𝐌(x)}|tr[ddx𝐌(x)]=1|𝐀(x)||𝐀(x)|tr[ddxlog𝐀(x)]=tr[𝐀-1d𝐀dx].\begin{split}\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\log|%\mathbf{A}(x)|&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}\!\mathrm{d}{}}{%\mathop{}\!\mathrm{d}{x}}|\mathbf{A}(x)|\\\mathbf{M}\mathrel{\vbox{\hbox{\scriptsize.}\hbox{\scriptsize.}}}=\log\mathbf{A}\implies&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}\!\mathrm{d%}{}}{\mathop{}\!\mathrm{d}{x}}|\exp\mathopen{}\mathclose{{}\left\{\mathbf{M}(x%)}\right\}|\\\text{Eq.\ \ref{eqn:exptr}}\implies&=\frac{1}{|\mathbf{A}(x)|}\frac{\mathop{}%\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\exp\mathopen{}\mathclose{{}\left\{%\text{tr}\mathopen{}\mathclose{{}\left[\mathbf{M}(x)}\right]}\right\}\\&=\frac{1}{|\mathbf{A}(x)|}\exp\mathopen{}\mathclose{{}\left\{\text{tr}%\mathopen{}\mathclose{{}\left[\mathbf{M}(x)}\right]}\right\}\frac{\mathop{}\!%\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\text{tr}\mathopen{}\mathclose{{}\left[%\mathbf{M}(x)}\right]\\&=\frac{1}{|\mathbf{A}(x)|}|\exp\mathopen{}\mathclose{{}\left\{\mathbf{M}(x)}%\right\}|\text{tr}\mathopen{}\mathclose{{}\left[\frac{\mathop{}\!\mathrm{d}{}}%{\mathop{}\!\mathrm{d}{x}}\mathbf{M}(x)}\right]\\&=\frac{1}{|\mathbf{A}(x)|}|\mathbf{A}(x)|\text{tr}\mathopen{}\mathclose{{}%\left[\frac{\mathop{}\!\mathrm{d}{}}{\mathop{}\!\mathrm{d}{x}}\log\mathbf{A}(x%)}\right]\\&=\text{tr}\mathopen{}\mathclose{{}\left[{\mathbf{A}}^{-1}\frac{\mathop{}\!%\mathrm{d}{\mathbf{A}}}{\mathop{}\!\mathrm{d}{x}}}\right].\\\end{split}

So far we have made use only of results from scalar calculus. (The derivative of the log of 𝐀(x)\mathbf{A}(x) can be derived easily in terms of the Maclaurin series for the natural logarithm.)

ANOTHER EXAMPLE (fix me).

Cf. the “trace trick,” in e.g. IPGM. When f=log|𝐀|f=\log|\mathbf{A}| (i.e. the log of the determinant of 𝐀\mathbf{A}),

dlog|𝐀|)d𝐀=𝐀-T,\frac{\mathop{}\!\mathrm{d}{\log|\mathbf{A}|)}}{\mathop{}\!\mathrm{d}{\mathbf{%A}}}=\mathbf{A}^{-\text{T}},

i.e. the inverse transpose. (This can be derived from the “interesting scalar case” below.) From this it follows easily that

d|𝐀|d𝐀=|𝐀|𝐀-T.\frac{\mathop{}\!\mathrm{d}{|\mathbf{A}|}}{\mathop{}\!\mathrm{d}{\mathbf{A}}}=%|\mathbf{A}|\mathbf{A}^{-\text{T}}.