Matrix Derivatives
Contents
Matrix Derivatives#
In principle, we can compute matrix derivatives using partial derivatives and familiar rules like the chain rule. However, this process can be tedious and complex. Fortunately, most of the rules you know for one-dimensional derivatives also apply to higher-order derivatives. However, unlike in one-dimensional calculus, the order of multiplication matters for matrix derivatives because matrix multiplication is generally not commutative.
We’ll start by examining the dimensionalities involved in matrix derivatives. For a function \(f:\mathbb{R}^{n\times d}\rightarrow \mathbb{R}\) that maps matrices to real values, we can define its derivative in two ways: as the gradient or the Jacobian.
You might notice that the Jacobian is the transposed of the gradient, and vice versa.
Derivatives for Real Valued Functions#
From the definition of matrix derivatives, we can also infer the definition of a vector derivative for a function \(f:\mathbb{R}^d\rightarrow \mathbb{R}\):
Derivatives for Vector Valued Functions (from Real Values)#
If we have a function that maps to a vector space, then we can compute the partial derivatives for each coordinate of the function value. For example, if we have a function mapping from real values to the \(c\)-dimensional real-valued vector space
Note that the Jacobian preserves now the dimensionality of the output of the function: function vales are in \(\mathbb{R}^c\) and so is the Jacobian.
Derivatives for Functions Mapping from a Vector Space to a Vector Space#
We can define the derivatives for a function \(\vvec{f}:\mathbb{R}^d\rightarrow \mathbb{R}^{c}\) from a vector space to a vector space:
Of course we could now consider more cases, like a function mapping a matrix to a matrix. Unfortunately, from this point on, it gets really complicated. There are multiple ways to define such derivatives - as tensors or as specifically structured matrices. We’re going to keep it comparatively simple and circumvent these cases in this course.
We can now concatenate these derivatives according to linearity and the chain rule for matrix derivatives.
Gradient and Jacobian Computation Rules#
Theorem 10 (The Jacobian is linear)
For any function whose Jacobian is defined as a matrix of partial derivatives \(\frac{\partial \vvec{f}(\vvec{x})}{\partial \vvec{x}} = \begin{pmatrix}\frac{\partial f_j(\vvec{x})}{\partial x_i}\end{pmatrix}_{i,j}\) for some indexes \(i,j\), the Jacobian is linear:
Proof. The proof follows from the linearity of the partial derivatives:
Theorem 11 (Chain Rule for the Jacobian)
For any continuously differentiable functions \(\vvec{f}:\mathbb{R}^c\rightarrow \mathbb{R}^p\) and \(\vvec{g}:\mathbb{R}^d\rightarrow \mathbb{R}^c\), the Jacobian of the composition \(\vvec{f}\circ\vvec{g}\) is given by the chain rule:
Theorem 12 (Jacobian of Element-wise Functions)
The gradient and the Jacobian of any element-wise defined function
Example 9
Consider the exponential function applied element-wise to a vector