Vector Spaces#

Vector spaces are ubiquitous because they enable an understanding of objects from simple pieces. The pieces are the vectors, that allow only two operations: addition and scalar multiplication. This might sound uneventful but it actually leads to a rich description of geometric properties that are relevant for modeling real world relationships that can be used to create some form of artificial intelligence. In the most general way, vector spaces are defined as follows.

Definition 1

A vector space over the real numbers is a set of vectors \(\mathcal{V}\) with two operations \(+\) and \(\cdot\) such that the following properties hold:

  • Addition: for \(\vvec{v},\vvec{w}\) we have \(\vvec{v}+\vvec{w}\in\mathcal{V}\). The set of vectors with the addition \((\mathcal{V},+)\) is an abelian group.

  • Scalar multiplication: for \(\alpha\in\mathbb{R}\) and \(\vvec{v}\in\mathcal{V}\), we have \(\alpha\vvec{v}\in\mathcal{V}\) such that the following properties hold:

    • \(\alpha(\beta\vvec{v}) = (\alpha\beta) \vvec{v}\) for \(\alpha,\beta\in\mathbb{R}\) and \(\vvec{v}\in\mathcal{V}\)

    • \(1\vvec{v}=\vvec{v}\) for \(\vvec{v}\in\mathcal{V}\)

  • Distributivity: the following properties hold:

    • \((\alpha + \beta)\vvec{v} = \alpha\vvec{v} +\beta \vvec{v}\) for \(\alpha,\beta\in\mathbb{R}\) and \(\vvec{v}\in\mathcal{V}\)

    • \(\alpha (\vvec{v}+\vvec{w})=\alpha \vvec{v}+\alpha\vvec{w}\) for \(\alpha\in\mathbb{R}\) and \(\vvec{v},\vvec{w}\in\mathcal{V}\)

This definition is quite technical, but it basically tells us that we can use the addition as we know it from real values (that is the abelian group part). We can add and subtract vectors, and adding the zero vector doesn’t change anything. Likewise, we can multiply and divide a vector by a scalar. That is, for \(\alpha\in\mathbb{R}, \vvec{v},\vvec{w}\in\mathcal{V}\) the following operations are well-defined:

  • \(\vvec{v}/\alpha = \frac1\alpha \vvec{v}\) for \(\alpha\neq 0\)

  • \(\vvec{v}-\vvec{w}\)

However, the vector-vector product is not per se defined in a vector space and likewise, you can also not divide by a vector. The following operations are not defined in a vector space:

  • \(\vvec{v}\cdot \vvec{w}\)

  • \(\alpha/\vvec{v}\)

The most canonical vector space is \(\mathbb{R}^d\).

Theorem 1 (\(\mathbb{R}^d\) is a vector space)

The elements of the space \(\mathbb{R}^d\) are defined as \(d\)-dimensional vectors

\[\begin{split}\vvec{v} = \begin{pmatrix}v_1\\\vdots\\v_d\end{pmatrix},\quad v_i\in\mathbb{R} \text{ for } 1\leq i \leq d, \end{split}\]
whose addition and the scalar multiplication are defined for \(\vvec{v},\vvec{w}\in\mathbb{R}^d\) and \(\alpha\in\mathbb{R}\) as follows:

\[\begin{align*} \vvec{v}+\vvec{w} = \begin{pmatrix}v_1+w_1\\\vdots\\v_d+w_d\end{pmatrix}, \alpha\vvec{v} = \begin{pmatrix}\alpha v_1\\\vdots\\\alpha v_d\end{pmatrix} \end{align*}\]

The space \(\mathbb{R}^d\) is a vector space.

A vector in \(\mathbb{R}^d\) can be visualized as an arrow from the origin to the point in space that it represents. Addition and scalar multiplication have visual interpretations as well, as indicated below.

Example 1 (The geometry of \(\mathbb{R}^2\))

The plot below shows four vectors:

\[\begin{align*} \vvec{v} &= \begin{pmatrix}0.5\\1.5\end{pmatrix} &\vvec{w} &= \begin{pmatrix}2\\0.5\end{pmatrix} &\vvec{v}+\vvec{w} &= \begin{pmatrix}2.5\\2\end{pmatrix} &2\vvec{v} &= \begin{pmatrix}1\\3\end{pmatrix} \end{align*}\]

Let’s have a look first at vector \(\mathbf{v}\) and the scaled vector \(2\mathbf{v}\). Multipling a vector with a scalar changes the vector’s length, in the plot below, we multiply with the factor \(2\). We can see that the vector \(2\mathbf{v}\) is twice as long as the vector \(\mathbf{v}\).

The addition of vectors can be visualized by a concatenation of the vectors. Below, we see vectors \(\mathbf{v}\) and \(\mathbf{w}\), and the sum of both vectors \(\mathbf{z}=\mathbf{v}+\mathbf{w}\). The result \(\mathbf{z}\) is the vector that we get when we start at the origin, then follow \(\mathbf{v}\) and then \(\mathbf{w}\). Alternatively, we can first follow \(\mathbf{w}\) and then \(\mathbf{v}\), because the vector addition is commutative.

Figure made with TikZ

Matrix Spaces#

Next to the vector space of \(\mathbb{R}^d\), there are other important vector spaces, such as the vector space of matrices \(\mathbb{R}^{n\times d}.\) This vector space is relevant in machine learning because data is represented as a matrix. A data table of \(n\) observations of \(d\) features is represented by a \((n\times d)\) matrix.

ID

\(\mathtt{x}_1\)

\(\mathtt{x}_2\)

\(\mathtt{x}_3\)

\(\ldots\)

\(\mathtt{x}_d\)

1

5.1

3.5

1.4

\(\ldots\)

0.2

2

6.4

3.5

4.5

\(\ldots\)

1.2

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(\vdots\)

\(n\)

5.9

3.0

5.0

\(\ldots\)

1.8

An \((n\times d)\) matrix concatenates \(n\) \(d\)-dimensional vectors column-wise (\(A_{\cdot j}\) denotes the column-vector \(j\) of \(A\))

\[\begin{align*} A= \begin{pmatrix} \vert & & \vert\\ A_{\cdot 1}&\ldots & A_{\cdot d}\\ \vert & & \vert \end{pmatrix} =\begin{pmatrix} A_{11 } & \ldots& A_{1d}\\ \vdots& & \vdots\\ A_{n1} &\ldots & A_{nd} \end{pmatrix} \end{align*}\]

Simultaneously, we can see a matrix as concatenation of \(d\) row-vectors (\(A_{i\cdot}\)):

\[\begin{align*} A= \begin{pmatrix} - & A_{1\cdot} & -\\ &\vdots & \\ - & A_{n\cdot} & - \end{pmatrix} =\begin{pmatrix} A_{11 } & \ldots& A_{1d}\\ \vdots& & \vdots\\ A_{n1} &\ldots & A_{nd} \end{pmatrix} \end{align*}\]

This notation is actuallly quite close to how we select rows and columns of matrices in Python. For example, consider the following matrix:

import numpy as np
A=np.array([[1,2,3],[4,5,6],[7,8,9]])
A
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In mathematical notation, the first column would be denoted as \(A_{\cdot 1}\). In Python we write

A[:,0]
array([1, 4, 7])

Note, that the indices differ conventionally: in mathematical notation we start with index 1 while the array indices in Python start with 0. But apart from that, both notations are quite close. The dot in the mathematical notation \(A_{\cdot 1}\) and the colon in the Python syntax A[:,0] indicates that we choose all row-indices. Likewise, we can select the first row \(A_{\cdot 1}\) in Python as follows:

A[0,:]
array([1, 2, 3])

Proposition 1 (The Vector Space \(\mathbb{R}^{n\times d}\))

The elements of the vector space \(\mathbb{R}^{n\times d}\) are \((n\times d)\)-dimensional matrices.

The addition between matrices and the scalar multiplication are defined for \(A,B\in\mathbb{R}^{n\times d}\) and \(\alpha\in\mathbb{R}\) as

\[\begin{align*} A+B &= \begin{pmatrix} A_{11 } + B_{11} & \ldots& A_{1d}+B_{1d}\\ \vdots& & \vdots\\ A_{n1}+B_{n1} &\ldots & A_{nd}+B_{nd} \end{pmatrix}\\ \alpha A &= \begin{pmatrix} \alpha A_{11 } & \ldots& \alpha A_{1d}\\ \vdots& & \vdots\\ \alpha A_{n1} &\ldots & \alpha A_{nd} \end{pmatrix} \end{align*}\]

Proof. All vector space axioms follow directly from the corresponding properties of real numbers, applied entrywise.

Matrix Operations#

In contrast to real numbers, matrices carry additional structure over the entries that are arranged in rows and columns. The rows of a matrix can be switched into columns with the transposed operator.

Definition 2

The transpose of a matrix \(A\in\mathbb{R}^{n\times d}\) is the matrix \(A^\top\in\mathbb{R}^{d\times n}\) obtained by exchanging rows and columns.

\[\begin{align*} A&= \begin{pmatrix} \vert & & \vert\\ A_{\cdot 1}&\ldots & A_{\cdot d}\\ \vert & & \vert \end{pmatrix} &=\begin{pmatrix} A_{11 } & \ldots& A_{1d}\\ \vdots& & \vdots\\ A_{n1} &\ldots & A_{nd} \end{pmatrix}\in\mathbb{R}^{n\times d}\\ A^\top&= \begin{pmatrix} - & A_{\cdot 1}^\top & -\\ &\vdots & \\ -&A_{\cdot d}^\top &- \end{pmatrix} &=\begin{pmatrix} A_{11 } & \ldots& A_{n1}\\ \vdots& & \vdots\\ A_{1d} &\ldots & A_{nd} \end{pmatrix}\in\mathbb{R}^{d\times n} \end{align*}\]

The transpose of a \(d\)-dimensional vector has an interpretation as transpose of a \((d\times 1)\) matrix:

\[\begin{align*} v&= \begin{pmatrix} v_1\\ \vdots\\ v_d\\ \end{pmatrix} &\in\mathbb{R}^{d\times 1}\\ v^\top&= \begin{pmatrix} v_1 & \ldots & v_d \end{pmatrix}&\in\mathbb{R}^{1\times d} \end{align*}\]

The transpose of the transpose returns the original matrix.

Proposition 2

For any matrix \(A\in\mathbb{R}^{n\times d}\), we have

\[{(A^\top)}^\top = A.\]
For any matrix \(A\in\mathbb{R}^{n\times d}\) we have \({A^\top}^\top = A\)

A= np.array([[1 , 2 , 3],[4 , 5 , 6]])
A
array([[1, 2, 3],
       [4, 5, 6]])
A.T
array([[1, 4],
       [2, 5],
       [3, 6]])
A.T.T
array([[1, 2, 3],
       [4, 5, 6]])

Definition 3

A symmetric matrix is a matrix \(A\in\mathbb{R}^{n\times n}\) such that \(A^\top = A\).

A= np.array([[1 , 2 , 3],[2 , 4 , 5],[3, 5, 6]])
A
array([[1, 2, 3],
       [2, 4, 5],
       [3, 5, 6]])
A.T
array([[1, 2, 3],
       [2, 4, 5],
       [3, 5, 6]])

Definition 4

A diagonal matrix is a symmetric matrix having only nonzero elements on the diagonal:

\[\begin{split}\diag(a_1,\ldots, a_n) = \begin{pmatrix}a_1 & 0 & \ldots & 0 \\ 0 & a_2 &\ldots & 0\\ & & \ddots &\\ 0 & 0 & \ldots & a_n\end{pmatrix}\end{split}\]

np.diag([1,2,3])
array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

Okay, great, we can add, scale and transpose matrices, respectively data. That seems very uninteresting at first, those operations are not the most exciting ones. But good news, it gets more interesting with the establishment of the matrix product.

Vector Products#

Definition 5

The inner product of two vectors \(\vvec{v},\vvec{w}\in\mathbb{R}^d\) returns a scalar:

\[\begin{align*} \vvec{v}^\top\vvec{w} = \begin{pmatrix}v_1&\ldots & v_d\end{pmatrix}\begin{pmatrix}w_1\\\vdots\\w_d\end{pmatrix}=\sum_{i=1}^dv_iw_i \end{align*}\]

Definition 6

The outer product of two vectors \(\vvec{v}\in\mathbb{R}^d\) and \(\vvec{w}\in\mathbb{R}^n\) returns a (\(d\times n\)) matrix:

\[\begin{align*} \vvec{v}\vvec{w}^\top = \begin{pmatrix}v_1\\ \vdots \\ v_d\end{pmatrix}\begin{pmatrix}w_1 & \ldots &w_n\end{pmatrix} = \begin{pmatrix} v_1\vvec{w}^\top\\ \vdots\\ v_d \vvec{w}^\top \end{pmatrix} = \begin{pmatrix} v_1w_1 &\ldots & v_1w_n\\ \vdots & & \vdots\\ v_dw_1 &\ldots & v_dw_n \end{pmatrix} \end{align*}\]

The inner product compresses two vectors into a scalar, measuring similarity, while the outer product expands them into a matrix capturing all pairwise interactions.

Let’s see how we can compute these products in Python. We define two vectors \(\mathbf{v}\) and \(\mathbf{w}\).

v = np.array([1,2,3])
w = np.array([4,5,6])

We can compute the inner product over the inner and the dot function, or use the @ operator.

np.inner(v,w)
32
np.dot(v,w)
32
v.T@w
32

For computing the outer product, we can use the outer function.

np.outer(v,w)
array([[ 4,  5,  6],
       [ 8, 10, 12],
       [12, 15, 18]])

Using the @ operator doesn’t work, the transposed doesn’t change anything for a numpy array.

v.shape, (v.T).shape
((3,), (3,))
v@w.T
32

Matrix Products#

Definition 7

Given \(A\in \mathbb{R}^{n\times r}\) and \(B\in\mathbb{R}^{r\times d}\), the matrix product \(C=AB\in\mathbb{R}^{n\times d}\) is defined as

\[\begin{split}C= \begin{pmatrix} A_{1\cdot}B_{\cdot 1}&\ldots & A_{1\cdot}B_{\cdot d}\\ \vdots & & \vdots\\ A_{n\cdot}B_{\cdot 1}&\ldots & A_{n\cdot}B_{\cdot d} \end{pmatrix} = \begin{pmatrix}-&A_{1\cdot}&-\\&\vdots&\\-&A_{n\cdot}&-\end{pmatrix} \begin{pmatrix} \vert& & \vert\\ B_{\cdot 1}&\ldots&B_{\cdot d}\\\vert& & \vert \end{pmatrix}\end{split}\]

Every element \(C_{ji}\) is computed by the inner product of row \(j\) and column \(i\) (row-times-column)

\[C_{ji}=A_{j\cdot}B_{\cdot i} = \sum_{s=1}^r A_{js}B_{si}\]
This is often referred to as the “row-times-column” rule.

Let’s compute a matrix product in Python. We define two matrices \(A\) and \(B\).

A = np.random.rand(2,3)
B = np.random.rand(3,5)

The most simple way (imho) to compute the matrix product is the @ operator.

A@B
array([[1.3, 0.8, 0.9, 1.1, 0.9],
       [0.8, 0.4, 0.5, 0.9, 0.5]])

The result of a multiplication of a \(2\times 3\)-dimensional matrix with a \(3\times 5\) dimensional matrix is a \(2\times 5\)-dimensional matrix. We can get the dimensionalities of a matrix via the shape property.

(A@B).shape
(2, 5)

Proposition 3

Given \(A\in \mathbb{R}^{n\times r}\) and \(B\in\mathbb{R}^{r\times d}\), the matrix product \(C=AB\) can be written as a sum of outer products:

\[\begin{align*} C&=\sum_{s=1}^r \begin{pmatrix} A_{1 s}B_{s1} &\ldots & A_{1 s}B_{sd}\\ \vdots & & \vdots\\ A_{n s}B_{s1} &\ldots & A_{n s}B_{sd} \end{pmatrix} = \begin{pmatrix} \vert & & \vert\\ A_{\cdot 1}&\ldots & A_{\cdot r}\\ \vert & & \vert \end{pmatrix} \begin{pmatrix} - &B_{1\cdot } & -\\ & \vdots & \\ -& B_{r\cdot} & -\end{pmatrix} \\ &= \sum_{s=1}^r \begin{pmatrix} \vert\\ A_{\cdot s}\\ \vert \end{pmatrix} \begin{pmatrix} - & B_{s\cdot } & - \end{pmatrix}=\sum_{s=1}^r A_{\cdot s}B_{s \cdot} \end{align*}\]

This representation shows that matrix multiplication builds a matrix as a sum of rank-one components.

Definition 8

The identity matrix \(I\) is a diagonal matrix having only ones on the diagonal:

\[\begin{split}I_3 = \begin{pmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0& 0& 1 \end{pmatrix}\end{split}\]

Proposition 4

Given \(A\in\mathbb{R}^{n\times d}\), and \(I_n\) the \((n\times n)\) identity matrix and \(I_d\) the \((d\times d)\) identity matrix, then we have

\[I_n A = A = AI_d \]

In Python, we can create the identity matrix over the eye function.

I = np.eye(3)
I
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

Multiplying \(A\) with the identity doesn’t change the matrix \(A\).

print("AI=",A@I,"\nA=", A)
AI= [[0.3 0.7 0.7]
 [0.3 0.3 0.6]] 
A= [[0.3 0.7 0.7]
 [0.3 0.3 0.6]]

Proposition 5

We have for \(A\in \mathbb{R}^{n\times r}\), \(B\in\mathbb{R}^{r\times d}\) and \(C=AB\)

\[\begin{align*} C^\top= B^\top A^\top \end{align*}\]

Proof. The proof follows from the definition of the transposed and the matrix product.

\[\begin{align*} C^\top&= \begin{pmatrix} A_{1\cdot}B_{\cdot 1}&\ldots & A_{1\cdot}B_{\cdot d}\\ \vdots & & \vdots\\ A_{n\cdot}B_{\cdot 1}&\ldots & A_{n\cdot}B_{\cdot d} \end{pmatrix}^\top = \begin{pmatrix} A_{1\cdot}B_{\cdot 1}&\ldots & A_{n\cdot}B_{\cdot 1}\\ \vdots & & \vdots\\ A_{1\cdot}B_{\cdot d}&\ldots & A_{n\cdot}B_{\cdot d} \end{pmatrix} \\ &= \begin{pmatrix} B_{\cdot 1}^\top A_{1\cdot}^\top&\ldots & B_{\cdot 1}^\top A_{n\cdot}^\top\\ \vdots & & \vdots\\ B_{\cdot d}^\top A_{1\cdot}^\top &\ldots & B_{\cdot d}^\top A_{n\cdot}^\top \end{pmatrix}= B^\top A^\top \end{align*}\]

Now, if we can multiply matrices, can we also divide by them? Just sometimes, if the matrix has an inverse.

Definition 9

The inverse matrix to a matrix \(A\in\mathbb{R}^{n\times n}\) is a matrix \(A^{-1}\) satisfying

\[AA^{-1} = A^{-1}A = I\]

Proposition 6

A diagonal matrix with nonzero diagonal entries is invertible, and its inverse is obtained by inverting each diagonal entry. That is, given a vector \(\mathbf{v}\in\mathbb{R}^n\) where \(v_i\neq 0\) for all \(1\leq i \leq n\), then

\[\diag(\mathbf{v})^{-1} = \diag\left(\frac{1}{v_1},\ldots, \frac{1}{v_n}\right)\]

For example, we can easily check that the diagonal matrix having the numbers \(1,2,3\) on the diagonal has as inverse the diagonal matrix with fractions \(1,\frac12,\frac13\) on the diagonal. If we multiply both matrices, then we get the identity.

\[\begin{align*}\begin{pmatrix} 1& 0 & 0\\ 0 & 2 & 0\\ 0 & 0 & 3\end{pmatrix} \begin{pmatrix}1& 0 & 0\\ 0 & \frac12 & 0\\ 0 & 0 & \frac13\end{pmatrix} = I\end{align*}\]

In Python, we can define diagonal matrices over the diag function.

A = np.diag([1,2,3])
np.linalg.inv(A)
array([[1. , 0. , 0. ],
       [0. , 0.5, 0. ],
       [0. , 0. , 0.3]])

So far, we have developed a set of operations that allow us to combine and transform vectors and matrices in a systematic way. These constructions are not just algebraic conveniences—they form the foundation for many practical applications.

In particular, machine learning models are typically expressed in matrix notation, where data and parameters are organized as matrices and vectors, and learning corresponds to applying sequences of linear transformations. The main advantage of formalizing operations in matrix notation is that matrix multiplication is highly parallizable and is hence computable fast and can be accelerated with Graphics Processing Units (GPUs).

Let’s have a look at the time being used when computing a matrix multiplication the naive way, and when we use the numpy implementation.

def matrix_mult(A,B):
    C = np.zeros((A.shape[0],B.shape[1]))
    for i in range(0,A.shape[0]):
        for j in range(0,B.shape[1]):
            for s in range(0,A.shape[1]):
                C[i,j]+= A[i,s]*B[s,j]
    return C
A = np.random.rand(200,100)
B = np.random.rand(100,300)
import time
startTime = time.time()
matrix_mult(A,B)
executionTime = (time.time() - startTime)
print('Execution time of our naive implementation in seconds: ' + str(executionTime))
startTime = time.time()
A@B
executionTime = (time.time() - startTime)
print('Execution time of the numpy implementation in seconds: ' + str(executionTime))
Execution time of our naive implementation in seconds: 1.7978038787841797
Execution time of the numpy implementation in seconds: 0.013770103454589844

The numpy implementation is more than 100 times faster!