Table of Contents
- 1. Linear Algebra
- 1.1. Scalars
- 1.2. Vectors
- 1.3. Length, Dimensionality, and Shape
- 1.4. Matrices
- 1.5. Tensors / NDArrays
- 1.6. Basic Properties of Tensor Arithmetic
- 1.7. Reduction
- 1.8. Dot Products
- 1.9. Matrix-Vector Products
- 1.10. Matrix-Matrix Multiplication
- 1.11. Norms
- 1.12. Norms and Objectives
- 1.13. More on Linear Algebra
- 1.14. Summary
1. Linear Algebra
Now that you can store and manipulate data, let us briefly review the subset of basic linear algebra that you will need to understand and implement most of models covered in this book. Below, we introduce the basic mathematical objects, arithmetic, and operations in linear algebra, expressing each of them through mathematical notation and the corresponding implementation in code.
(ns clj-d2l.linear-algebra (:require [clj-djl.ndarray :as nd] [clj-d2l.core :as d2l] [clojure.java.io :as io]))
1.1. Scalars
If you never studied linear algebra or machine learning, then your past experience with math probably consisted of thinking about one number at a time. And, if you ever balanced a checkbook or even paid for dinner at a restaurant then you already know how to do basic things like adding and multiplying pairs of numbers. For example, the temperature in Palo Alto is \(52\) degrees Fahrenheit. Formally, we call values consisting of just one numerical quantity scalars. If you wanted to convert this value to Celsius (the metric system’s more sensible temperature scale), you would evaluate the expression \(c=\frac{5}{9}(f−32)\), setting \(f\) to \(52\). In this equation, each of the terms — \(5\), \(9\), and \(32\) — are scalar values. The placeholders \(c\) and \(f\) are called variables and they represent unknown scalar values. We denote the space of all (continuous) real-valued scalars by \(\mathbb{R}\).
In this book, we adopt the mathematical notation where scalar variables are denoted by ordinary lower-cased letters (e.g., \(x\), \(y\), and \(z\)). For expedience, we will punt on rigorous definitions of what precisely space is, but just remember for now that the expression \(x \in \mathbb{R}\) is a formal way to say that \(x\) is a real-valued scalar. The symbol \(\in\) can be pronounced “in” and simply denotes membership in a set. Analogously, we could write \(x, y \in \{0,1\}\) to state that \(x\) and \(y\) are numbers whose value can only be \(0\) or \(1\).
A scalar is represented by a NDArray with just one element. In the next snippet, we instantiate two scalars and perform some familiar arithmetic operations with them, namely addition, multiplication, division, and exponentiation.
(def ndm (nd/base-manager)) (def x (nd/create ndm 3.)) (def y (nd/create ndm 2.))
(nd/+ x y)
ND: () cpu() float64 5.
(nd/* x y)
ND: () cpu() float64 6.
(nd// x y)
ND: () cpu() float64 1.5
(nd/pow x y)
ND: () cpu() float64 9.
1.2. Vectors
You can think of a vector as simply a list of scalar values. We call these values the elements (entries or components) of the vector. When our vectors represent examples from our dataset, their values hold some real-world significance. For example, if we were training a model to predict the risk that a loan defaults, we might associate each applicant with a vector whose components correspond to their income, length of employment, number of previous defaults, and other factors. If we were studying the risk of heart attacks hospital patients potentially face, we might represent each patient by a vector whose components capture their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. In math notation, we will usually denote vectors as bold-faced, lower-cased letters (e.g., \(\mathbf{x}\), \(\mathbf{y}\), and \(\mathbf{z}\).
We work with vectors via one-dimensional NDArrays. In general NDArrays can have arbitrary lengths, subject to the memory limits of your machine.
(def x (nd/arange ndm 4.)) x
ND: (4) cpu() float32 [0., 1., 2., 3.]
We can refer to any element of a vector by using a subscript. For example, we can refer to the ith element of \(\mathbf{x}\) by \(x_i\). Note that the element \(x_i\) is a scalar, so we do not bold-face the font when referring to it. Extensive literature considers column vectors to be the default orientation of vectors, so does this book. In math, a vector \(\mathbf{x}\) can be written as
\begin{equation} \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \end{equation}where \(x_1, \ldots, x_n\) are elements of the vector. In code, we access any element by indexing into the NDArray.
(nd/get x 3)
ND: () cpu() float32 3.
1.3. Length, Dimensionality, and Shape
Let us revisit some concepts from Section 2.1. A vector is just an array of numbers. And just as every array has a length, so does every vector. In math notation, if we want to say that a vector \(\mathbf{x}\) consists of \(n\) real-valued scalars, we can express this as \(\mathbf{x} \in \mathbb{R}^n\). The length of a vector is commonly called the dimension of the vector.
As with an ordinary Java array, we can access the length. In the case
of a NDArray we can achieve this by using the (size ndarray 0)
function, where \(0\) means the axis-0.
(nd/size x)
4
When a NDArray represents a vector (with precisely one axis), we can
also access its length via the shape
function. The shape lists the
length (dimensionality) along each axis of the NDArray. For NDArrays
with just one axis, the shape has just one element.
(nd/shape x)
(4)
or we can use get-shape
function:
(nd/get-shape x)
(4)
Note that the word “dimension” tends to get overloaded in these contexts and this tends to confuse people. To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the number of elements of a vector or an axis. However, we use the dimensionality of a NDArray to refer to the number of axes that a NDArray has. In this sense, the dimensionality of some axis of a NDArray will be the length of that axis.
1.4. Matrices
Just as vectors generalize scalars from order zero to order one, matrices generalize vectors from order one to order two. Matrices, which we will typically denote with bold-faced, capital letters (e.g., \(\mathbf{X}\), \(\mathbf{Y}\), and \(\mathbf{Z}\)), are represented in code as NDArray with two axes.
In math notation, we use \(\mathbf{A} \in \mathbb{R}^{m \times n}\) to express that the matrix \(\mathbf{A}\) consists of \(m\) rows and \(n\) columns of real-valued scalars. Visually, we can illustrate any matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) as a table, where each element \(a_{ij}\) belongs to the \(i\)th row and \(j\)th column:
\begin{equation} \mathbf{A}= \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}. \end{equation}For any \(\mathbf{A} \in \mathbb{R}^{m \times n}\), the shape of \(\mathbf{A}\) is \((m, n)\) or \(m \times n\). Specifically, when a matrix has the same number of rows and columns, its shape becomes a square; thus, it is called a square matrix.
We can create an \(m \times n\) matrix by specifying a shape with two components \(m\) and \(n\) when calling any of our favorite functions for instantiating a NDArray.
(def A (-> (nd/arange ndm 20.) (nd/reshape 5 4))) A
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.], [16., 17., 18., 19.], ]
We can access the scalar element \(a_{ij}\) of a matrix \(\mathbf{A}\) in (2.3.2) by specifying the indices for the row (\(i\)) and column (\(j\)), such as \([\mathbf{A}]_{ij}\). When the scalar elements of a matrix \(\mathbf{A}\), such as in (2.3.2), are not given, we may simply use the lower-case letter of the matrix \(\mathbf{A}\) with the index subscript, \(a_{ij}\), to refer to \([\mathbf{A}]_{ij}\). To keep notation simple, commas are inserted to separate indices only when necessary, such as \(a_{2,3j}\) and \([\mathbf{A}]_{2i−1,3}\).
Sometimes, we want to flip the axes. When we exchange a matrix’s rows and columns, the result is called the transpose of the matrix. Formally, we signify a matrix \(\mathbf{A}\)’s transpose by \(\mathbf{A}^\top\) and if \(\mathbf{B}=\mathbf{A}^\top\), then \(b_{ij}=a_{ji}\) for any \(i\) and \(j\). Thus, the transpose of \(\mathbf{A}\) in (2.3.2) is a \(n \times m\) matrix:
\begin{equation} \mathbf{A}^\top = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & \dots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \end{bmatrix}. \end{equation}Now we access a matrix’s transpose in code.
(nd/transpose A)
ND: (4, 5) cpu() float32 [[ 0., 4., 8., 12., 16.], [ 1., 5., 9., 13., 17.], [ 2., 6., 10., 14., 18.], [ 3., 7., 11., 15., 19.], ]
There is also a simplified function for the same purpose:
(nd/t A)
ND: (4, 5) cpu() float32 [[ 0., 4., 8., 12., 16.], [ 1., 5., 9., 13., 17.], [ 2., 6., 10., 14., 18.], [ 3., 7., 11., 15., 19.], ]
As a special type of the square matrix, a symmetric matrix \(\mathbf{A}\) is equal to its transpose: \(\mathbf{A}=\mathbf{A}^\top\). Here we define a symmetric matrix \(\mathbf{B}\).
(def B (nd/create ndm [[1 2 3] [2 0 4] [3 4 5]])) B
ND: (3, 3) cpu() int64 [[ 1, 2, 3], [ 2, 0, 4], [ 3, 4, 5], ]
Now we compare \(\mathbf{B}\) with its transpose.
(nd/= B (nd/t B))
ND: (3, 3) cpu() boolean [[ true, true, true], [ true, true, true], [ true, true, true], ]
Matrices are useful data structures: they allow us to organize data that have different modalities of variation. For example, rows in our matrix might correspond to different houses (data examples), while columns might correspond to different attributes. This should sound familiar if you have ever used spreadsheet software or have read Section 2.2. Thus, although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more conventional to treat each data example as a row vector in the matrix. And, as we will see in later chapters, this convention will enable common deep learning practices. For example, along the outermost axis of a NDArray, we can access or enumerate minibatches of data examples, or just data examples if no minibatch exists.
1.5. Tensors / NDArrays
Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. NDArrays (“NDArrays” in this subsection refer to algebraic objects) give us a generic way of describing $n$-dimensional arrays with an arbitrary number of axes. Vectors, for example, are first-order NDArrays, and matrices are second-order NDArrays. NDArrays are denoted with capital letters of a special font face (e.g., \(\mathbf{X}\), \(\mathbf{Y}\), and \(\mathbf{Z}\)) and their indexing mechanism (e.g., \(x_{ijk}\) and \([X]_{1,2i−1,3}\)) is similar to that of matrices.
NDArrays will become more important when we start working with images, which arrive as $n$-dimensional arrays with 3 axes corresponding to the height, width, and a channel axis for stacking the color channels (red, green, and blue). For now, we will skip over higher order NDArrays and focus on the basics.
(def X (-> (nd/arange ndm 24.) (nd/reshape 2 3 4))) X
ND: (2, 3, 4) cpu() float32 [[[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], ], [[12., 13., 14., 15.], [16., 17., 18., 19.], [20., 21., 22., 23.], ], ]
1.6. Basic Properties of Tensor Arithmetic
Scalars, vectors, matrices, and NDArrays (“NDArrays” in this subsection refer to algebraic objects) of an arbitrary number of axes have some nice properties that often come in handy. For example, you might have noticed from the definition of an elementwise operation that any elementwise unary operation does not change the shape of its operand. Similarly, given any two NDArrays with the same shape, the result of any binary elementwise operation will be a NDArray of that same shape. For example, adding two matrices of the same shape performs elementwise addition over these two matrices.
(def A (-> (nd/arange ndm 20.) (nd/reshape 5 4))) (def B (nd/duplicate A))
A
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.], [16., 17., 18., 19.], ]
B
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.], [16., 17., 18., 19.], ]
(nd/+ A B)
ND: (5, 4) cpu() float32 [[ 0., 2., 4., 6.], [ 8., 10., 12., 14.], [16., 18., 20., 22.], [24., 26., 28., 30.], [32., 34., 36., 38.], ]
Specifically, elementwise multiplication of two matrices is called their Hadamard product (math notation \(\odot\)). Consider matrix \(\mathbf{B} \in \mathbb{R}^{m \times n}\) whose element of row \(i\) and column \(j\) is \(b_{ij}\). The Hadamard product of matrices \(\mathbf{A}\) (defined in (2.3.2)) and \(\mathbf{B}\)
\begin{equation} \mathbf{A} \odot \mathbf{B} = \begin{bmatrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1n} b_{1n} \\ a_{21} b_{21} & a_{22} b_{22} & \dots & a_{2n} b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} b_{m1} & a_{m2} b_{m2} & \dots & a_{mn} b_{mn} \end{bmatrix}. \end{equation}(nd/* A B)
ND: (5, 4) cpu() float32 [[ 0., 1., 4., 9.], [ 16., 25., 36., 49.], [ 64., 81., 100., 121.], [144., 169., 196., 225.], [256., 289., 324., 361.], ]
Multiplying or adding a NDArray by a scalar also does not change the shape of the NDArray, where each element of the operand NDArray will be added or multiplied by the scalar.
(def a 2) (def X (-> (nd/arange ndm 24.) (nd/reshape 2 3 4))) (nd/+ X a)
ND: (2, 3, 4) cpu() float32 [[[ 2., 3., 4., 5.], [ 6., 7., 8., 9.], [10., 11., 12., 13.], ], [[14., 15., 16., 17.], [18., 19., 20., 21.], [22., 23., 24., 25.], ], ]
(nd/shape (nd/* X a))
(2, 3, 4)
1.7. Reduction
One useful operation that we can perform with arbitrary NDArrays is to calculate the sum of their elements. In mathematical notation, we express sums using the \(\sum\) symbol. To express the sum of the elements in a vector \(x\) of length \(d\), we write \(\sum^d_{i=1} x_i\). In code, we can just call the function for calculating the sum.
(def x (nd/arange ndm 4.)) x
ND: (4) cpu() float32 [0., 1., 2., 3.]
(nd/sum x)
ND: () cpu() float32 6.
We can express sums over the elements of NDArrays of arbitrary shape. For example, the sum of the elements of an \(m \times n\) matrix \(\mathbf{A}\) could be written \(\sum^m_{i=1} \sum^n_{j=1} a_{ij}\).
(nd/shape A)
(5, 4)
(nd/sum A)
ND: () cpu() float32 190.
By default, invoking the function for calculating the sum reduces a
NDArray along all its axes to a scalar. We can also specify the axes
along which the NDArray is reduced via summation. Take matrices as an
example. To reduce the row dimension (axis 0) by summing up elements
of all the rows, we specify [0]
when invoking the function. Since the
input matrix reduces along axis 0 to generate the output vector, the
dimension of axis 0 of the input is lost in the output shape.
(def A-sum-axis0 (nd/sum A [0])) A-sum-axis0
ND: (4) cpu() float32 [40., 45., 50., 55.]
(nd/shape A-sum-axis0)
(4)
Specifying [1]
will reduce the column dimension (axis 1) by summing up
elements of all the columns. Thus, the dimension of axis 1 of the
input is lost in the output shape.
(def A-sum-axis1 (nd/sum A [1])) A-sum-axis1
ND: (5) cpu() float32 [ 6., 22., 38., 54., 70.]
(nd/shape A-sum-axis1)
(5)
Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.
(nd/sum A [0 1])
ND: () cpu() float32 190.
A related quantity is the mean, which is also called the average. We calculate the mean by dividing the sum by the total number of elements. In code, we could just call the function for calculating the mean on NDArrays of arbitrary shape.
(nd/mean A)
ND: () cpu() float32 9.5
(nd// (nd/sum A) (nd/size A))
ND: () cpu() float32 9.5
Likewise, the function for calculating the mean can also reduce a NDArray along the specified axes.
(nd/mean A [0])
ND: (4) cpu() float32 [ 8., 9., 10., 11.]
(nd/shape A)
(5, 4)
(nd// (nd/sum A [0]) (nd/get (nd/shape A) 0))
ND: (4) cpu() float32 [ 8., 9., 10., 11.]
1.7.1. Non-Reduction Sum
However, sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean.
(def sum-A (nd/sum A [1] true)) sum-A
ND: (5, 1) cpu() float32 [[ 6.], [22.], [38.], [54.], [70.], ]
For instance, since sum-A
still keeps its two axes after summing each
row, we can divide A
by sum-A
with broadcasting.
(nd// A sum-A)
ND: (5, 4) cpu() float32 [[0. , 0.1667, 0.3333, 0.5 ], [0.1818, 0.2273, 0.2727, 0.3182], [0.2105, 0.2368, 0.2632, 0.2895], [0.2222, 0.2407, 0.2593, 0.2778], [0.2286, 0.2429, 0.2571, 0.2714], ]
If we want to calculate the cumulative sum of elements of A along some
axis, say axis 0 (row by row), we can call the cumsum
function. This
function will not reduce the input NDArray along any axis.
A
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.], [16., 17., 18., 19.], ]
(nd/cumsum A 0)
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 6., 8., 10.], [12., 15., 18., 21.], [24., 28., 32., 36.], [40., 45., 50., 55.], ]
(nd/cumsum A 1)
ND: (5, 4) cpu() float32 [[ 0., 1., 3., 6.], [ 4., 9., 15., 22.], [ 8., 17., 27., 38.], [12., 25., 39., 54.], [16., 33., 51., 70.], ]
1.8. Dot Products
So far, we have only performed elementwise operations, sums, and averages. And if this was all we could do, linear algebra probably would not deserve its own section. However, one of the most fundamental operations is the dot product. Given two vectors \(x,y \in \mathbb{R}^d\), their dot product \(x^\top y\) (or \(\langle x,y \rangle\)) is a sum over the products of the elements at the same position: \(x^\top y = \sum^d_{i=1} x_i y_i\).
(def y (nd/ones ndm [4]))
#'clj-d2l.linear-algebra/y
x
ND: (4) cpu() float32 [0., 1., 2., 3.]
y
ND: (4) cpu() float32 [1., 1., 1., 1.]
(nd/dot x y)
ND: () cpu() float32 6.
Note that we can express the dot product of two vectors equivalently by performing an elementwise multiplication and then a sum:
(nd/sum (nd/* x y))
ND: () cpu() float32 6.
Dot products are useful in a wide range of contexts. For example, given some set of values, denoted by a vector \(x \in \mathbb{R}^d\) and a set of weights denoted by \(w \in \mathbb{R}^d\), the weighted sum of the values in \(x\) according to the weights \(w\) could be expressed as the dot product \(x^\top w\). When the weights are non-negative and sum to one (i.e., (\(\sum^d_{i=1} w_i = 1\))), the dot product expresses a weighted average. After normalizing two vectors to have the unit length, the dot products express the cosine of the angle between them. We will formally introduce this notion of length later in this section.
1.9. Matrix-Vector Products
Now that we know how to calculate dot products, we can begin to understand matrix-vector products. Recall the matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) and the vector \(x \in \mathbb{R}^n\) defined and visualized in (2.3.2) and (2.3.1) respectively. Let us start off by visualizing the matrix \(\mathbf{A}\) in terms of its row vectors
\begin{equation} \mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}, \end{equation}where each \(\mathbf{a}^\top_i \in \mathbb{R}^n\) is a row vector representing the \(i\)th row of the matrix \(\mathbf{A}\). The matrix-vector product \(\mathbf{A}\mathbf{x}\) is simply a column vector of length \(m\), whose \(i\)th element is the dot product \(\mathbf{a}^top_i \mathbf{x}\):
\begin{equation} \mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{x} \\ \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\ \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}. \end{equation}We can think of multiplication by a matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) as a transformation that projects vectors from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). These transformations turn out to be remarkably useful. For example, we can represent rotations as multiplications by a square matrix. As we will see in subsequent chapters, we can also use matrix-vector products to describe the most intensive calculations required when computing each layer in a neural network given the values of the previous layer.
Expressing matrix-vector products in code with NDArrays, we use the
same dot function as for dot products. When we call (nd/dot A x)
with
a matrix \(\mathbf{A}\) and a vector \(\mathbf{x}\), the matrix-vector
product is performed. Note that the column dimension of \(\mathbf{A}\)
(its length along axis 1) must be the same as the dimension of
\(\mathbf{x}\) (its length).
(nd/shape A)
(5, 4)
(nd/shape x)
(4)
(nd/dot A x)
ND: (5) cpu() float32 [ 14., 38., 62., 86., 110.]
1.10. Matrix-Matrix Multiplication
If you have gotten the hang of dot products and matrix-vector products, then matrix-matrix multiplication should be straightforward.
Say that we have two matrices \(\mathbf{A} \in \mathbb{R}^{n \times k}\) and \(\mathbf{B} \in \mathbb{R}^{k \times m}\):
\begin{equation} \mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}. \end{equation}Denote by \(\mathbf{a}^\top_i \in \mathbb{R}^k\) the row vector representing the \(i\)th row of the matrix \(\mathbf{A}\), and let \(\mathbf{b}_j \in \mathbb{R}^k\) be the column vector from the \(j\)th column of the matrix \(\mathbf{B}\). To produce the matrix product \(\mathbf{C}=\mathbf{AB}\), it is easiest to think of \(\mathbf{A}\) in terms of its row vectors and \(\mathbf{B}\) in terms of its column vectors:
\begin{equation} \mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix}. \end{equation}Then the matrix product \(\mathbf{C} \in \mathbb{R}^{n \times m}\) is produced as we simply compute each element \(c_{ij}\) as the dot product \(\mathbf{a}^\top_i \mathbf{b}_j\):
\begin{equation} \mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}. \end{equation}We can think of the matrix-matrix multiplication \(\mathbf{AB}\) as simply performing \(m\) matrix-vector products and stitching the results together to form an \(n \times m\) matrix. In the following snippet, we perform matrix multiplication on \(\mathbf{A}\) and \(\mathbf{B}\). Here, \(\mathbf{A}\) is a matrix with 5 rows and 4 columns, and \(\mathbf{B}\) is a matrix with 4 rows and 3 columns. After multiplication, we obtain a matrix with 5 rows and 3 columns.
(def B (nd/ones ndm [4 3])) B
ND: (4, 3) cpu() float32 [[1., 1., 1.], [1., 1., 1.], [1., 1., 1.], [1., 1., 1.], ]
A
ND: (5, 4) cpu() float32 [[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.], [12., 13., 14., 15.], [16., 17., 18., 19.], ]
(nd/dot A B)
ND: (5, 3) cpu() float32 [[ 6., 6., 6.], [22., 22., 22.], [38., 38., 38.], [54., 54., 54.], [70., 70., 70.], ]
Matrix-matrix multiplication can be simply called matrix multiplication, and should not be confused with the Hadamard product.
1.11. Norms
Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big a vector is. The notion of size under consideration here concerns not dimensionality but rather the magnitude of the components.
In linear algebra, a vector norm is a function \(f\) that maps a vector to a scalar, satisfying a handful of properties. Given any vector \(\mathbf{x}\), the first property says that if we scale all the elements of a vector by a constant factor \(\alpha\), its norm also scales by the absolute value of the same constant factor:
\begin{equation} f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}). \end{equation}The second property is the familiar triangle inequality:
\begin{equation} f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}). \end{equation}The third property simply says that the norm must be non-negative:
\begin{equation} f(\mathbf{x}) \geq 0. \end{equation}That makes sense, as in most contexts the smallest size for anything is 0. The final property requires that the smallest norm is achieved and only achieved by a vector consisting of all zeros.
\begin{equation} \forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0. \end{equation}You might notice that norms sound a lot like measures of distance. And if you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then the concepts of non-negativity and the triangle inequality might ring a bell. In fact, the Euclidean distance is a norm: specifically it is the \(L_2\) norm. Suppose that the elements in the $n$-dimensional vector \(\mathbf{x}\) are \(x_1, \ldots, x_n\). The \(L_2\) norm of \(\mathbf{x}\) is the square root of the sum of the squares of the vector elements:
\begin{equation} \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}, \end{equation}where the subscript \(2\) is often omitted in \(L_2\) norms, i.e., \(\|\mathbf{x}\|\) is equivalent to \(\|\mathbf{x}\|_2\). In code, we can calculate the \(L_2\) norm of a vector as follows.
(defn l2norm [ndarray] (-> ndarray (nd/pow 2) (nd/sum) (nd/sqrt)))
(def u (nd/create ndm [3. -4.])) u
ND: (2) cpu() float64 [ 3., -4.]
(l2norm u)
ND: () cpu() float64 5.
In deep learning, we work more often with the squared \(L_2\) norm. You will also frequently encounter the \(L_1\) norm, which is expressed as the sum of the absolute values of the vector elements:
\begin{equation} \|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|. \end{equation}As compared with the \(L_2\) norm, it is less influenced by outliers. To calculate the \(L_1\) norm, we compose the absolute value function with a sum over the elements.
(nd/sum (nd/abs u))
ND: () cpu() float64 7.
Both the \(L_2\) norm and the \(L_1\) norm are special cases of the more general \(L_p\) norm:
\begin{equation} \|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}. \end{equation}Analogous to \(L_2\) norms of vectors, the Frobenius norm of a matrix \(\mathbf{X} \in \mathbb{R}^{m \times n}\) is the square root of the sum of the squares of the matrix elements:
\begin{equation} \|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}. \end{equation}The Frobenius norm satisfies all the properties of vector norms. It behaves as if it were an \(L_2\) norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix.
(l2norm (nd/ones ndm [4 9]))
ND: () cpu() float32 6.
1.12. Norms and Objectives
While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; minimize the distance between predictions and the ground-truth observations. Assign vector representations to items (like words, products, or news articles) such that the distance between similar items is minimized, and the distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most important components of deep learning algorithms (besides the data), are expressed as norms.
1.13. More on Linear Algebra
In just this section, we have taught you all the linear algebra that you will need to understand a remarkable chunk of modern deep learning. There is a lot more to linear algebra and a lot of that mathematics is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order NDArrays to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be much more inclined to learn more mathematics once you have gotten your hands dirty deploying useful machine learning models on real datasets. So while we reserve the right to introduce more mathematics much later on, we will wrap up this section here.
If you are eager to learn more about linear algebra, you may refer to either the online appendix on linear algebraic operations or other excellent resources [Strang, 1993][Kolter, 2008][Petersen et al., 2008].
1.14. Summary
- Scalars, vectors, matrices, and NDArrays are basic mathematical objects in linear algebra.
- Vectors generalize scalars, and matrices generalize vectors.
- Scalars, vectors, matrices, and NDArrays have zero, one, two, and an arbitrary number of axes, respectively.
- A NDArray can be reduced along the specified axes by sum and mean.
- Elementwise multiplication of two matrices is called their Hadamard product. It is different from matrix multiplication.
- In deep learning, we often work with norms such as the \(L_1\) norm, the \(L_2\) norm, and the Frobenius norm.
- We can perform a variety of operations over scalars, vectors, matrices, and NDArrays.