CLJ D2L

Notation
1. Introduction
2. Preliminaries
3. Linear Neural Networks

Notation

The notation used throughout this book is summarized below.

Numbers

$x$: A scalar
$\mathbf{x}$: A vector
$\mathbf{X}$: A matrix
$\mathsf{X}$: A tensor
$\mathbf{I}$: An identity matrix
$x_i$, $[\mathbf{x}]_i$: The $i^\mathrm{th}$ element of vector $\mathbf{x}$
$x_{ij}$, $[\mathbf{X}]_{ij}$: The element of matrix $\mathbf{X}$ at row $i$ and column $j$

Set Theory

$\mathcal{X}$: A set
$\mathbb{Z}$: The set of integers
$\mathbb{R}$: The set of real numbers
$\mathbb{R}^n$: The set of $n$-dimensional vectors of real numbers
$\mathbb{R}^{a\times b}$: The set of matrices of real numbers with $a$ rows and $b$ columns
$\mathcal{A}\cup\mathcal{B}$: Union of sets $\mathcal{A}$ and $\mathcal{B}$
$\mathcal{A}\cap\mathcal{B}$: Intersection of sets $\mathcal{A}$ and $\mathcal{B}$
$\mathcal{A}\setminus\mathcal{B}$: Subtraction of set $\mathcal{B}$ from set $\mathcal{A}$

Functions and Operators

$f(\cdot)$: A function
$\log(\cdot)$: The natural logarithm
$\exp(\cdot)$: The exponential function
$\mathbf{1}_\mathcal{X}$: The indicator function
$\mathbf{(\cdot)}^\top$: Transpose of a vector or a matrix
$\mathbf{X}^{-1}$: Inverse of matrix $\mathbf{X}$
$\odot$: Hadamard (elementwise) product
$[\cdot, \cdot]$: Concatenation
$\lvert \mathcal{X} \rvert$: Cardinality of set $\mathcal{X}$
$\|\cdot\|_p$: $\ell_p$ norm
$\|\cdot\|$: $\ell_2$ norm
$\langle \mathbf{x}, \mathbf{y} \rangle$: Dot product of vectors $\mathbf{x}$ and $\mathbf{y}$
$\sum$: Series addition
$\prod$: Series multiplication

Calculus

$\frac{dy}{dx}$: Derivative of $y$ with respect to $x$
$\frac{\partial y}{\partial x}$: Partial derivative of $y$ with respect to $x$
$\nabla_{\mathbf{x}} y$: Gradient of $y$ with respect to $\mathbf{x}$
$\int_a^b f(x) \;dx$: Definite integral of $f$ from $a$ to $b$ with respect to $x$
$\int f(x) \;dx$: Indefinite integral of $f$ with respect to $x$

Probability and Information Theory

$P(\cdot)$: Probability distribution
$z \sim P$: Random variable $z$ has probability distribution $P$
$P(X \mid Y)$: Conditional probability of $X \mid Y$
$p(x)$: Probability density function
${E}_{x} [f(x)]$: Expectation of $f$ with respect to $x$
$X \perp Y$: Random variables $X$ and $Y$ are independent
$X \perp Y \mid Z$: Random variables $X$ and $Y$ are conditionally independent given random variable $Z$
$\mathrm{Var}(X)$: Variance of random variable $X$
$\sigma_X$: Standard deviation of random variable $X$
$\mathrm{Cov}(X, Y)$: Covariance of random variables $X$ and $Y$
$\rho(X, Y)$: Correlation of random variables $X$ and $Y$
$H(X)$: Entropy of random variable $X$
$D_{\mathrm{KL}}(P\|Q)$: KL-divergence of distributions $P$ and $Q$

Complexity

$\mathcal{O}$: Big O notation

1. Introduction

Refer to: http://d2l.ai/chapter_introduction/index.html

2. Preliminaries

To get started with deep learning, we will need to develop a few basic skills. All machine learning is concerned with extracting information from data. So we will begin by learning the practical skills for storing, manipulating, and preprocessing data.

Moreover, machine learning typically requires working with large datasets, which we can think of as tables, where the rows correspond to examples and the columns correspond to attributes. Linear algebra gives us a powerful set of techniques for working with tabular data. We will not go too far into the weeds but rather focus on the basic of matrix operations and their implementation.

Additionally, deep learning is all about optimization. We have a model with some parameters and we want to find those that fit our data the best. Determining which way to move each parameter at each step of an algorithm requires a little bit of calculus, which will be briefly introduced. Fortunately, the autograd package automatically computes differentiation for us, and we will cover it next.

Next, machine learning is concerned with making predictions: what is the likely value of some unknown attribute, given the information that we observe? To reason rigorously under uncertainty we will need to invoke the language of probability.

In the end, the official documentation provides plenty of descriptions and examples that are beyond this book. To conclude the chapter, we will show you how to look up documentation for the needed information.

This book has kept the mathematical content to the minimum necessary to get a proper understanding of deep learning. However, it does not mean that this book is mathematics free. Thus, this chapter provides a rapid introduction to basic and frequently-used mathematics to allow anyone to understand at least most of the mathematical content of the book. If you wish to understand all of the mathematical content, further reviewing the online appendix on mathematics should be sufficient.

2.1. Data Manipulation

In order to get anything done, we need some way to store and manipulate data. Generally, there are two important things we need to do with data: (i) acquire them; and (ii) process them once they are inside the computer. There is no point in acquiring data without some way to store it, so let us get our hands dirty first by playing with synthetic data. To start, we introduce the $n$-dimensional array, which is also called the ndarray.

If you have worked with NumPy, the most widely-used scientific computing package in Python, then you will find this section familiar. No matter which framework you use, its tensor class (ndarray in MXNet, DJL and clj-djl, Tensor in both PyTorch and TensorFlow) is similar to NumPy’s ndarray with a few killer features. First, GPU is well-supported to accelerate the computation whereas NumPy only supports CPU computation. Second, the tensor class supports automatic differentiation. These properties make the tensor class suitable for deep learning. Throughout the book, when we say ndarrays, we are referring to instances of the ndarray class unless otherwise stated.

2.1.1. Getting Started

In this section, we aim to get you up and running, equipping you with the basic math and numerical computing tools that you will build on as you progress through the book. Do not worry if you struggle to grok some of the mathematical concepts or library functions. The following sections will revisit this material in the context of practical examples and it will sink. On the other hand, if you already have some background and want to go deeper into the mathematical content, just skip this section.

To start, we import the ndarray namespace from clj-djl. Here, the ndarray namespace includes functions supported by clj-djl.

(ns clj-d2l.data-manipulation
  (:require [clj-djl.ndarray :as nd]
            [clj-d2l.core :as d2l]))

An ndarray represents a (possibly multi-dimensional) array of numerical values. With one axis, an ndarray corresponds (in math) to a vector. With two axes, an ndarray corresponds to a matrix. NDArrays with more than two axes do not have special mathematical names.

To start, we can use arange to create a row vector x containing the first 12 integers starting with 0. Each of the values in an ndarray is called an element of the ndarray. For instance, there are 12 elements in the ndarray x. Unless otherwise specified, a new ndarray will be stored in main memory and designated for CPU-based computation.

(def ndm (nd/base-manager))
(def x (nd/arange ndm 0 12))
x

ND: (12) cpu() int32
[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]

We can access an ndarray’s shape (the length along each axis) by inspecting its shape property.

(nd/shape x)

(12)

If we just want to know the total number of elements in an ndarray, i.e., the product of all of the shape elements, we can inspect its size. Because we are dealing with a vector here, the single element of its shape is same to its size. The difference is that shape will return a Shape object.

(nd/size x)

To change the shape of an ndarray without altering either the number of elements or their values, we can invoke the reshape function. For example, we can transform our ndarray, x, from a row vector with shape (12) to a matrix with shape (3, 4). This new ndarray contains the exact same values, but views them as a matrix organized as 3 rows and 4 columns. To reiterate, although the shape has changed, the elements have not. Note that the size is unaltered by reshaping.

(def y (nd/reshape x [3 4]))
y

ND: (3, 4) cpu() int32
[[ 0,  1,  2,  3],
 [ 4,  5,  6,  7],
 [ 8,  9, 10, 11],
]

Reshaping by manually specifying every dimension is unnecessary. If our target shape is a matrix with shape (height, width), then after we know the width, the height is given implicitly. Why should we have to perform the division ourselves? In the example above, to get a matrix with 3 rows, we specified both that it should have 3 rows and 4 columns. Fortunately, ndarrays can automatically work out one dimension given the rest. We invoke this capability by placing -1 for the dimension that we would like ndarrays to automatically infer. In our case, instead of calling (reshape x [3 4]), we could have equivalently called (nd/reshape x [-1 4]) or (nd/reshape x [3 -1]).

(def y (nd/reshape x [3 -1]))
y

ND: (3, 4) cpu() int32
[[ 0,  1,  2,  3],
 [ 4,  5,  6,  7],
 [ 8,  9, 10, 11],
]

Passing create method with only Shape will grab a chunk of memory and hands us back a matrix without bothering to change the value of any of its entries. This is remarkably efficient but we must be careful because the entries might take arbitrary values, including very big ones!

(nd/create ndm (nd/shape [3 4]))

ND: (3, 4) cpu() float32
[[ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
 [ 1.21782351e+29,  0.00000000e+00,  1.21783258e+29,  0.00000000e+00],
 [ 1.19888078e+29,  0.00000000e+00,  1.21744119e+29,  0.00000000e+00],
]

Typically, we will want our matrices initialized either with zeros, ones, some other constants, or numbers randomly sampled from a specific distribution. We can create a ndarray representing a tensor with all elements set to 0 and a shape of [2 3 4] as follows:

(nd/zeros ndm [2 3 4])

ND: (2, 3, 4) cpu() float32
[[[0., 0., 0., 0.],
  [0., 0., 0., 0.],
  [0., 0., 0., 0.],
 ],
 [[0., 0., 0., 0.],
  [0., 0., 0., 0.],
  [0., 0., 0., 0.],
 ],
]

Similarly, we can create ndarrays with each element set to 1 as follows:

(nd/ones ndm [2 3 4])

ND: (2, 3, 4) cpu() float32
[[[1., 1., 1., 1.],
  [1., 1., 1., 1.],
  [1., 1., 1., 1.],
 ],
 [[1., 1., 1., 1.],
  [1., 1., 1., 1.],
  [1., 1., 1., 1.],
 ],
]

Often, we want to randomly sample the values for each element in an ndarray from some probability distribution. For example, when we construct arrays to serve as parameters in a neural network, we will typically initialize their values randomly. The following snippet creates an ndarray with shape (3, 4). Each of its elements is randomly sampled from a standard Gaussian (normal) distribution with a mean of 0 and a standard deviation of 1.

(nd/random-normal ndm 0 1 (nd/shape [3 4]))

ND: (3, 4) cpu() float32
[[ 1.1631,  2.2122,  0.4838,  0.774 ],
 [ 0.2996,  1.0434,  0.153 ,  1.1839],
 [-1.1688,  1.8917,  1.5581, -1.2347],
]

We can directly use a clojure vec as the shape:

(nd/random-normal ndm 0 1 [3 4])

ND: (3, 4) cpu() float32
[[-0.5459, -1.771 , -2.3556, -0.4514],
 [ 0.5414,  0.5794,  2.6785, -1.8561],
 [ 1.2546, -1.9769, -0.5488, -0.208 ],
]

You can also just pass the shape and it will use default values for mean and standard deviation (0 and 1).

(nd/random-normal ndm [3 4])

ND: (3, 4) cpu() float32
[[-0.6811,  0.2444, -0.1353, -0.0372],
 [ 0.3772, -0.4877,  0.4102, -0.0226],
 [ 0.5713,  0.5746, -2.758 ,  1.4661],
]

We can also specify the exact values for each element in the desired ndarray by supplying a clojure vec (or list) containing the numerical values. Here, the outermost list corresponds to axis 0, and the inner list to axis 1.

(nd/create ndm [2 1 4 3 1 2 3 4 4 3 2 1] [3 4])

ND: (3, 4) cpu() int64
[[ 2,  1,  4,  3],
 [ 1,  2,  3,  4],
 [ 4,  3,  2,  1],
]

(nd/create ndm [[2 1 4 3][1 2 3 4][4 3 2 1]])

ND: (3, 4) cpu() int64
[[ 2,  1,  4,  3],
 [ 1,  2,  3,  4],
 [ 4,  3,  2,  1],
]

2.1.2. Operations

This book is not about software engineering. Our interests are not limited to simply reading and writing data from/to arrays. We want to perform mathematical operations on those arrays. Some of the simplest and most useful operations are the elementwise operations. These apply a standard scalar operation to each element of an array. For functions that take two arrays as inputs, elementwise operations apply some standard binary operator on each pair of corresponding elements from the two arrays. We can create an elementwise function from any function that maps from a scalar to a scalar.

In mathematical notation, we would denote such a unary scalar operator (taking one input) by the signature $f: \mathbb{R} \ rightarrow \mathbb{R}$. This just means that the function is mapping from any real number ($\mathbb{R}$) onto another. Likewise, we denote a binary scalar operator (taking two real inputs, and yielding one output) by the signature $f: \mathbb{R}, \mathbb{R} \rightarrow \mathbb{R}$. Given any two vectors $\mathbf{u}$ and $\mathbf{v}$ of the same shape, and a binary operator $f$, we can produce a vector $\mathbf{c} = F(\mathbf{u}, \mathbf{v})$ by setting $c_i \gets f(u_i, v_i)$ for all $i$, where $c_i, u_i$, and $v_i$ are the $i^\mathrm{th}$ elements of vectors $\mathbf{c}$, $\mathbf{u}$, and $\mathbf{v}$. Here, we produced the vector-valued $F: \mathbb{R}^d, \mathbb{R}^d \rightarrow \mathbb{R}^d$ by lifting the scalar function to an elementwise vector operation.

The common standard arithmetic operators (+, -, *, /) have all been lifted to elementwise operations for any identically-shaped ndarrays of arbitrary shape. We can call elementwise operations on any two ndarrays of the same shape. In the following example, we use commas to formulate a 5-element tuple, where each element is the result of an elementwise operation.

2.1.2.1. Operations

The common standard arithmetic operators (+, -, *, /) have all been lifted to elementwise operations.

(def x (nd/create ndm [1. 2. 4. 8.]))
(def y (nd/create ndm [2. 2. 2. 2.]))
(nd/+ x y)

ND: (4) cpu() float64
[ 3.,  4.,  6., 10.]

(nd/- x y)

ND: (4) cpu() float64
[-1.,  0.,  2.,  6.]

(nd// x y)

ND: (4) cpu() float64
[0.5, 1. , 2. , 4. ]

(nd/pow x y)

ND: (4) cpu() float64
[ 1.,  4., 16., 64.]

Many more operations can be applied elementwise, including unary operators like exponentiation.

(nd/exp x)

ND: (4) cpu() float64
[ 2.71828183e+00,  7.38905610e+00,  5.45981500e+01,  2.98095799e+03]

In addition to elementwise computations, we can also perform linear algebra operations, including vector dot products and matrix multiplication. We will explain the crucial bits of linear algebra (with no assumed prior knowledge) in -Section 2.3-.

We can also concatenate multiple ndarrays together, stacking them end-to-end to form a larger ndarray. We just need to provide a list of ndarrays and tell the system along which axis to concatenate. The example below shows what happens when we concatenate two matrices along rows (axis 0, the first element of the shape) vs. columns (axis 1, the second element of the shape). We can see that the first output ndarray’s shape is (6, 4), its axis-0 length (6) is the sum of the two input ndarrays’ axis-0 lengths $(3+3)$; while the second output ndarray’s shape is (3, 8), its axis-1 length (8) is the sum of the two input ndarrays’ axis-1 lengths $(4+4)$.

(def X (-> (nd/arange ndm 12)
           (nd/reshape [3 4])))
X

ND: (3, 4) cpu() int32
[[ 0,  1,  2,  3],
 [ 4,  5,  6,  7],
 [ 8,  9, 10, 11],
]

(def Y (nd/create ndm [[2 1 4 3][1 2 3 4][4 3 2 1]]))
Y

ND: (3, 4) cpu() int64
[[ 2,  1,  4,  3],
 [ 1,  2,  3,  4],
 [ 4,  3,  2,  1],
]

;; concat only support int32 and float32 datatype
(def Y (nd/to-type Y :int32 false))
(nd/concat Y Y)

ND: (6, 4) cpu() int32
[[ 2,  1,  4,  3],
 [ 1,  2,  3,  4],
 [ 4,  3,  2,  1],
 [ 2,  1,  4,  3],
 [ 1,  2,  3,  4],
 [ 4,  3,  2,  1],
]

(nd/concat X Y 1)

ND: (3, 8) cpu() int32
[[ 0,  1,  2,  3,  2,  1,  4,  3],
 [ 4,  5,  6,  7,  1,  2,  3,  4],
 [ 8,  9, 10, 11,  4,  3,  2,  1],
]

The third argument of nd/concat is to specify the axis to concatenate, default is axis-0.

Sometimes, we want to construct a binary ndarray via logical statements. Take (nd/= X Y) as an example. For each position, if X and Y are equal at that position, the corresponding entry in the new tensor takes a value of true, meaning that the logical statement (nd/= X Y) is true at that position; otherwise that position takes false.

(nd/= X Y)

ND: (3, 4) cpu() boolean
[[false,  true, false,  true],
 [false, false, false, false],
 [false, false, false, false],
]

Summing all the elements in the ndarray yields a ndarray with only one element.

(nd/sum X)

ND: () cpu() int32
66

2.1.3. Broadcasting Mechanism

In the above section, we saw how to perform elementwise operations on two ndarrays of the same shape. Under certain conditions, even when shapes differ, we can still perform elementwise operations by invoking the broadcasting mechanism. This mechanism works in the following way: First, expand one or both arrays by copying elements appropriately so that after this transformation, the two ndarrays have the same shape. Second, carry out the elementwise operations on the resulting arrays.

In most cases, we broadcast along an axis where an array initially only has length 1, such as in the following example:

(def a (-> (nd/arange ndm 3) (nd/reshape [3 1])))
a

ND: (3, 1) cpu() int32
[[ 0],
 [ 1],
 [ 2],
]

(def b (-> (nd/arange ndm 2) (nd/reshape [1 2])))
b

ND: (1, 2) cpu() int32
[[ 0,  1],
]

Since a and b are $3 \times 1$ and $1 \times 2$ matrices respectively, their shapes do not match up if we want to add them. We broadcast the entries of both matrices into a larger $3 \times 2$ matrix as follows: for matrix a it replicates the columns and for matrix b it replicates the rows before adding up both elementwise.

The result of $a$ broadcasted is:

(nd/concat a a 1)

ND: (3, 2) cpu() int32
[[ 0,  0],
 [ 1,  1],
 [ 2,  2],
]

The result of $b$ broadcasted is:

(->> b
     (nd/concat b)
     (nd/concat b))

ND: (3, 2) cpu() int32
[[ 0,  1],
 [ 0,  1],
 [ 0,  1],
]

Thus the result is:

(nd/+ a b)

ND: (3, 2) cpu() int32
[[ 0,  1],
 [ 1,  2],
 [ 2,  3],
]

2.1.4. Indexing and Slicing

Just as in any other Python array, elements in a ndarray can be accessed by index. As in any Python array, the first element has index 0 and ranges are specified to include the first but before the last element. As in standard Python lists, we can access elements according to their relative position to the end of the list by using negative indices.

Java and Clojure do not support operator[] overload, a simulation is done with index and slice string.

ND: (3, 4) cpu() int32
[[ 0,  1,  2,  3],
 [ 4,  5,  6,  7],
 [ 8,  9, 10, 11],
]

Thus, [-1] selects the last element and [1:3] selects the second and the third elements as follows:

(nd/get X "-1")

ND: (4) cpu() int32
[ 8,  9, 10, 11]

(nd/get X "1:3")

ND: (2, 4) cpu() int32
[[ 4,  5,  6,  7],
 [ 8,  9, 10, 11],
]

Beyond reading, we can also set elements of a matrix by specifying indices.

(nd/set X "1,2" 999)

ND: (3, 4) cpu() int32
[[  0,   1,   2,   3],
 [  4,   5, 999,   7],
 [  8,   9,  10,  11],
]

If we want to assign multiple elements the same value, we simply index all of them and then assign them the value. For instance, [0:2, :] accesses the first and second rows, where : takes all the elements along axis 1 (column). While we discussed indexing for matrices, this obviously also works for vectors and for tensors of more than 2 dimensions.

(nd/set X "0:2,:" 12)

ND: (3, 4) cpu() int32
[[12, 12, 12, 12],
 [12, 12, 12, 12],
 [ 8,  9, 10, 11],
]

2.1.5. Saving Memory

Running operations can cause new memory to be allocated to host results. For example, if we write (def Y2 (nd/+! X Y), we will dereference the ndarray that Y used to point to and instead point Y at the newly allocated memory. In the following example, we demonstrate this with Clojure’s identical? function, which results true if the two object are exactly the same. After running Y’ = Y + X, we will find that Y and Y’ are different objects. That is because Clojure first evaluates Y + X, allocating new memory for the result and then makes Y point to this new location in memory.

(def Y (nd/zeros ndm (nd/get-shape X)))
(def Y' (nd/+ Y X))
(identical? Y Y')

false

(def Y'' (nd/+! Y X))
(identical? Y Y'')

true

Running operations can cause new memory to be allocated to host results. For example, if we write y = x.add(y), we will dereference the ndarray that y used to point to and instead point y at the newly allocated memory.

This might be undesirable for two reasons. First, we do not want to run around allocating memory unnecessarily all the time. In machine learning, we might have hundreds of megabytes of parameters and update all of them multiple times per second. Typically, we will want to perform these updates in place. Second, we might point at the same parameters from multiple variables. If we do not update in place, other references will still point to the old memory location, making it possible for parts of our code to inadvertently reference stale parameters.

Fortunately, performing in-place operations in DJL is easy. We can assign the result of an operation to a previously allocated array using inplace operators like addi, subi, muli, and divi.

(def Y (nd/zeros ndm (nd/get-shape X)))
(def Y' (nd/+ Y X))
(identical? Y Y')

false

(def Y'' (nd/+! Y X))
(identical? Y Y'')

true

2.1.6. Conversion to Other Clojure Objects

(type (nd/to-vec X))

class clojure.lang.PersistentVector

(nd/to-vec X)

[12 12 12 12 12 12 12 12 8 9 10 11]

(type (nd/to-array X))

class [Ljava.lang.Integer;

(type X)

class ai.djl.mxnet.engine.MxNDArray

ND: (3, 4) cpu() int32
[[12, 12, 12, 12],
 [12, 12, 12, 12],
 [ 8,  9, 10, 11],
]

To convert a size-1 tensor to a scalar

(def a (nd/create ndm [3.5]))
a

ND: (1) cpu() float64
[3.5]

(nd/get-element a)

3.5

2.2. Data Preprocessing

(ns clj-d2l.data-preprocess
  (:require [clj-djl.ndarray :as nd]
            [clj-d2l.core :as d2l]
            [clj-djl.dataframe :as df]
            [clj-djl.dataframe.functional :as functional]
            [clojure.java.io :as io]))

2.2.1. Reading the Dataset

As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ../data/house_tiny.csv. Data stored in other formats may be processed in similar ways.

Below we write the dataset row by row into a csv file.

(let [filename "data/house_tiny.csv"
      records ["NumRooms,Alley,Price\n"  ;; Column names
               "NA,Pave,127500\n"        ;;Each row represents a data example
               "2,NA,106000\n"
               "4,NA,178100\n"
               "NA,NA,140000\n"]]
  (io/make-parents filename)
  (dorun
   (map #(spit filename % :append true) records))
  (slurp filename))

NumRooms,Alley,Price
NA,Pave,127500
2,NA,106000
4,NA,178100
NA,NA,140000

To load the raw dataset from the created csv file, we require the clj-djl.dataframe package and invoke the read function to read directly from the csv we created. This dataset has four rows and three columns, where each row describes the number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house.

(def data (df/dataframe "data/house_tiny.csv"))
data

data/house_tiny.csv [4 3]:

| NumRooms | Alley |  Price |
|---------:|-------|-------:|
|          |  Pave | 127500 |
|        2 |       | 106000 |
|        4 |       | 178100 |
|          |       | 140000 |

2.2.2. Handling Missing Data

Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.

We split the data into inputs and outputs by creating new dataframes and specifying the columns desired, where the former takes the first two columns while the latter only keeps the last column. For numerical values in inputs that are missing, we replace the missing data entries with the mean value of the same column.

(def dataframe
  (let [data (df/replace-missing
              data ["NumRooms"] functional/mean)
        data (df/update-column
              data "Alley_nan"
              (map #(if (nil? %) 1 0) (data "Alley")))
        data (df/update-column
              data "Alley_Pave"
              (map #(if (some? %) 1 0) (data "Alley")))
        inputs (df/select-columns
                data ["NumRooms" "Alley_Pave" "Alley_nan"])
        outputs (df/select-columns
                 data ["Price"])]
    [inputs outputs]))

(first dataframe)

data/house_tiny.csv [4 3]:

| NumRooms | Alley_Pave | Alley_nan |
|---------:|-----------:|----------:|
|        3 |          1 |         0 |
|        2 |          0 |         1 |
|        4 |          0 |         1 |
|        3 |          0 |         1 |

(second dataframe)

data/house_tiny.csv [4 1]:

|  Price |
|-------:|
| 127500 |
| 106000 |
| 178100 |
| 140000 |

2.2.3. Conversion to the Tensor Format

Now that all the entries in inputs and outputs are numerical, they can be converted to the NDArray format. Once data are in this format, they can be further manipulated with those NDArray functionalities that we have introduced in Section 2.1.

(def ndm (nd/new-base-manager))
(def X (df/->ndarray ndm (first dataframe)))
(def Y (df/->ndarray ndm (second dataframe)))
X

ND: (4, 3) cpu() int32
[[ 3,  1,  0],
 [ 2,  0,  1],
 [ 4,  0,  1],
 [ 3,  0,  1],
]

ND: (4, 1) cpu() int64
[[127500],
 [106000],
 [178100],
 [140000],
]

2.2.4. Summary

Like many other extension packages in the vast ecosystem of clojure, clj-djl.dataframe can work together with NDArray.
Imputation and deletion can be used to handle missing data.

2.3. Linear Algebra

Now that you can store and manipulate data, let us briefly review the subset of basic linear algebra that you will need to understand and implement most of models covered in this book. Below, we introduce the basic mathematical objects, arithmetic, and operations in linear algebra, expressing each of them through mathematical notation and the corresponding implementation in code.

(ns clj-d2l.linear-algebra
  (:require [clj-djl.ndarray :as nd]
            [clj-d2l.core :as d2l]
            [clojure.java.io :as io]))

2.3.1. Scalars

If you never studied linear algebra or machine learning, then your past experience with math probably consisted of thinking about one number at a time. And, if you ever balanced a checkbook or even paid for dinner at a restaurant then you already know how to do basic things like adding and multiplying pairs of numbers. For example, the temperature in Palo Alto is $52$ degrees Fahrenheit. Formally, we call values consisting of just one numerical quantity scalars. If you wanted to convert this value to Celsius (the metric system’s more sensible temperature scale), you would evaluate the expression $c=\frac{5}{9}(f−32)$, setting $f$ to $52$. In this equation, each of the terms — $5$, $9$, and $32$ — are scalar values. The placeholders $c$ and $f$ are called variables and they represent unknown scalar values. We denote the space of all (continuous) real-valued scalars by $\mathbb{R}$.

In this book, we adopt the mathematical notation where scalar variables are denoted by ordinary lower-cased letters (e.g., $x$, $y$, and $z$). For expedience, we will punt on rigorous definitions of what precisely space is, but just remember for now that the expression $x \in \mathbb{R}$ is a formal way to say that $x$ is a real-valued scalar. The symbol $\in$ can be pronounced “in” and simply denotes membership in a set. Analogously, we could write $x, y \in \{0,1\}$ to state that $x$ and $y$ are numbers whose value can only be $0$ or $1$.

A scalar is represented by a NDArray with just one element. In the next snippet, we instantiate two scalars and perform some familiar arithmetic operations with them, namely addition, multiplication, division, and exponentiation.

(def ndm (nd/base-manager))
(def x (nd/create ndm 3.))
(def y (nd/create ndm 2.))

(nd/+ x y)

ND: () cpu() float64
5.

(nd/* x y)

ND: () cpu() float64
6.

(nd// x y)

ND: () cpu() float64
1.5

(nd/pow x y)

ND: () cpu() float64
9.

2.3.2. Vectors

You can think of a vector as simply a list of scalar values. We call these values the elements (entries or components) of the vector. When our vectors represent examples from our dataset, their values hold some real-world significance. For example, if we were training a model to predict the risk that a loan defaults, we might associate each applicant with a vector whose components correspond to their income, length of employment, number of previous defaults, and other factors. If we were studying the risk of heart attacks hospital patients potentially face, we might represent each patient by a vector whose components capture their most recent vital signs, cholesterol levels, minutes of exercise per day, etc. In math notation, we will usually denote vectors as bold-faced, lower-cased letters (e.g., $\mathbf{x}$, $\mathbf{y}$, and $\mathbf{z}$.

We work with vectors via one-dimensional NDArrays. In general NDArrays can have arbitrary lengths, subject to the memory limits of your machine.

(def x (nd/arange ndm 4.))
x

ND: (4) cpu() float32
[0., 1., 2., 3.]

We can refer to any element of a vector by using a subscript. For example, we can refer to the i^th element of $\mathbf{x}$ by $x_i$. Note that the element $x_i$ is a scalar, so we do not bold-face the font when referring to it. Extensive literature considers column vectors to be the default orientation of vectors, so does this book. In math, a vector $\mathbf{x}$ can be written as

\begin{equation} \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}, \end{equation}

where $x_1, \ldots, x_n$ are elements of the vector. In code, we access any element by indexing into the NDArray.

(nd/get x 3)

ND: () cpu() float32
3.

2.3.3. Length, Dimensionality, and Shape

Let us revisit some concepts from Section 2.1. A vector is just an array of numbers. And just as every array has a length, so does every vector. In math notation, if we want to say that a vector $\mathbf{x}$ consists of $n$ real-valued scalars, we can express this as $\mathbf{x} \in \mathbb{R}^n$. The length of a vector is commonly called the dimension of the vector.

As with an ordinary Java array, we can access the length. In the case of a NDArray we can achieve this by using the (size ndarray 0) function, where $0$ means the axis-0.

(nd/size x)

When a NDArray represents a vector (with precisely one axis), we can also access its length via the shape function. The shape lists the length (dimensionality) along each axis of the NDArray. For NDArrays with just one axis, the shape has just one element.

(nd/shape x)

(4)

or we can use get-shape function:

(nd/get-shape x)

(4)

Note that the word “dimension” tends to get overloaded in these contexts and this tends to confuse people. To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the number of elements of a vector or an axis. However, we use the dimensionality of a NDArray to refer to the number of axes that a NDArray has. In this sense, the dimensionality of some axis of a NDArray will be the length of that axis.

2.3.4. Matrices

Just as vectors generalize scalars from order zero to order one, matrices generalize vectors from order one to order two. Matrices, which we will typically denote with bold-faced, capital letters (e.g., $\mathbf{X}$, $\mathbf{Y}$, and $\mathbf{Z}$), are represented in code as NDArray with two axes.

In math notation, we use $\mathbf{A} \in \mathbb{R}^{m \times n}$ to express that the matrix $\mathbf{A}$ consists of $m$ rows and $n$ columns of real-valued scalars. Visually, we can illustrate any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ as a table, where each element $a_{ij}$ belongs to the $i$^th row and $j$^th column:

\begin{equation} \mathbf{A}= \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix}. \end{equation}

For any $\mathbf{A} \in \mathbb{R}^{m \times n}$, the shape of $\mathbf{A}$ is $(m, n)$ or $m \times n$. Specifically, when a matrix has the same number of rows and columns, its shape becomes a square; thus, it is called a square matrix.

We can create an $m \times n$ matrix by specifying a shape with two components $m$ and $n$ when calling any of our favorite functions for instantiating a NDArray.

(def A (-> (nd/arange ndm 20.)
           (nd/reshape 5 4)))
A

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  5.,  6.,  7.],
 [ 8.,  9., 10., 11.],
 [12., 13., 14., 15.],
 [16., 17., 18., 19.],
]

We can access the scalar element $a_{ij}$ of a matrix $\mathbf{A}$ in (2.3.2) by specifying the indices for the row ($i$) and column ($j$), such as $[\mathbf{A}]_{ij}$. When the scalar elements of a matrix $\mathbf{A}$, such as in (2.3.2), are not given, we may simply use the lower-case letter of the matrix $\mathbf{A}$ with the index subscript, $a_{ij}$, to refer to $[\mathbf{A}]_{ij}$. To keep notation simple, commas are inserted to separate indices only when necessary, such as $a_{2,3j}$ and $[\mathbf{A}]_{2i−1,3}$.

Sometimes, we want to flip the axes. When we exchange a matrix’s rows and columns, the result is called the transpose of the matrix. Formally, we signify a matrix $\mathbf{A}$’s transpose by $\mathbf{A}^\top$ and if $\mathbf{B}=\mathbf{A}^\top$, then $b_{ij}=a_{ji}$ for any $i$ and $j$. Thus, the transpose of $\mathbf{A}$ in (2.3.2) is a $n \times m$ matrix:

\begin{equation} \mathbf{A}^\top = \begin{bmatrix} a_{11} & a_{21} & \dots & a_{m1} \\ a_{12} & a_{22} & \dots & a_{m2} \\ \vdots & \vdots & \ddots & \vdots \\ a_{1n} & a_{2n} & \dots & a_{mn} \end{bmatrix}. \end{equation}

Now we access a matrix’s transpose in code.

(nd/transpose A)

ND: (4, 5) cpu() float32
[[ 0.,  4.,  8., 12., 16.],
 [ 1.,  5.,  9., 13., 17.],
 [ 2.,  6., 10., 14., 18.],
 [ 3.,  7., 11., 15., 19.],
]

There is also a simplified function for the same purpose:

(nd/t A)

ND: (4, 5) cpu() float32
[[ 0.,  4.,  8., 12., 16.],
 [ 1.,  5.,  9., 13., 17.],
 [ 2.,  6., 10., 14., 18.],
 [ 3.,  7., 11., 15., 19.],
]

As a special type of the square matrix, a symmetric matrix $\mathbf{A}$ is equal to its transpose: $\mathbf{A}=\mathbf{A}^\top$. Here we define a symmetric matrix $\mathbf{B}$.

(def B (nd/create ndm [[1 2 3] [2 0 4] [3 4 5]]))
B

ND: (3, 3) cpu() int64
[[ 1,  2,  3],
 [ 2,  0,  4],
 [ 3,  4,  5],
]

Now we compare $\mathbf{B}$ with its transpose.

(nd/= B (nd/t B))

ND: (3, 3) cpu() boolean
[[ true,  true,  true],
 [ true,  true,  true],
 [ true,  true,  true],
]

Matrices are useful data structures: they allow us to organize data that have different modalities of variation. For example, rows in our matrix might correspond to different houses (data examples), while columns might correspond to different attributes. This should sound familiar if you have ever used spreadsheet software or have read Section 2.2. Thus, although the default orientation of a single vector is a column vector, in a matrix that represents a tabular dataset, it is more conventional to treat each data example as a row vector in the matrix. And, as we will see in later chapters, this convention will enable common deep learning practices. For example, along the outermost axis of a NDArray, we can access or enumerate minibatches of data examples, or just data examples if no minibatch exists.

2.3.5. Tensors / NDArrays

Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes. NDArrays (“NDArrays” in this subsection refer to algebraic objects) give us a generic way of describing $n$-dimensional arrays with an arbitrary number of axes. Vectors, for example, are first-order NDArrays, and matrices are second-order NDArrays. NDArrays are denoted with capital letters of a special font face (e.g., $\mathbf{X}$, $\mathbf{Y}$, and $\mathbf{Z}$) and their indexing mechanism (e.g., $x_{ijk}$ and $[X]_{1,2i−1,3}$) is similar to that of matrices.

NDArrays will become more important when we start working with images, which arrive as $n$-dimensional arrays with 3 axes corresponding to the height, width, and a channel axis for stacking the color channels (red, green, and blue). For now, we will skip over higher order NDArrays and focus on the basics.

(def X (-> (nd/arange ndm 24.)
           (nd/reshape 2 3 4)))
X

ND: (2, 3, 4) cpu() float32
[[[ 0.,  1.,  2.,  3.],
  [ 4.,  5.,  6.,  7.],
  [ 8.,  9., 10., 11.],
 ],
 [[12., 13., 14., 15.],
  [16., 17., 18., 19.],
  [20., 21., 22., 23.],
 ],
]

2.3.6. Basic Properties of Tensor Arithmetic

Scalars, vectors, matrices, and NDArrays (“NDArrays” in this subsection refer to algebraic objects) of an arbitrary number of axes have some nice properties that often come in handy. For example, you might have noticed from the definition of an elementwise operation that any elementwise unary operation does not change the shape of its operand. Similarly, given any two NDArrays with the same shape, the result of any binary elementwise operation will be a NDArray of that same shape. For example, adding two matrices of the same shape performs elementwise addition over these two matrices.

(def A (-> (nd/arange ndm 20.)
           (nd/reshape 5 4)))
(def B (nd/duplicate A))

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  5.,  6.,  7.],
 [ 8.,  9., 10., 11.],
 [12., 13., 14., 15.],
 [16., 17., 18., 19.],
]

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  5.,  6.,  7.],
 [ 8.,  9., 10., 11.],
 [12., 13., 14., 15.],
 [16., 17., 18., 19.],
]

(nd/+ A B)

ND: (5, 4) cpu() float32
[[ 0.,  2.,  4.,  6.],
 [ 8., 10., 12., 14.],
 [16., 18., 20., 22.],
 [24., 26., 28., 30.],
 [32., 34., 36., 38.],
]

Specifically, elementwise multiplication of two matrices is called their Hadamard product (math notation $\odot$). Consider matrix $\mathbf{B} \in \mathbb{R}^{m \times n}$ whose element of row $i$ and column $j$ is $b_{ij}$. The Hadamard product of matrices $\mathbf{A}$ (defined in (2.3.2)) and $\mathbf{B}$

\begin{equation} \mathbf{A} \odot \mathbf{B} = \begin{bmatrix} a_{11} b_{11} & a_{12} b_{12} & \dots & a_{1n} b_{1n} \\ a_{21} b_{21} & a_{22} b_{22} & \dots & a_{2n} b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} b_{m1} & a_{m2} b_{m2} & \dots & a_{mn} b_{mn} \end{bmatrix}. \end{equation}

(nd/* A B)

ND: (5, 4) cpu() float32
[[  0.,   1.,   4.,   9.],
 [ 16.,  25.,  36.,  49.],
 [ 64.,  81., 100., 121.],
 [144., 169., 196., 225.],
 [256., 289., 324., 361.],
]

Multiplying or adding a NDArray by a scalar also does not change the shape of the NDArray, where each element of the operand NDArray will be added or multiplied by the scalar.

(def a 2)
(def X (-> (nd/arange ndm 24.)
           (nd/reshape 2 3 4)))
(nd/+ X a)

ND: (2, 3, 4) cpu() float32
[[[ 2.,  3.,  4.,  5.],
  [ 6.,  7.,  8.,  9.],
  [10., 11., 12., 13.],
 ],
 [[14., 15., 16., 17.],
  [18., 19., 20., 21.],
  [22., 23., 24., 25.],
 ],
]

(nd/shape (nd/* X a))

(2, 3, 4)

2.3.7. Reduction

One useful operation that we can perform with arbitrary NDArrays is to calculate the sum of their elements. In mathematical notation, we express sums using the $\sum$ symbol. To express the sum of the elements in a vector $x$ of length $d$, we write $\sum^d_{i=1} x_i$. In code, we can just call the function for calculating the sum.

(def x (nd/arange ndm 4.))
x

ND: (4) cpu() float32
[0., 1., 2., 3.]

(nd/sum x)

ND: () cpu() float32
6.

We can express sums over the elements of NDArrays of arbitrary shape. For example, the sum of the elements of an $m \times n$ matrix $\mathbf{A}$ could be written $\sum^m_{i=1} \sum^n_{j=1} a_{ij}$.

(nd/shape A)

(5, 4)

(nd/sum A)

ND: () cpu() float32
190.

By default, invoking the function for calculating the sum reduces a NDArray along all its axes to a scalar. We can also specify the axes along which the NDArray is reduced via summation. Take matrices as an example. To reduce the row dimension (axis 0) by summing up elements of all the rows, we specify [0] when invoking the function. Since the input matrix reduces along axis 0 to generate the output vector, the dimension of axis 0 of the input is lost in the output shape.

(def A-sum-axis0 (nd/sum A [0]))
A-sum-axis0

ND: (4) cpu() float32
[40., 45., 50., 55.]

(nd/shape A-sum-axis0)

(4)

Specifying [1] will reduce the column dimension (axis 1) by summing up elements of all the columns. Thus, the dimension of axis 1 of the input is lost in the output shape.

(def A-sum-axis1 (nd/sum A [1]))
A-sum-axis1

ND: (5) cpu() float32
[ 6., 22., 38., 54., 70.]

(nd/shape A-sum-axis1)

(5)

Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.

(nd/sum A [0 1])

ND: () cpu() float32
190.

A related quantity is the mean, which is also called the average. We calculate the mean by dividing the sum by the total number of elements. In code, we could just call the function for calculating the mean on NDArrays of arbitrary shape.

(nd/mean A)

ND: () cpu() float32
9.5

(nd// (nd/sum A) (nd/size A))

ND: () cpu() float32
9.5

Likewise, the function for calculating the mean can also reduce a NDArray along the specified axes.

(nd/mean A [0])

ND: (4) cpu() float32
[ 8.,  9., 10., 11.]

(nd/shape A)

(5, 4)

(nd// (nd/sum A [0]) (nd/get (nd/shape A) 0))

ND: (4) cpu() float32
[ 8.,  9., 10., 11.]

2.3.7.1. Non-Reduction Sum

However, sometimes it can be useful to keep the number of axes unchanged when invoking the function for calculating the sum or mean.

(def sum-A (nd/sum A [1] true))
sum-A

ND: (5, 1) cpu() float32
[[ 6.],
 [22.],
 [38.],
 [54.],
 [70.],
]

For instance, since sum-A still keeps its two axes after summing each row, we can divide A by sum-A with broadcasting.

(nd// A sum-A)

ND: (5, 4) cpu() float32
[[0.    , 0.1667, 0.3333, 0.5   ],
 [0.1818, 0.2273, 0.2727, 0.3182],
 [0.2105, 0.2368, 0.2632, 0.2895],
 [0.2222, 0.2407, 0.2593, 0.2778],
 [0.2286, 0.2429, 0.2571, 0.2714],
]

If we want to calculate the cumulative sum of elements of A along some axis, say axis 0 (row by row), we can call the cumsum function. This function will not reduce the input NDArray along any axis.

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  5.,  6.,  7.],
 [ 8.,  9., 10., 11.],
 [12., 13., 14., 15.],
 [16., 17., 18., 19.],
]

(nd/cumsum A 0)

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  6.,  8., 10.],
 [12., 15., 18., 21.],
 [24., 28., 32., 36.],
 [40., 45., 50., 55.],
]

(nd/cumsum A 1)

ND: (5, 4) cpu() float32
[[ 0.,  1.,  3.,  6.],
 [ 4.,  9., 15., 22.],
 [ 8., 17., 27., 38.],
 [12., 25., 39., 54.],
 [16., 33., 51., 70.],
]

2.3.8. Dot Products

So far, we have only performed elementwise operations, sums, and averages. And if this was all we could do, linear algebra probably would not deserve its own section. However, one of the most fundamental operations is the dot product. Given two vectors $x,y \in \mathbb{R}^d$, their dot product $x^\top y$ (or $\langle x,y \rangle$) is a sum over the products of the elements at the same position: $x^\top y = \sum^d_{i=1} x_i y_i$.

(def y (nd/ones ndm [4]))

#'clj-d2l.linear-algebra/y

ND: (4) cpu() float32
[0., 1., 2., 3.]

ND: (4) cpu() float32
[1., 1., 1., 1.]

(nd/dot x y)

ND: () cpu() float32
6.

Note that we can express the dot product of two vectors equivalently by performing an elementwise multiplication and then a sum:

(nd/sum (nd/* x y))

ND: () cpu() float32
6.

Dot products are useful in a wide range of contexts. For example, given some set of values, denoted by a vector $x \in \mathbb{R}^d$ and a set of weights denoted by $w \in \mathbb{R}^d$, the weighted sum of the values in $x$ according to the weights $w$ could be expressed as the dot product $x^\top w$. When the weights are non-negative and sum to one (i.e., ($\sum^d_{i=1} w_i = 1$)), the dot product expresses a weighted average. After normalizing two vectors to have the unit length, the dot products express the cosine of the angle between them. We will formally introduce this notion of length later in this section.

2.3.9. Matrix-Vector Products

Now that we know how to calculate dot products, we can begin to understand matrix-vector products. Recall the matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ and the vector $x \in \mathbb{R}^n$ defined and visualized in (2.3.2) and (2.3.1) respectively. Let us start off by visualizing the matrix $\mathbf{A}$ in terms of its row vectors

\begin{equation} \mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}, \end{equation}

where each $\mathbf{a}^\top_i \in \mathbb{R}^n$ is a row vector representing the $i$^th row of the matrix $\mathbf{A}$. The matrix-vector product $\mathbf{A}\mathbf{x}$ is simply a column vector of length $m$, whose $i$^th element is the dot product $\mathbf{a}^top_i \mathbf{x}$:

\begin{equation} \mathbf{A}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_m \\ \end{bmatrix}\mathbf{x} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{x} \\ \mathbf{a}^\top_{2} \mathbf{x} \\ \vdots\\ \mathbf{a}^\top_{m} \mathbf{x}\\ \end{bmatrix}. \end{equation}

We can think of multiplication by a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ as a transformation that projects vectors from $\mathbb{R}^n$ to $\mathbb{R}^m$. These transformations turn out to be remarkably useful. For example, we can represent rotations as multiplications by a square matrix. As we will see in subsequent chapters, we can also use matrix-vector products to describe the most intensive calculations required when computing each layer in a neural network given the values of the previous layer.

Expressing matrix-vector products in code with NDArrays, we use the same dot function as for dot products. When we call (nd/dot A x) with a matrix $\mathbf{A}$ and a vector $\mathbf{x}$, the matrix-vector product is performed. Note that the column dimension of $\mathbf{A}$ (its length along axis 1) must be the same as the dimension of $\mathbf{x}$ (its length).

(nd/shape A)

(5, 4)

(nd/shape x)

(4)

(nd/dot A x)

ND: (5) cpu() float32
[ 14.,  38.,  62.,  86., 110.]

2.3.10. Matrix-Matrix Multiplication

If you have gotten the hang of dot products and matrix-vector products, then matrix-matrix multiplication should be straightforward.

Say that we have two matrices $\mathbf{A} \in \mathbb{R}^{n \times k}$ and $\mathbf{B} \in \mathbb{R}^{k \times m}$:

\begin{equation} \mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nk} \\ \end{bmatrix},\quad \mathbf{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1m} \\ b_{21} & b_{22} & \cdots & b_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{km} \\ \end{bmatrix}. \end{equation}

Denote by $\mathbf{a}^\top_i \in \mathbb{R}^k$ the row vector representing the $i$^th row of the matrix $\mathbf{A}$, and let $\mathbf{b}_j \in \mathbb{R}^k$ be the column vector from the $j$^th column of the matrix $\mathbf{B}$. To produce the matrix product $\mathbf{C}=\mathbf{AB}$, it is easiest to think of $\mathbf{A}$ in terms of its row vectors and $\mathbf{B}$ in terms of its column vectors:

\begin{equation} \mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix}. \end{equation}

Then the matrix product $\mathbf{C} \in \mathbb{R}^{n \times m}$ is produced as we simply compute each element $c_{ij}$ as the dot product $\mathbf{a}^\top_i \mathbf{b}_j$:

\begin{equation} \mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \\ \mathbf{a}^\top_{2} \\ \vdots \\ \mathbf{a}^\top_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{b}_{1} & \mathbf{b}_{2} & \cdots & \mathbf{b}_{m} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top_{1} \mathbf{b}_1 & \mathbf{a}^\top_{1}\mathbf{b}_2& \cdots & \mathbf{a}^\top_{1} \mathbf{b}_m \\ \mathbf{a}^\top_{2}\mathbf{b}_1 & \mathbf{a}^\top_{2} \mathbf{b}_2 & \cdots & \mathbf{a}^\top_{2} \mathbf{b}_m \\ \vdots & \vdots & \ddots &\vdots\\ \mathbf{a}^\top_{n} \mathbf{b}_1 & \mathbf{a}^\top_{n}\mathbf{b}_2& \cdots& \mathbf{a}^\top_{n} \mathbf{b}_m \end{bmatrix}. \end{equation}

We can think of the matrix-matrix multiplication $\mathbf{AB}$ as simply performing $m$ matrix-vector products and stitching the results together to form an $n \times m$ matrix. In the following snippet, we perform matrix multiplication on $\mathbf{A}$ and $\mathbf{B}$. Here, $\mathbf{A}$ is a matrix with 5 rows and 4 columns, and $\mathbf{B}$ is a matrix with 4 rows and 3 columns. After multiplication, we obtain a matrix with 5 rows and 3 columns.

(def B (nd/ones ndm [4 3]))
B

ND: (4, 3) cpu() float32
[[1., 1., 1.],
 [1., 1., 1.],
 [1., 1., 1.],
 [1., 1., 1.],
]

ND: (5, 4) cpu() float32
[[ 0.,  1.,  2.,  3.],
 [ 4.,  5.,  6.,  7.],
 [ 8.,  9., 10., 11.],
 [12., 13., 14., 15.],
 [16., 17., 18., 19.],
]

(nd/dot A B)

ND: (5, 3) cpu() float32
[[ 6.,  6.,  6.],
 [22., 22., 22.],
 [38., 38., 38.],
 [54., 54., 54.],
 [70., 70., 70.],
]

Matrix-matrix multiplication can be simply called matrix multiplication, and should not be confused with the Hadamard product.

2.3.11. Norms

Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big a vector is. The notion of size under consideration here concerns not dimensionality but rather the magnitude of the components.

In linear algebra, a vector norm is a function $f$ that maps a vector to a scalar, satisfying a handful of properties. Given any vector $\mathbf{x}$, the first property says that if we scale all the elements of a vector by a constant factor $\alpha$, its norm also scales by the absolute value of the same constant factor:

\begin{equation} f(\alpha \mathbf{x}) = |\alpha| f(\mathbf{x}). \end{equation}

The second property is the familiar triangle inequality:

\begin{equation} f(\mathbf{x} + \mathbf{y}) \leq f(\mathbf{x}) + f(\mathbf{y}). \end{equation}

The third property simply says that the norm must be non-negative:

\begin{equation} f(\mathbf{x}) \geq 0. \end{equation}

That makes sense, as in most contexts the smallest size for anything is 0. The final property requires that the smallest norm is achieved and only achieved by a vector consisting of all zeros.

\begin{equation} \forall i, [\mathbf{x}]_i = 0 \Leftrightarrow f(\mathbf{x})=0. \end{equation}

You might notice that norms sound a lot like measures of distance. And if you remember Euclidean distances (think Pythagoras’ theorem) from grade school, then the concepts of non-negativity and the triangle inequality might ring a bell. In fact, the Euclidean distance is a norm: specifically it is the $L_2$ norm. Suppose that the elements in the $n$-dimensional vector $\mathbf{x}$ are $x_1, \ldots, x_n$. The $L_2$ norm of $\mathbf{x}$ is the square root of the sum of the squares of the vector elements:

\begin{equation} \|\mathbf{x}\|_2 = \sqrt{\sum_{i=1}^n x_i^2}, \end{equation}

where the subscript $2$ is often omitted in $L_2$ norms, i.e., $\|\mathbf{x}\|$ is equivalent to $\|\mathbf{x}\|_2$. In code, we can calculate the $L_2$ norm of a vector as follows.

(defn l2norm [ndarray]
  (-> ndarray
      (nd/pow 2)
      (nd/sum)
      (nd/sqrt)))

(def u (nd/create ndm [3. -4.]))
u

ND: (2) cpu() float64
[ 3., -4.]

(l2norm u)

ND: () cpu() float64
5.

In deep learning, we work more often with the squared $L_2$ norm. You will also frequently encounter the $L_1$ norm, which is expressed as the sum of the absolute values of the vector elements:

\begin{equation} \|\mathbf{x}\|_1 = \sum_{i=1}^n \left|x_i \right|. \end{equation}

As compared with the $L_2$ norm, it is less influenced by outliers. To calculate the $L_1$ norm, we compose the absolute value function with a sum over the elements.

(nd/sum (nd/abs u))

ND: () cpu() float64
7.

Both the $L_2$ norm and the $L_1$ norm are special cases of the more general $L_p$ norm:

\begin{equation} \|\mathbf{x}\|_p = \left(\sum_{i=1}^n \left|x_i \right|^p \right)^{1/p}. \end{equation}

Analogous to $L_2$ norms of vectors, the Frobenius norm of a matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ is the square root of the sum of the squares of the matrix elements:

\begin{equation} \|\mathbf{X}\|_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n x_{ij}^2}. \end{equation}

The Frobenius norm satisfies all the properties of vector norms. It behaves as if it were an $L_2$ norm of a matrix-shaped vector. Invoking the following function will calculate the Frobenius norm of a matrix.

(l2norm (nd/ones ndm [4 9]))

ND: () cpu() float32
6.

2.3.12. Norms and Objectives

While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; minimize the distance between predictions and the ground-truth observations. Assign vector representations to items (like words, products, or news articles) such that the distance between similar items is minimized, and the distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most important components of deep learning algorithms (besides the data), are expressed as norms.

2.3.13. More on Linear Algebra

In just this section, we have taught you all the linear algebra that you will need to understand a remarkable chunk of modern deep learning. There is a lot more to linear algebra and a lot of that mathematics is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order NDArrays to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be much more inclined to learn more mathematics once you have gotten your hands dirty deploying useful machine learning models on real datasets. So while we reserve the right to introduce more mathematics much later on, we will wrap up this section here.

If you are eager to learn more about linear algebra, you may refer to either the online appendix on linear algebraic operations or other excellent resources [Strang, 1993][Kolter, 2008][Petersen et al., 2008].

2.3.14. Summary

Scalars, vectors, matrices, and NDArrays are basic mathematical objects in linear algebra.
Vectors generalize scalars, and matrices generalize vectors.
Scalars, vectors, matrices, and NDArrays have zero, one, two, and an arbitrary number of axes, respectively.
A NDArray can be reduced along the specified axes by sum and mean.
Elementwise multiplication of two matrices is called their Hadamard product. It is different from matrix multiplication.
In deep learning, we often work with norms such as the $L_1$ norm, the $L_2$ norm, and the Frobenius norm.
We can perform a variety of operations over scalars, vectors, matrices, and NDArrays.

2.4. Calculus

Finding the area of a polygon had remained mysterious until at least 2,500 years ago, when ancient Greeks divided a polygon into triangles and summed their areas. To find the area of curved shapes, such as a circle, ancient Greeks inscribed polygons in such shapes. As shown in Section 2.4, an inscribed polygon with more sides of equal length better approximates the circle. This process is also known as the method of exhaustion.

In fact, the method of exhaustion is where integral calculus (will be described in sec_integral_calculus) originates from. More than 2,000 years later, the other branch of calculus, differential calculus, was invented. Among the most critical applications of differential calculus, optimization problems consider how to do something the best. As discussed in Section 2.3.10.1, such problems are ubiquitous in deep learning.

In deep learning, we train models, updating them successively so that they get better and better as they see more and more data. Usually, getting better means minimizing a loss function, a score that answers the question “how bad is our model?” This question is more subtle than it appears. Ultimately, what we really care about is producing a model that performs well on data that we have never seen before. But we can only fit the model to data that we can actually see. Thus we can decompose the task of fitting models into two key concerns: i) optimization: the process of fitting our models to observed data; ii) generalization: the mathematical principles and practitioners’ wisdom that guide as to how to produce models whose validity extends beyond the exact set of data examples used to train them.

To help you understand optimization problems and methods in later chapters, here we give a very brief primer on differential calculus that is commonly used in deep learning.

2.4.1. Derivatives and Differentiation

We begin by addressing the calculation of derivatives, a crucial step in nearly all deep learning optimization algorithms. In deep learning, we typically choose loss functions that are differentiable with respect to our model’s parameters. Put simply, this means that for each parameter, we can determine how rapidly the loss would increase or decrease, were we to increase or decrease that parameter by an infinitesimally small amount.

Suppose that we have a function $f: \mathbb{R} \rightarrow \mathbb{R}$, whose input and output are both scalars. The derivative of $f$ is defined as

\begin{equation} \label{orgda04469} f'(x) = \lim_{h \rightarrow 0} \frac{f(x+h) - f(x)}{h}, \end{equation}

if this limit exists. If $f'(a)$ exists, $f$ is said to be differentiable at $a$. If $f$ is differentiable at every number of an interval, then this function is differentiable on this interval. We can interpret the derivative :$f'(x)$ in \eqref{orgda04469} as the instantaneous rate of change of $f(x)$ with respect to $x$. The so-called instantaneous rate of change is based on the variation $h$ in $x$, which approaches $0$.

To illustrate derivatives, let us experiment with an example. Define $u = f(x) = 3x^2-4x$.

*Note: We will be using Double in this section to avoid incorrect results since Double provides more decimal precision. Generally though, we would use Float as deep learning frameworks by default use Fault.*

(ns clj-d2l.calculus
  (:require [clj-djl.ndarray :as nd]
            [clj-chart.chart :as chart]
            [clj-chart.plot :as plot]
            [clj-d2l.core :as d2l]
            [clojure.java.io :as io]))

(defn f [x]
  (- (* 3 (Math/pow x 2)) (* 4 x)))

By setting $x=1$ and letting $h$ approach $0$, the numerical result of $\frac{f(x+h) - f(x)}{h}$ in \eqref{orgda04469} approaches $2$. Though this experiment is not a mathematical proof, we will see later that the derivative $u'$ is $2$ when $x=1$.

(defn numerical-lim [f x h]
  (/ (- (f (+ x h)) (f x)) h))

(->> (map #(/ 0.1 (Math/pow 10 %)) (range 5))
     (map (fn [h] [h (numerical-lim f 1 h)]))
     (map #(println "h = " (% 0) ", numerical limit = " (% 1)))
     (dorun))

h =  0.1 , numerical limit =  2.3000000000000043
h =  0.01 , numerical limit =  2.0299999999999763
h =  0.001 , numerical limit =  2.002999999999311
h =  1.0E-4 , numerical limit =  2.0002999999979565
h =  1.0E-5 , numerical limit =  2.0000300000155846

Let us familiarize ourselves with a few equivalent notations for derivatives. Given $y = f(x)$, where $x$ and $y$ are the independent variable and the dependent variable of the function $f$, respectively. The following expressions are equivalent:

\begin{equation} \label{org5726755} f'(x) = y' = \frac{dy}{dx} = \frac{df}{dx} = \frac{d}{dx} f(x) = Df(x) = D_x f(x), \end{equation}

where symbols $\frac{d}{dx}$ and $D$ are differentiation operators that indicate operation of differentiation. We can use the following rules to differentiate common functions:

$DC = 0$ ($C$ is a constant),
$Dx^n = nx^{n-1}$ (the power rule, $n$ is any real number),
$De^x = e^x$,
$D\ln(x) = 1/x.$

To differentiate a function that is formed from a few simpler functions such as the above common functions, the following rules can be handy for us. Suppose that functions $f$ and $g$ are both differentiable and $C$ is a constant, we have the constant multiple rule

\begin{equation} \label{orgb68afac} \frac{d}{dx} [Cf(x)] = C \frac{d}{dx} f(x), \end{equation}

the sum rule

\begin{equation} \label{org3cd5fcc} \frac{d}{dx} [f(x) + g(x)] = \frac{d}{dx} f(x) + \frac{d}{dx} g(x), \end{equation}

and the quotient rule

\begin{equation} \label{orga4b489c} \frac{d}{dx} \left[\frac{f(x)}{g(x)}\right] = \frac{g(x) \frac{d}{dx} [f(x)] - f(x) \frac{d}{dx} [g(x)]}{[g(x)]^2}. \end{equation}

Now we can apply a few of the above rules to find $u' = f'(x) = 3 \frac{d}{dx} x^2-4\frac{d}{dx}x = 6x-4$. Thus, by setting $x = 1$, we have $u' = 2$: this is supported by our earlier experiment in this section where the numerical result approaches $2$. This derivative is also the slope of the tangent line to the curve $u = f(x)$ when $x = 1$.

To visualize such an interpretation of derivatives, we will use xchart. a simple plotting library.

We define plot-lines which will take as input three arrays. The first array will be the data in the x axis and the next two arrays will contain the two functions that we want to plot in the y axis. In addition to this data, the function requires us to specify the name of the two lines we will be plotting, the label of both axes, and the width and height of the figure. This function or a modified version of it, will allow us to plot multiple curves succinctly since we will need to visualize many curves throughout the book.

Now we can plot the function $u = f(x)$ and its tangent line $y = 2x - 3$ at $x=1$, where the coefficient $2$ is the slope of the tangent line.

(let [x (range 0 3 1/32)
      y1 (map f x)
      y2 (map #(- (* 2 %) 3) x)
      chart (chart/line {:title "tangent line (x=1)"
                         :series [{:name "y1"
                                   :xs x
                                   :ys y1}
                                  {:name "y2"
                                   :xs x
                                   :ys y2}]})]
  (plot/store! chart nil "notes/figures/tangent_line.svg"))

2.4.2. Partial Derivatives

So far we have dealt with the differentiation of functions of just one variable. In deep learning, functions often depend on many variables. Thus, we need to extend the ideas of differentiation to these multivariate functions.

Let $y = f(x_1, x_2, \ldots, x_n)$ be a function with $n$ variables. The partial derivative of $y$ with respect to its $i^\mathrm{th}$ parameter $x_i$ is

\begin{equation} \label{orge083780} \frac{\partial y}{\partial x_i} = \lim_{h \rightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}. \end{equation}

To calculate $\frac{\partial y}{\partial x_i}$, we can simply treat $x_1, \ldots, x_{i-1}, x_{i+1}, \ldots, x_n$ as constants and calculate the derivative of $y$ with respect to $x_i$. For notation of partial derivatives, the following are equivalent:

\begin{equation} \frac{\partial y}{\partial x_i} = \frac{\partial f}{\partial x_i} = f_{x_i} = f_i = D_i f = D_{x_i} f. \end{equation}

2.4.3. Gradients

We can concatenate partial derivatives of a multivariate function with respect to all its variables to obtain the gradient vector of the function. Suppose that the input of function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is an $n$-dimensional vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]^\top$ and the output is a scalar. The gradient of the function $f(\mathbf{x})$ with respect to $\mathbf{x}$ is a vector of $n$ partial derivatives:

\begin{equation} \label{orgbcc17ba} \nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top, \end{equation}

where $\nabla_{\mathbf{x}} f(\mathbf{x})$ is often replaced by $\nabla f(\mathbf{x})$ when there is no ambiguity.

Let $\mathbf{x}$ be an $n$-dimensional vector, the following rules are often used when differentiating multivariate functions:

For all $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\nabla_{\mathbf{x}} \mathbf{A} \mathbf{x} = \mathbf{A}^\top$,
For all $\mathbf{A} \in \mathbb{R}^{n \times m}$, $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} = \mathbf{A}$,
For all $\mathbf{A} \in \mathbb{R}^{n \times n}$, $\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}$,
$\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}$.

Similarly, for any matrix $\mathbf{X}$, we have $\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}$. As we will see later, gradients are useful for designing optimization algorithms in deep learning.

2.4.4. Chain Rule

However, such gradients can be hard to find. This is because multivariate functions in deep learning are often composite, so we may not apply any of the aforementioned rules to differentiate these functions. Fortunately, the chain rule enables us to differentiate composite functions.

Let us first consider functions of a single variable. Suppose that functions $y=f(u)$ and $u=g(x)$ are both differentiable, then the chain rule states that

\begin{equation} \label{orgee05063} \frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}. \end{equation}

Now let us turn our attention to a more general scenario where functions have an arbitrary number of variables. Suppose that the differentiable function $y$ has variables $u_1, u_2, \ldots, u_m$, where each differentiable function $u_i$ has variables $x_1, x_2, \ldots, x_n$. Note that $y$ is a function of $x_1, x_2, \ldots, x_n$. Then the chain rule gives

\begin{equation} \label{orgc0fd4de} \frac{dy}{dx_i} = \frac{dy}{du_1} \frac{du_1}{dx_i} + \frac{dy}{du_2} \frac{du_2}{dx_i} + \cdots + \frac{dy}{du_m} \frac{du_m}{dx_i} \end{equation}

for any $i = 1, 2, \ldots, n$.

2.4.5. Summary

Differential calculus and integral calculus are two branches of calculus, where the former can be applied to the ubiquitous optimization problems in deep learning.
A derivative can be interpreted as the instantaneous rate of change of a function with respect to its variable. It is also the slope of the tangent line to the curve of the function.
A gradient is a vector whose components are the partial derivatives of a multivariate function with respect to all its variables.
The chain rule enables us to differentiate composite functions.

2.4.6. Exercises

Plot the function $y = f(x) = x^3 - \frac{1}{x}$ and its tangent line when $x = 1$.

(defn f2 [x]
  (- (Math/pow x 3) (/ 1 x)))

(numerical-lim f2 1 0.000001)

4.000001999737712

(- (* 4 1) (f2 1))

4.0

(let [x (map #(* 1/100 %) (range 10 300))
      y1 (map f2 x)
      y2 (map #(- (* 4 %) 4) x)
      chart (chart/line {:title "tangent line"
                         :series [{:name "y1"
                                   :xs x
                                   :ys y1}
                                  {:name "y2"
                                   :xs x
                                   :ys y2}]})]
  (plot/store! chart nil "notes/figures/exercise-2.4-1.svg"))

exercise-2.4-1.svg

Find the gradient of the function $f(\mathbf{x}) = 3x_1^2 + 5e^{x_2}$.
\begin{equation} \label{org7ef0601} \nabla_\mathbf{x} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}\bigg]^\top = \bigg[6x_1, 5e^{x_2}\bigg]^\top \end{equation}
What is the gradient of the function $f(\mathbf{x}) = \|\mathbf{x}\|_2$?
Can you write out the chain rule for the case where $u = f(x, y, z)$ and $x = x(a, b)$, $y = y(a, b)$, and $z = z(a, b)$?

2.5. Automatic Differentiation

As we have explained in Section 2.4, differentiation is a crucial step in nearly all deep learning optimization algorithms. While the calculations for taking these derivatives are straightforward, requiring only some basic calculus, for complex models, working out the updates by hand can be a pain (and often error-prone).

Deep learning frameworks expedite this work by automatically calculating derivatives, i.e., automatic differentiation. In practice, based on our designed model the system builds a computational graph, tracking which data combined through which operations to produce the output. Automatic differentiation enables the system to subsequently backpropagate gradients. Here, backpropagate simply means to trace through the computational graph, filling in the partial derivatives with respect to each parameter.

(ns clj-d2l.auto-diff
  (:require [clj-djl.ndarray :as nd]
            [clj-djl.training :as t]))

2.5.1. A Simple Example

As a toy example, say that we are interested in differentiating the function $y = 2\mathbf{x}^{\top}\mathbf{x}$ with respect to the column vector $\mathbf{x}$. To start, let us create the variable $x$ and assign it an initial value.

(def ndm (nd/base-manager))
(def x (nd/arange ndm 4.))
x

ND: (4) cpu() float32
[0., 1., 2., 3.]

Before we even calculate the gradient of $y$ with respect to $\mathbf{x}$, we will need a place to store it. It is important that we do not allocate new memory every time we take a derivative with respect to a parameter because we will often update the same parameters thousands or millions of times and could quickly run out of memory. Note that a gradient of a scalar-valued function with respect to a vector $\mathbf{x}$ is itself vector-valued and has the same shape as $\mathbf{x}$.

(t/set-requires-gradient x true)
(t/get-gradient x)

ND: (4) cpu() float32
[0., 0., 0., 0.]

We place our code inside a with-open and declare the gradient-collector object that will build the computational graph. Now let us calculate $y$.

Since $\mathbf{x}$ is a vector of length 4, an inner product of $\mathbf{x}$ and $\mathbf{x}$ is performed, yielding the scalar output that we assign to $\mathbf{y}$. Next, we can automatically calculate the gradient of $\mathbf{y}$ with respect to each component of $\mathbf{x}$ by calling the function for backpropagation and printing the gradient.

(with-open [gc (t/gradient-collector)]
  (let [y (nd/* (nd/dot x x) 2)]
    (println (str x))
    (println (str y))
    (t/backward gc y)))

ND: (4) cpu() float32 hasGradient
[0., 1., 2., 3.]

ND: () cpu() float32
28.

(t/get-gradient x)

ND: (4) cpu() float32
[ 0.,  4.,  8., 12.]

The gradient of the function $y = 2\mathbf{x}^{\top}\mathbf{x}$ with respect to $\mathbf{x}$ should be $4\mathbf{x}$. Let us quickly verify that our desired gradient was calculated correctly.

(nd/= (t/get-gradient x) (nd/* x 4))

ND: (4) cpu() boolean
[ true,  true,  true,  true]

Now let us calculate another function of $\mathbf{x}$.

(with-open [gc (t/gradient-collector)]
  (let [y (nd/sum x)]
    (println (str x))
    (t/backward gc y)))

(t/get-gradient x)

ND: (4) cpu() float32 hasGradient
[0., 1., 2., 3.]

ND: (4) cpu() float32
[1., 1., 1., 1.]

2.5.2. Backward for Non-Scalar Variables

Technically, when $y$ is not a scalar, the most natural interpretation of the differentiation of a vector $\mathbf{y}$ with respect to a vector $\mathbf{x}$ is a matrix. For higher-order and higher-dimensional $\mathbf{y}$ and $\mathbf{x}$, the differentiation result could be a high-order tensor.

However, while these more exotic objects do show up in advanced machine learning (including in deep learning), more often when we are calling backward on a vector, we are trying to calculate the derivatives of the loss functions for each constituent of a batch of training examples. Here, our intent is not to calculate the differentiation matrix but rather the sum of the partial derivatives computed individually for each example in the batch.

(with-open [gc (t/gradient-collector)]
  (let [y (nd/* x x)]
    (t/backward gc y)))
(t/get-gradient x)

ND: (4) cpu() float32
[0., 2., 4., 6.]

2.5.3. Detaching Computation

Sometimes, we wish to move some calculations outside of the recorded computational graph. For example, say that $\mathbf{y}$ was calculated as a function of $\mathbf{x}$, and that subsequently $\mathbf{z}$ was calculated as a function of both $\mathbf{y}$ and $\mathbf{x}$. Now, imagine that we wanted to calculate the gradient of $\mathbf{z}$ with respect to $\mathbf{x}$, but wanted for some reason to treat $\mathbf{y}$ as a constant, and only take into account the role that $\mathbf{x}$ played after $\mathbf{y}$ was calculated.

Here, we can detach $\mathbf{y}$ using stop-gradient to return a new variable $\mathbf{u}$ that has the same value as $\mathbf{y}$ but discards any information about how $\mathbf{y}$ was computed in the computational graph. In other words, the gradient will not flow backwards through $\mathbf{u}$ to $\mathbf{x}$. Thus, the following backpropagation function computes the partial derivative of $\mathbf{z} = \mathbf{u} \times \mathbf{x}$ with respect to $\mathbf{x}$ while treating $\mathbf{u}$ as a constant, instead of the partial derivative of $\mathbf{z} = \mathbf{x} \times \mathbf{x} \times \mathbf{x}$ with respect to $\mathbf{x}$.

(with-open [gc (t/gradient-collector)]
  (let [y (nd/* x x)
        u (t/stop-gradient y)
        z (nd/* u x)]
    (t/backward gc z)
    (nd/= u (t/get-gradient x))))

ND: (4) cpu() boolean
[ true,  true,  true,  true]

We can subsequently invoke backpropagation on $\mathbf{y}$ to get the derivative of $\mathbf{y} = \mathbf{x} \times \mathbf{x}$ with respect to $\mathbf{x}$, which is $2 \times \mathbf{x}$.

(with-open [gc (t/gradient-collector)]
  (let [y (nd/* x x)]
    (t/backward gc y)
    (nd/= (t/get-gradient x) (nd/* x 2))))

ND: (4) cpu() boolean
[ true,  true,  true,  true]

2.5.4. Computing the Gradient of Clojure Control Flow

One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Clojure control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. In the following snippet, note that the number of iterations of the loop and the evaluation of the if statement both depend on the value of the input $\mathbf{a}$.

(defn f [a]
  (loop [b (nd/* a 2)]
    (if (nd/get-element (.lt (nd/norm b) 1000))
      (recur (nd/* b 2))
      (if (nd/get-element (.gt (nd/sum b) 0))
        b
        (nd/* b 100)))))

Let us compute the gradient.

We can then analyze the f function defined above. Note that it is piecewise linear in its input $\mathbf{a}$. In other words, for any $\mathbf{a}$ there exists some constant scalar $k$ such that $f(\mathbf{a}) = k \times \mathbf{a}$, where the value of $k$ depends on the input $\mathbf{a}$. Consequently (nd// d a) allows us to verify that the gradient is correct.

(def a (nd/random-normal ndm [10]))
a

ND: (10) cpu() float32
[-1.475 ,  1.5194, -0.5241,  1.9041,  1.2663, -1.5734,  0.8951, -0.1401, -0.6016,  0.2967]

(t/set-requires-gradient a true)
(with-open [gc (t/gradient-collector)]
  (let [d (f a)]
    (t/backward gc d)
    (println (str (nd// d a)))
    (println (str (nd/= (t/get-gradient a) (nd// d a))))))

ND: (10) cpu() float32
[512., 512., 512., 512., 512., 512., 512., 512., 512., 512.]

ND: (10) cpu() boolean
[ true,  true,  true,  true,  true,  true,  true,  true,  true,  true]

2.5.5. Summary

Deep learning frameworks can automate the calculation of derivatives. To use it, we first attach gradients to those variables with respect to which we desire partial derivatives. We then record the computation of our target value, execute its function for backpropagation, and access the resulting gradient.

2.6. Probability

In some form or another, machine learning is all about making predictions. We might want to predict the probability of a patient suffering a heart attack in the next year, given their clinical history. In anomaly detection, we might want to assess how likely a set of readings from an airplane’s jet engine would be, were it operating normally. In reinforcement learning, we want an agent to act intelligently in an environment. This means we need to think about the probability of getting a high reward under each of the available actions. And when we build recommender systems we also need to think about probability. For example, say hypothetically that we worked for a large online bookseller. We might want to estimate the probability that a particular user would buy a particular book. For this we need to use the language of probability. Entire courses, majors, theses, careers, and even departments, are devoted to probability. So naturally, our goal in this section is not to teach the whole subject. Instead we hope to get you off the ground, to teach you just enough that you can start building your first deep learning models, and to give you enough of a flavor for the subject that you can begin to explore it on your own if you wish.

We have already invoked probabilities in previous sections without articulating what precisely they are or giving a concrete example. Let us get more serious now by considering the first case: distinguishing cats and dogs based on photographs. This might sound simple but it is actually a formidable challenge. To start with, the difficulty of the problem may depend on the resolution of the image.

2.6.1. Basic Probability Theory

Say that we cast a die and want to know what the chance is of seeing a $1$ rather than another digit. If the die is fair, all the six outcomes $\{1, \ldots, 6\}$ are equally likely to occur, and thus we would see a $1$ in one out of six cases. Formally we state that $1$ occurs with probability $\frac{1}{6}$.

For a real die that we receive from a factory, we might not know those proportions and we would need to check whether it is tainted. The only way to investigate the die is by casting it many times and recording the outcomes. For each cast of the die, we will observe a value in $\{1, \ldots, 6\}$. Given these outcomes, we want to investigate the probability of observing each outcome.

One natural approach for each value is to take the individual count for that value and to divide it by the total number of tosses. This gives us an estimate of the probability of a given event. The law of large numbers tell us that as the number of tosses grows this estimate will draw closer and closer to the true underlying probability. Before going into the details of what is going here, let us try it out.

To start, let us import the necessary packages.

(ns clj-d2l.probability
  (:require [clj-djl.ndarray :as nd]
            [clj-djl.training :as t]
            [clj-chart.chart :as chart]
            [clj-chart.plot :as plot]
            [clojure.java.io :as io]
            [clj-d2l.core :as d2l]))

Next, we will want to be able to cast the die. In statistics we call this process of drawing examples from probability distributions sampling. The distribution that assigns probabilities to a number of discrete choices is called the multinomial distribution. We will give a more formal definition of distribution later, but at a high level, think of it as just an assignment of probabilities to events.

To draw a single sample, we simply pass in a vector of probabilities. The output is another vector of the same length: its value at index $i$ is the number of times the sampling outcome corresponds to $i$.

(def ndm (nd/new-base-manager))
(def fair-probs (nd/create ndm (repeat 6 (/ 1.0 6))))
;=> [1/6 1/6 1/6 1/6 1/6 1/6]
(nd/random-multinomial ndm 1 fair-probs)

ND: (6) cpu() int64
[ 0,  0,  1,  0,  0,  0]

If you run the sampler a bunch of times, you will find that you get out random values each time. As with estimating the fairness of a die, we often want to generate many samples from the same distribution. It would be unbearably slow to do this with a Clojure loop, so the function we are using supports drawing multiple samples at once, returning an array of independent samples in any shape we might desire.

(nd/random-multinomial ndm 10 fair-probs)

ND: (6) cpu() int64
[ 3,  1,  0,  0,  3,  3]

Now that we know how to sample rolls of a die, we can simulate 1000 rolls. We can then go through and count, after each of the 1000 rolls, how many times each number was rolled. Specifically, we calculate the relative frequency as the estimate of the true probability.

(def counts (nd/random-multinomial ndm 1000 fair-probs))
(nd// counts 1000)

ND: (6) cpu() float32
[0.183, 0.156, 0.182, 0.169, 0.153, 0.157]

Because we generated the data from a fair die, we know that each outcome has true probability $\frac{1}{6}$, roughly $0.167$, so the above output estimates look good.

We can also visualize how these probabilities converge over time towards the true probability. Let us conduct 500 groups of experiments where each group draws 10 samples.

(def counts (nd/random-multinomial ndm 10 fair-probs [500]))
(def cum-counts (nd/cumsum counts 0))
(def estimates (nd// cum-counts (nd/sum cum-counts 1 true)))
(let [x (range 0 500)
      dies (range 0 6)
      names (mapv #(str "P(die=" % ")") dies)
      ys (mapv #(nd/to-vec (nd/get (nd/transpose estimates) %)) dies)
      series (concat [{:name "0.167"
                       :xs x
                       :ys (repeat 500 0.167)}]
                     (map (fn [n y] {:name n :xs x :ys y})
                          names ys))
      c (chart/line {:series series})]
  (plot/store! c nil "notes/figures/probability_dies.svg"))

Each solid curve corresponds to one of the six values of the die and gives our estimated probability that the die turns up that value as assessed after each group of experiments. The blue line gives the true underlying probability. As we get more data by conducting more experiments, the $6$ solid curves converge towards the true probability.

2.6.2. Axioms of Probability Theory

When dealing with the rolls of a die, we call the set $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$ the sample space or outcome space, where each element is an outcome. An event is a set of outcomes from a given sample space. For instance, “seeing a $5$” ($\{5\}$) and “seeing an odd number” ($\{1, 3, 5\}$) are both valid events of rolling a die. Note that if the outcome of a random experiment is in event $\mathcal{A}$, then event $\mathcal{A}$ has occurred. That is to say, if $3$ dots faced up after rolling a die, since $3 \in \{1, 3, 5\}$, we can say that the event “seeing an odd number” has occurred.

Formally, probability can be thought of as a function that maps a set to a real value. The probability of an event $\mathcal{A}$ in the given sample space $\mathcal{S}$, denoted as $P(\mathcal{A})$, satisfies the following properties:

For any event $\mathcal{A}$, its probability is never negative, i.e., $P(\mathcal{A}) \geq 0$;
Probability of the entire sample space is $1$, i.e., $P(\mathcal{S}) = 1$;
For any countable sequence of events $\mathcal{A}_1, \mathcal{A}_2, \ldots$ that are mutually exclusive ($\mathcal{A}_i \cap \mathcal{A}_j = \emptyset$ for all $i \neq j$), the probability that any happens is equal to the sum of their individual probabilities, i.e., $P(\bigcup_{i=1}^{\infty} \mathcal{A}_i) = \sum_{i=1}^{\infty} P(\mathcal{A}_i)$.

These are also the axioms of probability theory, proposed by Kolmogorov in 1933. Thanks to this axiom system, we can avoid any philosophical dispute on randomness; instead, we can reason rigorously with a mathematical language. For instance, by letting event $\mathcal{A}_1$ be the entire sample space and $\mathcal{A}_i = \emptyset$ for all $i > 1$, we can prove that $P(\emptyset) = 0$, i.e., the probability of an impossible event is $0$.

2.6.3. Random Variables

In our random experiment of casting a die, we introduced the notion of a random variable. A random variable can be pretty much any quantity and is not deterministic. It could take one value among a set of possibilities in a random experiment. Consider a random variable $X$ whose value is in the sample space $\mathcal{S} = \{1, 2, 3, 4, 5, 6\}$ of rolling a die. We can denote the event “seeing a $5$” as $\{X = 5\}$ or $X = 5$, and its probability as $P(\{X = 5\})$ or $P(X = 5)$. By $P(X = a)$, we make a distinction between the random variable $X$ and the values (e.g., $a$) that $X$ can take. However, such pedantry results in a cumbersome notation. For a compact notation, on one hand, we can just denote $P(X)$ as the distribution over the random variable $X$: the distribution tells us the probability that $X$ takes any value. On the other hand, we can simply write $P(a)$ to denote the probability that a random variable takes the value $a$. Since an event in probability theory is a set of outcomes from the sample space, we can specify a range of values for a random variable to take. For example, $P(1 \leq X \leq 3)$ denotes the probability of the event $\{1 \leq X \leq 3\}$, which means $\{X = 1, 2, \text{or}, 3\}$. Equivalently, $P(1 \leq X \leq 3)$ represents the probability that the random variable $X$ can take a value from $\{1, 2, 3\}$.

Note that there is a subtle difference between discrete random variables, like the sides of a die, and continuous ones, like the weight and the height of a person. There is little point in asking whether two people have exactly the same height. If we take precise enough measurements you will find that no two people on the planet have the exact same height. In fact, if we take a fine enough measurement, you will not have the same height when you wake up and when you go to sleep. So there is no purpose in asking about the probability that someone is 1.80139278291028719210196740527486202 meters tall. Given the world population of humans the probability is virtually 0. It makes more sense in this case to ask whether someoneâ€™s height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we quantify the likelihood that we see a value as a density. The height of exactly 1.80 meters has no probability, but nonzero density. In the interval between any two different heights we have nonzero probability. In the rest of this section, we consider probability in discrete space. For probability over continuous random variables, you may refer to Section 18.6.

2.6.4. Dealing with Multiple Random Variables

Very often, we will want to consider more than one random variable at a time. For instance, we may want to model the relationship between diseases and symptoms. Given a disease and a symptom, say “flu” and “cough”, either may or may not occur in a patient with some probability. While we hope that the probability of both would be close to zero, we may want to estimate these probabilities and their relationships to each other so that we may apply our inferences to effect better medical care.

As a more complicated example, images contain millions of pixels, thus millions of random variables. And in many cases images will come with a label, identifying objects in the image. We can also think of the label as a random variable. We can even think of all the metadata as random variables such as location, time, aperture, focal length, ISO, focus distance, and camera type. All of these are random variables that occur jointly. When we deal with multiple random variables, there are several quantities of interest.

2.6.5. Joint Probability

The first is called the joint probability $P(A = a, B=b)$. Given any values $a$ and $b$, the joint probability lets us answer, what is the probability that $A=a$ and $B=b$ simultaneously? Note that for any values $a$ and $b$, $P(A=a, B=b) \leq P(A=a)$. This has to be the case, since for $A=a$ and $B=b$ to happen, $A=a$ has to happen and $B=b$ also has to happen (and vice versa). Thus, $A=a$ and $B=b$ cannot be more likely than $A=a$ or $B=b$ individually.

2.6.6. Conditional Probability

This brings us to an interesting ratio: $0 \leq \frac{P(A=a, B=b)}{P(A=a)} \leq 1$. We call this ratio a conditional probability and denote it by $P(B=b \mid A=a)$: it is the probability of $B=b$, provided that $A=a$ has occurred.

2.6.7. Bayes’ theorem

Using the definition of conditional probabilities, we can derive one of the most useful and celebrated equations in statistics: Bayes’ theorem. It goes as follows. By construction, we have the multiplication rule that $P(A, B) = P(B \mid A) P(A)$. By symmetry, this also holds for $P(A, B) = P(A \mid B) P(B)$. Assume that $P(B) > 0$. Solving for one of the conditional variables we get

\begin{equation} \label{org0ad706a} P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}. \end{equation}

Note that here we use the more compact notation where $P(A, B)$ is a joint distribution and $P(A \mid B)$ is a conditional distribution. Such distributions can be evaluated for particular values $A = a, B=b$.

2.6.8. Marginalization

Bayes’ theorem is very useful if we want to infer one thing from the other, say cause and effect, but we only know the properties in the reverse direction, as we will see later in this section. One important operation that we need, to make this work, is marginalization. It is the operation of determining $P(B)$ from $P(A, B)$. We can see that the probability of $B$ amounts to accounting for all possible choices of $A$ and aggregating the joint probabilities over all of them:

\begin{equation} \label{org9485ae5} P(B) = \sum_{A} P(A, B), \end{equation}

which is also known as the sum rule. The probability or distribution as a result of marginalization is called a marginal probability or a marginal distribution.

2.6.9. Independence

Another useful property to check for is dependence vs. independence. Two random variables $A$ and $B$ being independent means that the occurrence of one event of $A$ does not reveal any information about the occurrence of an event of $B$. In this case $P(B \mid A) = P(B)$. Statisticians typically express this as $A \perp B$. From Bayes’ theorem, it follows immediately that also $P(A \mid B) = P(A)$. In all the other cases we call $A$ and $B$ dependent. For instance, two successive rolls of a die are independent. In contrast, the position of a light switch and the brightness in the room are not (they are not perfectly deterministic, though, since we could always have a broken light bulb, power failure, or a broken switch).

Since $P(A \mid B) = \frac{P(A, B)}{P(B)} = P(A)$ is equivalent to $P(A, B) = P(A)P(B)$, two random variables are independent if and only if their joint distribution is the product of their individual distributions. Likewise, two random variables $A$ and $B$ are conditionally independent given another random variable $C$ if and only if $P(A, B \mid C) = P(A \mid C)P(B \mid C)$. This is expressed as $A \perp B \mid C$.

2.6.10. Application

Let us put our skills to the test. Assume that a doctor administers an HIV test to a patient. This test is fairly accurate and it fails only with 1% probability if the patient is healthy but reporting him as diseased. Moreover, it never fails to detect HIV if the patient actually has it. We use $D_1$ to indicate the diagnosis ($1$ if positive and $0$ if negative) and $H$ to denote the HIV status ($1$ if positive and $0$ if negative). Table 1 lists such conditional probabilities.

Table 1: Conditional probability of $P(D_1 \mid H)$.
Conditional probability	$H=1$	$H=0$
$P(D_1 = 1 \mid H)$	1	0.01
$P(D_1 = 0 \mid H)$	0	0.99

Note that the column sums are all 1 (but the row sums are not), since the conditional probability needs to sum up to 1, just like the probability. Let us work out the probability of the patient having HIV if the test comes back positive, i.e., $P(H = 1 \mid D_1 = 1)$. Obviously this is going to depend on how common the disease is, since it affects the number of false alarms. Assume that the population is quite healthy, e.g., $P(H=1) = 0.0015$. To apply Bayes’ theorem, we need to apply marginalization and the multiplication rule to determine

\begin{equation} \begin{aligned} &P(D_1 = 1) \\ =& P(D_1=1, H=0) + P(D_1=1, H=1) \\ =& P(D_1=1 \mid H=0) P(H=0) + P(D_1=1 \mid H=1) P(H=1) \\ =& 0.011485. \end{aligned} \end{equation}

Thus, we get

\begin{equation} \begin{aligned} &P(H = 1 \mid D_1 = 1)\\ =& \frac{P(D_1=1 \mid H=1) P(H=1)}{P(D_1=1)} \\ =& 0.1306 \end{aligned} \end{equation}

In other words, there is only a 13.06% chance that the patient actually has HIV, despite using a very accurate test. As we can see, probability can be counter intuitive.

What should a patient do upon receiving such terrifying news? Likely, the patient would ask the physician to administer another test to get clarity. The second test has different characteristics and it is not as good as the first one, as shown in Table 2.

Table 2: Conditional probability of $P(D_2 \mid H)$.
Conditional probability	$H=1$	$H=0$
$P(D_2 = 1 \mid H)$	0.98	0.03
$P(D_2 = 0 \mid H)$	0.02	0.97

Unfortunately, the second test comes back positive, too. Let us work out the requisite probabilities to invoke Bayes’ theorem by assuming the conditional independence:

\begin{equation} \begin{aligned} &P(D_1 = 1, D_2 = 1 \mid H = 0) \\ =& P(D_1 = 1 \mid H = 0) P(D_2 = 1 \mid H = 0) \\ =& 0.0003, \end{aligned} \end{equation} \begin{equation} \begin{aligned} &P(D_1 = 1, D_2 = 1 \mid H = 1) \\ =& P(D_1 = 1 \mid H = 1) P(D_2 = 1 \mid H = 1) \\ =& 0.98. \end{aligned} \end{equation}

Now we can apply marginalization and the multiplication rule:

\begin{equation} \begin{aligned} &P(D_1 = 1, D_2 = 1) \\ =& P(D_1 = 1, D_2 = 1, H = 0) + P(D_1 = 1, D_2 = 1, H = 1) \\ =& P(D_1 = 1, D_2 = 1 \mid H = 0)P(H=0) + P(D_1 = 1, D_2 = 1 \mid H = 1)P(H=1)\\ =& 0.00176955. \end{aligned} \end{equation}

In the end, the probability of the patient having HIV given both positive tests is

\begin{equation} \begin{aligned} &P(H = 1 \mid D_1 = 1, D_2 = 1)\\ =& \frac{P(D_1 = 1, D_2 = 1 \mid H=1) P(H=1)}{P(D_1 = 1, D_2 = 1)} \\ =& 0.8307. \end{aligned} \end{equation}

That is, the second test allowed us to gain much higher confidence that not all is well. Despite the second test being considerably less accurate than the first one, it still significantly improved our estimate.

2.6.11. Expectation and Variance

To summarize key characteristics of probability distributions, we need some measures. The expectation (or average) of the random variable $X$ is denoted as

\begin{equation} E[X] = \sum_{x} x P(X = x). \end{equation}

When the input of a function $f(x)$ is a random variable drawn from the distribution $P$ with different values $x$, the expectation of $f(x)$ is computed as

\begin{equation} E_{x \sim P}[f(x)] = \sum_x f(x) P(x). \end{equation}

In many cases we want to measure by how much the random variable $X$ deviates from its expectation. This can be quantified by the variance

\begin{equation} \mathrm{Var}[X] = E\left[(X - E[X])^2\right] = E[X^2] - E[X]^2. \end{equation}

Its square root is called the standard deviation. The variance of a function of a random variable measures by how much the function deviates from the expectation of the function, as different values $x$ of the random variable are sampled from its distribution:

\begin{equation} \mathrm{Var}[f(x)] = E\left[\left(f(x) - E[f(x)]\right)^2\right]. \end{equation}

2.6.12. Summary

We can sample from probability distributions.
We can analyze multiple random variables using joint distribution, conditional distribution, Bayes’ theorem, marginalization, and independence assumptions.
Expectation and variance offer useful measures to summarize key characteristics of probability distributions.

3. Linear Neural Networks

Before we get into the details of deep neural networks, we need to cover the basics of neural network training. In this chapter, we will cover the entire training process, including defining simple neural network architectures, handling data, specifying a loss function, and training the model. In order to make things easier to grasp, we begin with the simplest concepts. Fortunately, classic statistical learning techniques such as linear and softmax regression can be cast as linear neural networks. Starting from these classic algorithms, we will introduce you to the basics, providing the basis for more complex techniques in the rest of the book.

3.1. Linear Regression

Regression refers to a set of methods for modeling the relationship between one or more independent variables and a dependent variable. In the natural sciences and social sciences, the purpose of regression is most often to characterize the relationship between the inputs and outputs. Machine learning, on the other hand, is most often concerned with prediction.

Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting length of stay (for patients in the hospital), demand forecasting (for retail sales), among countless others. Not every prediction problem is a classic regression problem. In subsequent sections, we will introduce classification problems, where the goal is to predict membership among a set of categories.

3.1.1. Basic Elements of Linear Regression

Linear regression may be both the simplest and most popular among the standard tools to regression. Dating back to the dawn of the 19th century, linear regression flows from a few simple assumptions. First, we assume that the relationship between the independent variables $\mathbf{x}$ and the dependent variable $y$ is linear, i.e., that $y$ can be expressed as a weighted sum of the elements in $\mathbf{x}$, given some noise on the observations. Second, we assume that any noise is well-behaved (following a Gaussian distribution).

To motivate the approach, let us start with a running example. Suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To actually develop a model for predicting house prices, we would need to get our hands on a dataset consisting of sales for which we know the sale price, area, and age for each home. In the terminology of machine learning, the dataset is called a training dataset or training set, and each row (here the data corresponding to one sale) is called an example (or data point, data instance, sample). The thing we are trying to predict (price) is called a label (or target). The independent variables (age and area) upon which the predictions are based are called features (or covariates).

Typically, we will use $n$ to denote the number of examples in our dataset. We index the data examples by $i$, denoting each input as $\mathbf{x}^{(i)} = [x_1^{(i)}, x_2^{(i)}]^\top$ and the corresponding label as $y^{(i)}$.

3.1.1.1. Linear Model

The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age):

\begin{equation} \label{org21b6b79} \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b. \end{equation}

In equation \eqref{org21b6b79}, $w_{\mathrm{area}}$ and $w_{\mathrm{age}}$ are called weights, and $b$ is called a bias (also called an offset or intercept). The weights determine the influence of each feature on our prediction and the bias just says what value the predicted price should take when all of the features take value 0. Even if we will never see any homes with zero area, or that are precisely zero years old, we still need the bias or else we will limit the expressivity of our model. Strictly speaking, equation \eqref{org21b6b79} is an affine transformation of input features, which is characterized by a linear transformation of features via weighted sum, combined with a translation via the added bias.

Given a dataset, our goal is to choose the weights $\mathbf{w}$ and the bias $b$ such that on average, the predictions made according to our model best fit the true prices observed in the data. Models whose output prediction is determined by the affine transformation of input features are linear models, where the affine transformation is specified by the chosen weights and bias.

In disciplines where it is common to focus on datasets with just a few features, explicitly expressing models long-form like this is common. In machine learning, we usually work with high-dimensional datasets, so it is more convenient to employ linear algebra notation. When our inputs consist of $d$ features, we express our prediction $\hat{y}$ (in general the “hat” symbol denotes estimates) as

\begin{equation} \hat{y} = w_1 x_1 + ... + w_d x_d + b. \end{equation}

Collecting all features into a vector $\mathbf{x} \in \mathbb{R}^d$ and all weights into a vector $\mathbf{w} \in \mathbb{R}^d$, we can express our model compactly using a dot product:

\begin{equation} \label{org89c0d05} \hat{y} = \mathbf{w}^\top \mathbf{x} + b. \end{equation}

In equation \eqref{org89c0d05}, the vector $\mathbf{x}$ corresponds to features of a single data example. We will often find it convenient to refer to features of our entire dataset of $n$ examples via the design matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$. Here, $\mathbf{X}$ contains one row for every example and one column for every feature.

For a collection of features $\mathbf{X}$, the predictions $\hat{\mathbf{y}} \in \mathbb{R}^n$ can be expressed via the matrix-vector product:

\begin{equation} {\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b, \end{equation}

where broadcasting (see 2.1-data-manipulation.html#9dcbe412-db7e-485a-bb3c-d7181f2f7f05) is applied during the summation. Given features of a training dataset $\mathbf{X}$ and corresponding (known) labels $\mathbf{y}$, the goal of linear regression is to find the weight vector $\mathbf{w}$ and the bias term $b$ that given features of a new data example sampled from the same distribution as $\mathbf{X}$, the new example’s label will (in expectation) be predicted with the lowest error.

Even if we believe that the best model for predicting $y$ given $\mathbf{x}$ is linear, we would not expect to find a real-world dataset of $n$ examples where $y^{(i)}$ exactly equals $\mathbf{w}^\top \mathbf{x}^{(i)}+b$ for all $1 \leq i \leq n$. For example, whatever instruments we use to observe the features $\mathbf{X}$ and labels $\mathbf{y}$ might suffer small amount of measurement error. Thus, even when we are confident that the underlying relationship is linear, we will incorporate a noise term to account for such errors.

Before we can go about searching for the best parameters (or model parameters) $\mathbf{w}$ and $b$, we will need two more things: (i) a quality measure for some given model; and (ii) a procedure for updating the model to improve its quality.

3.1.1.2. Loss Function

Before we start thinking about how to fit data with our model, we need to determine a measure of fitness. The loss function quantifies the distance between the real and predicted value of the target. The loss will usually be a non-negative number where smaller values are better and perfect predictions incur a loss of 0. The most popular loss function in regression problems is the squared error. When our prediction for an example $i$ is $\hat{y}^{(i)}$ and the corresponding true label is $y^{(i)}$, the squared error is given by:

\begin{equation} l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2. \end{equation}

The constant $\frac{1}{2}$ makes no real difference but will prove notationally convenient, canceling out when we take the derivative of the loss. Since the training dataset is given to us, and thus out of our control, the empirical error is only a function of the model parameters. To make things more concrete, consider the example below where we plot a regression problem for a one-dimensional case as shown in Fig 1.

Figure 1: Fit data with a linear model.

Note that large differences between estimates $\hat{y}^{(i)}$ and observations $y^{(i)}$ lead to even larger contributions to the loss, due to the quadratic dependence. To measure the quality of a model on the entire dataset of $n$ examples, we simply average (or equivalently, sum) the losses on the training set.

\begin{equation} L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2. \end{equation}

When training the model, we want to find parameters ($\mathbf{w}^*, b^*$) that minimize the total loss across all training examples:

\begin{equation} \mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b). \end{equation}

3.1.1.3. Analytic Solution

Linear regression happens to be an unusually simple optimization problem. Unlike most other models that we will encounter in this book, linear regression can be solved analytically by applying a simple formula. To start, we can subsume the bias $b$ into the parameter $\mathbf{w}$ by appending a column to the design matrix consisting of all ones. Then our prediction problem is to minimize $\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$. There is just one critical point on the loss surface and it corresponds to the minimum of the loss over the entire domain. Taking the derivative of the loss with respect to $\mathbf{w}$ and setting it equal to zero yields the analytic (closed-form) solution:

\begin{equation} \mathbf{w}^* = (\mathbf X^\top \mathbf X)^{-1}\mathbf X^\top \mathbf{y}. \end{equation}

While simple problems like linear regression may admit analytic solutions, you should not get used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude all of deep learning.

3.1.1.4. Minibatch Stochastic Gradient Descent

Even in cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice. Moreover, for many tasks, those difficult-to-optimize models turn out to be so much better that figuring out how to train them ends up being well worth the trouble.

The key technique for optimizing nearly any deep learning model, and which we will call upon throughout this book, consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. This algorithm is called gradient descent.

The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.

In each iteration, we first randomly sample a minibatch $\mathcal{B}$ consisting of a fixed number of training examples. We then compute the derivative (gradient) of the average loss on the minibatch with regard to the model parameters. Finally, we multiply the gradient by a predetermined positive value $\eta$ and subtract the resulting term from the current parameter values.

We can express the update mathematically as follows ($\partial$ denotes the partial derivative):

\begin{equation} (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b). \end{equation}

To summarize, steps of the algorithm are the following: (i) we initialize the values of the model parameters, typically at random; (ii) we iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient. For quadratic losses and affine transformations, we can write this out explicitly as follows:

\begin{equation} \label{org3306916} \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned} \end{equation}

Note that $\mathbf{w}$ and $\mathbf{x}$ are vectors in equation \eqref{org3306916}. Here, the more elegant vector notation makes the math much more readable than expressing things in terms of coefficients, say $w_1, w_2, \ldots, w_d$. The set cardinality $|\mathcal{B}|$ represents the number of examples in each minibatch (the batch size) and $\eta$ denotes the learning rate. We emphasize that the values of the batch size and learning rate are manually pre-specified and not typically learned through model training. These parameters that are tunable but not updated in the training loop are called hyperparameters. Hyperparameter tuning is the process by which hyperparameters are chosen, and typically requires that we adjust them based on the results of the training loop as assessed on a separate validation dataset (or validation set).

After training for some predetermined number of iterations (or until some other stopping criteria are met), we record the estimated model parameters, denoted $\hat{\mathbf{w}}, \hat{b}$. Note that even if our function is truly linear and noiseless, these parameters will not be the exact minimizers of the loss because, although the algorithm converges slowly towards the minimizers it cannot achieve it exactly in a finite number of steps.

Linear regression happens to be a learning problem where there is only one minimum over the entire domain. However, for more complicated models, like deep networks, the loss surfaces contain many minima. Fortunately, for reasons that are not yet fully understood, deep learning practitioners seldom struggle to find parameters that minimize the loss on training sets. The more formidable task is to find parameters that will achieve low loss on data that we have not seen before, a challenge called generalization. We return to these topics throughout the book.

3.1.1.5. Making Predictions with the Learned Model

Given the learned linear regression model $\hat{\mathbf{w}}^\top \mathbf{x} + \hat{b}$, we can now estimate the price of a new house (not contained in the training data) given its area $x_1$ and age $x_2$. Estimating targets given features is commonly called prediction or inference.

We will try to stick with prediction because calling this step inference, despite emerging as standard jargon in deep learning, is somewhat of a misnomer. In statistics, inference more often denotes estimating parameters based on a dataset. This misuse of terminology is a common source of confusion when deep learning practitioners talk to statisticians.

3.1.2. Vectorization for Speed

When training our models, we typically want to process whole minibatches of examples simultaneously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries rather than writing costly loops in Clojure.

We will use dm3/stopwatch to measure the time duration.

(ns clj-d2l.linear_regression
  (:require [clj-djl.ndarray :as nd]
            [stopwatch.core :as stopwatch]
            [clj-chart.chart :as chart]
            [clj-chart.plot :as plot]
            [clojure.java.io :as io]
            [clj-d2l.core :as d2l]))

To illustrate why this matters so much, we can consider two methods for adding vectors. To start we instantiate two 10000-dimensional vectors containing all ones. In one method we will loop over the vectors with a Clojure doseq. In the other method we will rely on a single call to +.

(def n 10000)
(def ndm (nd/base-manager))
(def a (nd/ones ndm [n]))
(def b (nd/ones ndm [n]))
(def c (nd/zeros ndm [n]))

Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a doseq.

(let [elapsed (stopwatch/start)]
  (doseq [i (range n)]
    (nd/set c [i] (+ (nd/get-element a [i]) (nd/get-element b [i]))))
  (println "Elapsed: " (/ (elapsed) 1e9) "sec"))

Elapsed:  4.044424873 sec

Alternatively, we rely on the clj-djl.ndarray/+ operator to compute the elementwise sum.

(let [elapsed (stopwatch/start)]
  (nd/+ a b)
  (println "Elapsed: " (/ (elapsed) 1e9) "sec"))

Elapsed:  1.64916E-4 sec

You probably noticed that the second method is dramatically faster than the first. Vectorizing code often yields order-of-magnitude speedups. Moreover, we push more of the mathematics to the library and need not write as many calculations ourselves, reducing the potential for errors.

3.1.3. The Normal Distribution and Squared Loss

While you can already get your hands dirty using only the information above, in the following we can more formally motivate the squared loss objective via assumptions about the distribution of noise.

Linear regression was invented by Gauss in 1795, who also discovered the normal distribution (also called the Gaussian). It turns out that the connection between the normal distribution and linear regression runs deeper than common parentage. To refresh your memory, the probability density of a normal distribution with mean $\mu$ and variance $\sigma^2$ (standard deviation $\sigma$) is given as

\begin{equation} p(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (x - \mu)^2\right). \end{equation}

Below we define a Clojure function to compute the normal distribution.

(defn normal [x mu sigma]
  (let [p (/ 1.0 (Math/sqrt (* 2 Math/PI (Math/pow sigma 2))))]
    (nd/* (nd/exp (nd/* (nd/pow (nd/- x mu) 2)
                        (/ -0.5 (Math/pow sigma 2))))
          p)))

(def x (nd/arange ndm -7. 7. 0.01))
(def params [[0 1] [0 2] [3 1]])
(let [titles (map #(str "mean " (first %) ", std " (second %)) params)
      yss (map #(nd/to-vec (normal x (first %) (second %))) params)
      xs (nd/to-vec x)
      series (map (fn [t ys-] {:name t :xs xs :ys ys-}) titles yss)
      c (chart/line {:series series})]
  (plot/store! c nil "notes/figures/normal-distribution.svg"))

As we can see, changing the mean corresponds to a shift along the $x$-axis, and increasing the variance spreads the distribution out, lowering its peak.

One way to motivate linear regression with the mean squared error loss function (or simply squared loss) is to formally assume that observations arise from noisy observations, where the noise is normally distributed as follows:

\begin{equation} y = \mathbf{w}^\top \mathbf{x} + b + \epsilon \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2). \end{equation}

Thus, we can now write out the likelihood of seeing a particular $y$ for a given $\mathbf{x}$ via

\begin{equation} P(y \mid \mathbf{x}) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{1}{2 \sigma^2} (y - \mathbf{w}^\top \mathbf{x} - b)^2\right). \end{equation}

Now, according to the principle of maximum likelihood, the best values of parameters $\mathbf{w}$ and $b$ are those that maximize the likelihood of the entire dataset:

\begin{equation} P(\mathbf y \mid \mathbf X) = \prod_{i=1}^{n} p(y^{(i)}|\mathbf{x}^{(i)}). \end{equation}

Estimators chosen according to the principle of maximum likelihood are called maximum likelihood estimators. While, maximizing the product of many exponential functions, might look difficult, we can simplify things significantly, without changing the objective, by maximizing the log of the likelihood instead. For historical reasons, optimizations are more often expressed as minimization rather than maximization. So, without changing anything we can minimize the negative log-likelihood $-\log P(\mathbf y \mid \mathbf X)$. Working out the mathematics gives us:

\begin{equation} -\log P(\mathbf y \mid \mathbf X) = \sum_{i=1}^n \frac{1}{2} \log(2 \pi \sigma^2) + \frac{1}{2 \sigma^2} \left(y^{(i)} - \mathbf{w}^\top \mathbf{x}^{(i)} - b\right)^2. \end{equation}

Now we just need one more assumption that $\sigma$ is some fixed constant. Thus we can ignore the first term because it does not depend on $\mathbf{w}$ or $b$. Now the second term is identical to the squared error loss introduced earlier, except for the multiplicative constant $\frac{1}{\sigma^2}$. Fortunately, the solution does not depend on $\sigma$. It follows that minimizing the mean squared error is equivalent to maximum likelihood estimation of a linear model under the assumption of additive Gaussian noise.

3.1.4. From Linear Regression to Deep Networks

So far we only talked about linear models. While neural networks cover a much richer family of models, we can begin thinking of the linear model as a neural network by expressing it in the language of neural networks. To begin, let us start by rewriting things in a “layer” notation.

3.1.4.1. Neural Network Diagram

Deep learning practitioners like to draw diagrams to visualize what is happening in their models. In fig 2, we depict our linear regression model as a neural network. Note that these diagrams highlight the connectivity pattern such as how each input is connected to the output, but not the values taken by the weights or biases.

Figure 2: Linear regression is a single-layer neural network.

For the neural network shown in fig 2, the inputs are $x_1, \ldots, x_d$, so the number of inputs (or feature dimensionality) in the input layer is $d$. The output of the network in fig 2 is $o_1$, so the number of outputs in the output layer is 1. Note that the input values are all given and there is just a single computed neuron. Focusing on where computation takes place, conventionally we do not consider the input layer when counting layers. That is to say, the number of layers for the neural network in fig 2 is 1. We can think of linear regression models As neural networks consisting of just a single artificial neuron, or as single-layer neural networks.

Since for linear regression, every input is connected to every output (in this case there is only one output), we can regard this transformation (the output layer in fig 2 as a fully-connected layer or dense layer. We will talk a lot more about networks composed of such layers in the next chapter.

3.1.4.2. Biology

Since linear regression (invented in 1795) predates computational neuroscience, it might seem anachronistic to describe linear regression as a neural network. To see why linear models were a natural place to begin when the cyberneticists/neurophysiologists Warren McCulloch and Walter Pitts began to develop models of artificial neurons, consider the cartoonish picture of a biological neuron in fig 3, consisting of dendrites (input terminals), the nucleus (CPU), the axon (output wire), and the axon terminals (output terminals), enabling connections to other neurons via synapses.

Figure 3: The real neuron.

Information $x_i$ arriving from other neurons (or environmental sensors such as the retina) is received in the dendrites. In particular, that information is weighted by synaptic weights $w_i$ determining the effect of the inputs (e.g., activation or inhibition via the product $x_i w_i$). The weighted inputs arriving from multiple sources are aggregated in the nucleus as a weighted sum $y = \sum_i x_i w_i + b$, and this information is then sent for further processing in the axon $y$, typically after some nonlinear processing via $\sigma(y)$. From there it either reaches its destination (e.g., a muscle) or is fed into another neuron via its dendrites.

Certainly, the high-level idea that many such units could be cobbled together with the right connectivity and right learning algorithm, to produce far more interesting and complex behavior than any one neuron alone could express owes to our study of real biological neural systems.

At the same time, most research in deep learning today draws little direct inspiration in neuroscience. We invoke Stuart Russell and Peter Norvig who, in their classic AI text book Artificial Intelligence: A Modern Approach [Russell & Norvig, 2016], pointed out that although airplanes might have been inspired by birds, ornithology has not been the primary driver of aeronautics innovation for some centuries. Likewise, inspiration in deep learning these days comes in equal or greater measure from mathematics, statistics, and computer science.

3.1.5. Summary

Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm, and quite obviously, the model itself.
Vectorizing makes everything better (mostly math) and faster (mostly code).
Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.
Linear regression models are neural networks, too.

3.2. Linear Regression Implementation from Scratch

Now that you understand the key ideas behind linear regression, we can begin to work through a hands-on implementation in code. In this section, we will implement the entire method from scratch, including the data pipeline, the model, the loss function, and the minibatch stochastic gradient descent optimizer. While modern deep learning frameworks can automate nearly all of this work, implementing things from scratch is the only way to make sure that you really know what you are doing. Moreover, when it comes time to customize models, defining our own layers or loss functions, understanding how things work under the hood will prove handy. In this section, we will rely only on tensors and auto differentiation. Afterwards, we will introduce a more concise implementation, taking advantage of bells and whistles of deep learning frameworks.

(ns clj-d2l.linreg
  (:require [clj-djl.ndarray :as nd]
            [clj-djl.device :as device]
            [clj-djl.engine :as engine]
            [clj-djl.training :as t]
            [clj-djl.training.dataset :as ds]
            [clj-d2l.core :as d2l]))

3.2.1. Generating the Dataset

To keep things simple, we will construct an artificial dataset according to a linear model with additive noise. Our task will be to recover this model’s parameters using the finite set of examples contained in our dataset. We will keep the data low-dimensional so we can visualize it easily. In the following code snippet, we generate a dataset containing 1000 examples, each consisting of 2 features sampled from a standard normal distribution. Thus our synthetic dataset will be a matrix $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$.

The true parameters generating our dataset will be $\mathbf{w} = [2, -3.4]^\top$ and $b = 4.2$, and our synthetic labels will be assigned according to the following linear model with the noise term $\epsilon$:

\begin{equation} \mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon. \end{equation}

You could think of $\epsilon$ as capturing potential measurement errors on the features and labels. We will assume that the standard assumptions hold and thus that $\epsilon$ obeys a normal distribution with mean of 0. To make our problem easy, we will set its standard deviation to 0.01. The following code generates our synthetic dataset.

(defn synthetic-data [ndm w b num]
  (let [X (nd/random-normal ndm [num (nd/size w)])
        y (nd/+ (nd/dot X w) b)
        noise (nd/random-normal ndm 0 0.01 (nd/shape y) :float32)]
    [X (nd/+ y noise)]))

(def ndm (nd/new-base-manager))
(def true-w (nd/create ndm (float-array [2 -3.4])))
(def true-b 4.2)
(def dp (synthetic-data ndm true-w true-b 1000))
(def features (get dp 0))
(def labels (get dp 1))

Note that each row in features consists of a 2-dimensional data example and that each row in labels consists of a 1-dimensional label value (a scalar).

(str "features(0): " (nd/to-vec (nd/get features [0]))
     "\nlabels(0): " (nd/get-element labels [0]))

features(0): [1.1630785 2.2122061]
labels(0): -1.0015316

By generating a scatter plot using the second feature and labels, we can clearly observe the linear correlation between the two.

(let [x (nd/to-vec (nd/get features ":, 1"))
      y (nd/to-vec labels)]
  (d2l/plot-scatter
   "notes/figures/synthetic_data.svg"
   "data"
   x
   y))

3.2.2. Reading the Dataset

Recall that training models consists of making multiple passes over the dataset, grabbing one minibatch of examples at a time, and using them to update our model. Since this process is so fundamental to training machine learning algorithms, it is worth defining a utility function to shuffle the dataset and access it in minibatches.

In the following code, we define the data-iter function to demonstrate one possible implementation of this functionality. The function takes a batch size, a matrix of features, and a vector of labels, yielding minibatches of the size batch-size. Each minibatch consists of a tuple of features and labels.

(defn data-iter [batch-size features labels]
  (let [num-examples (nd/size features)
        indices (range num-examples)
        indices (shuffle indices)]
    (for [i (range 0 num-examples batch-size)]
      [(->> (range i (min (+ i batch-size) num-examples))
           (map #(nd/get features [%]))
           (nd/stack))
       (->> (range i (min (+ i batch-size) num-examples))
           (map #(nd/get labels [%]))
           (nd/stack))])))

In general, note that we want to use reasonably sized minibatches to take advantage of the GPU hardware, which excels at parallelizing operations. Because each example can be fed through our models in parallel and the gradient of the loss function for each example can also be taken in parallel, GPUs allow us to process hundreds of examples in scarcely more time than it might take to process just a single example.

To build some intuition, let us read and print the first small batch of data examples. The shape of the features in each minibatch tells us both the minibatch size and the number of input features. Likewise, our minibatch of labels will have a shape given by batch-size.

(first (first (data-iter 10 features labels)))

ND: (10, 2) cpu() float32
[[ 1.1631,  2.2122],
 [ 0.4838,  0.774 ],
 [ 0.2996,  1.0434],
 [ 0.153 ,  1.1839],
 [-1.1688,  1.8917],
 [ 1.5581, -1.2347],
 [-0.5459, -1.771 ],
 [-2.3556, -0.4514],
 [ 0.5414,  0.5794],
 [ 2.6785, -1.8561],
]

(second (first (data-iter 10 features labels)))

ND: (10) cpu() float32
[-1.0015,  2.5281,  1.2197,  0.4807, -4.5658, 11.5155,  9.1389,  1.0271,  3.3058, 15.8548]

As we run the iteration, we obtain distinct minibatches successively until the entire dataset has been exhausted (try this). While the iteration implemented above is good for didactic purposes, it is inefficient in ways that might get us in trouble on real problems. For example, it requires that we load all the data in memory and that we perform lots of random memory access. The built-in iterators implemented in a deep learning framework are considerably more efficient and they can deal with both data stored in files and data fed via data streams.

3.2.3. Initializing Model Parameters

Before we can begin optimizing our model’s parameters by minibatch stochastic gradient descent, we need to have some parameters in the first place. In the following code, we initialize weights by sampling random numbers from a normal distribution with mean 0 and a standard deviation of 0.01, and setting the bias to 0.

(def w (nd/random-normal ndm 0 0.01 [2 1] :float32))
(def b (nd/zeros ndm [1]))
(nd/set-requires-gradient w true)
(nd/set-requires-gradient b true)
(println (nd/to-vec w))
(println (nd/to-vec b))

[0.0013263007 0.0072455113]
[0.0]

After initializing our parameters, our next task is to update them until they fit our data sufficiently well. Each update requires taking the gradient of our loss function with respect to the parameters. Given this gradient, we can update each parameter in the direction that may reduce the loss.

Since nobody wants to compute gradients explicitly (this is tedious and error prone), we use automatic differentiation, as introduced in Section 2.5-automatic-differentiation.html#automatic_differentiation, to compute the gradient.

3.2.4. Defining the Model

Next, we must define our model, relating its inputs and parameters to its outputs. Recall that to calculate the output of the linear model, we simply take the matrix-vector dot product of the input features $\mathbf{X}$ and the model weights $\mathbf{w}$, and add the offset $b$ to each example. Note that below $\mathbf{Xw}$ is a vector and $b$ is a scalar. Recall the broadcasting mechanism as described in Section 2.1-data-manipulation.html#9dcbe412-db7e-485a-bb3c-d7181f2f7f05. When we add a vector and a scalar, the scalar is added to each component of the vector.

(defn linreg
  "The linear regression model."
  [X w b]
  (nd/+ (nd/dot X w) b))

3.2.5. Defining the Loss Function

Since updating our model requires taking the gradient of our loss function, we ought to define the loss function first. Here we will use the squared loss function as described in Section 3.1-linear-regression.html#lin_reg. In the implementation, we need to transform the true value y into the predicted value’s shape y-hat. The result returned by the following function will also have the same shape as y-hat.

(defn squared-loss [y-hat y]
  (nd// (nd/* (nd/- y-hat (nd/reshape y (nd/shape y-hat)))
              (nd/- y-hat (nd/reshape y (nd/shape y-hat))))
        2))

3.2.6. Defining the Optimization Algorithm

As we discussed in Section 3.1-linear-regression.html#lin_reg, linear regression has a closed-form solution. However, this is not a book about linear regression: it is a book about deep learning. Since none of the other models that this book introduces can be solved analytically, we will take this opportunity to introduce your first working example of minibatch stochastic gradient descent.

At each step, using one minibatch randomly drawn from our dataset, we will estimate the gradient of the loss with respect to our parameters. Next, we will update our parameters in the direction that may reduce the loss. The following code applies the minibatch stochastic gradient descent update, given a set of parameters, a learning rate, and a batch size. The size of the update step is determined by the learning rate lr. Because our loss is calculated as a sum over the minibatch of examples, we normalize our step size by the batch size (batch-size), so that the magnitude of a typical step size does not depend heavily on our choice of the batch size.

(defn sgd
  "Minibatch stochastic gradient descent."
  [params lr batch-size]
  (doseq [param params]
    ;; param = param - param.gradient * lr / batchSize
    (nd/-! param (nd// (nd/* (nd/get-gradient param) lr) batch-size))))

3.2.7. Training

Now that we have all of the parts in place, we are ready to implement the main training loop. It is crucial that you understand this code because you will see nearly identical training loops over and over again throughout your career in deep learning.

In each iteration, we will grab a minibatch of training examples, and pass them through our model to obtain a set of predictions. After calculating the loss, we initiate the backwards pass through the network, storing the gradients with respect to each parameter. Finally, we will call the optimization algorithm sgd to update the model parameters.

In summary, we will execute the following loop:

Initialize parameters $(\mathbf{w}, b)$
Repeat until done
- Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} l(\mathbf{x}^{(i)}, y^{(i)}, \mathbf{w}, b)$
- Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$

In each epoch, we will iterate through the entire dataset (using the data-iter function) once passing through every example in the training dataset (assuming that the number of examples is divisible by the batch size). The number of epochs num-epochs and the learning rate lr are both hyperparameters, which we set here to 3 and 0.03, respectively. Unfortunately, setting hyperparameters is tricky and requires some adjustment by trial and error. We elide these details for now but revise them later in Section 11.

(def lr 0.03)
(def num-epochs 3)
(def batch-size 10)
[lr num-epochs batch-size]

[0.03 3 10]

(def datasets (data-iter batch-size features labels))

#'clj-d2l.linreg/datasets

(doseq [epoch (range num-epochs)]
  (doseq [[X y] datasets]
    (with-open [gc (t/gradient-collector)]
      (let [l (-> (linreg X w b) (squared-loss y))]
        (t/backward gc l)))
    (sgd [w b] lr batch-size))
  (println "epoch " epoch
           (-> (squared-loss (linreg features w b) labels)
               (nd/mean)
               (nd/get-element))))

epoch  0 3.3477103E-4
epoch  1 5.24727E-5
epoch  2 5.1000967E-5

In this case, because we synthesized the dataset ourselves, we know precisely what the true parameters are. Thus, we can evaluate our success in training by comparing the true parameters with those that we learned through our training loop. Indeed they turn out to be very close to each other.

(println (nd/to-vec w))
(println (nd/to-vec true-w))
(def w-error (nd/to-vec (nd/- true-w (nd/reshape w (nd/get-shape true-w)))))
(println "Error in estimating w:" (vec w-error))
(println "Error in estimating w:" (- true-b (nd/get-element b)))

[1.999567 -3.400089]
[2.0 -3.4]
Error in estimating w: [4.3296814E-4 8.893013E-5]
Error in estimating w: 5.704879760743964E-4

Note that we should not take it for granted that we are able to recover the parameters perfectly. However, in machine learning, we are typically less concerned with recovering true underlying parameters, and more concerned with parameters that lead to highly accurate prediction. Fortunately, even on difficult optimization problems, stochastic gradient descent can often find remarkably good solutions, owing partly to the fact that, for deep networks, there exist many configurations of the parameters that lead to highly accurate prediction.

3.2.8. Summary

We saw how a deep network can be implemented and optimized from scratch, using just tensors and auto differentiation, without any need for defining layers or fancy optimizers.
This section only scratches the surface of what is possible. In the following sections, we will describe additional models based on the concepts that we have just introduced and learn how to implement them more concisely.

3.3. Concise Implementation of Linear Regression

Broad and intense interest in deep learning for the past several years has inspired companies, academics, and hobbyists to develop a variety of mature open source frameworks for automating the repetitive work of implementing gradient-based learning algorithms. In :numref:`sec_linear_scratch`, we relied only on (i) tensors for data storage and linear algebra; and (ii) auto differentiation for calculating gradients. In practice, because data iterators, loss functions, optimizers, and neural network layers are so common, modern libraries implement these components for us as well.

In this section, we will show you how to implement the linear regression model from Section 3.2-linear-reg-impl-from-scratch.html#lin-reg-scratch concisely by using high-level APIs of deep learning frameworks.

3.3.1. Generating the Dataset

To start, we will generate the same dataset as in Section 3.2-linear-reg-impl-from-scratch.html#lin-reg-scratch.

(ns clj-d2l.linreg-easy
  (:require [clojure.java.io :as io]
            [clj-d2l.core :as d2l]
            [clj-djl.ndarray :as nd]
            [clj-djl.device :as device]
            [clj-djl.engine :as engine]
            [clj-djl.training.dataset :as ds]
            [clj-djl.model :as model]
            [clj-djl.nn :as nn]
            [clj-djl.training.loss :as loss]
            [clj-djl.training.tracker :as tracker]
            [clj-djl.training.optimizer :as optimizer]
            [clj-djl.training.parameter :as parameter]
            [clj-djl.training.initializer :as initializer]
            [clj-djl.training :as t]
            [clj-djl.training.listener :as listener])
  (:import [ai.djl.ndarray.types DataType]
           [java.nio.file Paths]))

(def ndm (nd/base-manager))
(def true-w (nd/create ndm (float-array [2 -3.4])))
(def true-b 4.2)
(def dp (d2l/synthetic-data ndm true-w true-b 1000))
(def features (get dp 0))
(def labels (get dp 1))
(println "features(0): "(nd/to-vec (nd/get features [0])))
(println "labels(0): " (nd/to-vec (nd/get labels [0])))

features(0):  [0.36353156 1.8406333]
labels(0):  [-1.3278697]

3.3.2. Reading the Dataset

Rather than rolling our own iterator, we can call upon the existing API in a framework to read data. We pass in features and labels as arguments and specify batch-size when instantiating a data iterator object. Besides, the boolean value is-train indicates whether or not we want the data iterator object to shuffle the data on each epoch (pass through the dataset).

(def batch-size 10)
(def datasets (-> (ds/new-array-dataset-builder)
                  (ds/set-data features)
                  (ds/opt-labels labels)
                  ;; (defn set-sampling [batch-size shuffle] ...)
                  (ds/set-sampling batch-size false)
                  (ds/build)))

Now we can use get-data-iterator in much the same way as we called the data-iter function in Section 3.2-linear-reg-impl-from-scratch.html#lin-reg-scratch. To verify that it is working, we can read and print the first minibatch of examples. Comparing with Section 3.2-linear-reg-impl-from-scratch.html#lin-reg-scratch, here we use Clojure function first to obtain the first item from the iterator.

(-> datasets
    (ds/get-data-iterator ndm) ;; generate a data iterator
    first ;; get the first batch
    ds/get-data ;; get the data
    first)

ND: (10, 2) cpu() float32
[[-2.2896,  1.0315],
 [-0.6617,  0.5531],
 [ 0.3967, -0.9902],
 [ 0.9992, -1.9574],
 [-0.9857, -0.1098],
 [-0.5344,  0.0834],
 [ 1.0844,  0.2221],
 [ 1.3125, -0.8627],
 [-0.581 ,  0.7608],
 [-1.4804,  0.2687],
]

(-> datasets
    (ds/get-data-iterator ndm) ;; generate a data iterator
    first ;; get the first batch
    ds/get-labels ;; get the labels
    first)

ND: (10) cpu() float32
[ 5.7955,  3.7533, 11.8775,  6.7918,  3.2628,  2.2621,  1.5316, 10.5247, 15.5371, -0.0894]

3.4. Defining the Model

When we implemented linear regression from scratch in Section 3.2-linear-reg-impl-from-scratch.html#lin-reg-scratch, we defined our model parameters explicitly and coded up the calculations to produce output using basic linear algebra operations. You should know how to do this. But once your models get more complex, and once you have to do this nearly every day, you will be glad for the assistance. The situation is similar to coding up your own blog from scratch. Doing it once or twice is rewarding and instructive, but you would be a lousy web developer if every time you needed a blog you spent a month reinventing the wheel.

For standard operations, we can use a framework’s predefined layers, which allow us to focus especially on the layers used to construct the model rather than having to focus on the implementation. We will first define a model variable net, which will refer to an instance of the sequential-block class. The sequential-block class defines a container for several layers that will be chained together. Given input data, a sequential-block instance passes it through the first layer, in turn passing the output as the second layer’s input and so forth. In the following example, our model consists of only one layer, so we do not really need sequential-block. But since nearly all of our future models will involve multiple layers, we will use it anyway just to familiarize you with the most standard workflow.

Recall the architecture of a single-layer network as shown in Fig 3.1-linear-regression.html#org30de283. The layer is said to be fully-connected because each of its inputs is connected to each of its outputs by means of a matrix-vector multiplication.

Now we define a model with name “lin-reg” and create a sequential-block with a linear-block inside it. And finally, set the sequential-block to the model.

(def model (model/new-instance "lin-reg"))
(def net (nn/sequential-block))
(def linear-block (nn/linear-block {:bias true
                                    :units 1}))
(nn/add net linear-block)

3.4.1. Initializing Model Parameters

Before using net, we need to initialize the model parameters, such as the weights and bias in the linear regression model. Deep learning frameworks often have a predefined way to initialize the parameters. Here we specify that each weight parameter should be randomly sampled from a normal distribution with mean 0 and standard deviation 0.01. The bias parameter will be initialized to zero.

We import the initializer namespace from clj-djl. This module provides various methods for model parameter initialization. We only specify how to initialize the weight by calling (normal-initializer 0.01). Bias parameters are initialized to zero by default.

(nn/set-initializer net (initializer/normal-initializer 0.01) parameter/weight)
(model/set-block model net)

Model (
	Name: lin-reg
	Data Type: float32
)

The code above may look straightforward but you should note that something strange is happening here. We are initializing parameters for a network even though clj-djl does not yet know how many dimensions the input will have! It might be 2 as in our example or it might be 2000. clj-djl lets us get away with this because behind the scene, the initialization is actually deferred. The real initialization will take place only when we for the first time attempt to pass data through the network. Just be careful to remember that since the parameters have not been initialized yet, we cannot access or manipulate them.

3.4.2. Defining the Loss Function

In clj-djl, the loss namespace defines various loss functions. In this example, we will use the squared loss (l2-Loss).

(def loss (loss/l2-loss))

#'clj-d2l.linreg-easy/loss

3.4.3. Defining the Optimization Algorithm

Minibatch stochastic gradient descent is a standard tool for optimizing neural networks and thus clj-djl supports it alongside a number of variations on this algorithm through its trainer. When we instantiate trainer, we will specify the parameters to optimize over, the optimization algorithm we wish to use (sgd), and a dictionary of hyperparameters required by our optimization algorithm. Minibatch stochastic gradient descent just requires that we set the value learning rate, which is set to 0.03 here.

(def lrt (tracker/fixed 0.3))
(def sgd (optimizer/sgd {:tracker lrt}))

#'clj-d2l.linreg-easy/sgd

3.4.4. Instantiate Configuration and Trainer

(def trainer (t/trainer {:model model
                         :loss loss
                         :optimizer sgd
                         :listeners (listener/logging)}))

#'clj-d2l.linreg-easy/trainer

3.4.5. Initializing Model Parameters

(t/initialize trainer [(nd/shape batch-size 2)])

ai.djl.training.Trainer@4b641b7c

3.4.6. Metrics

(def metrics (t/metrics))
(t/set-metrics trainer metrics)

3.4.7. Training

You might have noticed that expressing our model through high-level APIs of a deep learning framework requires comparatively few lines of code. We did not have to individually allocate parameters, define our loss function, or implement minibatch stochastic gradient descent. Once we start working with much more complex models, advantages of high-level APIs will grow considerably. However, once we have all the basic pieces in place, the training loop itself is strikingly similar to what we did when implementing everything from scratch.

To refresh your memory: for some number of epochs, we will make a complete pass over the dataset (train-data), iteratively grabbing one minibatch of inputs and the corresponding ground-truth labels. For each minibatch, we go through the following ritual:

Generate predictions by calling train-batch and calculate the loss l (the forward propagation).
Calculate gradients by running the backpropagation.
Update the model parameters by invoking our optimizer.

For good measure, we compute the loss after each epoch and print it to monitor progress.

(def epochs 3)

(doseq [epoch (range epochs)]
  (doseq [batch (t/iterate-dataset trainer datasets)]
    (t/train-batch trainer batch)
    (t/step trainer)
    (ds/close-batch batch))
  (t/notify-listeners trainer (fn [listner] (.onEpoch listner trainer))))

Training:      1% |=                                       | L2Loss: _
Training:      2% |=                                       | L2Loss: _
Training:      3% |==                                      | L2Loss: _
Training:      4% |==                                      | L2Loss: _
Training:      5% |===                                     | L2Loss: 6.82
Training:      6% |===                                     | L2Loss: 6.82
Training:      7% |===                                     | L2Loss: 6.82
Training:      8% |====                                    | L2Loss: 6.82
Training:      9% |====                                    | L2Loss: 6.82
Training:     10% |=====                                   | L2Loss: 3.54
Training:     11% |=====                                   | L2Loss: 3.54
Training:     12% |=====                                   | L2Loss: 3.54
Training:     13% |======                                  | L2Loss: 3.54
Training:     14% |======                                  | L2Loss: 3.54
Training:     15% |=======                                 | L2Loss: 2.36
Training:     16% |=======                                 | L2Loss: 2.36
Training:     17% |=======                                 | L2Loss: 2.36
Training:     18% |========                                | L2Loss: 2.36
Training:     19% |========                                | L2Loss: 2.36
Training:     20% |=========                               | L2Loss: 1.77
Training:     21% |=========                               | L2Loss: 1.77
Training:     22% |=========                               | L2Loss: 1.77
Training:     23% |==========                              | L2Loss: 1.77
Training:     24% |==========                              | L2Loss: 1.77
Training:     25% |===========                             | L2Loss: 1.42
Training:     26% |===========                             | L2Loss: 1.42
Training:     27% |===========                             | L2Loss: 1.42
Training:     28% |============                            | L2Loss: 1.42
Training:     29% |============                            | L2Loss: 1.42
Training:     30% |=============                           | L2Loss: 1.18
Training:     31% |=============                           | L2Loss: 1.18
Training:     32% |=============                           | L2Loss: 1.18
Training:     33% |==============                          | L2Loss: 1.18
Training:     34% |==============                          | L2Loss: 1.18
Training:     35% |===============                         | L2Loss: 1.01
Training:     36% |===============                         | L2Loss: 1.01
Training:     37% |===============                         | L2Loss: 1.01
Training:     38% |================                        | L2Loss: 1.01
Training:     39% |================                        | L2Loss: 1.01
Training:     40% |=================                       | L2Loss: 0.89
Training:     41% |=================                       | L2Loss: 0.89
Training:     42% |=================                       | L2Loss: 0.89
Training:     43% |==================                      | L2Loss: 0.89
Training:     44% |==================                      | L2Loss: 0.89
Training:     45% |===================                     | L2Loss: 0.79
Training:     46% |===================                     | L2Loss: 0.79
Training:     47% |===================                     | L2Loss: 0.79
Training:     48% |====================                    | L2Loss: 0.79
Training:     49% |====================                    | L2Loss: 0.79
Training:     50% |=====================                   | L2Loss: 0.71
Training:     51% |=====================                   | L2Loss: 0.71
Training:     52% |=====================                   | L2Loss: 0.71
Training:     53% |======================                  | L2Loss: 0.71
Training:     54% |======================                  | L2Loss: 0.71
Training:     55% |=======================                 | L2Loss: 0.64
Training:     56% |=======================                 | L2Loss: 0.64
Training:     57% |=======================                 | L2Loss: 0.64
Training:     58% |========================                | L2Loss: 0.64
Training:     59% |========================                | L2Loss: 0.64
Training:     60% |=========================               | L2Loss: 0.59
Training:     61% |=========================               | L2Loss: 0.59
Training:     62% |=========================               | L2Loss: 0.59
Training:     63% |==========================              | L2Loss: 0.59
Training:     64% |==========================              | L2Loss: 0.59
Training:     65% |===========================             | L2Loss: 0.55
Training:     66% |===========================             | L2Loss: 0.55
Training:     67% |===========================             | L2Loss: 0.55
Training:     68% |============================            | L2Loss: 0.55
Training:     69% |============================            | L2Loss: 0.55
Training:     70% |=============================           | L2Loss: 0.51
Training:     71% |=============================           | L2Loss: 0.51
Training:     72% |=============================           | L2Loss: 0.51
Training:     73% |==============================          | L2Loss: 0.51
Training:     74% |==============================          | L2Loss: 0.51
Training:     75% |===============================         | L2Loss: 0.47
Training:     76% |===============================         | L2Loss: 0.47
Training:     77% |===============================         | L2Loss: 0.47
Training:     78% |================================        | L2Loss: 0.47
Training:     79% |================================        | L2Loss: 0.47
Training:     80% |=================================       | L2Loss: 0.44
Training:     81% |=================================       | L2Loss: 0.44
Training:     82% |=================================       | L2Loss: 0.44
Training:     83% |==================================      | L2Loss: 0.44
Training:     84% |==================================      | L2Loss: 0.44
Training:     85% |===================================     | L2Loss: 0.42
Training:     86% |===================================     | L2Loss: 0.42
Training:     87% |===================================     | L2Loss: 0.42
Training:     88% |====================================    | L2Loss: 0.42
Training:     89% |====================================    | L2Loss: 0.42
Training:     90% |=====================================   | L2Loss: 0.39
Training:     91% |=====================================   | L2Loss: 0.39
Training:     92% |=====================================   | L2Loss: 0.39
Training:     93% |======================================  | L2Loss: 0.39
Training:     94% |======================================  | L2Loss: 0.39
Training:     95% |======================================= | L2Loss: 0.37
Training:     96% |======================================= | L2Loss: 0.37
Training:     97% |======================================= | L2Loss: 0.37
Training:     98% |========================================| L2Loss: 0.37
Training:     99% |========================================| L2Loss: 0.37
Training:    100% |========================================| L2Loss: 0.35

Training:      1% |=                                       | L2Loss: 0.35
Training:      2% |=                                       | L2Loss: 0.35
Training:      3% |==                                      | L2Loss: 0.35
Training:      4% |==                                      | L2Loss: 0.35
Training:      5% |===                                     | L2Loss: 7.43E-05
Training:      6% |===                                     | L2Loss: 7.43E-05
Training:      7% |===                                     | L2Loss: 7.43E-05
Training:      8% |====                                    | L2Loss: 7.43E-05
Training:      9% |====                                    | L2Loss: 7.43E-05
Training:     10% |=====                                   | L2Loss: 6.59E-05
Training:     11% |=====                                   | L2Loss: 6.59E-05
Training:     12% |=====                                   | L2Loss: 6.59E-05
Training:     13% |======                                  | L2Loss: 6.59E-05
Training:     14% |======                                  | L2Loss: 6.59E-05
Training:     15% |=======                                 | L2Loss: 5.80E-05
Training:     16% |=======                                 | L2Loss: 5.80E-05
Training:     17% |=======                                 | L2Loss: 5.80E-05
Training:     18% |========                                | L2Loss: 5.80E-05
Training:     19% |========                                | L2Loss: 5.80E-05
Training:     20% |=========                               | L2Loss: 5.92E-05
Training:     21% |=========                               | L2Loss: 5.92E-05
Training:     22% |=========                               | L2Loss: 5.92E-05
Training:     23% |==========                              | L2Loss: 5.92E-05
Training:     24% |==========                              | L2Loss: 5.92E-05
Training:     25% |===========                             | L2Loss: 5.60E-05
Training:     26% |===========                             | L2Loss: 5.60E-05
Training:     27% |===========                             | L2Loss: 5.60E-05
Training:     28% |============                            | L2Loss: 5.60E-05
Training:     29% |============                            | L2Loss: 5.60E-05
Training:     30% |=============                           | L2Loss: 5.70E-05
Training:     31% |=============                           | L2Loss: 5.70E-05
Training:     32% |=============                           | L2Loss: 5.70E-05
Training:     33% |==============                          | L2Loss: 5.70E-05
Training:     34% |==============                          | L2Loss: 5.70E-05
Training:     35% |===============                         | L2Loss: 5.73E-05
Training:     36% |===============                         | L2Loss: 5.73E-05
Training:     37% |===============                         | L2Loss: 5.73E-05
Training:     38% |================                        | L2Loss: 5.73E-05
Training:     39% |================                        | L2Loss: 5.73E-05
Training:     40% |=================                       | L2Loss: 5.78E-05
Training:     41% |=================                       | L2Loss: 5.78E-05
Training:     42% |=================                       | L2Loss: 5.78E-05
Training:     43% |==================                      | L2Loss: 5.78E-05
Training:     44% |==================                      | L2Loss: 5.78E-05
Training:     45% |===================                     | L2Loss: 5.58E-05
Training:     46% |===================                     | L2Loss: 5.58E-05
Training:     47% |===================                     | L2Loss: 5.58E-05
Training:     48% |====================                    | L2Loss: 5.58E-05
Training:     49% |====================                    | L2Loss: 5.58E-05
Training:     50% |=====================                   | L2Loss: 5.64E-05
Training:     51% |=====================                   | L2Loss: 5.64E-05
Training:     52% |=====================                   | L2Loss: 5.64E-05
Training:     53% |======================                  | L2Loss: 5.64E-05
Training:     54% |======================                  | L2Loss: 5.64E-05
Training:     55% |=======================                 | L2Loss: 5.74E-05
Training:     56% |=======================                 | L2Loss: 5.74E-05
Training:     57% |=======================                 | L2Loss: 5.74E-05
Training:     58% |========================                | L2Loss: 5.74E-05
Training:     59% |========================                | L2Loss: 5.74E-05
Training:     60% |=========================               | L2Loss: 5.72E-05
Training:     61% |=========================               | L2Loss: 5.72E-05
Training:     62% |=========================               | L2Loss: 5.72E-05
Training:     63% |==========================              | L2Loss: 5.72E-05
Training:     64% |==========================              | L2Loss: 5.72E-05
Training:     65% |===========================             | L2Loss: 5.75E-05
Training:     66% |===========================             | L2Loss: 5.75E-05
Training:     67% |===========================             | L2Loss: 5.75E-05
Training:     68% |============================            | L2Loss: 5.75E-05
Training:     69% |============================            | L2Loss: 5.75E-05
Training:     70% |=============================           | L2Loss: 5.77E-05
Training:     71% |=============================           | L2Loss: 5.77E-05
Training:     72% |=============================           | L2Loss: 5.77E-05
Training:     73% |==============================          | L2Loss: 5.77E-05
Training:     74% |==============================          | L2Loss: 5.77E-05
Training:     75% |===============================         | L2Loss: 5.85E-05
Training:     76% |===============================         | L2Loss: 5.85E-05
Training:     77% |===============================         | L2Loss: 5.85E-05
Training:     78% |================================        | L2Loss: 5.85E-05
Training:     79% |================================        | L2Loss: 5.85E-05
Training:     80% |=================================       | L2Loss: 5.78E-05
Training:     81% |=================================       | L2Loss: 5.78E-05
Training:     82% |=================================       | L2Loss: 5.78E-05
Training:     83% |==================================      | L2Loss: 5.78E-05
Training:     84% |==================================      | L2Loss: 5.78E-05
Training:     85% |===================================     | L2Loss: 5.64E-05
Training:     86% |===================================     | L2Loss: 5.64E-05
Training:     87% |===================================     | L2Loss: 5.64E-05
Training:     88% |====================================    | L2Loss: 5.64E-05
Training:     89% |====================================    | L2Loss: 5.64E-05
Training:     90% |=====================================   | L2Loss: 5.66E-05
Training:     91% |=====================================   | L2Loss: 5.66E-05
Training:     92% |=====================================   | L2Loss: 5.66E-05
Training:     93% |======================================  | L2Loss: 5.66E-05
Training:     94% |======================================  | L2Loss: 5.66E-05
Training:     95% |======================================= | L2Loss: 5.76E-05
Training:     96% |======================================= | L2Loss: 5.76E-05
Training:     97% |======================================= | L2Loss: 5.76E-05
Training:     98% |========================================| L2Loss: 5.76E-05
Training:     99% |========================================| L2Loss: 5.76E-05
Training:    100% |========================================| L2Loss: 5.63E-05

Training:      1% |=                                       | L2Loss: 5.63E-05
Training:      2% |=                                       | L2Loss: 5.63E-05
Training:      3% |==                                      | L2Loss: 5.63E-05
Training:      4% |==                                      | L2Loss: 5.63E-05
Training:      5% |===                                     | L2Loss: 7.43E-05
Training:      6% |===                                     | L2Loss: 7.43E-05
Training:      7% |===                                     | L2Loss: 7.43E-05
Training:      8% |====                                    | L2Loss: 7.43E-05
Training:      9% |====                                    | L2Loss: 7.43E-05
Training:     10% |=====                                   | L2Loss: 6.59E-05
Training:     11% |=====                                   | L2Loss: 6.59E-05
Training:     12% |=====                                   | L2Loss: 6.59E-05
Training:     13% |======                                  | L2Loss: 6.59E-05
Training:     14% |======                                  | L2Loss: 6.59E-05
Training:     15% |=======                                 | L2Loss: 5.80E-05
Training:     16% |=======                                 | L2Loss: 5.80E-05
Training:     17% |=======                                 | L2Loss: 5.80E-05
Training:     18% |========                                | L2Loss: 5.80E-05
Training:     19% |========                                | L2Loss: 5.80E-05
Training:     20% |=========                               | L2Loss: 5.92E-05
Training:     21% |=========                               | L2Loss: 5.92E-05
Training:     22% |=========                               | L2Loss: 5.92E-05
Training:     23% |==========                              | L2Loss: 5.92E-05
Training:     24% |==========                              | L2Loss: 5.92E-05
Training:     25% |===========                             | L2Loss: 5.60E-05
Training:     26% |===========                             | L2Loss: 5.60E-05
Training:     27% |===========                             | L2Loss: 5.60E-05
Training:     28% |============                            | L2Loss: 5.60E-05
Training:     29% |============                            | L2Loss: 5.60E-05
Training:     30% |=============                           | L2Loss: 5.70E-05
Training:     31% |=============                           | L2Loss: 5.70E-05
Training:     32% |=============                           | L2Loss: 5.70E-05
Training:     33% |==============                          | L2Loss: 5.70E-05
Training:     34% |==============                          | L2Loss: 5.70E-05
Training:     35% |===============                         | L2Loss: 5.73E-05
Training:     36% |===============                         | L2Loss: 5.73E-05
Training:     37% |===============                         | L2Loss: 5.73E-05
Training:     38% |================                        | L2Loss: 5.73E-05
Training:     39% |================                        | L2Loss: 5.73E-05
Training:     40% |=================                       | L2Loss: 5.78E-05
Training:     41% |=================                       | L2Loss: 5.78E-05
Training:     42% |=================                       | L2Loss: 5.78E-05
Training:     43% |==================                      | L2Loss: 5.78E-05
Training:     44% |==================                      | L2Loss: 5.78E-05
Training:     45% |===================                     | L2Loss: 5.58E-05
Training:     46% |===================                     | L2Loss: 5.58E-05
Training:     47% |===================                     | L2Loss: 5.58E-05
Training:     48% |====================                    | L2Loss: 5.58E-05
Training:     49% |====================                    | L2Loss: 5.58E-05
Training:     50% |=====================                   | L2Loss: 5.64E-05
Training:     51% |=====================                   | L2Loss: 5.64E-05
Training:     52% |=====================                   | L2Loss: 5.64E-05
Training:     53% |======================                  | L2Loss: 5.64E-05
Training:     54% |======================                  | L2Loss: 5.64E-05
Training:     55% |=======================                 | L2Loss: 5.74E-05
Training:     56% |=======================                 | L2Loss: 5.74E-05
Training:     57% |=======================                 | L2Loss: 5.74E-05
Training:     58% |========================                | L2Loss: 5.74E-05
Training:     59% |========================                | L2Loss: 5.74E-05
Training:     60% |=========================               | L2Loss: 5.72E-05
Training:     61% |=========================               | L2Loss: 5.72E-05
Training:     62% |=========================               | L2Loss: 5.72E-05
Training:     63% |==========================              | L2Loss: 5.72E-05
Training:     64% |==========================              | L2Loss: 5.72E-05
Training:     65% |===========================             | L2Loss: 5.75E-05
Training:     66% |===========================             | L2Loss: 5.75E-05
Training:     67% |===========================             | L2Loss: 5.75E-05
Training:     68% |============================            | L2Loss: 5.75E-05
Training:     69% |============================            | L2Loss: 5.75E-05
Training:     70% |=============================           | L2Loss: 5.77E-05
Training:     71% |=============================           | L2Loss: 5.77E-05
Training:     72% |=============================           | L2Loss: 5.77E-05
Training:     73% |==============================          | L2Loss: 5.77E-05
Training:     74% |==============================          | L2Loss: 5.77E-05
Training:     75% |===============================         | L2Loss: 5.85E-05
Training:     76% |===============================         | L2Loss: 5.85E-05
Training:     77% |===============================         | L2Loss: 5.85E-05
Training:     78% |================================        | L2Loss: 5.85E-05
Training:     79% |================================        | L2Loss: 5.85E-05
Training:     80% |=================================       | L2Loss: 5.78E-05
Training:     81% |=================================       | L2Loss: 5.78E-05
Training:     82% |=================================       | L2Loss: 5.78E-05
Training:     83% |==================================      | L2Loss: 5.78E-05
Training:     84% |==================================      | L2Loss: 5.78E-05
Training:     85% |===================================     | L2Loss: 5.64E-05
Training:     86% |===================================     | L2Loss: 5.64E-05
Training:     87% |===================================     | L2Loss: 5.64E-05
Training:     88% |====================================    | L2Loss: 5.64E-05
Training:     89% |====================================    | L2Loss: 5.64E-05
Training:     90% |=====================================   | L2Loss: 5.66E-05
Training:     91% |=====================================   | L2Loss: 5.66E-05
Training:     92% |=====================================   | L2Loss: 5.66E-05
Training:     93% |======================================  | L2Loss: 5.66E-05
Training:     94% |======================================  | L2Loss: 5.66E-05
Training:     95% |======================================= | L2Loss: 5.76E-05
Training:     96% |======================================= | L2Loss: 5.76E-05
Training:     97% |======================================= | L2Loss: 5.76E-05
Training:     98% |========================================| L2Loss: 5.76E-05
Training:     99% |========================================| L2Loss: 5.76E-05
Training:    100% |========================================| L2Loss: 5.63E-05

Below, we compare the model parameters learned by training on finite data and the actual parameters that generated our dataset. To access parameters, we first access the layer that we need from net and then access that layer’s weights and bias. As in our from-scratch implementation, note that our estimated parameters are close to their ground-truth counterparts.

(def params (-> model (model/get-block) (model/get-parameters)))
(def w (.getArray (.valueAt params 0)))
(def b (.getArray (.valueAt params 1)))
(def w-error (nd/to-vec (nd/- true-w (nd/reshape w (nd/get-shape true-w)))))
(println "Error in estimating w:" (vec w-error))
(println "Error in estimating w:" (- true-b (nd/get-element b)))

Error in estimating w: [-0.0019903183 7.4744225E-4]
Error in estimating w: -4.289627075193536E-4

3.4.8. Saving Your Model

You can also save the model for future prediction task.

(defn save-model [model path epoch name]
  (let [nio-path (java.nio.file.Paths/get path (into-array [""]))]
    (io/make-parents path)
    (model/set-property model "Epoch" epoch)
    (model/save model nio-path name)))

(save-model model "models/lin-reg" "3" "lin-reg")
(println (str model))

Model (
	Name: lin-reg
	Model location: /home/kimim/workspace/clj-d2l/models/lin-reg
	Data Type: float32
	Epoch: 3
)

3.4.9. Summary

Using clj-djl, we can implement models much more concisely.
In clj-djl, the dataset namespace provides tools for data processing, the nn namespace defines a large number of neural network layers, and the loss namespace defines many common loss functions.
initializer namespace provides various methods for model parameter initialization.
Dimensionality and storage are automatically inferred, but be careful not to attempt to access parameters before they have been initialized.

3.5. Softmax Regression

In Section 3.1, we introduced linear regression, working through implementations from scratch in Section 3.2 and again using high-level APIs of a deep learning framework in Section 3.3 to do the heavy lifting.

Regression is the hammer we reach for when we want to answer how much? or how many? questions. If you want to predict the number of dollars (price) at which a house will be sold, or the number of wins a baseball team might have, or the number of days that a patient will remain hospitalized before being discharged, then you are probably looking for a regression model.

In practice, we are more often interested in classification: asking not “how much” but “which one”:

Does this email belong in the spam folder or the inbox?
Is this customer more likely to sign up or not to sign up for a subscription service?
Does this image depict a donkey, a dog, a cat, or a rooster?
Which movie is Aston most likely to watch next?

Colloquially, machine learning practitioners overload the word classification to describe two subtly different problems: (i) those where we are interested only in hard assignments of examples to categories (classes); and (ii) those where we wish to make soft assignments, i.e., to assess the probability that each category applies. The distinction tends to get blurred, in part, because often, even when we only care about hard assignments, we still use models that make soft assignments.

3.6. The Image Classification Dataset

One of the widely used dataset for image classification is the MNIST dataset [LeCun et al., 1998]. While it had a good run as a benchmark dataset, even simple models by today’s standards achieve classification accuracy over 95%, making it unsuitable for distinguishing between stronger models and weaker ones. Today, MNIST serves as more of sanity checks than as a benchmark. To up the ante just a bit, we will focus our discussion in the coming sections on the qualitatively similar, but comparatively complex Fashion-MNIST dataset [Xiao et al., 2017], which was released in 2017.

(ns clj-d2l.image-classification
  (:require [clj-djl.training.dataset :as ds]
            [clj-djl.ndarray :as nd]
            [stopwatch.core :as stopwatch])
  (:import [ai.djl.basicdataset.cv.classification FashionMnist]))

3.6.1. Getting the Dataset

We can download and read the Fashion-MNIST dataset into memory via the build-in functions in the framework.

;; downloading time may be quite long, extend the repl timeout time
(setq org-babel-clojure-sync-nrepl-timeout 1000)

(def batch-size 256)
(def random-shuffle true)
;; defn d2l/load-data-fashion-mnist in clj-d2l.core
(def mnist-train (-> (FashionMnist/builder)
                     (ds/opt-usage :train)
                     (ds/set-sampling batch-size random-shuffle)
                     (ds/build)
                     (ds/prepare)))

(def mnist-test (-> (FashionMnist/builder)
                    (ds/opt-usage :test)
                    (ds/set-sampling batch-size random-shuffle)
                    (ds/build)
                    (ds/prepare)))

Fashion-MNIST consists of images from 10 categories, each represented by 6000 images in the training dataset and by 1000 in the test dataset. A test dataset (or test set) is used for evaluating model performance and not for training. Consequently the training set and the test set contain 60000 and 10000 images, respectively.

#'clj-d2l.image-classification/mnist-test

[(nd/size mnist-test)
 (nd/size mnist-train)]

[10000 60000]

The height and width of each input image are both 28 pixels. Note that the dataset consists of grayscale images, whose number of channels is 1. For brevity, throughout this book we store the shape of any image with height $h$ width $w$ pixels as $h \times w$ or $(h, w)$.

(def ndm (nd/base-manager))

(def train-ds (ds/get-data-iterator mnist-train ndm))

#'clj-d2l.image-classification/train-ds

(nd/shape (nd/get (first (ds/get-data (first train-ds))) [0]))

(1, 28, 28)

The images in Fashion-MNIST are associated with the following categories: t-shirt, trousers, pullover, dress, coat, sandal, shirt, sneaker, bag, and ankle boot. The following function converts between numeric label indices and their names in text.

(defn get-fashion-mnist-labels [labels]
  (let [num-label-map  {0 "t-shirt" 1 "trouser" 2 "pullover"
                        3 "dress" 4 "coat" 5 "sandal"
                        6 "shirt" 7 "sneaker" 8 "bag"
                        9 "ankle boot"}]
    (map num-label-map labels)))
(print (get-fashion-mnist-labels [1 2 2 4]))

(trouser pullover pullover coat)

We can now create a function to visualize these examples.

;; TODO
;; how to convert ndarray to an image?

3.6.2. Reading a Minibatch

To make our life easier when reading from the training and test sets, we use the built-in data iterator rather than creating one from scratch. Recall that at each iteration, a data iterator reads a minibatch of data with size batch-size each time. We also randomly shuffle the examples for the training data iterator.

Let us look at the time it takes to read the training data.

(let [elapsed (stopwatch/start)]
  (for [batch train-ds]
    [(first (ds/get-data batch))
     (first (ds/get-labels batch))])
  (println "Elapsed: " (/ (elapsed) 1e9) "sec"))

Elapsed:  1.667E-4 sec

We are now ready to work with the Fashion-MNIST dataset in the sections that follow.

3.6.3. Summary

Fashion-MNIST is an apparel classification dataset consisting of images representing 10 categories. We will use this dataset in subsequent sections and chapters to evaluate various classification algorithms.
We store the shape of any image with height $h$ width $w$ pixels as $h \times w$ or ($h$, $w$).
Data iterators are a key component for efficient performance. Rely on well-implemented data iterators that exploit high-performance computing to avoid slowing down your training loop.

3.7. Implementation of Softmax Regression from Scratch

(ns clj-d2l.softmax-from-scratch
  (:require [clojure.java.io :as io]
            [clj-djl.ndarray :as nd]
            [clj-djl.training.dataset :as ds]
            [clj-djl.model :as model]
            [clj-djl.nn :as nn]
            [clj-djl.training.loss :as loss]
            [clj-djl.training.tracker :as tracker]
            [clj-djl.training.optimizer :as optimizer]
            [clj-djl.training :as t]
            [clj-djl.training.listener :as listener]
            [clj-djl.engine :as engine]
            [clj-d2l.core :as d2l])
  (:import [ai.djl.basicdataset.cv.classification FashionMnist]
           [java.nio.file Paths]))

(setq org-babel-clojure-sync-nrepl-timeout 1000)

(def batch-size 256)
(def fashion-mnist (d2l/load-data-fashion-mnist batch-size))

#’clj-d2l.softmax-from-scratch/batch-size

#’clj-d2l.softmax-from-scratch/fashion-mnist

(.size (first fashion-mnist))

3.7.1. Initializing Model Parameters

As in our linear regression example, each example here will be represented by a fixed-length vector. Each example in the raw dataset is a $28 \times 28$ image. In this section, we will flatten each image, treating them as vectors of length 784. In the future, we will talk about more sophisticated strategies for exploiting the spatial structure in images, but for now we treat each pixel location as just another feature.

Recall that in softmax regression, we have as many outputs as there are classes. Because our dataset has 10 classes, our network will have an output dimension of 10. Consequently, our weights will constitute a $784 \times 10$ matrix and the biases will constitute a $1 \times 10$ row vector. As with linear regression, we will initialize our weights W with Gaussian noise and our biases to take the initial value 0.

(def num-inputs 784)
(def num-outputs 10)
(def ndm (nd/base-manager))
(def W (nd/random-normal ndm 0 0.01 [num-inputs num-outputs] :float32))
(def b (nd/zeros ndm [num-outputs] :float32))

3.7.2. Defining the Softmax Operation

Before implementing the softmax regression model, let us briefly review how the sum operator works along specific dimensions in a tensor, as discussed in Section 2.3-linear-algebra.html#lin-alg-reduction and Section 2.3-linear-algebra.html#lin-alg-non-reduction. Given a matrix X we can sum over all elements (by default) or only over elements in the same axis, i.e., the same column (axis 0) or the same row (axis 1). Note that if X is a tensor with shape (2, 3) and we sum over the columns, the result will be a vector with shape (3,). When invoking the sum operator, we can specify to keep the number of axes in the original tensor, rather than collapsing out the dimension that we summed over. This will result in a two-dimensional tensor with shape (1, 3).

(def X (nd/create ndm [[1 2 3] [4 5 6]]))

(nd/sum X [0] true)

ND: (1, 3) cpu() int64
[[ 5,  7,  9],
]

(nd/sum X [1] true)

ND: (2, 1) cpu() int64
[[ 6],
 [15],
]

(nd/sum X [0 1] true)

ND: (1, 1) cpu() int64
[[21],
]

(nd/sum X [0 1] false)

ND: () cpu() int64
21

(nd/sum X)

ND: () cpu() int64
21

We are now ready to implement the softmax operation. Recall that softmax consists of three steps: (i) we exponentiate each term (using exp); (ii) we sum over each row (we have one row per example in the batch) to get the normalization constant for each example; (iii) we divide each row by its normalization constant, ensuring that the result sums to 1. Before looking at the code, let us recall how this looks expressed as an equation:

\begin{equation} \label{org0e76c78} \mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}. \end{equation}

The denominator, or normalization constant, is also sometimes called the partition function (and its logarithm is called the log-partition function). The origins of that name are in statistical physics where a related equation models the distribution over an ensemble of particles.

(defn softmax [ndarray]
  (let [Xexp (nd/exp ndarray)
        partition (nd/sum Xexp [1] true)]
    (nd// Xexp partition)))

#'clj-d2l.softmax-from-scratch/softmax

As you can see, for any random input, we turn each element into a non-negative number. Moreover, each row sums up to 1, as is required for a probability.

(def X (nd/random-normal ndm [2 5]))
(softmax X)

ND: (2, 5) cpu() float32
[[0.2492, 0.2315, 0.2786, 0.0503, 0.1905],
 [0.1874, 0.4392, 0.2314, 0.0868, 0.0552],
]

(nd/sum (softmax X) [1])

ND: (2) cpu() float32
[1., 1.]

Note that while this looks correct mathematically, we were a bit sloppy in our implementation because we failed to take precautions against numerical overflow or underflow due to large or very small elements of the matrix.

3.7.3. Defining the Model

Now that we have defined the softmax operation, we can implement the softmax regression model. The below code defines how the input is mapped to the output through the network. Note that we flatten each original image in the batch into a vector using the reshape function before passing the data through our model.

(defn net [ndarray]
  (let [current-W W
        current-b b]
    (-> ndarray
        (nd/reshape [-1 num-inputs])
        (nd/dot current-W)
        (nd/+ current-b)
        softmax)))

3.7.4. Defining the Loss Function

Next, we need to implement the cross-entropy loss function, as introduced in Section 3.4-softmax-regression.html#sec-softmax. This may be the most common loss function in all of deep learning because, at the moment, classification problems far outnumber regression problems.

Recall that cross-entropy takes the negative log-likelihood of the predicted probability assigned to the true label. Rather than iterating over the predictions with a for-loop (which tends to be inefficient), we can pick all elements by a single operator. Below, we create sample data y-hat with 2 examples of predicted probabilities over 3 classes and their corresponding labels y. With y we know that in the first example the first class is the correct prediction and in the second example the third class is the ground-truth. Using y as the indices of the probabilities in y-hat, we pick the probability of the first class in the first example and the probability of the third class in the second example.

(def y (nd/create ndm [0 2]))
(def y-hat (nd/create ndm [[0.1 0.3 0.6][0.3 0.2 0.5]]))
(nd/get y-hat ":,{}" y)

ND: (2, 1) cpu() float64
[[0.1],
 [0.5],
]

Now we can implement the cross-entropy loss function efficiently with just one line of code.

(defn cross-entropy [y-hat y]
  (-> (nd/get y-hat ":, {}" (nd/to-type y :int32 false))
      (.log)
      (.neg)))

(cross-entropy y-hat y)

ND: (2, 1) cpu() float64
[[2.3026],
 [0.6931],
]

3.7.5. Classification Accuracy

Given the predicted probability distribution y-hat, we typically choose the class with the highest predicted probability whenever we must output a hard prediction. Indeed, many applications require that we make a choice. Gmail must categorize an email into “Primary”, “Social”, “Updates”, or “Forums”. It might estimate probabilities internally, but at the end of the day it has to choose one among the classes.

When predictions are consistent with the label class y, they are correct. The classification accuracy is the fraction of all predictions that are correct. Although it can be difficult to optimize accuracy directly (it is not differentiable), it is often the performance measure that we care most about, and we will nearly always report it when training classifiers.

To compute accuracy we do the following. First, if y-hat is a matrix, we assume that the second dimension stores prediction scores for each class. We use argmax to obtain the predicted class by the index for the largest entry in each row. Then we compare the predicted class with the ground-truth y elementwise. Since the equality operator == is sensitive to data types, we convert y-hat’s data type to match that of y. The result is a tensor containing entries of 0 (false) and 1 (true). Taking the sum yields the number of correct predictions.

(defn accuracy [y-hat y]
  (if (> (nd/size (nd/shape y-hat)) 1)
    (-> (nd/argmax y-hat 1)
        (nd/= (nd/to-type y :int64 false))
        (nd/sum)
        (nd/get-element))))

We will continue to use the variables y-hat and y defined before as the predicted probability distributions and labels, respectively. We can see that the first example’s prediction class is 2 (the largest element of the row is 0.6 with the index 2), which is inconsistent with the actual label, 0. The second example’s prediction class is 2 (the largest element of the row is 0.5 with the index of 2), which is consistent with the actual label, 2. Therefore, the classification accuracy rate for these two examples is 0.5.

(/ (accuracy y-hat y) (nd/size y))

1/2

Similarly, we can evaluate the accuracy for any model net on a dataset that is accessed via the data iterator data-iter.

(def fashion-mnist-train (first fashion-mnist))
(defn evaluate-accuracy [net data-iter]
  (let [acc (atom [0 0])]
    (doseq [batch data-iter]
      (let [X (nd/head (ds/get-data batch))
            y (nd/head (ds/get-labels batch))]
        (swap! acc update 0 + (accuracy (net X) y))
        (swap! acc update 1 + (nd/size y))
        (ds/close batch)))
    (reduce / @acc)))

#'clj-d2l.softmax-from-scratch/evaluate-accuracy

Here accumulate is a utility function to accumulate sums over multiple variables. In the above evaluate-accuracy function, we create a atom of vector with 2 variables for storing both the number of correct predictions and the number of predictions, respectively. Both will be accumulated over time as we iterate over the dataset.

(defn accumulate [atom x y z]
  (swap! atom update 0 + x)
  (swap! atom update 1 + y)
  (swap! atom update 2 + z))

Because we initialized the net model with random weights, the accuracy of this model should be close to random guessing, i.e., 0.1 for 10 classes.

(evaluate-accuracy net (ds/get-data-iterator (second fashion-mnist) ndm))

119/2000

3.7.6. Training

The training loop for softmax regression should look strikingly familiar if you read through our implementation of linear regression in Section 3.2. Here we refactor the implementation to make it reusable. First, we define a function to train for one epoch. Note that updater is a general function to update the model parameters, which accepts the batch size as an argument. It can be either a wrapper of the d2l/sgd function or a framework’s built-in optimization function.

(defn train-epoch-ch3 [net train-iter lr loss updater]
  (let [acc (atom [0 0 0])]
    (doseq [param [W b]]
      (nd/set-requires-gradient param true))
    (doseq [batch (t/iter-seq train-iter)]
      (let [X (-> batch ds/get-data nd/head (nd/reshape [-1 num-inputs]))
            y (-> batch ds/get-labels nd/head)]
        (with-open [gc (-> (engine/get-instance) (engine/new-gradient-collector))]
          (let [y-hat (net X)
                l (loss y-hat y)]
            (t/backward gc l)
            (accumulate acc (nd/get-element (nd/sum l)) (accuracy y-hat y) (nd/size y)))))
      (updater [W b] lr batch-size)
      (ds/close batch))
    [(/ (@acc 0) (@acc 2)) (/ (@acc 1) (@acc 2))]))

The training function then runs multiple epochs and visualize the training progress.

Again, we use the minibatch stochastic gradient descent to optimize the loss function of the model. Note that the number of epochs (numEpochs), and learning rate (lr) are both adjustable hyper-parameters. By changing their values, we may be able to increase the classification accuracy of the model. In practice we will want to split our data three ways into training, validation, and test data, using the validation data to choose the best values of our hyper-parameters.

(defn sgd [params lr batch-size]
  (doseq [param params]
    (nd/-! param (nd// (nd/* (nd/get-gradient param) lr) batch-size))))

(defn train-ch3 [net train-ds test-ds lr loss num-epochs updater]
  (doseq [i (range num-epochs)]
    (let [train-metrics (train-epoch-ch3 net (ds/get-data-iterator train-ds ndm) lr loss updater)
          accuracy (evaluate-accuracy net (ds/get-data-iterator test-ds ndm))
          train-accuracy (get train-metrics 1)
          train-loss (get train-metrics 0)]
      (println "Epoch " i ": Test Accuracy: " accuracy)
      (println "Train Accuracy: " train-accuracy)
      (println "Train Loss: "train-loss))))

(def num-epochs 3)
(def lr 0.1)
(train-ch3 net (first fashion-mnist) (second fashion-mnist) lr cross-entropy num-epochs sgd)

Epoch  0 : Test Accuracy:  2067/2500
Train Accuracy:  10117/12000
Train Loss:  0.4646147914886475
Epoch  1 : Test Accuracy:  1041/1250
Train Accuracy:  25369/30000
Train Loss:  0.4579694811503092
Epoch  2 : Test Accuracy:  4169/5000
Train Accuracy:  25367/30000
Train Loss:  0.45321130771636964

3.7.7. Prediction

Now that training is complete, our model is ready to classify some images. Given a series of images, we will compare their actual labels (first line of text output) and the model predictions (second line of text output).

(defn predict-ch3 [net dataset ndmanager]
  (let [batch (first (ds/get-data-iterator dataset ndmanager))
        X (nd/head (ds/get-data batch))
        y-hat (nd/argmax (net X) 1)
        y (nd/head (ds/get-labels batch))]
    [y-hat y]))

(def prediction (predict-ch3 net (second fashion-mnist) ndm))
(println "Prediction:   " (take 20 (nd/to-vec (prediction 0))))
(println "Actual label: "(take 20 (map int (nd/to-vec (prediction 1)))))

Prediction:    (3 4 3 0 7 1 5 3 9 7 8 0 5 3 4 8 1 0 4 4)
Actual label:  (3 4 3 0 7 1 5 3 9 7 8 0 5 3 2 8 1 6 2 4)

3.7.8. Summary

With softmax regression, we can train models for multi-category classification. The training loop is very similar to that in linear regression: retrieve and read data, define models and loss functions, then train models using optimization algorithms. As you will soon find out, most common deep learning models have similar training procedures.

Conditional probability	\(H=1\)	\(H=0\)
\(P(D_1 = 1 \mid H)\)	1	0.01
\(P(D_1 = 0 \mid H)\)	0	0.99