Table of Contents
1. Implementation of Softmax Regression from Scratch
(ns clj-d2l.softmax-from-scratch (:require [clojure.java.io :as io] [clj-djl.ndarray :as nd] [clj-djl.training.dataset :as ds] [clj-djl.model :as model] [clj-djl.nn :as nn] [clj-djl.training.loss :as loss] [clj-djl.training.tracker :as tracker] [clj-djl.training.optimizer :as optimizer] [clj-djl.training :as t] [clj-djl.training.listener :as listener] [clj-djl.engine :as engine] [clj-d2l.core :as d2l]) (:import [ai.djl.basicdataset.cv.classification FashionMnist] [java.nio.file Paths]))
(setq org-babel-clojure-sync-nrepl-timeout 1000)
1000
(def batch-size 256) (def fashion-mnist (d2l/load-data-fashion-mnist batch-size))
#’clj-d2l.softmax-from-scratch/batch-size |
#’clj-d2l.softmax-from-scratch/fashion-mnist |
(.size (first fashion-mnist))
60000
1.1. Initializing Model Parameters
As in our linear regression example, each example here will be represented by a fixed-length vector. Each example in the raw dataset is a \(28 \times 28\) image. In this section, we will flatten each image, treating them as vectors of length 784. In the future, we will talk about more sophisticated strategies for exploiting the spatial structure in images, but for now we treat each pixel location as just another feature.
Recall that in softmax regression, we have as many outputs as there are
classes. Because our dataset has 10 classes, our network will have an
output dimension of 10. Consequently, our weights will constitute a
\(784 \times 10\) matrix and the biases will constitute a
\(1 \times 10\) row vector. As with linear regression, we will
initialize our weights W
with Gaussian noise and our biases to take
the initial value 0.
(def num-inputs 784) (def num-outputs 10) (def ndm (nd/base-manager)) (def W (nd/random-normal ndm 0 0.01 [num-inputs num-outputs] :float32)) (def b (nd/zeros ndm [num-outputs] :float32))
1.2. Defining the Softmax Operation
Before implementing the softmax regression model, let us briefly
review how the sum operator works along specific dimensions in a
tensor, as discussed in Section
2.3-linear-algebra.html#lin-alg-reduction and Section
2.3-linear-algebra.html#lin-alg-non-reduction. Given a matrix X
we can sum over all elements (by default) or only over elements in the
same axis, i.e., the same column (axis 0) or the same row (axis
1). Note that if X
is a tensor with shape (2, 3) and we sum over the
columns, the result will be a vector with shape (3,). When invoking
the sum operator, we can specify to keep the number of axes in the
original tensor, rather than collapsing out the dimension that we
summed over. This will result in a two-dimensional tensor with shape
(1, 3).
(def X (nd/create ndm [[1 2 3] [4 5 6]]))
(nd/sum X [0] true)
ND: (1, 3) cpu() int64 [[ 5, 7, 9], ]
(nd/sum X [1] true)
ND: (2, 1) cpu() int64 [[ 6], [15], ]
(nd/sum X [0 1] true)
ND: (1, 1) cpu() int64 [[21], ]
(nd/sum X [0 1] false)
ND: () cpu() int64 21
(nd/sum X)
ND: () cpu() int64 21
We are now ready to implement the softmax operation. Recall that
softmax consists of three steps: (i) we exponentiate each term (using
exp
); (ii) we sum over each row (we have one row per example in the
batch) to get the normalization constant for each example; (iii) we
divide each row by its normalization constant, ensuring that the
result sums to 1. Before looking at the code, let us recall how this
looks expressed as an equation:
The denominator, or normalization constant, is also sometimes called the partition function (and its logarithm is called the log-partition function). The origins of that name are in statistical physics where a related equation models the distribution over an ensemble of particles.
(defn softmax [ndarray] (let [Xexp (nd/exp ndarray) partition (nd/sum Xexp [1] true)] (nd// Xexp partition)))
#'clj-d2l.softmax-from-scratch/softmax
As you can see, for any random input, we turn each element into a non-negative number. Moreover, each row sums up to 1, as is required for a probability.
(def X (nd/random-normal ndm [2 5])) (softmax X)
ND: (2, 5) cpu() float32 [[0.2492, 0.2315, 0.2786, 0.0503, 0.1905], [0.1874, 0.4392, 0.2314, 0.0868, 0.0552], ]
(nd/sum (softmax X) [1])
ND: (2) cpu() float32 [1., 1.]
Note that while this looks correct mathematically, we were a bit sloppy in our implementation because we failed to take precautions against numerical overflow or underflow due to large or very small elements of the matrix.
1.3. Defining the Model
Now that we have defined the softmax operation, we can implement the
softmax regression model. The below code defines how the input is
mapped to the output through the network. Note that we flatten each
original image in the batch into a vector using the reshape
function
before passing the data through our model.
(defn net [ndarray] (let [current-W W current-b b] (-> ndarray (nd/reshape [-1 num-inputs]) (nd/dot current-W) (nd/+ current-b) softmax)))
1.4. Defining the Loss Function
Next, we need to implement the cross-entropy loss function, as introduced in Section 3.4-softmax-regression.html#sec-softmax. This may be the most common loss function in all of deep learning because, at the moment, classification problems far outnumber regression problems.
Recall that cross-entropy takes the negative log-likelihood of the
predicted probability assigned to the true label. Rather than
iterating over the predictions with a for-loop (which tends to be
inefficient), we can pick all elements by a single operator. Below, we
create sample data y-hat
with 2 examples of predicted probabilities
over 3 classes and their corresponding labels y
. With y
we know that
in the first example the first class is the correct prediction and in
the second example the third class is the ground-truth. Using y
as the
indices of the probabilities in y-hat
, we pick the probability of the
first class in the first example and the probability of the third
class in the second example.
(def y (nd/create ndm [0 2])) (def y-hat (nd/create ndm [[0.1 0.3 0.6][0.3 0.2 0.5]])) (nd/get y-hat ":,{}" y)
ND: (2, 1) cpu() float64 [[0.1], [0.5], ]
Now we can implement the cross-entropy loss function efficiently with just one line of code.
(defn cross-entropy [y-hat y] (-> (nd/get y-hat ":, {}" (nd/to-type y :int32 false)) (.log) (.neg))) (cross-entropy y-hat y)
ND: (2, 1) cpu() float64 [[2.3026], [0.6931], ]
1.5. Classification Accuracy
Given the predicted probability distribution y-hat
, we typically
choose the class with the highest predicted probability whenever we
must output a hard prediction. Indeed, many applications require that
we make a choice. Gmail must categorize an email into “Primary”,
“Social”, “Updates”, or “Forums”. It might estimate probabilities
internally, but at the end of the day it has to choose one among the
classes.
When predictions are consistent with the label class y
, they are
correct. The classification accuracy is the fraction of all
predictions that are correct. Although it can be difficult to optimize
accuracy directly (it is not differentiable), it is often the
performance measure that we care most about, and we will nearly always
report it when training classifiers.
To compute accuracy we do the following. First, if y-hat
is a matrix,
we assume that the second dimension stores prediction scores for each
class. We use argmax
to obtain the predicted class by the index for
the largest entry in each row. Then we compare the predicted class
with the ground-truth y
elementwise. Since the equality operator ==
is
sensitive to data types, we convert y-hat
’s data type to match that of
y
. The result is a tensor containing entries of 0 (false) and 1
(true). Taking the sum yields the number of correct predictions.
(defn accuracy [y-hat y] (if (> (nd/size (nd/shape y-hat)) 1) (-> (nd/argmax y-hat 1) (nd/= (nd/to-type y :int64 false)) (nd/sum) (nd/get-element))))
We will continue to use the variables y-hat
and y
defined before as
the predicted probability distributions and labels, respectively. We
can see that the first example’s prediction class is 2 (the largest
element of the row is 0.6 with the index 2), which is inconsistent
with the actual label, 0. The second example’s prediction class is 2
(the largest element of the row is 0.5 with the index of 2), which is
consistent with the actual label, 2. Therefore, the classification
accuracy rate for these two examples is 0.5.
(/ (accuracy y-hat y) (nd/size y))
1/2
Similarly, we can evaluate the accuracy for any model net on a dataset
that is accessed via the data iterator data-iter
.
(def fashion-mnist-train (first fashion-mnist)) (defn evaluate-accuracy [net data-iter] (let [acc (atom [0 0])] (doseq [batch data-iter] (let [X (nd/head (ds/get-data batch)) y (nd/head (ds/get-labels batch))] (swap! acc update 0 + (accuracy (net X) y)) (swap! acc update 1 + (nd/size y)) (ds/close batch))) (reduce / @acc)))
#'clj-d2l.softmax-from-scratch/evaluate-accuracy
Here accumulate
is a utility function to accumulate sums over multiple
variables. In the above evaluate-accuracy
function, we create a atom
of vector with 2 variables for storing both the number of correct
predictions and the number of predictions, respectively. Both will be
accumulated over time as we iterate over the dataset.
(defn accumulate [atom x y z] (swap! atom update 0 + x) (swap! atom update 1 + y) (swap! atom update 2 + z))
Because we initialized the net model with random weights, the accuracy of this model should be close to random guessing, i.e., 0.1 for 10 classes.
(evaluate-accuracy net (ds/get-data-iterator (second fashion-mnist) ndm))
119/2000
1.6. Training
The training loop for softmax regression should look strikingly
familiar if you read through our implementation of linear regression
in Section 3.2. Here we refactor the implementation to make it
reusable. First, we define a function to train for one epoch. Note
that updater is a general function to update the model parameters,
which accepts the batch size as an argument. It can be either a
wrapper of the d2l/sgd
function or a framework’s built-in optimization
function.
(defn train-epoch-ch3 [net train-iter lr loss updater] (let [acc (atom [0 0 0])] (doseq [param [W b]] (nd/set-requires-gradient param true)) (doseq [batch (t/iter-seq train-iter)] (let [X (-> batch ds/get-data nd/head (nd/reshape [-1 num-inputs])) y (-> batch ds/get-labels nd/head)] (with-open [gc (-> (engine/get-instance) (engine/new-gradient-collector))] (let [y-hat (net X) l (loss y-hat y)] (t/backward gc l) (accumulate acc (nd/get-element (nd/sum l)) (accuracy y-hat y) (nd/size y))))) (updater [W b] lr batch-size) (ds/close batch)) [(/ (@acc 0) (@acc 2)) (/ (@acc 1) (@acc 2))]))
The training function then runs multiple epochs and visualize the training progress.
Again, we use the minibatch stochastic gradient descent to optimize the loss function of the model. Note that the number of epochs (numEpochs), and learning rate (lr) are both adjustable hyper-parameters. By changing their values, we may be able to increase the classification accuracy of the model. In practice we will want to split our data three ways into training, validation, and test data, using the validation data to choose the best values of our hyper-parameters.
(defn sgd [params lr batch-size] (doseq [param params] (nd/-! param (nd// (nd/* (nd/get-gradient param) lr) batch-size))))
(defn train-ch3 [net train-ds test-ds lr loss num-epochs updater] (doseq [i (range num-epochs)] (let [train-metrics (train-epoch-ch3 net (ds/get-data-iterator train-ds ndm) lr loss updater) accuracy (evaluate-accuracy net (ds/get-data-iterator test-ds ndm)) train-accuracy (get train-metrics 1) train-loss (get train-metrics 0)] (println "Epoch " i ": Test Accuracy: " accuracy) (println "Train Accuracy: " train-accuracy) (println "Train Loss: "train-loss))))
(def num-epochs 3) (def lr 0.1) (train-ch3 net (first fashion-mnist) (second fashion-mnist) lr cross-entropy num-epochs sgd)
Epoch 0 : Test Accuracy: 2067/2500 Train Accuracy: 10117/12000 Train Loss: 0.4646147914886475 Epoch 1 : Test Accuracy: 1041/1250 Train Accuracy: 25369/30000 Train Loss: 0.4579694811503092 Epoch 2 : Test Accuracy: 4169/5000 Train Accuracy: 25367/30000 Train Loss: 0.45321130771636964
1.7. Prediction
Now that training is complete, our model is ready to classify some images. Given a series of images, we will compare their actual labels (first line of text output) and the model predictions (second line of text output).
(defn predict-ch3 [net dataset ndmanager] (let [batch (first (ds/get-data-iterator dataset ndmanager)) X (nd/head (ds/get-data batch)) y-hat (nd/argmax (net X) 1) y (nd/head (ds/get-labels batch))] [y-hat y])) (def prediction (predict-ch3 net (second fashion-mnist) ndm)) (println "Prediction: " (take 20 (nd/to-vec (prediction 0)))) (println "Actual label: "(take 20 (map int (nd/to-vec (prediction 1)))))
Prediction: (3 4 3 0 7 1 5 3 9 7 8 0 5 3 4 8 1 0 4 4) Actual label: (3 4 3 0 7 1 5 3 9 7 8 0 5 3 2 8 1 6 2 4)
1.8. Summary
With softmax regression, we can train models for multi-category classification. The training loop is very similar to that in linear regression: retrieve and read data, define models and loss functions, then train models using optimization algorithms. As you will soon find out, most common deep learning models have similar training procedures.