‎

1. Implementation of Softmax Regression from Scratch

1. Implementation of Softmax Regression from Scratch

(ns clj-d2l.softmax-from-scratch
  (:require [clojure.java.io :as io]
            [clj-djl.ndarray :as nd]
            [clj-djl.training.dataset :as ds]
            [clj-djl.model :as model]
            [clj-djl.nn :as nn]
            [clj-djl.training.loss :as loss]
            [clj-djl.training.tracker :as tracker]
            [clj-djl.training.optimizer :as optimizer]
            [clj-djl.training :as t]
            [clj-djl.training.listener :as listener]
            [clj-djl.engine :as engine]
            [clj-d2l.core :as d2l])
  (:import [ai.djl.basicdataset.cv.classification FashionMnist]
           [java.nio.file Paths]))

(setq org-babel-clojure-sync-nrepl-timeout 1000)

(def batch-size 256)
(def fashion-mnist (d2l/load-data-fashion-mnist batch-size))

#’clj-d2l.softmax-from-scratch/batch-size

#’clj-d2l.softmax-from-scratch/fashion-mnist

(.size (first fashion-mnist))

1.1. Initializing Model Parameters

As in our linear regression example, each example here will be represented by a fixed-length vector. Each example in the raw dataset is a $28 \times 28$ image. In this section, we will flatten each image, treating them as vectors of length 784. In the future, we will talk about more sophisticated strategies for exploiting the spatial structure in images, but for now we treat each pixel location as just another feature.

Recall that in softmax regression, we have as many outputs as there are classes. Because our dataset has 10 classes, our network will have an output dimension of 10. Consequently, our weights will constitute a $784 \times 10$ matrix and the biases will constitute a $1 \times 10$ row vector. As with linear regression, we will initialize our weights W with Gaussian noise and our biases to take the initial value 0.

(def num-inputs 784)
(def num-outputs 10)
(def ndm (nd/base-manager))
(def W (nd/random-normal ndm 0 0.01 [num-inputs num-outputs] :float32))
(def b (nd/zeros ndm [num-outputs] :float32))

1.2. Defining the Softmax Operation

Before implementing the softmax regression model, let us briefly review how the sum operator works along specific dimensions in a tensor, as discussed in Section 2.3-linear-algebra.html#lin-alg-reduction and Section 2.3-linear-algebra.html#lin-alg-non-reduction. Given a matrix X we can sum over all elements (by default) or only over elements in the same axis, i.e., the same column (axis 0) or the same row (axis 1). Note that if X is a tensor with shape (2, 3) and we sum over the columns, the result will be a vector with shape (3,). When invoking the sum operator, we can specify to keep the number of axes in the original tensor, rather than collapsing out the dimension that we summed over. This will result in a two-dimensional tensor with shape (1, 3).

(def X (nd/create ndm [[1 2 3] [4 5 6]]))

(nd/sum X [0] true)

ND: (1, 3) cpu() int64
[[ 5,  7,  9],
]

(nd/sum X [1] true)

ND: (2, 1) cpu() int64
[[ 6],
 [15],
]

(nd/sum X [0 1] true)

ND: (1, 1) cpu() int64
[[21],
]

(nd/sum X [0 1] false)

ND: () cpu() int64
21

(nd/sum X)

ND: () cpu() int64
21

We are now ready to implement the softmax operation. Recall that softmax consists of three steps: (i) we exponentiate each term (using exp); (ii) we sum over each row (we have one row per example in the batch) to get the normalization constant for each example; (iii) we divide each row by its normalization constant, ensuring that the result sums to 1. Before looking at the code, let us recall how this looks expressed as an equation:

$\begin{equation} \label{org7db6f3c} \mathrm{softmax}(\mathbf{X})_{ij} = \frac{\exp(\mathbf{X}_{ij})}{\sum_k \exp(\mathbf{X}_{ik})}. \end{equation}$

The denominator, or normalization constant, is also sometimes called the partition function (and its logarithm is called the log-partition function). The origins of that name are in statistical physics where a related equation models the distribution over an ensemble of particles.

(defn softmax [ndarray]
  (let [Xexp (nd/exp ndarray)
        partition (nd/sum Xexp [1] true)]
    (nd// Xexp partition)))

#'clj-d2l.softmax-from-scratch/softmax

As you can see, for any random input, we turn each element into a non-negative number. Moreover, each row sums up to 1, as is required for a probability.

(def X (nd/random-normal ndm [2 5]))
(softmax X)

ND: (2, 5) cpu() float32
[[0.2492, 0.2315, 0.2786, 0.0503, 0.1905],
 [0.1874, 0.4392, 0.2314, 0.0868, 0.0552],
]

(nd/sum (softmax X) [1])

ND: (2) cpu() float32
[1., 1.]

Note that while this looks correct mathematically, we were a bit sloppy in our implementation because we failed to take precautions against numerical overflow or underflow due to large or very small elements of the matrix.

1.3. Defining the Model

Now that we have defined the softmax operation, we can implement the softmax regression model. The below code defines how the input is mapped to the output through the network. Note that we flatten each original image in the batch into a vector using the reshape function before passing the data through our model.

(defn net [ndarray]
  (let [current-W W
        current-b b]
    (-> ndarray
        (nd/reshape [-1 num-inputs])
        (nd/dot current-W)
        (nd/+ current-b)
        softmax)))

1.4. Defining the Loss Function

Next, we need to implement the cross-entropy loss function, as introduced in Section 3.4-softmax-regression.html#sec-softmax. This may be the most common loss function in all of deep learning because, at the moment, classification problems far outnumber regression problems.

Recall that cross-entropy takes the negative log-likelihood of the predicted probability assigned to the true label. Rather than iterating over the predictions with a for-loop (which tends to be inefficient), we can pick all elements by a single operator. Below, we create sample data y-hat with 2 examples of predicted probabilities over 3 classes and their corresponding labels y. With y we know that in the first example the first class is the correct prediction and in the second example the third class is the ground-truth. Using y as the indices of the probabilities in y-hat, we pick the probability of the first class in the first example and the probability of the third class in the second example.

(def y (nd/create ndm [0 2]))
(def y-hat (nd/create ndm [[0.1 0.3 0.6][0.3 0.2 0.5]]))
(nd/get y-hat ":,{}" y)

ND: (2, 1) cpu() float64
[[0.1],
 [0.5],
]

Now we can implement the cross-entropy loss function efficiently with just one line of code.

(defn cross-entropy [y-hat y]
  (-> (nd/get y-hat ":, {}" (nd/to-type y :int32 false))
      (.log)
      (.neg)))

(cross-entropy y-hat y)

ND: (2, 1) cpu() float64
[[2.3026],
 [0.6931],
]

1.5. Classification Accuracy

Given the predicted probability distribution y-hat, we typically choose the class with the highest predicted probability whenever we must output a hard prediction. Indeed, many applications require that we make a choice. Gmail must categorize an email into “Primary”, “Social”, “Updates”, or “Forums”. It might estimate probabilities internally, but at the end of the day it has to choose one among the classes.

When predictions are consistent with the label class y, they are correct. The classification accuracy is the fraction of all predictions that are correct. Although it can be difficult to optimize accuracy directly (it is not differentiable), it is often the performance measure that we care most about, and we will nearly always report it when training classifiers.

To compute accuracy we do the following. First, if y-hat is a matrix, we assume that the second dimension stores prediction scores for each class. We use argmax to obtain the predicted class by the index for the largest entry in each row. Then we compare the predicted class with the ground-truth y elementwise. Since the equality operator == is sensitive to data types, we convert y-hat’s data type to match that of y. The result is a tensor containing entries of 0 (false) and 1 (true). Taking the sum yields the number of correct predictions.

(defn accuracy [y-hat y]
  (if (> (nd/size (nd/shape y-hat)) 1)
    (-> (nd/argmax y-hat 1)
        (nd/= (nd/to-type y :int64 false))
        (nd/sum)
        (nd/get-element))))

We will continue to use the variables y-hat and y defined before as the predicted probability distributions and labels, respectively. We can see that the first example’s prediction class is 2 (the largest element of the row is 0.6 with the index 2), which is inconsistent with the actual label, 0. The second example’s prediction class is 2 (the largest element of the row is 0.5 with the index of 2), which is consistent with the actual label, 2. Therefore, the classification accuracy rate for these two examples is 0.5.

(/ (accuracy y-hat y) (nd/size y))

1/2

Similarly, we can evaluate the accuracy for any model net on a dataset that is accessed via the data iterator data-iter.

(def fashion-mnist-train (first fashion-mnist))
(defn evaluate-accuracy [net data-iter]
  (let [acc (atom [0 0])]
    (doseq [batch data-iter]
      (let [X (nd/head (ds/get-data batch))
            y (nd/head (ds/get-labels batch))]
        (swap! acc update 0 + (accuracy (net X) y))
        (swap! acc update 1 + (nd/size y))
        (ds/close batch)))
    (reduce / @acc)))

#'clj-d2l.softmax-from-scratch/evaluate-accuracy

Here accumulate is a utility function to accumulate sums over multiple variables. In the above evaluate-accuracy function, we create a atom of vector with 2 variables for storing both the number of correct predictions and the number of predictions, respectively. Both will be accumulated over time as we iterate over the dataset.

(defn accumulate [atom x y z]
  (swap! atom update 0 + x)
  (swap! atom update 1 + y)
  (swap! atom update 2 + z))

Because we initialized the net model with random weights, the accuracy of this model should be close to random guessing, i.e., 0.1 for 10 classes.

(evaluate-accuracy net (ds/get-data-iterator (second fashion-mnist) ndm))

119/2000

1.6. Training

The training loop for softmax regression should look strikingly familiar if you read through our implementation of linear regression in Section 3.2. Here we refactor the implementation to make it reusable. First, we define a function to train for one epoch. Note that updater is a general function to update the model parameters, which accepts the batch size as an argument. It can be either a wrapper of the d2l/sgd function or a framework’s built-in optimization function.

(defn train-epoch-ch3 [net train-iter lr loss updater]
  (let [acc (atom [0 0 0])]
    (doseq [param [W b]]
      (nd/set-requires-gradient param true))
    (doseq [batch (t/iter-seq train-iter)]
      (let [X (-> batch ds/get-data nd/head (nd/reshape [-1 num-inputs]))
            y (-> batch ds/get-labels nd/head)]
        (with-open [gc (-> (engine/get-instance) (engine/new-gradient-collector))]
          (let [y-hat (net X)
                l (loss y-hat y)]
            (t/backward gc l)
            (accumulate acc (nd/get-element (nd/sum l)) (accuracy y-hat y) (nd/size y)))))
      (updater [W b] lr batch-size)
      (ds/close batch))
    [(/ (@acc 0) (@acc 2)) (/ (@acc 1) (@acc 2))]))

The training function then runs multiple epochs and visualize the training progress.

Again, we use the minibatch stochastic gradient descent to optimize the loss function of the model. Note that the number of epochs (numEpochs), and learning rate (lr) are both adjustable hyper-parameters. By changing their values, we may be able to increase the classification accuracy of the model. In practice we will want to split our data three ways into training, validation, and test data, using the validation data to choose the best values of our hyper-parameters.

(defn sgd [params lr batch-size]
  (doseq [param params]
    (nd/-! param (nd// (nd/* (nd/get-gradient param) lr) batch-size))))

(defn train-ch3 [net train-ds test-ds lr loss num-epochs updater]
  (doseq [i (range num-epochs)]
    (let [train-metrics (train-epoch-ch3 net (ds/get-data-iterator train-ds ndm) lr loss updater)
          accuracy (evaluate-accuracy net (ds/get-data-iterator test-ds ndm))
          train-accuracy (get train-metrics 1)
          train-loss (get train-metrics 0)]
      (println "Epoch " i ": Test Accuracy: " accuracy)
      (println "Train Accuracy: " train-accuracy)
      (println "Train Loss: "train-loss))))

(def num-epochs 3)
(def lr 0.1)
(train-ch3 net (first fashion-mnist) (second fashion-mnist) lr cross-entropy num-epochs sgd)

Epoch  0 : Test Accuracy:  2067/2500
Train Accuracy:  10117/12000
Train Loss:  0.4646147914886475
Epoch  1 : Test Accuracy:  1041/1250
Train Accuracy:  25369/30000
Train Loss:  0.4579694811503092
Epoch  2 : Test Accuracy:  4169/5000
Train Accuracy:  25367/30000
Train Loss:  0.45321130771636964

1.7. Prediction

Now that training is complete, our model is ready to classify some images. Given a series of images, we will compare their actual labels (first line of text output) and the model predictions (second line of text output).

(defn predict-ch3 [net dataset ndmanager]
  (let [batch (first (ds/get-data-iterator dataset ndmanager))
        X (nd/head (ds/get-data batch))
        y-hat (nd/argmax (net X) 1)
        y (nd/head (ds/get-labels batch))]
    [y-hat y]))

(def prediction (predict-ch3 net (second fashion-mnist) ndm))
(println "Prediction:   " (take 20 (nd/to-vec (prediction 0))))
(println "Actual label: "(take 20 (map int (nd/to-vec (prediction 1)))))

Prediction:    (3 4 3 0 7 1 5 3 9 7 8 0 5 3 4 8 1 0 4 4)
Actual label:  (3 4 3 0 7 1 5 3 9 7 8 0 5 3 2 8 1 6 2 4)

1.8. Summary

With softmax regression, we can train models for multi-category classification. The training loop is very similar to that in linear regression: retrieve and read data, define models and loss functions, then train models using optimization algorithms. As you will soon find out, most common deep learning models have similar training procedures.

Table of Contents