UP | HOME

Table of Contents

1. Data Preprocessing

(ns clj-d2l.data-preprocess
  (:require [clj-djl.ndarray :as nd]
            [clj-d2l.core :as d2l]
            [clj-djl.dataframe :as df]
            [clj-djl.dataframe.functional :as functional]
            [clojure.java.io :as io]))

1.1. Reading the Dataset

As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ../data/housetiny.csv. Data stored in other formats may be processed in similar ways.

Below we write the dataset row by row into a csv file.

(let [filename "data/house_tiny.csv"
      records ["NumRooms,Alley,Price\n"  ;; Column names
               "NA,Pave,127500\n"        ;;Each row represents a data example
               "2,NA,106000\n"
               "4,NA,178100\n"
               "NA,NA,140000\n"]]
  (io/make-parents filename)
  (dorun
   (map #(spit filename % :append true) records))
  (slurp filename))
NumRooms,Alley,Price
NA,Pave,127500
2,NA,106000
4,NA,178100
NA,NA,140000

To load the raw dataset from the created csv file, we require the clj-djl.dataframe package and invoke the read function to read directly from the csv we created. This dataset has four rows and three columns, where each row describes the number of rooms (“NumRooms”), the alley type (“Alley”), and the price (“Price”) of a house.

(def data (df/dataframe "data/house_tiny.csv"))
data
data/house_tiny.csv [4 3]:

| NumRooms | Alley |  Price |
|---------:|-------|-------:|
|          |  Pave | 127500 |
|        2 |       | 106000 |
|        4 |       | 178100 |
|          |       | 140000 |

1.2. Handling Missing Data

Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.

We split the data into inputs and outputs by creating new dataframes and specifying the columns desired, where the former takes the first two columns while the latter only keeps the last column. For numerical values in inputs that are missing, we replace the missing data entries with the mean value of the same column.

(def dataframe
  (let [data (df/replace-missing
              data ["NumRooms"] functional/mean)
        data (df/update-column
              data "Alley_nan"
              (map #(if (nil? %) 1 0) (data "Alley")))
        data (df/update-column
              data "Alley_Pave"
              (map #(if (some? %) 1 0) (data "Alley")))
        inputs (df/select-columns
                data ["NumRooms" "Alley_Pave" "Alley_nan"])
        outputs (df/select-columns
                 data ["Price"])]
    [inputs outputs]))
(first dataframe)
data/house_tiny.csv [4 3]:

| NumRooms | Alley_Pave | Alley_nan |
|---------:|-----------:|----------:|
|        3 |          1 |         0 |
|        2 |          0 |         1 |
|        4 |          0 |         1 |
|        3 |          0 |         1 |
(second dataframe)
data/house_tiny.csv [4 1]:

|  Price |
|-------:|
| 127500 |
| 106000 |
| 178100 |
| 140000 |

1.3. Conversion to the Tensor Format

Now that all the entries in inputs and outputs are numerical, they can be converted to the NDArray format. Once data are in this format, they can be further manipulated with those NDArray functionalities that we have introduced in Section 2.1.

(def ndm (nd/new-base-manager))
(def X (df/->ndarray ndm (first dataframe)))
(def Y (df/->ndarray ndm (second dataframe)))
X
ND: (4, 3) cpu() int32
[[ 3,  1,  0],
 [ 2,  0,  1],
 [ 4,  0,  1],
 [ 3,  0,  1],
]
Y
ND: (4, 1) cpu() int64
[[127500],
 [106000],
 [178100],
 [140000],
]

1.4. Summary

  • Like many other extension packages in the vast ecosystem of clojure, clj-djl.dataframe can work together with NDArray.
  • Imputation and deletion can be used to handle missing data.

Author: Kimi Ma

Created: 2022-05-17 Tue 08:06