Table of Contents
1. Data Preprocessing
(ns clj-d2l.data-preprocess (:require [clj-djl.ndarray :as nd] [clj-d2l.core :as d2l] [clj-djl.dataframe :as df] [clj-djl.dataframe.functional :as functional] [clojure.java.io :as io]))
1.1. Reading the Dataset
As an example, we begin by creating an artificial dataset that is stored in a csv (comma-separated values) file ../data/housetiny.csv. Data stored in other formats may be processed in similar ways.
Below we write the dataset row by row into a csv file.
(let [filename "data/house_tiny.csv" records ["NumRooms,Alley,Price\n" ;; Column names "NA,Pave,127500\n" ;;Each row represents a data example "2,NA,106000\n" "4,NA,178100\n" "NA,NA,140000\n"]] (io/make-parents filename) (dorun (map #(spit filename % :append true) records)) (slurp filename))
NumRooms,Alley,Price NA,Pave,127500 2,NA,106000 4,NA,178100 NA,NA,140000
To load the raw dataset from the created csv file, we require the
clj-djl.dataframe
package and invoke the read function to read
directly from the csv we created. This dataset has four rows and three
columns, where each row describes the number of rooms (“NumRooms”),
the alley type (“Alley”), and the price (“Price”) of a house.
(def data (df/dataframe "data/house_tiny.csv")) data
data/house_tiny.csv [4 3]: | NumRooms | Alley | Price | |---------:|-------|-------:| | | Pave | 127500 | | 2 | | 106000 | | 4 | | 178100 | | | | 140000 |
1.2. Handling Missing Data
Note that there are some blank spaces which are missing values. To handle missing data, typical methods include imputation and deletion, where imputation replaces missing values with substituted ones, while deletion ignores missing values. Here we will consider imputation.
We split the data into inputs and outputs by creating new dataframes and specifying the columns desired, where the former takes the first two columns while the latter only keeps the last column. For numerical values in inputs that are missing, we replace the missing data entries with the mean value of the same column.
(def dataframe (let [data (df/replace-missing data ["NumRooms"] functional/mean) data (df/update-column data "Alley_nan" (map #(if (nil? %) 1 0) (data "Alley"))) data (df/update-column data "Alley_Pave" (map #(if (some? %) 1 0) (data "Alley"))) inputs (df/select-columns data ["NumRooms" "Alley_Pave" "Alley_nan"]) outputs (df/select-columns data ["Price"])] [inputs outputs]))
(first dataframe)
data/house_tiny.csv [4 3]: | NumRooms | Alley_Pave | Alley_nan | |---------:|-----------:|----------:| | 3 | 1 | 0 | | 2 | 0 | 1 | | 4 | 0 | 1 | | 3 | 0 | 1 |
(second dataframe)
data/house_tiny.csv [4 1]: | Price | |-------:| | 127500 | | 106000 | | 178100 | | 140000 |
1.3. Conversion to the Tensor Format
Now that all the entries in inputs and outputs are numerical, they can be converted to the NDArray format. Once data are in this format, they can be further manipulated with those NDArray functionalities that we have introduced in Section 2.1.
(def ndm (nd/new-base-manager)) (def X (df/->ndarray ndm (first dataframe))) (def Y (df/->ndarray ndm (second dataframe))) X
ND: (4, 3) cpu() int32 [[ 3, 1, 0], [ 2, 0, 1], [ 4, 0, 1], [ 3, 0, 1], ]
Y
ND: (4, 1) cpu() int64 [[127500], [106000], [178100], [140000], ]
1.4. Summary
- Like many other extension packages in the vast ecosystem of clojure,
clj-djl.dataframe
can work together with NDArray. - Imputation and deletion can be used to handle missing data.