Dataset preparation

Friday, November 03, 2017


Posted by Yoshiyuki Kobayashi

A dataset is a collection of data used for neural network training and performance evaluation.



Using the sample project included in the Neural Network Console enables you to try a training example without having to create a dataset from scratch. For instructions on how to train using this sample dataset, see “Tutorial.” This section explains how to create a dataset on your own.



The Neural Network Console contains the functionality to create a dataset for image classification based on categorized folder-separated data. For details, see “Creating a image classification dataset based on categorized folder-separated data”.


The Neural Network Console supports datasets in CSV (comma-separated text) format that the Neural Network Console defines. This dataset format supports a wide variety of formats including image input-vector output format used in image classifier training, image input-image output format used in pixel-level classification and image filter training, and matrix input-vector output format used in classifier training based on other types of arbitrary vector or matrix data.


In order to train on the Neural Network Console, you need to prepare two datasets: a training dataset and a validation dataset. The training dataset is used for neural network training, and the validation dataset is used only for accuracy evaluation (but not for training). The file format of training datasets and validation datasets is the same.


1 Basic structure of datasets

The dataset CSV file consists of a header (the first row) and data (all subsequent rows).


1.1 Header

Each cell in the header row indicates the variable name, dimensional index, and label name of the data in each column of the CSV file. The value of each cell in the header is expressed as variable name[_dimensional index][:label name].


The variable name represents the variable name used for identification within the Neural Network Console. You can use any character string for the variable name, but typically x is assigned to input data and y to output data. If there are multiple inputs and outputs, they are distinguished by appending a suffix number after x and y as in x1, x2, …, and y1, y2 … y is placed in a column to the right of x. Each of the x and y variables is placed in ascending order by suffix from the left.


The dimensional index indicates which dimension of the vector each column of the CSV corresponds to, when the variable is in vector form. A dimensional index is expressed as a variable name followed by two underscores and a number. The dimensional index starts at 0. For example, a 10-dimensional vector takes on index numbers 0 to 9.


The label name indicates a simple name that represents each variable or each dimension of a variable. Unlike variable names, any character string that is easy for the user to understand can be assigned to a label name with no particular restrictions.



The Neural Network Console handles spaces in CSV files as normal characters. Be careful not to insert extraneous spaces after commas or the like.


1.2 Data

Each row, starting from the second row, contains a data sample. For example, a CSV file with 1001 rows contains a total of 1000 data samples. A data sample consists of multiple variables specified by the header. Each cell from the second row contains a file name or value depending on the variable type.


If the variable is image data, enter an image file name in each cell. For a file name, you can use an absolute path or a relative path to the dataset CSV file. The Neural Network Console supports image files with .png, .jpg, .jpeg, .gif, .bmp, and .tif extensions with one (grayscale) or three (RGB) color channels. Images in core library are handled as arrays containing the number of channels, height, and width as elements of each dimension.


If the variable is a vector or matrix instead of an image, enter the name of the data CSV file in each cell. Data CSV files are CSV files that you prepare for each data sample separately from the dataset CSV file. A data CSV file does not have a header and consists only of value cells. The Neural Network Console handles CSV files consisting of multiple rows and multiple columns as arrays with (rows, columns) elements. The actual values need to be processed in advance, so that range of the data is roughly between -1.0 and 1.0.



The number of elements of all arrays must be the same for the same variables contained in a single dataset CSV file. For example, images with different number of color channels, width, and height must not be mixed as images for the same variable x. If they are different, the resolution of all images must be consolidated in advance through preprocessing (e.g., resizing).



You can check whether the dataset you created is in the proper dataset format by loading the dataset on the Neural Network Console’s DATASET tab and clicking Check Consistency on the shortcut menu. For details, see “Checking the consistency of datasets.”



The Neural Network Console loads the data in order, from the beginning of the provided training dataset, and performs parameter optimization based on the mini-batch gradient descent method, which handles data in mini-batches, whose size is specified by Batch Size on the CONFIG tab. Optimally, the data contained in each mini-batch should be full of variations, in order to perform optimization efficiently. Therefore, we recommend that the rows in a dataset CSV file should be shuffled in advance.


2 Example of an image classification dataset

This section explains the format of datasets for training an image classifier using the MNIST handwritten digit classification sample dataset generated in the following folder as an example.



(* This file is automatically created when any of the sample project files that use the MNIST dataset is loaded. If the file is not available, a sample project such as 01_logistic_regresion is loaded. For details on loading sample projects, see “Tutorial.”)


This dataset consists of monochrome images of digits of size 28 by 28 pixels and label data that indicates which digit each image represents, from 0 to 9.


x:image y:label
./training/5/0.png 5
./training/0/1.png 0
./training/4/2.png 4
./training/1/3.png 1
./training/9/4.png 9
./training/2/5.png 2
./training/1/6.png 1
./training/3/7.png 3
./training/1/8.png 1


In row 1, column 1, x:image indicates that the variable name of column 1 is x and that its label name is image. In row 1, column 2, y:label indicates that the variable name of column 2 is y and that its label name is label.


Each row from the second row indicates a single data sample. Column 1 indicates the image file name (relative path from the CSV file) of input x while column 2 indicates the correct category index. As there are 10 types of digits (0 to 9) for classification in the MNIST dataset, the category indexes take on values 0 to 9, which correspond to the digits “0” to “9.” For example, if the image given in column 1 is the digit “5,” category index 5, which corresponds to the digit “5,” is specified in column 2.


In the MNIST dataset, the image data file indicated by variable x is a monochrome image of 28 by 28 pixels. As such, the array size of x on the Neural Network Console is (1,28,28). As the data indicated by variable y is one-dimensional, the array size of y is (1).