FAQ – During Training Execution

Friday, November 03, 2017


Posted by Yoshiyuki Kobayashi

“Creating cache data…” takes an extremely long amount of time.

Cache data is being created based on the dataset input in CSV format in order to perform high-speed training. If there are a lot of training data samples and the disk access speed is slow, creating this cache data may take a long time.
If you are using a slow-speed storage, such as a network drive or archive hard disk, to store datasets, consider using NVMe SSD or similar.
The cache data is created in a folder with the name CSV file name+”.cache”.
By opening the cache data folder, you can check the cache creation status (a file is created for every 100 data samples).
If you select the Enable Dataset Cache check box on the CONFIG tab, cache data created in the past is used. This enables subsequent cache data recreation to be skipped. If you update the CSV file, you must clear the Enable Dataset Cache check box to execute training and recreate the cache data.

Loss and error on the validation data do not decrease monotonically.

This may be because the number of data samples is insufficient or the network structure is not appropriate. See also “Training does not progress properly such as NaN being displayed for the evaluation values” in the FAQ.
Another reason may be that similar data samples are being provided consecutively (not shuffled) for the training datasets. Neural Network Console, while loading data in order from the beginning of the provided training dataset, performs parameter optimization based on the mini-batch gradient descent method that handles data in unit of size specified by Batch Size on the CONFIG tab. When this is done, it is ideal for the data contained in each mini-batch to be shuffled in order to perform optimization efficiently. The situation may improve by shuffling the rows in the dataset CSV file in advance or by selecting the Shuffle check box on the CONFIG tab to randomly shuffle the dataset at the time of execution.

Training does not progress properly such as NaN being displayed for the evaluation values.

The input data values must be adjusted so that they roughly fall between -1.0 and 1.0. If there are data samples that deviate greatly from this range, NaN may occur during the neural network computation process (for image input, values are normalized in the range 0.0 ≤ x < 1.0 by default).
The initialization function (Initializer) for Weight of Affine or Convolution may not be set properly.
The combination of the Loss layer and its previous Activation layer may not be appropriate. To perform regression on continuous values, specify a SquaredError layer without an Activation layer. To perform binary discrimination of 0s and 1s, specify a Simoid + BinaryCrossEntropy layer. To solve a category classification problem, specify a Softmax + CategoricalCrossEntropy layer.
You may be able to avoid the phenomenon by replacing Tanh or other layers in which NaN occurs when the input value deviates from a given range (via inf) with a layer such as ReLU that NaN never occurs.

“MemoryError,” or “Out of memory” appears in the log output.

There is a shortage in memory. If you are using a GPU, consider employing a GPU with larger memory. Reproducing cutting-edge image classification problems requires at least 12 GB of GPU memory, being 8 GB the bare minimum.
To resolve memory shortage without changing the GPU, you can make the size of the input data smaller (for example, if it is an image, downsample to a lower resolution), set Batch Size to a smaller value on the CONFIG tab (change the default value of 100 to 50, 20, or even 1), edit the network structure so that it uses less memory (Output and CostParameter indicated in the network statistics in the lower right of the EDIT tab).

After loading trained parameters and editing the network, an error occurs when training is started.

The parameter size of the layer after editing may not match the size of the trained parameters. Check that the size of the loaded trained parameters (parameter size indicated by the .txt file specified by the W.File property in the case of Affine) matches the layer parameter size (input size × output size in the case of Affine). If they do not match, the trained parameters cannot be used, so set the property that specifies the trained parameter to blank.

An error appears in the log output, but I do not understand the meaning.

If the error is caused by a dataset, you may be able to determine the error by selecting Check Consistency on the shortcut menu on the DATASET tab. For details, see “Checking the consistency of datasets.”