Project window. Using Optimizer on the CONFIG tab

Friday, November 03, 2017

App

Posted by Yoshiyuki Kobayashi

 

1 Specifying the range to apply the optimizer

Specify the boundaries in the Effective Range’s from and to boxes in units of iterations or epochs.

For example, to apply the optimizer between 50 epoch to 100 epoch, enter 50 in the from box and 100 in the to box. To apply the optimizer over the entire range from 1 epoch to the maximum epoch, leave the from and to boxes blank.

 

2 Specifying the name of the network used for optimization

Set Network to the name of the network created on the edit tab.

 

3 Specifying the name of the dataset used for optimization

Set data to the name of the dataset loaded on the dataset tab. To use several datasets simultaneously with the optimizer, specify the dataset names by separating each name with a comma.

 

4 Specifying the parameter update method

  1. From the Config list, select Optimizer.
  2. Select an updater from the following (Adam by default).
Updater Update expression
Adadelta  
$$g_t \leftarrow \Delta w_t\\
v_t \leftarrow – \frac{RMS \left[ v_t \right]_{t-1}}{RMS \left[ g \right]_t}g_t\\
w_{t+1} \leftarrow w_t + \eta v_t$$

Matthew D. Zeiler

ADADELTA: An Adaptive Learning Rate Method

https://arxiv.org/abs/1212.5701

Adagrad  
$$g_t \leftarrow \Delta w_t\\
G_t \leftarrow G_{t-1} + g_t^2\\
w_{t+1} \leftarrow w_t – \frac{\eta}{\sqrt{G_t} + \epsilon} g_t$$

John Duchi, Elad Hazan and Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

Adam  
$$m_t \leftarrow \beta_1 m_{t-1} + (1 – \beta_1) g_t\\
v_t \leftarrow \beta_2 v_{t-1} + (1 – \beta_2) g_t^2\\
w_{t+1} \leftarrow w_t – \alpha
\frac{\sqrt{1 – \beta_2^t}}{1 – \beta_1^t}
\frac{m_t}{\sqrt{v_t} + \epsilon}$$

Kingma and Ba

Adam: A Method for Stochastic Optimization.

https://arxiv.org/abs/1412.6980

Adamax  
$$m_t \leftarrow \beta_1 m_{t-1} + (1 – \beta_1) g_t\\
v_t \leftarrow \max\left(\beta_2 v_{t-1}, |g_t|\right)\\
w_{t+1} \leftarrow w_t – \alpha
\frac{\sqrt{1 – \beta_2^t}}{1 – \beta_1^t}
\frac{m_t}{v_t + \epsilon}$$

Kingma and Ba

Adam: A Method for Stochastic Optimization.

https://arxiv.org/abs/1412.6980

AMSGRAD  
$$m_t=\beta_1m_{t-1}+\left(1-\beta_1\right)g_t\\
v_t=\beta_2v_{t-1}+\left(1-\beta_2\right){g_t}^2\\
{\hat{v}}_t=max\left({\hat{v}}_{t-1},\ v_t\right)\\
\theta_{t+1}=\theta_t-\alpha\frac{m_t}{\sqrt{{\hat{v}}_t}+\varepsilon}$$

Reddi et al.

On the convergence of ADAM and beyond.

https://openreview.net/pdf?id=ryQu7f-RZ

Momentum  
$$v_t \leftarrow \gamma v_{t-1} + \eta \Delta w_t\\
w_{t+1} \leftarrow w_t – v_t$$

Ning Qian

On the momentum term in gradient descent learning algorithms

http://www.columbia.edu/~nq6/publications/momentum.pdf

Nag  
$$v_t \leftarrow \gamma v_{t-1} – \eta \Delta w_t\\
w_{t+1} \leftarrow w_t – \gamma v_{t-1} + \left(1 + \gamma \right) v_t$$

Yurii Nesterov

A method for unconstrained convex minimization problem with the rate of convergence o(1/k2)

RMSprop  
$$g_t \leftarrow \Delta w_t\\
v_t \leftarrow \gamma v_{t-1} + \left(1 – \gamma \right) g_t^2\\
w_{t+1} \leftarrow w_t – \eta \frac{g_t}{\sqrt{v_t} + \epsilon}$$

Geoff Hinton

Lecture 6a : Overview of mini-batch gradient descent

http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf

Sgd $$w_{t+1} \leftarrow w_t – \eta \Delta w_t$$

 

θ: Parameter to be updated

g: Gradient

η, α: Learning Rate, Alpha (learning rate)

γ, β1, β2: Momentum, or Decay, Beta1, Beta2 (Decay parameters)

ε: Epsilon (a small value used to prevent division by zero)

 

5 Setting the Weight Decay (L2 regularization) strength

Specify the weight decay coefficient in Weight Decay.

 

6 Specifying the learning rate decay mode

  1. From the Config list, select Optimizer.
  2. Select an LR Scheduler (Learning Rate Scheduler) for the updater from the following (Exponential by default).
  3. To perform Linear Warmup, select the Warmup check box, and specify the Warmup length in iterations or epochs.
LR Scheduler Description
Cosine  
The learning rate is decayed according to the following formula.
$$\eta_t=\frac{\eta_0}{2}\left(1+cos\left(\frac{t}{T}\pi\right)\right)$$

t: Parameter update count up to the present
ηt: Learning rate to use for parameter updating
η0: Learning rate specified by Updater
T: Parameter update count up to training completion

Exponential  
The learning rate is decayed using the exponential function.

Specify the rate to decay the learning rate in Learning Rate Multiplier. Specify the interval for decaying the learning rate in unit of mini-batches in LR Update Interval.
For example, to decay the learning rate 0.9999 times every mini-batch, set Learning Rate Multiplier to 0.9999 and LR Update Interval to 1 iteration. To decay the learning rate 0.1 times every 20 epochs, set Learning Rate Multiplier to 0.1 and LR Update Interval to 1 epoch.

Polynomial  
The learning rate is decayed using the following polynomial.
$$\eta_t=\eta_0\left(1-\left(\frac{t}{T}\right)^p\right)$$

Specify the rate to use in p in Power.
t: Parameter update count up to the present
ηt: Learning rate to use for parameter updating
η0: Learning rate specified by Updater
T: Parameter update count up to training completion

Step  
The learning rate is decayed by the specified multiplying factor every specified iteration or every epoch.

Specify the rate to decay the learning rate in Learning Rate Multiplier. Specify the timing for decaying the learning rate in LR Update Steps using comma-separated values.
For example, to decay the learning rate 0.1 times at 200,000, 300,000, and 400,000 iterations, set Learning Rate Multiplier to 0.1 and LR Update Steps to “200000,300000,400000” iterations.
For example, to decay the learning rate 0.1 times at 30, 60, and 80 epochs, set Learning Rate Multiplier to 0.1 and LR Update Steps to “30,60,80” epochs.

 

7 Updating parameters once every several mini-batches

Specify the parameter update interval in Update Interval. For example, to calculate four gradients using mini-batches containing 64 data samples and update the parameters using these gradients every four mini-batches, set Batch Size to 64 and Update Interval to 4.

 

Notes

To perform optimization using multiple training networks, Update Interval must be set to 1.

 

8 Adding a new optimizer

Click the hamburger menu () or right-click the Config list to open a shortcut menu, and click Add Optimizer.

 

9 Renaming an optimizer

  1. Click the hamburger menu () or right-click the Config list to open a shortcut menu, and click Rename.
    Or, on the Config list, double-click the optimizer you want to rename.

  2. Type the new name, and press Enter.

 

10 Deleting an optimizer

  1. From the Config list, select the optimizer you want to delete.
  2. Click the hamburger menu () or right-click the Config list to open a shortcut menu, and click Delete.
    Or, press Delete on the keyboard.

 

11 Rearranging optimizers

  1. From the Config list, select the optimizer you want to rearrange.
  2. Click the hamburger menu or right-click the Config to open a shortcut menu, and click Move Up or Move Down.