ELU activation: A comprehensive analysis

Exponential Linear Unit (ELU), proposed by Djork-Arné in 2015, is a variant of the so-called ReLU nonlinearity. Through various experiments, ELU is accepted by many researchers as a good successor of the original version (ReLU).

$\left ELU(x, \alpha) = \Bigg \{ \begin{array}{ll} x &\text{if } x > 0\\ \alpha (e^x - 1) &\text{if } x \leq 0 \end{array} \right.$

Elu Function — ELU Function for different $\alpha$ . Note that the value $\alpha$ is pre-defined by the user, i.e. it is not learned by the network.

While ELU’s output may not be smaller than or equal to – $\alpha$ , the upper bound is clearly undefined (it goes to infinity). Hence, ELU is a non-saturated nonlinearity.

In practice, $\alpha$ is often set to be 1.0 or in the range [0.1, 0.3].

The derivative of ELU is given by:

$\left ELU'(x, \alpha) = \Bigg \{ \begin{array}{ll} 1 &\text{if } x > 0\\ ELU(x, \alpha) + \alpha &\text{if } x \leq 0 \end{array} \right.$

Elu Derivative Function — The derivative of ELU.

Advantages

$\blacktriangleright$ The dying ReLU problem no longer exists, as both negative and positive inputs are transformed to non-0 outputs.

$\blacktriangleright$ Outputs can be either positive or negative. Studies showed that functions with 0-centered outputs help networks train faster. Although ELU’s outputs are not distributed around 0, the fact that it does produce negative values makes it be preferred in this sense compared to ReLU. In practice, it seems that networks with ELU converge more quickly than with ReLU, even though the exponential computation in ELU ( $e^x$ ) requires longer processing time.

$\blacktriangleright$ ELU is not piece-wise linear, this makes it model the non-linearity better.

$\blacktriangleright$ The plateau in the negative region helps in maintaining robustness and stability.

Disadvantages

$\blacktriangleright$ $\alpha$ is fixed, not learned.

$\blacktriangleright$ It still suffers from Gradient Exploding and Gradient Vanishing Problem, thus the help of normalization methods may remain necessary sometimes (e.g. this paper). SELU, which was built from ELU with an innate ability to self-normalize, was introduced to address this problem.

$\blacktriangleright$ It is not 0-centered. Although ELU does produce negative outputs, the fact that it is not 0-centered makes it seemingly sub-optimal. Another nonlinearity called Parametric ELU is introduced to mitigate this issue.

$\blacktriangleright$ No sparsity. ReLU outputs 0 for negative inputs, which is bad in some senses but also good in some others, as elaborated in this post.

Performance Comparison

Experiment from ELU’s authors

Experiment 1:

Objective: To compare the performance of ELU with ReLU and Leaky ReLU on a simple classification task.

Experimental Design:

Dataset: MNIST
Network: 8 hidden layers of 128 neurons each. The networks are trained by stochastic gradient descent (SGD) with a learning rate of 0.1 and a mini-batch size of 64.
ELU with $\alpha$ = 1
Leaky ReLU with $\alpha$ = 0.1

Result:

Testing different activations on MNIST dataset. — Networks’ loss using the 3 types of nonlinearity. The straight lines represent training errors while the dotted lines show testing errors.

Verdict: While ReLU and Leaky ReLU had similar behaviors after 10 epochs of training with the loss around 0.15, ELU expressed a stronger power that nearly converged after 5 epochs with only 0.1 cross-entropy loss.

Experiment 2:

Objective: To compare the performance of ELU versus ReLU and Leaky ReLU on unsupervised learning.

Experimental Design:

Dataset: MNIST
Network: a deep autoencoder whose encoder part consists of 4 fully connected hidden layers with sizes 1000, 500, 250, and 30, respectively, while the decoder part is totally symmetrical. There are different fixed learning rate in use. The networks are trained by stochastic gradient descent with mini-batch sizes of 64.
ELU with $\alpha$ = 1
Leaky ReLU with $\alpha$ = 0.1

Result:

Networks' recovering errors with different types of nonlinearity. — Network recovering errors with different types of nonlinearity. All the errors are measured on the testing set. The blue, red and green colors represent ELU, ReLU and Leaky ReLU, respectively.

Verdict: This result implies the superiority of ELU over the other 2 activations while its error rates are clearly the lowest for every choice of learning rate.

Experiment 3:

Objective: To compare the performance of ELU versus ReLU, Leaky ReLU and SReLU on a more complex supervised learning task.

Experimental Design:

Dataset: CIFAR-100, which contains color images in 100 classes, 50k train, and 10k test. The images were also preprocessed with global contrast normalization, ZCA whitening, and 4 0-pixels padded at all borders.
Network: a CNN with 11 convolutional layers in stacks. A 2×2 max-pooling with a stride of 2 was applied after each stack. Dropout and L2-weight decay regularization are used. The learning rate is decreased in iterations.
The networks are trained with and without BatchNorm.

(More details about the configuration can be found in the paper here.)

Result:

Without BatchNorm:

Networks' test errors (and their standard deviations) on CIFAR-100 dataset using different nonlinearities. BatchNorm is not used. — Networks’ test errors (and their standard deviations) on CIFAR-100 dataset using different nonlinearities. BatchNorm is not used.

With BatchNorm:

Comparison in performance between ELU and ReLU with and without BatchNorm

Comparison in performance between ELU and SReLU with and without BatchNorm

Comparison in performance between ELU and LReLU with and without BatchNorm — Performance comparison between ELU and each of the 3 other nonlinearities.

Verdict: It is shown that ELU outperforms the remaining activations, even when they are helped by BatchNorm. Furthermore, there is some evidence that adding BatchNorm does not enhance a network with ELU but may even make its error rate higher.

Experiment 4:

Objective: To compare the performance of a CNN network with ELU versus other renowned CNN architectures.

Experimental Design:

Dataset: The CIFAR-10 and CIFAR-100 datasets. Data preprocessing is similar to the previous experiment.
Network: The CNN with ELU is structured as: there are 18 convolutional layers. The dropout rate, Max-pooling, L2-weight decay, momentum term are similar to the previous experiment. The initial learning rate is set to 0.01 and decreased by a factor of 10 after each 35k iterations. The mini-batch size is 100.

(More details about the configuration can be found in the paper here.)

Result:

Verdict: It is shown that the network with ELU gave the lowest error rate for CIFAR-100. Furthermore, it is also ranked second on CIFAR-10.

Experiment 5:

Objective: To compare ELU and ReLU on the ImageNet dataset.

Experimental Design:

Dataset: the 1000-class ImageNet. Images are resized to 256×256 pixels and per-pixel mean subtracted. Training is performed on 224×224 random crops with random horizontal flipping. No augmentation during training.
Network: The network has 15 convolutional layers. 2×2 max-pooling with a stride of 2 is applied after each stack. Spatial pyramid pooling with 3 levels is added 3 levels before the first fully-connected (FC) layer. L2-weight decay term to 0.0005 is used. The dropout rate is 50% for 2 penultimate FC layers.

Result:

The comparison between ELU and ReLU on the ImageNet dataset. On the left is Top-5 Test error while on the right is Top-1 Test error.

Verdict: This result shows that the network with ELU reduces the error rate faster than with ReLU in terms of the number of iterations. To reach a 20% Top-5 error rate, ELU needs only 160k iterations, this number is 200k iterations for ReLU.

Note that the speed of using ELU is slower by 5% compared with using ReLU.