Exponential Linear Unit (ELU), proposed by DjorkArné in 2015, is a variant of the socalled ReLU nonlinearity. Through various experiments, ELU is accepted by many researchers as a good successor of the original version (ReLU).
While ELU’s output may not be smaller than or equal to –, the upper bound is clearly undefined (it goes to infinity). Hence, ELU is a nonsaturated nonlinearity.
In practice, is often set to be 1.0 or in the range [0.1, 0.3].
The derivative of ELU is given by:
Advantages
The dying ReLU problem no longer exists, as both negative and positive inputs are transformed to non0 outputs.
Outputs can be either positive or negative. Studies showed that functions with 0centered outputs help networks train faster. Although ELU’s outputs are not distributed around 0, the fact that it does produce negative values makes it be preferred in this sense compared to ReLU. In practice, it seems that networks with ELU converge more quickly than with ReLU, even though the exponential computation in ELU () requires longer processing time.
ELU is not piecewise linear, this makes it model the nonlinearity better.
The plateau in the negative region helps in maintaining robustness and stability.
Disadvantages
is fixed, not learned.
It still suffers from Gradient Exploding and Gradient Vanishing Problem, thus the help of normalization methods may remain necessary sometimes (e.g. this paper). SELU, which was built from ELU with an innate ability to selfnormalize, was introduced to address this problem.
It is not 0centered. Although ELU does produce negative outputs, the fact that it is not 0centered makes it seemingly suboptimal. Another nonlinearity called Parametric ELU is introduced to mitigate this issue.
No sparsity. ReLU outputs 0 for negative inputs, which is bad in some senses but also good in some others, as elaborated in this post.
Test your understanding 

Performance Comparison
Experiment from ELU’s authors
Experiment 1:
Objective: To compare the performance of ELU with ReLU and Leaky ReLU on a simple classification task.
Experimental Design:
Result:
Verdict: While ReLU and Leaky ReLU had similar behaviors after 10 epochs of training with the loss around 0.15, ELU expressed a stronger power that nearly converged after 5 epochs with only 0.1 crossentropy loss.
Experiment 2:
Objective: To compare the performance of ELU versus ReLU and Leaky ReLU on unsupervised learning.
Experimental Design:
Result:
Verdict: This result implies the superiority of ELU over the other 2 activations while its error rates are clearly the lowest for every choice of learning rate.
Experiment 3:
Objective: To compare the performance of ELU versus ReLU, Leaky ReLU and SReLU on a more complex supervised learning task.
Experimental Design:
(More details about the configuration can be found in the paper here.)
Result:
Without BatchNorm:
With BatchNorm:
Verdict: It is shown that ELU outperforms the remaining activations, even when they are helped by BatchNorm. Furthermore, there is some evidence that adding BatchNorm does not enhance a network with ELU but may even make its error rate higher.
Experiment 4:
Objective: To compare the performance of a CNN network with ELU versus other renowned CNN architectures.
Experimental Design:
(More details about the configuration can be found in the paper here.)
Result:
Verdict: It is shown that the network with ELU gave the lowest error rate for CIFAR100. Furthermore, it is also ranked second on CIFAR10.
Experiment 5:
Objective: To compare ELU and ReLU on the ImageNet dataset.
Experimental Design:
Result:
Verdict: This result shows that the network with ELU reduces the error rate faster than with ReLU in terms of the number of iterations. To reach a 20% Top5 error rate, ELU needs only 160k iterations, this number is 200k iterations for ReLU.
Note that the speed of using ELU is slower by 5% compared with using ReLU.
Experiments from others
Experiment 1:
Source: Deep Residual Networks with Exponential Linear Unit
Objective: To compare the performance of the original ResNet (which uses a combination of ReLU + BatchNorm) and the same network but changed to use ELU instead.
Experimental design:
Result:
The 2 architectures are trained with different depths.
Verdict: The new ResNet with ELU gives higher performance regardless of the depth.
Experiment 2:
Source: Deep Residual Networks with Exponential Linear Unit
Objective: To compare the new network architecture (the modified ResNet using ELU) with other stateoftheart methods.
Experimental design:
Result:
Verdict: The modified ResNet with ELU shows the best performance.
Experiment 3:
Source: On the Impact of the Activation Function on Deep Neural Networks Training
Objective: The original purpose of this experiment is to highlight the effectiveness of Edge of Chaos (EOC), which is a specific choice of hyperparameters as described in Schoenholz, 2017. However, we can adapt the result to compare the time to convergence of networks using ELU, ReLU and Tanh nonlinearities.
Experimental design:
(More details about the configuration can be found in the paper here.)
Result:
Verdict: In terms of speed, ELU outperforms ReLU and Tanh in this experiment, given that it converges after sufficiently less time and epochs than the other 2 activations.
Experiment 4:
Source: sonamsingh19.github.io
Objective: To compare the performances of ReLU and ELU activation functions for long dependencies.
Experimental design:
Result:
Verdict: ELU reaches an accuracy of around 96% after 15 epochs while ReLU struggles below 85%.
Conclusion
In this article, we discussed the ELU activation function for Deep Learning. Compared to the canonical ReLU, ELU is a bit more involved in the computation, however, with its various advantages, in practice, networks using ELU is not significantly slower than using ReLU (in fact, some experiments indicate that using ELU may even boost the speed for a large amount). Furthermore, there is evidence that ELU outperforms many other activations in terms of accuracy and error rate. Henceforth, many researchers have assumed using ELU as an enhanced version to replacing ReLU as their default nonlinearity.
References:
 The original paper of ELU by DjorkArné: link
 Deep Residual Networks with Exponential Linear Unit by Shah et al.: link
 On the Impact of the Activation Function on Deep Neural Networks Training by Hayou et al.: link
 ML Low hanging Fruit: If you are using RNN for long dependencies: Try ELU: link
 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units by Quoc et al.: link
 Deep Information Propagation by Schoenholz et al.: link