Paper

The Benefits of Over-parameterization at Initialization in Deep ReLU\n Networks

It has been noted in existing literature that over-parameterization in ReLU\nnetworks generally improves performance. While there could be several factors\ninvolved behind this, we prove some desirable theoretical properties at\ninitialization which may be enjoyed by ReLU networks. Specifically, it is known\nthat He initialization in deep ReLU networks asymptotically preserves variance\nof activations in the forward pass and variance of gradients in the backward\npass for infinitely wide networks, thus preserving the flow of information in\nboth directions. Our paper goes beyond these results and shows novel properties\nthat hold under He initialization: i) the norm of hidden activation of each\nlayer is equal to the norm of the input, and, ii) the norm of weight gradient\nof each layer is equal to the product of norm of the input vector and the error\nat output layer. These results are derived using the PAC analysis framework,\nand hold true for finitely sized datasets such that the width of the ReLU\nnetwork only needs to be larger than a certain finite lower bound. As we show,\nthis lower bound depends on the depth of the network and the number of samples,\nand by the virtue of being a lower bound, over-parameterized ReLU networks are\nendowed with these desirable properties. For the aforementioned hidden\nactivation norm property under He initialization, we further extend our theory\nand show that this property holds for a finite width network even when the\nnumber of data samples is infinite. Thus we overcome several limitations of\nexisting papers, and show new properties of deep ReLU networks at\ninitialization.\n

arXiv (Cornell University)Published 2019-01-11Paper linkPDF

Authors: Arpit, Devansh · Bengio, Yoshua

Topics

Relevant entities

People

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.