The Cardinal Edge


Weight initialization is the method of determining starting values of weights in a neural network. The way this method is done can have massive effects on the network[2, 3, 6, 9] and can halt training if not handled properly. On the other hand, if initialization is chosen tactfully it can improve training and accuracy greatly. The initialization method usually called Normalized Xavier will be referred to as Nox in this paper to avoid confusion with the Xavier initialization method. This study analyzes five methods of weight initialization(Nox, He, Xavier, Plutonian, and Self-Root), two of them being new to this study combined with three activation functions(Relu, Swish, and Tanh) and uses two datasets(MNIST[5], US Census 1990[4]). The study compares weight initialization methods using average MSE’s of FFN’s and shows significance by using MannWhitney U p-tests. This study does not provide very many definitive results outside of what is already proven in other studies but does provide a lot of new questions and speculation that can hopefully be answered. The definitive data this study does provide is as follows. While Swish is the activation function for all layers, the Plutonian produces lower error than the He, Nox, and Xavier, and the Xavier produces higher error than any other initialization method with statistical significance. The Self-Root produces higher error than any other initialization method while Tanh is the activation function for all layers. When Relu was the activation function in all layers Nox and He had a very significant statistical similarity. As for speculation, the Plutonian proved to be quite flexible in its use, possibly indicating a low error if used in networks with different activation functions in different layers. The Nox networks with Tanh as an activation performed better on MNIST, which could mean that when Tanh is an activation function more neurons per layer could lead to less error with the Nox.