Introduction

In software engineering, artificial neural networks (ANN) is a family of software systems that can perform supervised learning or unsupervised pattern recognition tasks based on the idea that a network of neuron-like signal processing elements can be used to find the set of optimal parameters that will fit a classifier function to the given data.

For classification tasks, when compared to Support Vector Machine (SVM) algorithms, feed-forward neural networks can be harder to train because they can easily get stuck in local minima or overfit the training data. While SVMs extract their 'support vectors' from the data set and the number of model parameters can grow with the data set, neural networks try to make a fixed number of parameters fit the given data. SVMs expect the user to have chosen a 'kernel' while in neural networks the function of the kernel needs to be learned by the network. Unless the data set is split up, SVM algorithms don't scale to huge data sets like neural networks do. Plus, the layered architecture of neural networks is a more natural fit for deep learning.

Network Architectures

  • Multilayer Networks
  • Convolutional Networks
  • Recursive Networks


Network Size versus Training Effort

Interestingly, loss functions for larger networks (number of parameters rather than number of layers) have a disproportionately larger number of local minima with values close to the global minimum. That is, the chance of finding a close-to-optimal combination of parameters by wandering around the network via a local method (such as gradient descent) is higher in larger networks. It is therefore not recommended to reduce the size of the network to deal with overfitting issues. Instead, regularization techniques like weight decay (adding L2 parameter penalty term), dropout, bagging, early stopping, adding noise to the input should be applied.

Parameter Initialization

The idea is to have unit variance (=1) at the outputs of a cell when training starts. The weights should be large enough to prevent running into the problem of having redundant symmetries in the network while also small enough to prevent instability from overly large gradient values. A common heuristic for a generic network cell is to initialize the weights by sampling from the uniform distribution [math]\displaystyle{ \mathcal{U}(-\frac1{\sqrt{n}},\frac1{\sqrt{n}}) }[/math], where [math]\displaystyle{ n }[/math] is the number of inputs of the cell.

For networks using rectified linear units (ReLU) as activation function, the current best practice is to initialize the weights of a cell to [math]\displaystyle{ \sqrt{\frac{2}{n}}\mathcal{N}(0, 1) }[/math], where [math]\displaystyle{ \mathcal{N}(0,1) }[/math] is the Normal distribution with zero mean and unit variance. Biases are either initialized to zero or to a small value (0.01) that makes sure the ReLUs are in their active regions when training starts.

Batch Normalization is a recent technique of inserting a normalization operation into the cell architecture, between the linear input weighting and the nonlinear activation function. The normalization operation is differentiable and, therefore, does not interfere with back propagation.

Software Tools

Parallel computing techniques are used in the implementation of neural networks to vastly reduce training times. Neural networks are well suited for parallel computation.

Resources


Debug data: