Scaling and Activation Functions


Scaling and Activation Functions


NeuroShell 2's Advanced System gives you the option of using several different methods to scale data for neuron values in the input layer and to select different activation functions for each slab that is not in the input layer.


You can set the option in the Advanced System's Design Module when you open the Architecture and Parameters module and click on the desired network architecture.  (The scaling option is available for all types of networks except self-organizing Kohonen networks.  The activation functions may be changed in Backpropagation networks only.)   The large diagram of the network architecture includes a slab box which allows you to select a scaling or activation function, depending upon whether the slab is in the input layer or another layer.


Scaling Functions

When variables are loaded into a neural network, they must be scaled from their numeric range into the numeric range that the neural network deals with efficiently.  There are two main numeric ranges the networks commonly operate in, depending upon the type of activation functions you use:  zero to one denoted [0, 1], and minus one to one denoted [-1, 1].  If the numbers are scaled into the same ranges, but larger numbers are allowed later (i.e., they are not clipped at the bottom or top) then we will denote the ranges <<0, 1>> and <<-1, 1>>.  Thus [0, 1] and [-1, 1] denote that NeuroShell 2 will clip off numbers below and above the ranges that it encounters later in new data.  In other words, if data from 0 to 100 is scaled to [0, 1], then a later data value of 120 will get scaled to 1.  However, if the same data were scaled to <<0, 1>>, then 120 would be scaled to 1.2.


In addition to the aforementioned linear scaling functions, there are two non-linear scaling functions:  logistic and tanh.  The logistic function scales data to (0, 1) according to the following formula:



where mean is the average of all of the values of that variable in the pattern file, and sd is the standard deviation of those values.


Note: The ( ) instead of [ ] indicates that the data never actually gets to 0 or 1.


Tanh scales to (-1, 1) according to:



where tanh is the hyperbolic tangent.


Both of these functions will tend to squeeze together data at the low and high ends of the original data range.  They may thus be helpful in reducing the effects of "outliers."  They have an additional advantage in that no new data no matter how large is ever clipped or scaled out of range.  Use both linear and non-linear scaling methods to see which works best for your data.  The default scaling function is [-1, 1], which is linear.


You may also select "none" and NeuroShell 2 will not scale your data but will move it into the input layer "as is."  This option should only be used when you have used some other method to scale your input data into either the range of [0, 1] or [-1, 1] or a similar range.


Activation Function

You may also select an activation function to be used on each slab to which you will propagate (e.g., in a three layer net, to the hidden and output slabs).  The default activation function, if you do not specify one, is the logistic function.  You may also obtain good results by using the Gaussian function in the hidden layer.  Refer to the following description of the Gaussian function for details.


The hidden layers produce outputs based upon the sum of weighted values passed to them.  So does the output layer.  The way they produce their outputs is by applying an "activation" function to the sum of the weighted values.  The activation function, also called the squashing function, maps this sum into the output value, which is then "fired" on to the next layer.


Although the logistic function is the most popular, there are other functions which may be used.  Some enhancements in NeuroShell 2 have reduced the need for these other functions.  However, some problems will nevertheless respond better to these other activation functions, so they are included.


logistic -- f(x)=1/(1+exp(-x))

linear -- f(x)=x

tanh -- f(x)=tanh(x), the hyperbolic tangent function

tanh15 -- tanh(1.5x)

sine -- sin(x)

symmetric_logistic -- 2/(1+exp(-x))-1

Gaussian -- exp(-x^2)

Gaussian-complement -- 1 - exp(-x^2)


The following are guidelines for when to use the various functions, but like so much in neural networks, there will always be exceptions to these guidelines:


Logistic (Sigmoid logistic) - We have found this function useful for most neural network applications.  It maps values into the (0, 1) range.  Always use this function when the outputs are categories.





Linear - Use of this function should generally be limited to the output slab.  It is useful for problems where the output is a continuous variable, as opposed to several outputs which represent categories.  Stick to the logistic for categories.  Although the linear function detracts from the power of the network somewhat, it sometimes prevents the network from producing outputs with more error near the min or max of the output scale.  In other words the results may be more consistent throughout the scale.  If you use it, stick to smaller learning rates, momentums, and initial weight sizes.  Otherwise, the network may produce larger and larger errors and weights and hence not ever lower the error.  The linear activation function is often ineffective for the same reason if there are a large number of connections coming to the output layer because the total weight sum generated will be high.





Tanh (hyperbolic tangent) - Many experts feel this function should be used almost exclusively, but we do not agree.  It is sometimes better for continuous valued outputs, however, especially if the linear function is used on the output layer.  If you use it in the first hidden layer, scale your inputs into [-1, 1] instead of [0,1].  We have experienced good results when using the hyperbolic tangent in the hidden layer of a 3 layer network, and using the logistic or the linear function on the output layer.





Tanh15 (hyperbolic tangent 1.5) - There has been at least one technical paper wherein it was strongly proposed that tanh(1.5x) is much better than tanh(x).  We do not necessarily agree, but have included it anyway.





Sine - This is included for research only.  We know of no advocates of it, although on some problems it is just as good as the others.  If you use it in the first hidden layer, scale your inputs into [-1, 1] instead of [0,1].  If used on the output layer, scale outputs to [-1,1] also.





Symmetric Logistic - This is like the logistic, except that it maps to (-1,1) instead of to (0,1).  The different scaling may be better for some problems, and it should be tried.  When the outputs are categories, try using the symmetric logistic function instead of logistic in the hidden and output layers.  In some cases, the network will train to a lower error in the training and test sets.



Symmetric Logistic


Gaussian - This function is unique, because unlike the others, it is not an increasing function.  It is the classic bell shaped curve, which maps high values into low ones, and maps mid-range values into high ones.  There is not much about its use in the literature, but we have found it very useful in a small set of problems.  We suspect that it is bringing out meaningful characteristics not found at the extreme ends of the sum of weighted values.  This function produces outputs in [0,1].


An architecture which has worked very well for us is a three layer network with two slabs in the hidden layer (this is one of the Ward backpropagation networks).  Put a Gaussian activation function in one slab and Tanh in the second, then use the logistic activation function in the output layer. For prediction of continuous valued outputs, use a three-layer backpropagation network with a Gaussian in the hidden layer, and a linear activation function in the output layer.  For classification problems, try a regular three-layer backpropagation network with a Gaussian in the hidden layer, and a Sigmoid logistic function in the output layer.





Gaussian Complement - This function will tend to bring out meaningful characteristics in the extremes of the data.  We have not seen the function in the literature, but we have included it for research.  It is very useful in Ward networks.



Gaussian Complement


Some of these activation functions may require lower learning rates and momentums than the logistic function requires; in fact these other activation functions may thrive on different learning rates and momentums than the default settings for the logistic function.  Also, be careful if you find that a particular function solves your problem faster:  make sure it also solves it as well, i.e., that the net generalizes as well.  Tanh and sine have steeper slopes than the logistic or symmetric logistic functions, and so they may appear to solve the problem faster.  Also note that because some activation functions scale data between 0 and 1 while others scale data between -1 and 1, you may get a different average error for architectures which use different activation functions on the output layer.