Example File: GHSAMPLE.XLS, Polynom worksheet
Solution Type: Continuous Chromosomes
This example is located in a tab labeled Polynom in the GHSAMPLE.XLS worksheet that was installed in the C:\GENEHUNTER\EXAMPLES\EXCEL subdirectory if you chose the default directory during GeneHunter installation. (If you install the entire AI Trilogy set of programs, the GeneHunter folder is a subfolder of the C:\AI Trilogy folder.)
Following is an example where we have used GeneHunter to build a non-linear model from a set of independent variables. The model is similar to the kind that can be built with non-linear regression analysis, and even the so called "polynomial nets". Of course, our example does not contain terms in several variables, but such terms are not difficult to add. Adding too many additional terms, however, may make evolution much slower. As a matter of fact, there is no reason why more general types of equations than polynomials can't be used in the same fashion.
The general idea is to find an equation that "fits" a set of data. The data consists of some number of "independent variables" from which some sort of prediction is to be made. The prediction is called the "dependent variable". A large number of sample predictions must be available, wherein the dependent variable corresponding to the independent variables is known. Each such sample is called an "observation".
For example, suppose we wanted to build an equation which models Bob's New York Style Deli sandwich sales (See the GeneHunter example called Neural Net in the file GHSAMPLE.XLS). We could use the same worksheet. The neural network output is the dependent variable, the number of sandwiches actually sold. The neural network inputs are the independent variables, the temperature, day of the week, etc. The rows in the worksheet are the observations.
The model or formula which predicts the dependent variable from the independent variables will be a polynomial, i.e., an equation of the form:
c1*x1^n1 + c2*x2^n2 + ... ck*xk^nk
where x1, x2, ..., xk are the k independent variables,
c1, c2, ...,ck are the k coefficients of the independent variables, and
n1, n2, ... ,nk are the powers to which the independent variables are raised.
Here is an example of a polynomial of 3 independent variables x1, x2, and x3:
dependent variable = 2.2x1^2 + 4x2^3 - 1x3^1
Figure 5.15 Polynomial Approximation
In the example Polynom, we created 5 independent variables in 5 rows labeled a, b, c, d, and e. We filled these columns with random numbers. Then we created a dependent variable column. In order to see how well the process worked, we wanted the dependent variable to be computed with an actual formula so we could see if GeneHunter could find the correct formula.
The formula we used was:
dependent variable = 3.2a + 4b + 5.5c^2 - 6.1d^3 + 3.3e^3
The fitness function we used was the sum of the squared errors between the actual dependent variable as calculated by our formula and the dependent variable as calculated by the formula generated by GeneHunter. Obviously, we would like to minimize this total error. The total error cell is H15.
Since there are 5 independent variables, there are 5 continuous chromosomes for the coefficients. We constrained these to lie within [-7, 7]. There are also 5 powers which should be integers between 0 and 3. The coefficients are on the worksheet in the range B14:F14 and the powers are in the range B15:F15.
The formulas in column H simply raise the independent variables to the right power, then multiply them by the appropriate coefficients, and finally sum the terms.
You can control the degree of overfitting by the limitations you place on the powers. The higher the powers you allow, the greater the chance of overfitting. Note that if a power goes to 0, then the term becomes a constant since any variable to the 0 power is 1. Also, if the coefficient goes close to zero, the term should probably be dropped. Use precisions of 16 or 32 bits so more precise coefficients can be found.
There are some extensions that you can make to this problem if you desire.
1. You can try using more than one variable in a single term, e.g., a^2*b^3 or b^2*d^2.
2. You may want to fix certain powers (exponents) and allow the GA to only find the coefficients for certain terms. For example, if you only want a first degree equation, you can force all the terms to use only an exponent of 1. Then there is less work for the GA.
3. You can provide multiple terms in each variable which allows the GA to find terms in the same variables with different exponents, e.g, b^1, b^2, and b^3.
4. If you want the GA to actually pick variables, you can use some integer chromosomes that will refer to the variable number, i.e., the column in which the variable is found. Use the Excel INDEX macro function in the equation to refer to the proper column. This scheme will also allow some variables to be picked more than once with different exponents.
5. If the variables are of different types in different scales, you will probably want to scale or normalize the variables into one range before finding the polynomial fitting equation. An excellent way to do this is by taking the mean and standard deviation of each column (variable). Then adjust each value in the column by subtracting the mean and dividing the result by the standard deviation. This will give you values for all variables in the same small range.