﻿ GMDH Advanced Training Criteria
 GMDH Advanced Training Criteria

 GMDH Type Use the mouse to click on the button for either Smart or Advanced GMDH.  Different parameter screens are displayed depending upon which type is chosen.   Note:  We realize that the explanation we have given for GMDH is brief, but a complete and detailed description of the algorithm is beyond the scope of this help file.  Readers interested in more technical detail should refer to Farlow’s book (listed in References).   Also, whenever the notation X^2 appears in the following documentation, it refers to X squared.  X^3 refers to X cubed, etc.   It is strongly recommended that you choose Smart GMDH unless you want to become an expert in the use of GMDH.  Even if you are an expert, it is recommended that you try Smart GMDH as a starting point.  The controls available in Smart GMDH are sufficient to enable GMDH to find solutions to many real world problems.   1. Smart GMDH - This mode minimizes the number of controls to be set by the user and allows the user to describe the desired properties of the constructed model in simple terms.   2. Advanced GMDH - This mode gives expert users maximum freedom in setting training parameters, releasing the full power of GMDH.  Because there is no universal method which will give the best results for every problem, using Advanced GMDH may achieve better results than using Smart GMDH.  Advanced GMDH is mathematical in nature and should only be used by those with training in math.   Refer to GMDH Smart Training Criteria for additional information on using that option.  Refer to GMDH Overview for details on GMDH terminology.   Overview In order to understand many of the Advanced GMDH controls, it will be helpful to understand the polynomial models that GMDH builds.  A more complete description with slightly different terminology is in the GMDH Overview.   At each layer, GMDH will build a polynomial like the following:   c1 + c2x1+ c3x2 + c4x3 + c5x1x2 + c6x1x3 + c7x2x3 + c8x1^2 + c9x2^2 + c10x3^2   where c1 is a constant, c2, c3, ... are coefficients, and x1, x2, and x3 are the input variables.   In the first layer, the input neurons are just the problem inputs and the second layer neurons consist of polynomials like the one above.  When the third layer is built, the inputs to the third layer polynomials can either be the original problem inputs, the polynomials from the second layer, or both.  If the inputs are polynomials, then a much more complicated polynomial will result.  The fourth layer takes either the original inputs or the polynomials in the third layer as inputs, and so forth.  Successive layers take inputs from either the original inputs or the polynomials from the immediately preceding layer.   Each new layer initially has many neurons in it.  Each neuron is a polynomial of some inputs (up to three inputs in each polynomial).  Then the best ones (the ones which have a lower selection criterion value) are left and called survivors, and the rest are discarded.   Some of the Advanced GMDH controls are concerned with the form that each polynomial takes (e.g., how high of a power may each variable be) and others are concerned with how the survivors are chosen.  The selection criteria to be used is one of the major controls, since different types of selection criteria formulas will produce different types of models.   Max. Variables in Connection This control refers to the maximum number of input variables coming into this layer used to build one non-linear candidate for survival (each input being either an initial input variable or one of the survivors of the preceding layer).  It may be set to:   (X1), which means only one variable is selected.  This makes sense only for a limited number of special cases (see below).   (X1, X2), which means two variables are selected.   (X1, X2, X3), which means three variables are selected.  It is the default value which is recommended for most cases.   Max. Product Term in Connection This control sets the maximum number of input variables of this layer (each of them being either an initial input variable of the system or one of the survivors of the preceding layer) combined to make covariants included in one non-linear candidate for survival.  This control may be set to:   None - no covariants allowed.  Used only in a limited number of special cases.  Refer to Useful Hints at the end of this section for more information.   Double (X1X2) - only covariants allowed.   Triple (X1X2X3) - covariants and trivariants allowed.  This is the default value recommended for most cases.   Max. Variable Degree in Connection This is the maximum allowed degree of one input variable of any layer used to build one non-linear candidate for survival.  Each input is either an initial input variable or one of the survivors of the preceding layer.  This control may be set to X, X^2 or X^3.  The default value of X^3 is recommended for most cases.   Selection Criterion The Selection Criterion is the most important parameter that must be set in the Design module for GMDH. The Selection Criterion determines which candidates for survival actually become survivors to go on to the next layer.  It is the “objective function” that determines how good the model is at any stage.  Do not worry if you do not understand the technical descriptions of these criteria, since you do not need to understand them to use them.  There are six criteria available.  PSE is defined first (even though it is not first on the screen) since it is a common one in GMDH implementations.   Note: When using any of the Selection Criteria except for Regularity, you should not extract a test set from the pattern (.PAT) file.  If .TRN and .TST files exist, they should be erased prior to training with GMDH, unless you are using Regularity as the Selection Criterion.  This is because GMDH trains with all data and does not use a test set to prevent overtraining except for Regularity (Calibration).  Therefore you will want to use all available data.   PSE (Prediction Squared Error) To avoid dividing the data into two datasets, on the one hand, and to avoid model overfit, on the other hand, this criterion is a sum of two terms:  Norm. MSE and an overfitting penalty.  (Norm. MSE is often referred to as Training Squared Error or TSE in the literature.)  Norm. MSE is the average squared error of a model on the training set.  The value of Norm. MSE is displayed in the Learning module in the Norm. MSE text box of the Statistics frame.  If you want to know what the overfitting penalty is, you can calculate it as the difference between the Best Criterion Value and the Norm. MSE (this is also true for the FCPSE and MDL criteria).   Let N be the number of patterns in the pattern file, and let k be the number of coefficients in the model (that are determined in order to minimize Norm. MSE).  Then PSE can be expressed as                               PSE = Norm. MSE + 2*var(p) * k/N,   where var(p) is a prior estimation of the true error variance of the prediction model.  Theoretical considerations show that var(p) = var(a)/2 is a good estimation [here var(a) is the variance of the actual output variable].  Nevertheless, it makes sense to introduce a special factor named CC (Criterion Coefficient) to change the weight of the overfitting penalty.  So, the final expression for PSE is   PSE = Norm. MSE + CC * var(a) * k/N.   You may set the CC yourself (see Selection Strategy below).  Usually you should begin with the Criterion Coefficient = 1, and make it larger or smaller, depending upon whether the model is overfitting or underfitting the data, respectively.   FCPSE (Full Complexity Prediction Squared Error) This is a modified version of PSE.  The expression for FCPSE   FCPSE = Norm. MSE + CC * var(a) * C/N   is similar to that for PSE, except that the number of coefficients in the model k is changed to an overall model complexity C, which takes into account the complexity of each term in the model.  The algorithm for calculating overall complexity is a Ward Systems Group proprietary method.  In many cases, FCPSE gives better results than PSE, as it provides a more effective estimation of complexity.  The Criterion Coefficient for FCPSE is used in the same way as it is for PSE.   MDL (Minimal Description Length) This is a criterion similar to PSE, but it incorporates a stronger penalty for model complexity:   MDL = Norm. MSE + CC * var(a) * k/N * ln(N) where ln is the natural log.   The Criterion Coefficient for MDL is used in the same way as for PSE.   GCV (Generalized Cross Validation) This is one more method of introducing an overfitting penalty.  The expression for GCV is   GCV = Norm. MSE / (1 - CC * k/N)^2.   The Criterion Coefficient for GCV is used in the same way as for PSE.   FPE (Final Prediction Error) This is the minimum variance unbiased estimator of the mean-squared error of prediction, in the case of a correctly specified model and Gaussian errors.  The expression for FPE is   FPE = Norm. MSE * (N+k)/(N-k)   The Criterion Coefficient is not used for FPE, so the value set in the Training Criteria module is ignored.   Regularity (Calibration) This is the average squared error of the model on the test set.  Use of this criterion is very much like using the Calibration feature for other NeuroShell 2 architectures.  To use Regularity, you should extract a test set from the pattern file with the Test Set Extract module.  It is recommended that the test set be about one third the size of the pattern file.  The Criterion Coefficient is not used for Regularity, so the value set in the Training Criteria module is ignored.   Maximum Number of Survivors in the First Layer This control determines the maximum number of models which are allowed to survive the current layer and which will be used as inputs to the next layer.  (In some cases, the actual number of survivors may be less than the value defined by this option.)  If a Schedule Type (see below) other than Const is selected, then the Max. Number of Survivors is decreased automatically from layer to layer.  The Max. Number of Survivors affects the diversity of survivor models and the quality of choice of important variables.  If you set this value too low, then the algorithm may miss some important variables.  On the other hand, increasing the Max. Number of Survivors increases computation time, so there is a tradeoff.  The minimum and maximum values allowed to be set for this control depend upon the number of inputs to the problem.  It is recommended that Max. Number of Survivors be set at the default value.  However, it is one of the options which should be varied if you want to improve your model.   Schedule Type This control automatically decreases the Max. Number of Survivors, which may, in some cases, achieve results faster or even better.  However, this control is a fine tuning parameter, and it should be varied only after the other controls have been varied.   There are three options available for this control:   Const (default) - sets the Max. Number of Survivors for all the layers equal to the Max. Number of Survivors in the First Layer.   Asymptotic - decreases the Max. Number of Survivors from layer to layer, asymptotically achieving the minimum value of 2.  The degree of the decrease is determined by Decrease in Max. Number of Survivors  (see below).   Linear - sets a linear rule to decrease Max. Number of Survivors from layer to layer.  The degree of this decrease is determined by Decrease in Max. Number of Survivors (see below).  For a given number of layers and a given Decrease in Max. Number of Survivors, the Linear option results in less survivors than Asymptotic.   Decrease in Max. Number of Survivors This control determines the relative slope of the decrease of the Max. Number of Survivors.  This control is disabled if Schedule Type is set to Const. The possible options are:               Gentle - gradually decreases the maximum number of survivors.               Medium - decreases the maximum number of survivors at a steady rate.               Steep - rapidly decreases the maximum number of survivors.   Model Optimization The model obtained by the GMDH algorithm may be improved by giving up some unnecessary terms at different stages of the algorithm.  As there is no way to determine a priori which terms should be removed, several different algorithms are used which implement different strategies and which trade thoroughness for speed.   As a rule, turning on more thorough optimization slows the system down, but it may bring better solutions. This option may also affect how well GMDH is able to select significant variables. There are more optimization options in Advanced GMDH than there are in Smart GMDH.   Off - this mode is very fast, but it creates solutions which are too complex.  It usually is not recommended unless you want a very rough determination of significant variables.   Express - this mode of optimization is fast, but it may leave some terms which would be removed in case of a more thorough optimization.   Fast - this mode may provide a better selection of significant variables than Express mode.   Smart (default) - provides smart optimization which is in most cases is an optimal tradeoff between calculation speed and model quality.   Thorough - same as Smart but performs the most thorough selection of significant variables.  In most cases it makes sense to try this mode in addition to Smart.   Full - checks all possible combinations of terms in all stages of the algorithm.  It provides the best possible results but it takes a lot of time.  It is not necessary for most problems.   Selection Strategy Criterion Coefficient The Criterion Coefficient governs the relative weight of the penalty for model complexity.  It must be greater than zero, and should usually be within the range 0.1 to 4 (1 is the preferred setting.  The program will allow you to use values between .05 to 10.).  If the value of the Criterion Coefficient is set too high, it causes the generation of simpler models with poor performance.  If the value is set too low, it creates more complex models which overfit the training data and generalize poorly.  The Criterion Coefficient is a parameter of the FCPSE, PSE, MDL, and GCV Selection Criteria types described above.   If you want GMDH to memorize the training data, set the Criterion Coefficient to .05, the smallest possible value.  You may want to do this to find the relative importance of input variables, which you could use in other networks.   Extended Linear Models Check this box if you want to allow linear regressions at each layer even if Max. Variable Degree in Connection is > 1.  Such regressions may contain more than Max. Variables in Connection.   Whether testing of Extended Linear Models should be done depends upon the problem.  For some problems, the use of ELM achieves suitable results much faster.  However, some problems progress fast at first, but finish with poor results.  This happens because if linear models survive over the early layers, the non-linearity of all survivors turns out to be too low.  All of the survivors are too much alike and it is hard to gain the required non-linearity during the following layers.  It is recommended that you allow testing of ELMs at first, but if the results are poor, turn it off and try again.   Random Selection of Vars Usually testing of all possible models is accomplished by sequentially testing all possible combinations of input variables.  That is fine if you are trying to find the very best model and don’t care how long it trains.  However, if you want to estimate the performance of the algorithm for a given set of parameters or for a new problem, then it is sufficient to test only some of the possible variable combinations.  If your problem is large and has many inputs and/or you have allowed many Survivors, this saves a lot of time.  The combinations of variables selected for testing should be chosen at random in order not to give preference to the first variables.  Check the Random Selection of Vars box to do this.   Note: Random Selection of Vars has nothing in common with Random Pattern Selection used with other architectures.  In GMDH, all patterns are used with a rotational pattern selection.   Combinations to Try If Random Selection of Vars is turned on, you can limit the number of tested combinations of variables.  The Combinations to Try edit box allows you to enter the percentage of all the possible variable combinations which will actually be tested. For example, if you set Combinations to Try to 50 percent, this will cause the system to test half of all possible variable combinations, randomly chosen. (Our experience has shown that 50 percent usually is enough to obtain a model with a criterion value very close to that which results from testing all possible variable combinations.) If you set Combinations to Try to 100 percent, then all combinations will be tested (this makes sense if you use Fast Completion Threshold - see below).  If Random Selection of Vars is turned off, then the Combinations to Try box is disabled and its value is ignored.   Fast Completion Threshold The Fast Completion Threshold checkbox works in conjunction with the Completion Factor described below.  If Random Selection of Vars is turned off, then the Fast Completion Threshold box is disabled and its value is ignored.   The quality of models actually chosen in the case of Random Selection of Vars depends upon which combinations will actually be tested.  If you set the Combinations to Try percentage too low, you risk missing the best possible models.  On the other hand, if you set it close to 100%, random selection becomes useless for saving time.   There is a solution, however.  You may specify the criterion level value for the worst of the survivors in order to terminate current layer construction and pass on to the next one.   In some sense, the use of Fast Completion Threshold is equivalent to the use of Terminate Layer Construction mode of training interrupt (refer to GMDH Learning), except the decision to terminate the current layer is made automatically based upon the criterion value, which is determined by the Completion Factor (see below).   Completion Factor If Fast Completion Threshold is turned on, this factor determines the desired improvement in the criterion value for the survivor models compared to the criterion value from the preceding layer.  The current layer is terminated if the criterion value for the worst survivor becomes less than the criterion value of the best model of the preceding layer, multiplied by the specified Completion Factor percentage.   For example, if you set Completion Factor to 0.9, then the current layer will be terminated if the criterion values of all current survivors become less than 90 percent of the Best Criterion Value displayed in the Output Status frame of the GMDH Learning module.   When setting the Completion Factor, it is better to err on the low side.  In fact, if you set the factor too low, the system will check all possible combinations of variables, but you will miss nothing.  On the other hand, if you set the value too close to 1, the system may terminate the layer a long time before it finds the best possible models.  In this case the whole system will probably perform poorly.  Choosing the correct value of this factor is quite difficult.   If either Random Selection of Vars or Fast Completion Threshold is turned off, then the Completion Factor box is disabled and its value is ignored.     Missing Values See Missing Values for details.   Useful Hints Parameters Variation - Advanced GMDH The recommended order of parameter variation for Advanced GMDH is the following:   Most Significant Parameters Selection Criterion:  FCPSE, PSE, MDL, GCV, FPE, Regularity   Criterion Coefficient: 1 For any type of Selection Criterion, start with the Criterion Coefficient equal to 1.  The expressions for criteria calculation were normalized in such a way that statistically (if you apply a criterion to many problems) the best results are obtained for a coefficient = 1.  However, the dependence on the coefficient value is very strong, so usually it makes sense to vary the coefficient, say, in the range from 0.1 to 4.  If the Criterion Coefficient is too small, the model will be too complex and will overfit the data.  If the Criterion Coefficient is too large, the model will be too simple and give poor results.   You should probably try the following for a given Selection Criterion:  vary the Criterion Coefficient, finding the region of best models; adjust all other parameters as described below; slightly vary the Criterion Coefficient again in the previously found region to locate the value that gives the best model.   Extended Linear Models: on, off It is hard to predict whether turning on Extended Linear Models will make the result better or worse.  You should try both.   Max. Number of Survivors in the First Layer Start with the default value.  Decreasing the Number of Survivors will definitely speed up training, but it may not give good results.  If you increase the Number of Survivors, you may obtain better results at the cost of slower training.   Model Optimization: Smart, Thorough Thorough may provide better selection of significant variables than Smart, at a cost of longer computation time.  You may also try Fast or even Express if your problem is quite large and/or your computer is quite slow.  Fast sometimes gives the same results as Smart; but never better.  Selecting Off makes sense only for a "quick and rough" estimation.  Full is recommended only for very small problems; turning it on increases calculation time drastically, with little or no improvement in the model compared to that obtained with Thorough optimization.   Less Significant Parameters Data scaling: First choice is <<-1, 1>> with min/max chosen by mean plus or minus 1 standard deviation (or 3 standard deviations for complex data such as financial data); second choice is <<-1, 1>> with min/max chosen in the usual way.   (Scaling using a standard deviation is done in the Define Inputs and Outputs module.)   Schedule Type: Const, Asymptotic, Linear Try other schedules if Const has failed.  Usually they give either worse or the same results, but they may succeed for some cases where the Const schedule results in overfitting the data.   Decrease in Max. Number of Survivors: Gentle, Medium, Steep Begin with a slow decrease in the number of survivors, but if you start changing this option, you should try all three settings.   Max. Variable Degree in Connection: X^3, then X^2, then X X^3 is strongly recommended.  If you feel your data is linear, it may be more effective to use x than x^2 or x^3.   Max. Product Term in Connection: X1X2X3, then X1X2, then None X1X2X3 is strongly recommended.  If you feel your data is fairly linear, it may be more effective to use X1X2.  None may be useful only if you are trying to find a purely linear solution.   Max. Variables in Connection: X1, X2, X3, then X1, X2, then X1 X1, X2, X3 is strongly recommended. Reducing Max. Variables in Connection may help in very rare cases when the algorithm includes insignificant input variables in the model.  X1 may be useful only if you try to find a purely linear solution (together with setting Max. Product Term in Connection to None and Max. Variable Degree in Connection to X).