Janet E. Burge, Genetic Algorithm Project

Computer Science Department

Genetic Algorithms

Background/Description
Genetic Algorithms are a search technique that utilizes the concept of natural selection. The concept being learned is represented as a "gene" - usually a bit string. Populations of these genes are created randomly and then tested using a fitness function to indicate which ones are the "best" (this depends on the application). For example, if a GA is being used for classification, the best genes are the ones that correctly classify the most training instances. After the fitness for each gene has been computed, a selection algorithm is used to determine which genes survive to the next population. Some selection algorithms are:

Probabalistic selection of the fittest
Tournament selection - two are selected at random, the best is kept

Some percentage of the genes in the new population are created by crossover. Crossover is the GA equivalent to mating. In crossover, two genes are spliced and recombined in such a way that the correctness of the resulting gene is preserved. Some crossover algorithms use one crossover point (like the middle), others use several. Crossover points can be fixed or random.

GAs can be used for many different types of applications. Some examples are:

Classification - the genes consist of sets of rules used for data classification.
Planning - the genes consist of parameters of a plan (such as a battle plan).
Design - the genes give the parameters of each element of a design.

The advantage of GAs are that they can be used to search a wide search space and come up with answers that might not be obtained by other means. A disadvantage is that they may need many generations to find their solution, if one is found at all. Another potential problem is that if there is not sufficient randomness in the selection and crossover methods the GA may not converge on the best possible solution.

References
Books and Papers

Mitchell, T., "Machine Learning," McGraw Hill, 1997, pp. 249-259.

This textbook, used in class, described how GAs worked and provided the example used for crossover by this project.

Schlabach, J., Hayes, C., Goldberg, D., SHAKA-GA: A Genetic Algorithm for Generating and Analyzing Battlefield Courses of Action, 1997.

This paper describes an earlier version of the FOX Course of Action Generator described in the project presentation.

Bently, P., Wakefield, J., The Table: An Illustration of Evolutionary Design using Genetic Algorithms, Genetic Algorithms in Enginerring Systems: Innovations and Applications, 12-14 September 1995, Conference Publication No. 414, IEE, 1995.

This paper describes using GAs to design tables.

Web References:

These were used for background information only.

Collection of GA Links
GA FAQ

Problem
This project used GAs to classify the census income data used in earlier projects in this class.
Gene Representation
Each gene is a variable length number of classification rules stored as bit strings. Not all of the census data attributes were used. Attributes used were selected based on the Decision Tree project done earlier in the semester.

Education - 3 bits
Occupation Status - 4 bits
Marital Status - 3 bits
Relationship - 3 bits
Hours Worked - 2 bits
Salary - 1 bit (target attribute)

Classification attributes where any value was acceptable were represented as all ones so that a bitwise AND could be used to compare the rules with the training and test instances.
Gene Selection and Crossover
Selection was done using a tournament selection where two genes were selected at random and the one with the highest fitness was selected. Genes for crossover were selected randomly from the original population and crossed over using the algorithm used by GABIL. In this algorithm, cutpoints are chosen randomly but checked to ensure that the rule boundaries remain consistent. Mutation was not implemented for this project.
Fitness
Fitness was computed by comparing each training instance to each rule within a gene. If any of the rules matched the instance, the rule was considered a correct classification. The fitness for each gene was incremented by one each time it correctly classified a training instance.
Results
The results were surprisingly good. They are shown in the sample runs referenced below and in the report and presentation. The results are even more surprising when the rules are examined. They are not what would be expected to classify the data. This could be due to a problem with the program (which has not been found) or the data set is not evenly distributed for the attributes chosen.
Solution
There are two sample runs available for the program: a test using 5000 test and training instances and a test using 100 test and training instances. These sample runs give the accuracy for the program and the resulting rules.

Results are also given in the Presentation and project report.