Examples of non-strategy testing would be saving and loading a game file, and continuing play from that point, or translating messages into a new language, and viewing those messages throughout the client.
Strategy testing will take the bulk of effort, since this is where the vast majority of the effort will be taking place. There will be a test directory at the same level as src which will contain a parallel package hierarchy. Test cases will be written for the module type, not the module name. Thus, tests would be written for Shape, Life and Death, etc, rather than for specific implementations of these. The test cases will only work on the public defined methods of these module types. There are two types of test cases, those that are pass-fail, and those that give a graded response. The former is used by testing per se, and the latter for the evaluation task described below.
The basic method of testing is as follows; for each of a series of modes and game records, computer - computer play is initiated. After each move of the game, the test cases for each module are run. So, for example, the Groups module can be tested by getting the liberties for each group and comparing these against known results. A great deal of effort would go into both writing these test cases, and recording the correct answer for each test case for each move of all of the test games. In addition, the log file is monitored to determine if any errors or warnings occur for any of the modules.
Another type of testing would be running the program in computer - computer mode with the game saved after each move, and checking the log files for any errors which occur. In addition, the game records would be saved, and a person could manually go through the game records. If any of the moves appear to be below the programs usual skill level, then the game is restored to that move, and the debug windows are used to determine the reason for the poor move.
In order to perform this evaluation, there would have to be a standard test suite for each module which would be run against all of the different implementations to determine this ranking for each mode. For example, for the life and death module, a series of life and death problems could be given to the modules, and their results and the time taken and the memory used to get them measured. This would not consist of pass-fail tests as for the testing task described above, but life and death problems of varying difficulty. Each Life and Death module would then be graded on a scale of how well it did in solving the problems, as opposed to failing the test if it got any problems wrong. Evaluation may also include playing different configurations of the program against itself. Of course, this evaluation would include things such as making sure that the implementations did not throw exceptions, enter infinite loops, etc.
Obviously, this is a very large task in itself, which includes designing and implementing a code framework, as well as gathering data from other sources such as life and death problems or joseki dictionaries. The complexity of this task and the amount of code required could easily match that of the code in org.moyoman.module itself. As with the development of the Go playing code, the quality of the testing effort will be dependent on having enough volunteers to perform this task.