Zobrist hashing is used throughout the Moyoman program. A Zobrist hash is a 64 bit number which represents a given board position. The Zobrist hash could be used as an index. This scheme can most easily be explained using an example. Assume that 20,000 professional games are used, with an average of 250 moves per game. That results in a maximum of 5 million board positions. Since the games will be unique after the first 30 or 40 moves, the actual number of board positions will be somewhere between 4 and 5 million. Each of these board positions will have a 64 bit number associated with it. For a given board position during a game by the Moyoman program, 8 different Zobrist hashes can be computed for the board position, to take into account rotation and reflection. Only one of the 8 hashes would be stored, for example, the largest one. If the hash matches any of the 5 million stored Zobrist hashes, then the game so far has been played at least once in one of the professional games. For each of the stored Zobrist hashes, a list of followup moves and Zobrist hashes are stored as well. Thus, the program can quickly determine for any board position if it occurred before in the 20,000 professional games, and what the next move or moves were.
How much space would be required to store this information? Assuming that we want to use an ASCII file to represent this data, a 64 bit number can be represented in 12 bytes using BASE64 encoding. On average, there might be 2 follow up moves, each of which would take up 15 bytes for the hash, a separator character, and the follow up move position. Assume a blank line to separate board position entries. Assuming each part of this entry is on its own line, and there is an average of 45 bytes per entry. Multiply this by 5 million board positions, and we now have a 200Mb file. Clearly, the program would not cache this information, but would do its work off of disk. Repeatedly reading a 200Mb file would be cumbersome, so the files could be stored as 4096 files of about 50kb each. This would be natural because the file names could just be based on the first two characters of the BASE64 encoding. The two characters could not be used directly, since some operating systems do not have case sensitive file names and the slash character might be problematic. The file name could be a number between 0 and 4095. The files would all be sorted to allow for binary search to be used to efficiently find entries in these files.
It would not be unreasonable to use 200Mb of storage for this game information. If the game records were stored as SGF files, then 20,000 games would take up about 30Mb without compression. Downloading this would be the problematic part, especially for people without a high speed Internet connection. The Moyoman program would provide a conversion program to convert SGF files into the data format of 4096 files discussed above. Perhaps the default distribution of Moyoman would come with one hundred professional games, and the user could then get additional games and run the conversion program to produce the 4096 files at their convenience.
The end user would not need to use these professional games, only the module developer who is using information from professional games as part of the move generation strategy for the module. The idea is that the module developer obtains the sgf games on his own, then runs the conversion program which is part of the Moyoman software. Tools for lookup based on Zobrist hash would also be available. The module developer then processes this data, and produces a much smaller amount of data which his module would use and would be part of the standard Moyoman distribution under the data directory.