Endgame tablebase generation can be highly parallelized, but not uniformly, and not necessarily efficiently.
When I say "not uniformly" I mean two things. Different force combinations are not equally easy to run in parallel, nor offer equal parallelism. Secondly, the threading code must be aware of the force combination. I'll illustrate with tbgen2, since that is my familiarity.
In tbgen2 indexing, pawns are the "outermost" multiplier. With pawns fixed in position, the king pair iterates through their range. With pawns and kings fixed, pieces move. For pawnless positions, each configuration can be computed in parallel for a given iteration. For the standard 8x8 board that's a maximum 462 threads. Each thread must finish and be collected at a synchronization point before beginning the next iteration. In practice, N << 462 worker threads are launched in waiting, which pluck off a portion of the work from a queue (or are assigned jobs).
Things get fun with pawns on the board. Adding a single pawn increases the parallelism by a factor of 4. First positions with pawns on {a7/h7, b7/g7, c7/f7, d7/e7} are computed. If, let us say, b7/g7 finishes first, then you are ready to tackle b6/g6. It is not necessary to synchronize with the other pawn threads since each subspace converges independently of the others. Columns are treated as matching pairs because of the horizontal symmetry of pawnfull indexing. To be more specific, if the white king traverses the d/e boundary the configuration is no longer canonical, and the board has to be flipped along the vertical axis to restore order to the land. As a result a pawn on the b column moves to g, and vice versa, etc. That's why you can't get an 8x improvement in parallelization. At first I thought each column could be treated separately; only after careful study did I realize my mistake.
Adding extra pawns to the board offers even more leverage, but becomes increasingly difficult to manage. The top of the pyramid -- pure pawn endgames ala kppppk -- are a dream and a nightmare. It can be parallelized like crazy, but the overhead in managing the work far outweighs the little bit of computation that goes on inside each case.
v is incorrect in claiming that generation is memory bandwidth-bound. It depends on the algorithm. By all reports the algorithm of Muller is bandwidth-bound. tbgen and tbgen2 are CPU-bound. Edwards is disk bound. FEG appears well balanced between disk and CPU. A point he makes that is true, is that in a cluster architecture the huge amount of data passing between nodes would drag down any naive strategy for the distribution of work. Bal,
et al solved this by cleverly combining messages in a message passing parallel architechure (*). Preferable (for ease of programming), is a shared memory architecture, if you can get enough RAM. Of today's big iron machines I favor SGI Numalink for tablebase crunching. (Not that I have access to one, mind you.)
Enough talk of supercomputers. Back to desktop machines. Here are some measurements I made on a quad and a dual quad Intel machine with 16 GB memory. The table is kbnkn. Times are in seconds and are scaled to a 3 GHz clock.
Code: Select all
CPUs Quad Oct
1 5491 5543
2 3098 3158
4 1521 1567
6 1067
8 862
The quad running at full utilization achieves a speedup of 3.61x. For the dual-quad machine it is only 6.43x. The underutilization is due to the semi-random nonlocal memory access pattern of looking up child position DTM values. They can be spread all over the memory space. That sucks because the hierarchical memory architecture of modern CPUs are deeply hinged on the presumption of memory locality. EGTB generation breaks this assumption with a vengeance. If instead of generating a table I merely iterate positions, then the speedup is perfectly linear.
I realize I haven't answered guy's implicit question. Should we be getting excited about GPUs and Cell architectures and so on? I don't know, but probably not.
jk
(*) Henri Bal, Victor Allis, Parallel Retrograde Analysis on a Distributed System (1995),
http://citeseer.ist.psu.edu/bal95parallel.html.
See also John W. Romein, Henri E. Bal et al. Solving the Game of Awari Using Parallel Retrograde Analysis (2003).