The Parallelisability of EGT-generation

guyhaw · Post by **guyhaw** » Fri Jul 18, 2008 10:35 pm

A colleague referred me to nVidia's Tesla offerings - lots of processors but only 16GB of memory. Impressive, but is it useful for EGT work?
Started me off thinking how intrinsically parallelisable EGT algorithms are: what are the separable processes etc?
Unless one has a compute-bound task, needing O(n^k) compute cycles on 'n' items of data (with k > 1), then maybe one cannot use more than a few processors.

Vegan · Post by **Vegan** » Sat Jul 19, 2008 3:37 am

EGTB is sensitive to memory bandwidth, not CPU performance. This is due to the massive amount of searching done.

6 pieces needs at least 8 GB of RAM, and 7 Pieces need at least 128 GB

jkominek · Post by **jkominek** » Sat Jul 19, 2008 4:59 am

Endgame tablebase generation can be highly parallelized, but not uniformly, and not necessarily efficiently.

When I say "not uniformly" I mean two things. Different force combinations are not equally easy to run in parallel, nor offer equal parallelism. Secondly, the threading code must be aware of the force combination. I'll illustrate with tbgen2, since that is my familiarity.

In tbgen2 indexing, pawns are the "outermost" multiplier. With pawns fixed in position, the king pair iterates through their range. With pawns and kings fixed, pieces move. For pawnless positions, each configuration can be computed in parallel for a given iteration. For the standard 8x8 board that's a maximum 462 threads. Each thread must finish and be collected at a synchronization point before beginning the next iteration. In practice, N << 462 worker threads are launched in waiting, which pluck off a portion of the work from a queue (or are assigned jobs).

Things get fun with pawns on the board. Adding a single pawn increases the parallelism by a factor of 4. First positions with pawns on {a7/h7, b7/g7, c7/f7, d7/e7} are computed. If, let us say, b7/g7 finishes first, then you are ready to tackle b6/g6. It is not necessary to synchronize with the other pawn threads since each subspace converges independently of the others. Columns are treated as matching pairs because of the horizontal symmetry of pawnfull indexing. To be more specific, if the white king traverses the d/e boundary the configuration is no longer canonical, and the board has to be flipped along the vertical axis to restore order to the land. As a result a pawn on the b column moves to g, and vice versa, etc. That's why you can't get an 8x improvement in parallelization. At first I thought each column could be treated separately; only after careful study did I realize my mistake.

Adding extra pawns to the board offers even more leverage, but becomes increasingly difficult to manage. The top of the pyramid -- pure pawn endgames ala kppppk -- are a dream and a nightmare. It can be parallelized like crazy, but the overhead in managing the work far outweighs the little bit of computation that goes on inside each case.

v is incorrect in claiming that generation is memory bandwidth-bound. It depends on the algorithm. By all reports the algorithm of Muller is bandwidth-bound. tbgen and tbgen2 are CPU-bound. Edwards is disk bound. FEG appears well balanced between disk and CPU. A point he makes that is true, is that in a cluster architecture the huge amount of data passing between nodes would drag down any naive strategy for the distribution of work. Bal, et al solved this by cleverly combining messages in a message passing parallel architechure (*). Preferable (for ease of programming), is a shared memory architecture, if you can get enough RAM. Of today's big iron machines I favor SGI Numalink for tablebase crunching. (Not that I have access to one, mind you.)

Enough talk of supercomputers. Back to desktop machines. Here are some measurements I made on a quad and a dual quad Intel machine with 16 GB memory. The table is kbnkn. Times are in seconds and are scaled to a 3 GHz clock.

Code: Select all

CPUs  Quad  Oct
1     5491  5543
2     3098  3158
4     1521  1567
6           1067
8           862

The quad running at full utilization achieves a speedup of 3.61x. For the dual-quad machine it is only 6.43x. The underutilization is due to the semi-random nonlocal memory access pattern of looking up child position DTM values. They can be spread all over the memory space. That sucks because the hierarchical memory architecture of modern CPUs are deeply hinged on the presumption of memory locality. EGTB generation breaks this assumption with a vengeance. If instead of generating a table I merely iterate positions, then the speedup is perfectly linear.

I realize I haven't answered guy's implicit question. Should we be getting excited about GPUs and Cell architectures and so on? I don't know, but probably not.

jk

(*) Henri Bal, Victor Allis, Parallel Retrograde Analysis on a Distributed System (1995), http://citeseer.ist.psu.edu/bal95parallel.html.
See also John W. Romein, Henri E. Bal et al. Solving the Game of Awari Using Parallel Retrograde Analysis (2003).

Vegan · Post by **Vegan** » Sat Jul 19, 2008 5:09 am

I have use of several mainframes and I have tested chess engines and retrograde analysis and machines with faster memory access such as Cray's machines are much faster then some other machines such as a Sun Enterprise. The problem is 7 pieces needs tons of memory to be achievable.

I have access to Cray, Sun and the like so I have experience with many mainframes as well as PCs. I found the biggest gains for EGTB generation is with memory not processor. Cutting random access times is much more material.

As for using more cores, you may be chaining cache memory which would benefit. I have only got a dual core machine at present. The quad burned out.

jkominek · Post by **jkominek** » Sat Jul 19, 2008 5:34 am

Vegan wrote:I have use of several mainframes and I have tested chess engines and retrograde analysis and machines with faster memory access such as Cray's machines are much faster then some other machines such as a Sun Enterprise.

Nice. Conduct some carefully designed experiments and show us scaling numbers.

jk

Vegan · Post by **Vegan** » Sat Jul 19, 2008 5:43 am

Unfortunately I am not using those machines as a single users so tests are not so stable.

I found that tweaking my RAM made more benefit than overclocking the processor. My current CPU is an AMD X2 4200+ with a DDR2-800 2GB stick of RAM.

guyhaw · Post by **guyhaw** » Sat Jul 19, 2008 5:38 pm

Thanks to jk for the precise data, showing good benefits [better than I'd expect] from up to 8 processors. Also, thanks to the reference to the past work of Allis, Romein and Bal which I had forgotten about. My question was really about parallelism within one endgame, which jk addressed.

I recall that Ren Wu did some careful experiments with Retrograde Analysis on different architectures. Ren and I discussed his algorithm in some detail, esp. the serial use of multiple disc channels and the possible use of compression/decompression. With extra processors, on-the-fly compression/decompression comes 'free' so there's no need to store information decompressed on disc, even the evolving bitmaps. Unmoving, especially from stm-wins to possible stm-losses, is I think an expensive inner-loop.

Obviously, there's a 'precedence lattice' which indicates for which endgame forces independent computations can be launched in parallel. This extends to P-slice sub-EGTs (of equal total distance to conversion) with a P-ful endgame force. Computing by P-slices surely improves locality of search as much smaller index-ranges are involved, so it's the P-less endgames that are the problem.

I guess the more RAM the better, and I think the existence of Bitbases speed up the generation of metric-EGTs. I always believed that EN completed his 6-man EGTs with no more than 2GB of RAM.

g

Vegan · Post by **Vegan** » Sat Jul 19, 2008 10:02 pm

Eugene used a quad processor Alpha box to do the tablebases. It initially had 512 MB of RAM but was upgraded to 8 GB for 6 pieces.

7 pieces needs a mainframe class machine.

been looking at various machines testing and i am attempting to get single user access for some serious testing before running a program as a low priority user

CCRL Discussion Board

The Parallelisability of EGT-generation

The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation

Re: The Parallelisability of EGT-generation