Factorizing tablebases by K-slice
Posted: Tue May 06, 2008 12:27 pm
To minimize the working set during TB generation, to make sure it fits in the CPU caches, it can be very useful to work by K-slice. A slice is a set of positions that have certain 'static' pieces always on the same square, all other pieces (the 'dynamic' ones) being put in every possible constellation on the board. A K-slice has a King in such a fixed position.
The nice property of the King, except for its ever-precense on both sides, is its localized mode of moving. This makes it possible to strongly reduce the storage requirement for the working set if we are prepared to treat the King moves in two passes through the TB. E.g., we could treat in the first pass all moves within and in between pairs of files (a,b), (c,d), (e,f) and (g,h). This would require to have at most 4 different K-slices in the working set. If we are working on the (a,b) files, we would load the slice with King on a1, a2, b1 and b2 in cache, process all moves between them, and then we are completely done with a1 and b1. So they could be flushed out of the cache at the expense of the K-slices for a3 and c3, after which we could do all moves between a2, b2, a3 and b3 (i.e. a2<->a3, b2<->b3, a3<->b3, a2<->b3 and b2<->a3, as a2<->b2 was already done in the previous set). And so on, replacing a2 andb2 in the cache with a4 and b4.
In a second pass we would then have to treat moves between the remaining file pairs: (b,c), (d,e) and (f,g). This would go in the same way, first loading b1, c1, b2, c2, processing all moves between them (this time leaving out b1<->b2 and c1<->c2, which had already been processed in the first pass), before discarding b1 and c1 at the expense of b3, c3, etc. I will call this set of moves the Y-moves, the first set the X-moves.
In each pass at maximum 4 King positions need to be held in cache simultaneously. This is very convenient, as current CPUs cannot hold slices with 4 dynamic pieces (which have 64^4 = 16M positions, so even packing the TB to 4 bits per posiiton would take 8MB, exceeding even the Penryn L2 size), but are way too big for a slice with 3 dynamic pieces (256K positions). So it would be very easy to keep 4 K-slices with 3 dynamic pieces each in cache (1M positions).
This would provide efficient schemes for in-RAM generation of 6-men EGTBs or bitbases. For instance:
2+4 men:
Pass 1 brings slices in cache where 2 white pieces are dynamic, for 4 different positions of the black King. This looks like very little data, but the storage scheme will be such that some black piece (other than K) will eb the most-rapidly scanning piece when we sequentially increase the index. Because of cache lines always containing 64 contiguous bytes of data, we will be forced to always treat that piece as dynamic, whether we want it or not. The other positions of that piece will always be dragged into cache together with any of them we try to access. So in fact we will have 3 dynamic pieces, even though we don't use the black one to move with, and need to cache 4 x 256K = 1M entries.
Within this set we can then scan for LOST-in-N btm positions, and do all white unmoves to get to the WON-in-N+1 wtm positions (and mark those as such). From the latter, we then continue doing the black King unmoves of the X-group to get to (some of the) POTENTIALLY-LOST-in-N+1 btm positions. All these moves stay within the caches slices, so 100% cache hits after the first miss that brought them in cache.
Pass 2 then scans by slices that have 3 (non-K) dynamic pieces, again for 4 different black King positions. It seeks out the WON-in-N+1 wtm positions marked in pass 1, to do the unmoves of the three dynamic black pieces, and tries the black King Y-group unmoves, to get to the remaining POTENTIALLY-LOST-in-N+1 btm positions (which are now all in cache). For each such position it creates (plus for each such position it encounters that was created in pass 1) it then starts the verification step: it tries all black moves with the dynamic pieces, to see if black has an 'escape' to an UNDECIDED position. If he has, the POTENTIALLY-LOST position is reset to UNDECIDED. If it hasn't we still cannot be sure that it is a LOST-in-N+1 position, as it might still have X-group King moves that save black. And we cannot test those at this time, as the positions to which these moves lead are in K-slices currently not in cache. So at the end of pass 2 we will have identified all POTENTIALLY-LOST-in-N+1 positions, and already discarded the bulk of them. But we have no LOST-in-N+1 positions yet.
This means the pass 1 for the next cycle, which again does white unmoves and X-group black King moves, has no LOST-in-N+1 positions to start from, but must actually first finish the verification of the POTENTIALLY-LOST-in-N+1 positions, by trying the X-moves. As a consequence, we need to actually store 6 K-slices, rather than 4, as we cannot verify the positions with Ka2, and Kb2 through X-moves without having a3 and b3 present for probing (next to a1 and b1). And we cannot discard a1 and b1 yet, because after we verified thePOTENTIALLY-LOST-in-N+1 positions with Ka2, Kb2, (turning them into LOST-in-N+1 positions), we will have to revisit them with the black X-group unmoves. But storing 6 x 256K = 1.5M positions still should be no problem.
So in summary:
Pass1
1) verify any POTENTIALLY-LOST-in-N through black King X-moves
2) if they survive, do all white unmoves from them to get to WON-in-N+1
3) from every WON-in-N+1 you create, do black King X-unmoves to label the POTENTIALLY-LOST-in-N+1
Pass 2
1) for every WON-in-N+1 do all King Y-unmoves and black pieces' other moves to create more POTENTIALLY-LOST-in-N+1
2) for every POTENTIALLY-LOST-in-N+1 created or found, (partially) verify it with black King Y-moves and all other pieces' moves. If it veifies, it remains POTENTIALLY-LOST-in-N+1, otherwise it is reset to UNDECIDED.
For 3+3 you can do it in a similar way, but now a white piece has to be stored contiguously (so that a cache line contains positions that differ by a single white piece).
Pass 1 would now treat 3 white dynamic pieces, and the black King X-moves
Pass 2 would treat the black King Y-moves and other black pieces' moves. There would now only be 2 black dynamic pieces. The third dynamic piece would be the white one forced on us, but we already had done its moves in pass 1, so we ignore it here. (Apart from the fact that we scan over all its positions, of course, to make sure we do not skip parts of the TB.)
For 4+2 another scheme is needed, as there are now too many white pieces to treat them all as dynamic in pass 1. Here we can easily keep a slice containing all black pieces as dynamic, though, and there would still be room left to treat one white piece as dynamic, plus the white King partially. So we would need 3 passes.
Pass 1 would scan in 1+2 K-slices for 4 fixed white King positions. It would treat the LOST-IN-N btm by generating white King X-unmoves plus unmove of the dynamic white piece to create WON-in-N+1 wtm. For each one it crated it would generate all black unmoves to identify POTENTIALLY-LOST-in-N+1 positions, and immediately verify them with all black moves. As all black moves are available within the cached data set (and thus any arbitrary long sequence of black moves), the verification can be completed, and we get either a LOST-in-N+1 btm position, or the entry reverts to UNDECIDED.
Such a pass thus brings us directly from LOST-in-N to LOST-in-N+1 btm positions (grandparent algorithm). It is just that it did not get to all of them, as some of them have to be reached through unmoves of other white pieces. We do those in the later passes:
Pass 2 is as pass 1, for another white piece, but without King moves.
Pass 3 is as pass 1, for the third white piece, plus the white King Y-unmoves. After that, we will have created all LOST-in-N+1 positions.
Now it seems we need 3 passes here, but if we are clever, we can change the order of the passes in the next cycle (as, unlike the 3+3 and 2+4 case, they are identical, conceptually parallel operations rather than sequential steps). If we start with pass 3 there, we could merge it with the last pass of the previous cycle. So as soon as we got all LOST-IN-N+1 positions of the slices of pass 3 of the previous cycle that are currently in cache, we immediately repeat the process on those slices to get to the LOST-in-N+2 positions. This would require us to keep 6 K-slices in cache, though, as we now want to do two subsequent white King moves, and would still have to be able to move from b2 to b1 in cycle N+2 after we moved from b3 to b2 in cycle N+1 to create the LOST-in-N+1 position there. But this is no problem. Six 3-men slices fit easily. By merging these passes we woul need only 2 passes per cycle, like before.
For 5+1 it would be even eaisier. It would be done as 4+2, except now we can have 2 dynamic white pieces in the slice. So pass 2 of the 4+2 scheme would vbe skipped, and Pass 1 and 3 would be done with 2+1 slices rather than 1+2 slices. The black piece (King) would be stored contiguously.
In conclusion: it seems that all of the 6-men tablebases can be built with only 2 passes per cycle and 1.5M posiitons in cache. Note that my 2.4GHz Core 2 Duo can sequentially scan through dirty memory (i.e. where it has to write back the old cache content before loading the new) at 0.9ns / byte. So on an 8GB uncompressed pawnless 6-men (1 byte per position), a pass could be done in 7.2 sec, or 15 sec per cycle, if all the entries in the TB needed to be accessed. In practice, most passes are very sparse (especially in lengthy end-games), and many cache lines can be skipped, saving a large factor on those 15 sec. You would really only need the 15 sec during passes were most of the positions would be assigned a DTx. This is usually in a range of 10 cycles, or so. (You cannot keep assigning more than 10% of the entries a DTM for more than 10 cycles, as you will run out of entries.) That would take 150 sec, then. The challenge would be to make sure the CPU can keep up with this, so the process remains memory bandwidth limited.
The nice property of the King, except for its ever-precense on both sides, is its localized mode of moving. This makes it possible to strongly reduce the storage requirement for the working set if we are prepared to treat the King moves in two passes through the TB. E.g., we could treat in the first pass all moves within and in between pairs of files (a,b), (c,d), (e,f) and (g,h). This would require to have at most 4 different K-slices in the working set. If we are working on the (a,b) files, we would load the slice with King on a1, a2, b1 and b2 in cache, process all moves between them, and then we are completely done with a1 and b1. So they could be flushed out of the cache at the expense of the K-slices for a3 and c3, after which we could do all moves between a2, b2, a3 and b3 (i.e. a2<->a3, b2<->b3, a3<->b3, a2<->b3 and b2<->a3, as a2<->b2 was already done in the previous set). And so on, replacing a2 andb2 in the cache with a4 and b4.
In a second pass we would then have to treat moves between the remaining file pairs: (b,c), (d,e) and (f,g). This would go in the same way, first loading b1, c1, b2, c2, processing all moves between them (this time leaving out b1<->b2 and c1<->c2, which had already been processed in the first pass), before discarding b1 and c1 at the expense of b3, c3, etc. I will call this set of moves the Y-moves, the first set the X-moves.
In each pass at maximum 4 King positions need to be held in cache simultaneously. This is very convenient, as current CPUs cannot hold slices with 4 dynamic pieces (which have 64^4 = 16M positions, so even packing the TB to 4 bits per posiiton would take 8MB, exceeding even the Penryn L2 size), but are way too big for a slice with 3 dynamic pieces (256K positions). So it would be very easy to keep 4 K-slices with 3 dynamic pieces each in cache (1M positions).
This would provide efficient schemes for in-RAM generation of 6-men EGTBs or bitbases. For instance:
2+4 men:
Pass 1 brings slices in cache where 2 white pieces are dynamic, for 4 different positions of the black King. This looks like very little data, but the storage scheme will be such that some black piece (other than K) will eb the most-rapidly scanning piece when we sequentially increase the index. Because of cache lines always containing 64 contiguous bytes of data, we will be forced to always treat that piece as dynamic, whether we want it or not. The other positions of that piece will always be dragged into cache together with any of them we try to access. So in fact we will have 3 dynamic pieces, even though we don't use the black one to move with, and need to cache 4 x 256K = 1M entries.
Within this set we can then scan for LOST-in-N btm positions, and do all white unmoves to get to the WON-in-N+1 wtm positions (and mark those as such). From the latter, we then continue doing the black King unmoves of the X-group to get to (some of the) POTENTIALLY-LOST-in-N+1 btm positions. All these moves stay within the caches slices, so 100% cache hits after the first miss that brought them in cache.
Pass 2 then scans by slices that have 3 (non-K) dynamic pieces, again for 4 different black King positions. It seeks out the WON-in-N+1 wtm positions marked in pass 1, to do the unmoves of the three dynamic black pieces, and tries the black King Y-group unmoves, to get to the remaining POTENTIALLY-LOST-in-N+1 btm positions (which are now all in cache). For each such position it creates (plus for each such position it encounters that was created in pass 1) it then starts the verification step: it tries all black moves with the dynamic pieces, to see if black has an 'escape' to an UNDECIDED position. If he has, the POTENTIALLY-LOST position is reset to UNDECIDED. If it hasn't we still cannot be sure that it is a LOST-in-N+1 position, as it might still have X-group King moves that save black. And we cannot test those at this time, as the positions to which these moves lead are in K-slices currently not in cache. So at the end of pass 2 we will have identified all POTENTIALLY-LOST-in-N+1 positions, and already discarded the bulk of them. But we have no LOST-in-N+1 positions yet.
This means the pass 1 for the next cycle, which again does white unmoves and X-group black King moves, has no LOST-in-N+1 positions to start from, but must actually first finish the verification of the POTENTIALLY-LOST-in-N+1 positions, by trying the X-moves. As a consequence, we need to actually store 6 K-slices, rather than 4, as we cannot verify the positions with Ka2, and Kb2 through X-moves without having a3 and b3 present for probing (next to a1 and b1). And we cannot discard a1 and b1 yet, because after we verified thePOTENTIALLY-LOST-in-N+1 positions with Ka2, Kb2, (turning them into LOST-in-N+1 positions), we will have to revisit them with the black X-group unmoves. But storing 6 x 256K = 1.5M positions still should be no problem.
So in summary:
Pass1
1) verify any POTENTIALLY-LOST-in-N through black King X-moves
2) if they survive, do all white unmoves from them to get to WON-in-N+1
3) from every WON-in-N+1 you create, do black King X-unmoves to label the POTENTIALLY-LOST-in-N+1
Pass 2
1) for every WON-in-N+1 do all King Y-unmoves and black pieces' other moves to create more POTENTIALLY-LOST-in-N+1
2) for every POTENTIALLY-LOST-in-N+1 created or found, (partially) verify it with black King Y-moves and all other pieces' moves. If it veifies, it remains POTENTIALLY-LOST-in-N+1, otherwise it is reset to UNDECIDED.
For 3+3 you can do it in a similar way, but now a white piece has to be stored contiguously (so that a cache line contains positions that differ by a single white piece).
Pass 1 would now treat 3 white dynamic pieces, and the black King X-moves
Pass 2 would treat the black King Y-moves and other black pieces' moves. There would now only be 2 black dynamic pieces. The third dynamic piece would be the white one forced on us, but we already had done its moves in pass 1, so we ignore it here. (Apart from the fact that we scan over all its positions, of course, to make sure we do not skip parts of the TB.)
For 4+2 another scheme is needed, as there are now too many white pieces to treat them all as dynamic in pass 1. Here we can easily keep a slice containing all black pieces as dynamic, though, and there would still be room left to treat one white piece as dynamic, plus the white King partially. So we would need 3 passes.
Pass 1 would scan in 1+2 K-slices for 4 fixed white King positions. It would treat the LOST-IN-N btm by generating white King X-unmoves plus unmove of the dynamic white piece to create WON-in-N+1 wtm. For each one it crated it would generate all black unmoves to identify POTENTIALLY-LOST-in-N+1 positions, and immediately verify them with all black moves. As all black moves are available within the cached data set (and thus any arbitrary long sequence of black moves), the verification can be completed, and we get either a LOST-in-N+1 btm position, or the entry reverts to UNDECIDED.
Such a pass thus brings us directly from LOST-in-N to LOST-in-N+1 btm positions (grandparent algorithm). It is just that it did not get to all of them, as some of them have to be reached through unmoves of other white pieces. We do those in the later passes:
Pass 2 is as pass 1, for another white piece, but without King moves.
Pass 3 is as pass 1, for the third white piece, plus the white King Y-unmoves. After that, we will have created all LOST-in-N+1 positions.
Now it seems we need 3 passes here, but if we are clever, we can change the order of the passes in the next cycle (as, unlike the 3+3 and 2+4 case, they are identical, conceptually parallel operations rather than sequential steps). If we start with pass 3 there, we could merge it with the last pass of the previous cycle. So as soon as we got all LOST-IN-N+1 positions of the slices of pass 3 of the previous cycle that are currently in cache, we immediately repeat the process on those slices to get to the LOST-in-N+2 positions. This would require us to keep 6 K-slices in cache, though, as we now want to do two subsequent white King moves, and would still have to be able to move from b2 to b1 in cycle N+2 after we moved from b3 to b2 in cycle N+1 to create the LOST-in-N+1 position there. But this is no problem. Six 3-men slices fit easily. By merging these passes we woul need only 2 passes per cycle, like before.
For 5+1 it would be even eaisier. It would be done as 4+2, except now we can have 2 dynamic white pieces in the slice. So pass 2 of the 4+2 scheme would vbe skipped, and Pass 1 and 3 would be done with 2+1 slices rather than 1+2 slices. The black piece (King) would be stored contiguously.
In conclusion: it seems that all of the 6-men tablebases can be built with only 2 passes per cycle and 1.5M posiitons in cache. Note that my 2.4GHz Core 2 Duo can sequentially scan through dirty memory (i.e. where it has to write back the old cache content before loading the new) at 0.9ns / byte. So on an 8GB uncompressed pawnless 6-men (1 byte per position), a pass could be done in 7.2 sec, or 15 sec per cycle, if all the entries in the TB needed to be accessed. In practice, most passes are very sparse (especially in lengthy end-games), and many cache lines can be skipped, saving a large factor on those 15 sec. You would really only need the 15 sec during passes were most of the positions would be assigned a DTx. This is usually in a range of 10 cycles, or so. (You cannot keep assigning more than 10% of the entries a DTM for more than 10 cycles, as you will run out of entries.) That would take 150 sec, then. The challenge would be to make sure the CPU can keep up with this, so the process remains memory bandwidth limited.