Factorizing tablebases by K-slice

h.g.muller · Post by **h.g.muller** » Tue May 06, 2008 12:27 pm

To minimize the working set during TB generation, to make sure it fits in the CPU caches, it can be very useful to work by K-slice. A slice is a set of positions that have certain 'static' pieces always on the same square, all other pieces (the 'dynamic' ones) being put in every possible constellation on the board. A K-slice has a King in such a fixed position.

The nice property of the King, except for its ever-precense on both sides, is its localized mode of moving. This makes it possible to strongly reduce the storage requirement for the working set if we are prepared to treat the King moves in two passes through the TB. E.g., we could treat in the first pass all moves within and in between pairs of files (a,b), (c,d), (e,f) and (g,h). This would require to have at most 4 different K-slices in the working set. If we are working on the (a,b) files, we would load the slice with King on a1, a2, b1 and b2 in cache, process all moves between them, and then we are completely done with a1 and b1. So they could be flushed out of the cache at the expense of the K-slices for a3 and c3, after which we could do all moves between a2, b2, a3 and b3 (i.e. a2<->a3, b2<->b3, a3<->b3, a2<->b3 and b2<->a3, as a2<->b2 was already done in the previous set). And so on, replacing a2 andb2 in the cache with a4 and b4.

In a second pass we would then have to treat moves between the remaining file pairs: (b,c), (d,e) and (f,g). This would go in the same way, first loading b1, c1, b2, c2, processing all moves between them (this time leaving out b1<->b2 and c1<->c2, which had already been processed in the first pass), before discarding b1 and c1 at the expense of b3, c3, etc. I will call this set of moves the Y-moves, the first set the X-moves.

In each pass at maximum 4 King positions need to be held in cache simultaneously. This is very convenient, as current CPUs cannot hold slices with 4 dynamic pieces (which have 64^4 = 16M positions, so even packing the TB to 4 bits per posiiton would take 8MB, exceeding even the Penryn L2 size), but are way too big for a slice with 3 dynamic pieces (256K positions). So it would be very easy to keep 4 K-slices with 3 dynamic pieces each in cache (1M positions).

This would provide efficient schemes for in-RAM generation of 6-men EGTBs or bitbases. For instance:
2+4 men:
Pass 1 brings slices in cache where 2 white pieces are dynamic, for 4 different positions of the black King. This looks like very little data, but the storage scheme will be such that some black piece (other than K) will eb the most-rapidly scanning piece when we sequentially increase the index. Because of cache lines always containing 64 contiguous bytes of data, we will be forced to always treat that piece as dynamic, whether we want it or not. The other positions of that piece will always be dragged into cache together with any of them we try to access. So in fact we will have 3 dynamic pieces, even though we don't use the black one to move with, and need to cache 4 x 256K = 1M entries.
Within this set we can then scan for LOST-in-N btm positions, and do all white unmoves to get to the WON-in-N+1 wtm positions (and mark those as such). From the latter, we then continue doing the black King unmoves of the X-group to get to (some of the) POTENTIALLY-LOST-in-N+1 btm positions. All these moves stay within the caches slices, so 100% cache hits after the first miss that brought them in cache.
Pass 2 then scans by slices that have 3 (non-K) dynamic pieces, again for 4 different black King positions. It seeks out the WON-in-N+1 wtm positions marked in pass 1, to do the unmoves of the three dynamic black pieces, and tries the black King Y-group unmoves, to get to the remaining POTENTIALLY-LOST-in-N+1 btm positions (which are now all in cache). For each such position it creates (plus for each such position it encounters that was created in pass 1) it then starts the verification step: it tries all black moves with the dynamic pieces, to see if black has an 'escape' to an UNDECIDED position. If he has, the POTENTIALLY-LOST position is reset to UNDECIDED. If it hasn't we still cannot be sure that it is a LOST-in-N+1 position, as it might still have X-group King moves that save black. And we cannot test those at this time, as the positions to which these moves lead are in K-slices currently not in cache. So at the end of pass 2 we will have identified all POTENTIALLY-LOST-in-N+1 positions, and already discarded the bulk of them. But we have no LOST-in-N+1 positions yet.
This means the pass 1 for the next cycle, which again does white unmoves and X-group black King moves, has no LOST-in-N+1 positions to start from, but must actually first finish the verification of the POTENTIALLY-LOST-in-N+1 positions, by trying the X-moves. As a consequence, we need to actually store 6 K-slices, rather than 4, as we cannot verify the positions with Ka2, and Kb2 through X-moves without having a3 and b3 present for probing (next to a1 and b1). And we cannot discard a1 and b1 yet, because after we verified thePOTENTIALLY-LOST-in-N+1 positions with Ka2, Kb2, (turning them into LOST-in-N+1 positions), we will have to revisit them with the black X-group unmoves. But storing 6 x 256K = 1.5M positions still should be no problem.
So in summary:
Pass1
1) verify any POTENTIALLY-LOST-in-N through black King X-moves
2) if they survive, do all white unmoves from them to get to WON-in-N+1
3) from every WON-in-N+1 you create, do black King X-unmoves to label the POTENTIALLY-LOST-in-N+1
Pass 2
1) for every WON-in-N+1 do all King Y-unmoves and black pieces' other moves to create more POTENTIALLY-LOST-in-N+1
2) for every POTENTIALLY-LOST-in-N+1 created or found, (partially) verify it with black King Y-moves and all other pieces' moves. If it veifies, it remains POTENTIALLY-LOST-in-N+1, otherwise it is reset to UNDECIDED.

For 3+3 you can do it in a similar way, but now a white piece has to be stored contiguously (so that a cache line contains positions that differ by a single white piece).
Pass 1 would now treat 3 white dynamic pieces, and the black King X-moves
Pass 2 would treat the black King Y-moves and other black pieces' moves. There would now only be 2 black dynamic pieces. The third dynamic piece would be the white one forced on us, but we already had done its moves in pass 1, so we ignore it here. (Apart from the fact that we scan over all its positions, of course, to make sure we do not skip parts of the TB.)

For 4+2 another scheme is needed, as there are now too many white pieces to treat them all as dynamic in pass 1. Here we can easily keep a slice containing all black pieces as dynamic, though, and there would still be room left to treat one white piece as dynamic, plus the white King partially. So we would need 3 passes.
Pass 1 would scan in 1+2 K-slices for 4 fixed white King positions. It would treat the LOST-IN-N btm by generating white King X-unmoves plus unmove of the dynamic white piece to create WON-in-N+1 wtm. For each one it crated it would generate all black unmoves to identify POTENTIALLY-LOST-in-N+1 positions, and immediately verify them with all black moves. As all black moves are available within the cached data set (and thus any arbitrary long sequence of black moves), the verification can be completed, and we get either a LOST-in-N+1 btm position, or the entry reverts to UNDECIDED.
Such a pass thus brings us directly from LOST-in-N to LOST-in-N+1 btm positions (grandparent algorithm). It is just that it did not get to all of them, as some of them have to be reached through unmoves of other white pieces. We do those in the later passes:
Pass 2 is as pass 1, for another white piece, but without King moves.
Pass 3 is as pass 1, for the third white piece, plus the white King Y-unmoves. After that, we will have created all LOST-in-N+1 positions.
Now it seems we need 3 passes here, but if we are clever, we can change the order of the passes in the next cycle (as, unlike the 3+3 and 2+4 case, they are identical, conceptually parallel operations rather than sequential steps). If we start with pass 3 there, we could merge it with the last pass of the previous cycle. So as soon as we got all LOST-IN-N+1 positions of the slices of pass 3 of the previous cycle that are currently in cache, we immediately repeat the process on those slices to get to the LOST-in-N+2 positions. This would require us to keep 6 K-slices in cache, though, as we now want to do two subsequent white King moves, and would still have to be able to move from b2 to b1 in cycle N+2 after we moved from b3 to b2 in cycle N+1 to create the LOST-in-N+1 position there. But this is no problem. Six 3-men slices fit easily. By merging these passes we woul need only 2 passes per cycle, like before.

For 5+1 it would be even eaisier. It would be done as 4+2, except now we can have 2 dynamic white pieces in the slice. So pass 2 of the 4+2 scheme would vbe skipped, and Pass 1 and 3 would be done with 2+1 slices rather than 1+2 slices. The black piece (King) would be stored contiguously.

In conclusion: it seems that all of the 6-men tablebases can be built with only 2 passes per cycle and 1.5M posiitons in cache. Note that my 2.4GHz Core 2 Duo can sequentially scan through dirty memory (i.e. where it has to write back the old cache content before loading the new) at 0.9ns / byte. So on an 8GB uncompressed pawnless 6-men (1 byte per position), a pass could be done in 7.2 sec, or 15 sec per cycle, if all the entries in the TB needed to be accessed. In practice, most passes are very sparse (especially in lengthy end-games), and many cache lines can be skipped, saving a large factor on those 15 sec. You would really only need the 15 sec during passes were most of the positions would be assigned a DTx. This is usually in a range of 10 cycles, or so. (You cannot keep assigning more than 10% of the entries a DTM for more than 10 cycles, as you will run out of entries.) That would take 150 sec, then. The challenge would be to make sure the CPU can keep up with this, so the process remains memory bandwidth limited.

Tuhma · Post by **Tuhma** » Tue May 06, 2008 8:56 pm

If you can write a program that is able to generate all 6-men tablebase files (nalimov format) in 2,4 - 2,7 Ghz quad with 2-4 giga memory under 100 days then this is a great leap forward!

I have been downloading all 6-men tablebases 1,3 years now and I am not finished yet. Just to set your goals to perspective. I am trying to say that even if you are not able to reach you goals, but if you are able to meet the goal I have stated above then that kind of program would be extremely useful for tablebase community + one tera disk are getting quite cheap so having complete 6-men set is a reasonable target nowdays.

regards,

Tuhma

h.g.muller · Post by **h.g.muller** » Tue May 06, 2008 9:30 pm

Well, the memory seems to be the biggest problem. For pawnless 6-men, that is. They have 8G entries. I am not sure they could be compressed to anything less than 4GB RAM.

But fortunately most 6-men contain Pawns, and a P-slice with even a single Pawn is only 1G positions. (and any additional Pawn would make that 64 times less). So for pawny 6-men, it should be no problem. And I think most of the 6-men do have pawns in them.

h.g.muller · Post by **h.g.muller** » Wed May 07, 2008 2:28 pm

I have found even a slight refinement on the King scanning.

If we want to distribute a King move over two passes, we can do the first pass in the order:

a1, b1, b2, a2, a3, b3, b4, a4, a5, ...., b8, a8, c1, d1, d2, c2, d3, ...

where we treat all diagonal moves between (a, b) files, (c,d) files (so not between (b,c)!). We would also treat all moves between squares directly following each other in the sequence (so not between, say, b2 and b3). In the second pass we would use the same pattern shifted one file to the right. This would do the moves we left out in the first pass.

Inspection will learn that a diagonal move will always go to the next-nearest neighbor in the sequence. This means that you would have to store at most three K-slices simultaneously, if you are only doing a single move. If you would want to advance by two moves within the subset (for doing two cycles of white unmoves, of for doing a black unmove and a forward verification move) you might end up four places further in the sequence. So you wuld only need to hold 5 slices (in stead of 6).

This might not be very important for squeezing several 3-men K-slices in the cache, but this technique might also be used for on-disk generation of 7-men EGTBs, where you want to squeeze several 5-men K-slices in RAM. A 5-men K-slice has 1G positions. E.g. for a 7-men 4+3 you could store (1+3)-men contiguous chunks on the disk. This means the entire TB can be read with only 256K seeks to step through the chunks in any order, while you would have needed 16M seeks if you would store them as 3-men chunks. The remaining 3 white pieces (the King being one of them) would have to be treated in two passes, piece 1 + X-group King moves, and later piece 2 + Y-group King moves, where you can choose to do the moves with the contiguously stored piece (which is always dynamic) in either of the passes. The maximum RAM demand is than five 1G slices if yu want to use leap-frog (doing two cycles of white moves each pass), or 3 1G slices if you would do two passes per cycle. As a 1G slice might only contain 4 bits per entry (even without compression if it is a bitbase), this means we could already use this technique with 2GB of RAM.

With (3+4)-men it works the same, but with the 16MB chunks containing 4 dynamic black pieces.

CCRL Discussion Board

Factorizing tablebases by K-slice

Factorizing tablebases by K-slice

Re: Factorizing tablebases by K-slice

Re: Factorizing tablebases by K-slice

Re: Factorizing tablebases by K-slice