fastest EGTB generation

Endgame analysis using tablebases, EGTB generation, exchange, sharing, discussions, etc..
Post Reply
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

fastest EGTB generation

Post by h.g.muller »

As I want to have my engine construct EGTBs 'behind the board' as it needs them (so that I can handle 9-10 men end-games with Pawns without needing gigabytes of pre-computed information), speed is of the utmost importance.

My own EGTB generator for pieces-only, (i.e. applying 8-fold symmetry) now generates the worst-case 5-men TB (KQRKQ) in 339 sec on my 1.3GHz laptop (wins only). Although this is acceptable for non-blitz time control, I am curious how this compares with what others are doing. What is the world record on generation time?

The average time over all 40 possible 3+2 men is 211 sec per end-game, but for generating behind the board the worst case is more relevant than the average. On the other hand, the EGTBs that take the longest (those with two Queens for the win side) are not very relevant, as my engine would win them (except KQQKQ) with 100% probability at any time control even without TB. Especially if I tell it in the evaluation that any KQK position scores better than a position with 5 men on the board, so that it trades one of Queens at the earliest possible moment.

Many non-trivial TBs have generation times in the 240 sec range, however. This is acceptable, as after the generation it can play all the remaining moves in zero time. If it is generating the TB only for probing, this is a bit more tricky. So I am still interested in speeding up the process.

(I guess I could easily gain a factor 2 for KQQKx by making use of the exchange symmetry between the two white Queens. But in the end I expect that the TBs that will be needed most are of the type KX..KY.., where X and Y are Pieces, with any number of Pawns substituting the dots, e.g. KRPKBPP. And in such end-games exchange symmetry between pieces doesn't play a role. So I am interested more in raw speed of the algorithm rather than in tricks to gain time through reducing the TB size.)
Arpad Rusz
Posts: 93
Joined: Mon Mar 27, 2006 5:33 pm
Sign-up code: 0
Location: Romania/Hungary
Contact:

Post by Arpad Rusz »

Just an idea: what if you use pre-computed bitbases (which are very small in size compared to tablebases) and then, during the game, compute the tablebases you need with the help of the them. Generating the tablebases with the help of the bitbases should be faster. The only question is: how much faster is that process?
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

Why do you expect that to be faster? I don't see how having the bitbase could shorten the building process.

It might help to have a bitbase for the TB with Queens to which one with a Pawn corresponds after promotion, though. Then it cn easily look up if a promotion is winning or not. :idea:
Arpad Rusz
Posts: 93
Joined: Mon Mar 27, 2006 5:33 pm
Sign-up code: 0
Location: Romania/Hungary
Contact:

Post by Arpad Rusz »

OK, maybe not a classical bitbase helps in the generation process but some other pre-computed, small-sized database.
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

EGT Generation: turbo ideas

Post by guyhaw »

5-6 minutes to generate KQRKQ is certainly quicker than Nalimov's GTBGEN on my 2.8GHz uni-processor - and would be if it was only computing wins. I take it you are computing DTM and not a win/not_win bitbase.
Computing 9-10-man EGTs during the game is a touch ambitious at the moment but ... (possibly further) ideas that speed up the computation:
- compute to the DTC rather than the DTM metric
- write your program in machine-code rather than, e.g., C++ (as Yakov Konoval has demonstrated on 7-man)
- use compression before writing large blocks of data to disc
- use more discs, wisely to minimise head-movement: 3 is enough with the best algorithms if running just one processor
- use more processors: the retrograde algorithms are entirely parallellisable, in themselves and alongside compression/decompression
- increase parallelisability by computing to the DTZ metric, doing a 'Pawn-slice sub-endgame' at a time, e.g. KQP(a7)KQ then KQP(a6)KQ etc
- probe 'minor endgame' EGTs before the game starts (if that's within your rules)
- use a bitbase of the same endgame (as suggested above) to avoid testing draws and wins as potential losses.
g
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

I calculate currently DTM. The advantage of DTC would be that I only need win/loss information on the TBs it converts to. For capture conversions it doesn't really matter, as my current algorithm automatically calculates all TBs with a subset of the pieces as well. (This is included in the quoted times.)

Currently I don't save anything on disk at all, so there is nothing to gain there. (The TBs are purely produced for playing / probing by the engine that generated them on the fly, that wants them in RAM anyway.)

I am not sure if SMP or assembly coding would help much, as it seems the building process is largely memory-bandwidth bound. I tried for instance to keep track of which entry in a cache line (of 64 entries) needs to be treated, so that it does not have to loop through 64 entries to find it. Even in tablebases where almost all cache lines contain only a single new win in every iteration (such as KBBKN), this only gave a speed-up of 1%. It seems that the time to process the position is completel negligible compared to the time needed to bring the data into cache. (This is not surprising, as a cache miss on my machine takes 286 cpu cocks to process, in which time 750 instructions could execute.)

Doing 9 or 10 men is ambitious, but if 5 or 6 of the men are Pawns, I don't think it is out of reach. I plan to use the fragentation trick (based on Pawn configuration). At the current speed I should be able to do a fragment with 4 non-pawns in ~16 sec. That means I can only handle Pawn configuratons resulting in ~50 reachable Pawn structures, which corresponds to 2 or 3 Pawns. With only 3 non-Pawns things speed up by a factor 64, and thus handle Pawn configurations that have ~3200 reachable pawn structures. There are many configurations of 5-6 Pawns that fit this requirement.

I hope to be able to save appreciably by just excluding a large fraction of the Pawn conformations from the winning path. E.g. in KRPPPKRPP with one passer and an the other 4 Pawns bloacked head to head, it seems in general a bad idea for the side that is ahead to sacrifice both his blocked Pawns and let the black advance his two Pawns to the 2nd row. It is extremely unlikely that such a fragment (e.g. KR(e4)KR(g2)(h2)) would still contain any relevant wins). Declaring the entire frament lost will probably not affect any position in KR(e4)(g7)(h6)KR(g6)(h5).

So my algorithm will be to first designate unattractive fragments as completely lost, and build the TB for the attractive fragments based on that assumption. If the current position is found to be a win under these assumptions, there is no reason to sarch any further. If not, you add some fragments that were so far deemed unattractive to the list, to dig up any wins there. (e.g. give up h5, and let black's h-Pawn advance to h4, h3 and h2.) Then you build those fragments, which gives rise to seeding of new won positions in the fragments you already had, so you just extend those which the new wins that follow from it by retrograde analysis.
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

Pondering performance ...

Post by guyhaw »

There's a good thought there: that subsequent-endgames' DTC EGTs could be reduced to 2-bit win/draw/loss/broken form before being probed to produce the new endgame. That's a new one for the idea-set, at least with me.
DTC EGTs also only require maxDTC cycles to create them, whereas DTM EGTs require maxDTM cycles: inherited wins in 'm' are held over until cycle 'm' if the most efficient use is made of RAM.
Some close analysis of where disc-bandwidth and CPU-time are used would be useful. Having just produced some 6-man DTC EGTs, and seen no perceptible speed-up per cycle as the number of positions at that depth dried up, I tend to agree that the compression/decompression task is the CPU-eater.
Therefore, minimising cache-misses, and having more processors to do the decompression would seem to be good.
There's also the question of how the board is represented: there are some fancy schemes for board-representation which make move (and presumably 'unmove') generation quicker: bitboards?
'Fixed Pawn slices', or P-slices as I call them, would be useful: then one only needs to generate part of a 6-man or 7-man P-ful endgame's EGT. One might substitute for some endgames by chessic assumptions but this is fraught with danger. An approach to a 6-man EGT, done years ago, was undermined by false assumptions - though not by much.
Cf also Eiko Bleicher's excellent innovation with FREEZER which allows this sort of approach and would allow you to compute 7-man EGTs with blocked Pawns, if one had the relevant 6-man EGTs.
g
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

My generator experiences a large speedup as the number of won states dry up, as it then can skip all cache lines that do not contain any won states. Before I did that, a single cycle through a 5-men symmetrized TB (128MB) took at least 1.2 sec, even if there was almost nothing in it to process. Now it is negligible.

But that only works in RAM; if between cycles the TB is stored on disk, you must read it in per disk sector, or (if the OS is not very smart) perhaps even per cluster. And the sectors are much larger than the cache lines (512 vs 64 bytes), so the chance that there is one you can skip becomes much smaller.

My estimate for the time it will take to do a single P-slice in an end-game with 4 other pieces is based on the time it takes to build a symmetrized 4-men EGTB, which is ~2 sec. For the P-slices, which are completely without symmetry, the size is 8 times larger. But I hope that this estimate in practice will turn out to be very pessimistic, as not having to worry about the symmetry might save you a lot of code (and branch mispredictions...), plus that a very large part of the P-slices might be immediately identified as being non-won, by seeding it from the P-slices it converts to.

The way I generate the TB now is by first doing one pass to generate and count all legal moves from each position. The counts go into the TB, so that in later passes, for any position you identify as won-in-N, you can decrease the count of the grandmothers, and immediately know if they are won-in-(N+1) from the remaining count. This 'escape counting' consumes about 40% of the time. In principle, you know that the position is not won yet after the first escape you find, but you still have to count the total, because you never know if that escape will be plugged in a later cycle. But an escape to another P-slice, which basically you simply copy from the already finished TB for that P-slice, is guaranteed to be permanent. So after the first you find, you don't have to bother at all with that position anymore. This could be a big time saver.
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

EGT Generation performance

Post by guyhaw »

I should have said that one idea is to generate the EGT in a format that gives quick-generation, and then convert it to a probably-different format that gives efficient storage and access.
Thus, one might not use the full sophistication of Nalimov indexing, with all the clever ideas, but convert to Nalimov-indexing later. Of course, an EGT-verification has to be done after that, which could take a long time.

Since you mention that you count the number of moves from a position, I should refer you to a Ken Thompson paper where he said that this idea did not work for him when backing up known-wins to potential-losses. This paper, 6-Piece Endgames is in the ICCA Journal, v19.4, pp215-226 .
The RAM-efficient schemes now work with a set of bitmaps, with one (?) bit per position in each, rather than with the EGT itself which will have one byte (and maybe two) per position.
Certainly, if one is working a P-slice at a time, then 'polling subsequent endgames' EGTs' will give information like 'at least a draw' which will be helpful.

g
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

OK, thanks for the reference, I will have a look what Ken Thompson has to say on this. For me it so far is the fastest way I found for end-games that are predominantly won. Only in end-games that are virtually without wins it is a big waste of time to count moves in positions that you will never visit again. Initially I labeled grandmothers as potentially lost (in N+1), and in a second pass through the TB I generated forward moves in these positions until I found the first escape to a non-lost position. Only the ones without any escapes remained labeled as lost. Problem in a generally won end-game is that you eventually will plug all escapes, and the number of moves you have to try before you find the first one grows during the process, so that near the end you will go through the early moves again and again, repeating each move on the average 4-5 times.

I also tried just leaving an indication in the potential wins of which escapes had already been plugged, so that it would not redo them, but start immediaately with the first one that is still open. But for some reason this was even slower. The advantage of counting once and for all, is that you don't have to label potential winsfor later processing, but you immediately know if they are truely lost in N+1 or not. This leaves far fewer positions to visit in the next pass through the TB.

The first idea is what I actually do: for speed I made the indexing completely trivial. Basically I just concatenate the bits of the square numbers. The advantage is that I can do move generation directly from the index, by adding the stride corresponding to the step for a particular piece on the board to the index (repeatedly so for sliders). Only symmetry clutters it up a little, but I ordered the pieces in such a way that the symmetry class can only change on white moves (of which you have to do fewer if you look for grandmothers of positions with black to move, and which you don't need at all in the counting of the black escapes). Positions that are forbidden by the symmetry rules (e.g. second white piece above the diagonal if the white King is on it) remain simply unused.

The storage and acces part I don't get to, as storage is not my aim. (Dan Corbit seems to be eager to do this, though.) If I would be forced to save P-slices to disk, because they don't all fit into memory, I guess I would do it as a bit-base (two bits per position, one for won with white to move, one for lost with black to move). Symmetry doesn't play a role in P-slices anyway.
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Re: EGT Generation performance

Post by h.g.muller »

guyhaw wrote:This paper, 6-Piece Endgames is in the ICCA Journal, v19.4, pp215-226 .
Do you happen to have an URL for this paper? I could not find it on the web.

I did find the paper by Wu and Beal, that discuss the bit-map strategy. I guess this is very similar to what I do, except that I hide the bit map as one bit of the DTC field (which only needs 7 bits, as DTCs larger than 128 do not seem very useful). Of course this is only attractive if the main databases reside in RAM as well. In Beal's case they were assumed to reside on disk, so he needed to separate off the bit maps. But once you are going to use disk, you are dead anyway.
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

Ken Thompson's paper ...

Post by guyhaw »

I don't think it's on the web: http://www.cs.unimaas.nl/icga/journal/backissu.php is the source for buying back copies.
The Wu/Beal papers were the first to show that DTM, as well as DTC, EGTs can be done with a bit-based technique - by deferring the promulgation of DTM wins. One only refers to the EGT itself in cycle 'd' to note those positions that have just been given depth DTx = 'd'. This has to be better than lugging around 1-byte and 2-byte entries per position.
I believe that the fact that Yakov Konoval uses Pentium Assembler rather than a high-level lanugage does make a contribution to his program's speed.
I fell to thinking how the number of disc accesses could be minimised when backing up to stm-losses: maybe someone does this already. One could:
1) 'retro' to the potentially-lost positions from the 'current frontier' of won positions
2) subset this list, removing positions that are known to be draws (if there is any prior knowledge): none will be in the 'lost already' category
3) generate successor-positions for these positions - and order these before consulting some bit-map, noting those that are 'escapes' to draws or wins.
4) thereby know which of the positions retro'd-to are actually losses - avoiding random-access to the bit-map of known results.
g
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

The 'random' accesses are actually not very random if you design the indexing scheme properly. I use the black piece locations to derive the low-order index bit. So during potential-loss verification you sample the TB very locally, (4KB blocks for 2 black pieces, 256KB for three, nicely fitting in the L1 or L2 cache, respectively), and you can verify all positions within such a 4KB or 256KB block without leaving the block. This means that the block has to be cached only once, with no use of any memory outside the block while it is being used. This scheme would benifit caching disk in memory just as well as caching DRAM in the CPU caches.

The nasty part is the white moves: they correspond to large strides in the TB. Currently I do this the 'naive' way: scanning the TB sequentially for ponential wins, and for every one I find, visit all the target blocks reachable by displacing one white piece. With 3 white pieces, you would access ~30 far-away target locations in the TB. In general, they won't last long enough in the cache for a second access from another position.

(Unless there are only two black pieces, then all moves of the white piece supplying the lowest-order index bits fall into a 256KB address range, that could be simultaneously cached, and all moves with this piece would give cache hits. But that would still leave ~20 moves with the other white pieces. A disk cache can easily be 16MB, but the only reason to use disk is if you have more pieces, so there will always be some 20-40 moves that lead to non-cached data.)

If from every position 22 white moves (say 8 for K and 14 for R) lead to non-cached data, that means that all data on the average has to be brought in cache 23 times (be it from DRAM or from disk). Now this is very suboptimal. :idea: It would require far fewer accesses if one treated groups of potentially-lost positions that share many mothers simultaneously. :idea: If you would for instance treat all positions with the Rook in a 2x2 sub-square like {a1,a2,b1,b2} simultaneously, you would reduce the number of scan steps by a factor 4 (as you do 4 at the time), and the work per scan step would increase from 8+14=22 mother positions to 8+28=36 (the a+b file and 1st+2nd rank have only 28 squares). And (36+4)/4 = 10, so that the number of times each data set has to be brought in cache reduces from 23 to 10.

The limit to this technique is the number of blocks that you can bring in memory simultaneously. Doing one block of potential losses and its 22 mothers requires the simultaneous presence of only 2 blocks, as the mothers can betreated one at the time, and can be discarded as soon as they are processed. With the 2x2 Rook set of potential losses, you need to cache 5 blocks, as you have to keep the four blocks of potential losses. But you could reduce them to a bit set first, saving a factor 8.

So I guess you could easily push the number of mothers even more, e.g. use a 4x4 quarter board of Rook locations for treating 16 potential-loss blocks simultaneously. From this 4x4 quadrant a Rook can reach (retrogradely) 48 squares, plus 8 King moves, that makes 56 mother positions. Together with the 16 potential-loss blocks that is 72, but you would save a factor 16 on the number of scan steps you need, and 72/16 = 4.5. Compared to the original 23, that is a savings of a factor 5 in number of disk accesses. And the memory (or cache) requirement only goes up by 50%, from having 2 blocks (potential losses + buffer for treating the mothers one by one) to 3 (16 blocks of potential mothers, reduced a factor 8 by packing as a bit map, makes 2, plus the buffer in which you seaquentially treat the mothers against each of the 16 potential losses). Admitted, if I would have used the compression of potential losses from the beginning, I would have needed only 1+1/8 block, so the memory/cache requirement goes up by a factor 3/(9/8) = 8/3 = 2.66. But that is still not bad.

Currently I don't use such tricks yet, but I guess I could still really speed up my code by a factor, this way!
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

To hgm ...

Post by guyhaw »

I have permission to scan Ken Thompson's 1996 paper and make it available here. I will do so shortly.
I think we are using two different 'languages' to talk about algorithms, which makes it difficult for me to understand your last.
As we are in a binary dialogue, you can email me via http://www.tinyurl.com/law6k if you wish and we'll see if we can make progress.
I could have added earlier that I think Yakov Konoval's new generation-time index-scheme contributes to his speed of generation. I wonder if he is marshalling positions to minimise cache-misses as well.
g
h.g.muller
Posts: 223
Joined: Mon Feb 19, 2007 8:24 am
Sign-up code: 0
Location: Amsterdam
Contact:

Post by h.g.muller »

OK. I will try the e-mail. My guess why escape counting was not working for Ken, while it seems to work for me, is that he was building the TB on disk rather than in memory. Redoing a calculation 4 or 5 times, even a complicated one like a one-ply search, is always much faster than a disk access, but will be much slower than a cache-line fill.
Post Reply