EGTB Compression Goals

h.g.muller · Post by **h.g.muller** » Mon May 19, 2008 11:08 am

First then a clear description of leap-frog:

// pseudo-code for leap-frog solution of KQK
// Note it is not really essential to store the full DTx of BTM positions
// For efficiency, however, it is important to be able to distinguish
// the entries decided in the most-recent cycle from those assigned earlier
//
// The basic structure has the nesting
//
// SCAN staticPieces
//   SCAN dynamicPieces
//     MOVE dynamicPieces

#define WHITE_KING 0
#define WHITE_QUEEN 1
#define BLACK_KING 2

int lostBTM[64*64*64];
boolean wonWTM[64*64*64];
int pos[NR_OF_MEN];

DoSlice(int N, int whiteDynamicPiece, int whiteStaticPiece)
{
     // perform one cycle through grandparent algorithm within cached slice

     for(pos[whiteDynamicPiece] = 0 to 63) {
      for(pos[blackKing] = 0 to 63) {
        if(lostBTM[currentPosition] == N) {
          while(move1 = NextMove(whiteDynamicPiece)) {
            Make(move1);
            if(!wonWTM[currentPosition]) {
              wonWTM[currentPosition] = 1;
              while(move2 = NextMove(BLACK_KING)) {
                Make(move2);
                if(lostBTM[currentPosition] == UNDECIDED) {
                  while(move3 = NextMove(BLACK_KING)) {
                    Make(move3);
                    escape = !wonWTM[currentPosition];
                    UnMake(move3);
                    if(escape) goto cutoff;  
                  }
                  lostBTM[currentPosition] = N+1;
                 cutoff:
                }
                Unmake(Move2);
              }
            }
            Unmake(move1);
          }
        }
      }
    }
}

DoCycles(int N, boolean leapFrog, int whiteDynamicPiece, int whiteStaticPiece)
{
  for(pos[whiteStaticPiece]=0 to 63) {

    // do the N -> N+1 cycle for the slice that keeps the whiteStaticPiece fixed
    // in the current place. The slice fits in fast storage, 
    // and all moves remain inside of it

    DoSlice(N, whiteDynamicPiece, whiteStaticPiece);

    if(!leapFrog) continue;

    // the slice with the whiteStaticPiece fixed (64*64 entries)
    // is still in fast storage after this. Make use of that by doing
    // the N+1 -> N+2 cycle in this slice as well, now that we are finished
    // determining the lostBTM[] == N+1 positions.

    DoSlice(N+1, whiteDynamicPiece, whiteStaticPiece);
  }
}

build()
{
   int N = 0;

   LabelCheckmatesAndIllegals();

   // now lostBTM[i] is set to 0 for every checkmate, and to UNDECIDED for others
   // wonWTM is set to 1 for every position where black is in check.

   DoCycles(N, FALSE, WHITE_KING, WHITE_QUEEN);

   do {
       DoCycles(N, TRUE, WHITE_QUEEN, WHITE_KING);
       if(FINISHED) break;
       N++;
       DoCycles(N, TRUE, WHITE_KING, WHITE_QUEEN);
       N++;
   } while(!FINISHED);

}

This description does not deal with symmetry yet. If the white King is restricted to 1/8 of the board, the TB scanning loop should only run over the squares where it is allowed, when it scans the King. Make and UnMake should deal with moves that make the white King leave the allowed region, by mirroring the entire currentPosition such that the white King is mapped back into the allowed region. As this requires displacement of the static pieces as well, it means that when the King is a dynamic piece (and thus the slices are 8 times smaller), all 8 symmetry-equivalent positions of the static pieces have to be loaded to 'close' the slice w.r.t. all white King moves.

koistinen · Post by **koistinen** » Thu May 22, 2008 6:26 am

h.g.muller:
Thanks, that was a good explanation of leapfrog.

Here is how I now understand it:
In one cycle you read and write the wtm table twice and go 2 full moves forward, so you only need about 25 cycles to compute DTZ50.
With a simple indexing scheme the main wtm table would contain 10*2^36 positions for KQRBKQR and need 80GB.
Disk I/O needed would be about 8TB and at 100MB/s that would take about 24h including seeks.
(I seem to have misread myself and got into thinking my algorithm would only need 24h when it is more like 72h of disk time.)
To keep track of the wins found at different DTZ you need 100 tables which will be sparse as each position is 1 at at most in one table.
Using run length encoding by octets when storing blocks of such positions will need at most 8+8*99/256 bits per position that will need to be written once and read no more than 3 times and so requiring no more than 3,6 TB of I/O for the above example, that is about 45% extra I/O.
With 9 bit encoding it would be 9+9*99/512 bits/position, just a little cheaper. Perhaps best would be to use 10 bits as that would cost about the same in the worst case while being better in other cases, or perhaps select the number of bits to use dynamically.

I guess using your leapfrog idea with examining every black at each step would about double the needed cpu. The grandparent algorithm has an advantage in that its cpu cost is about the same with leapfrog as without. The mirroring indexing scheme might help you increase block size.

h.g.muller · Post by **h.g.muller** » Thu May 22, 2008 12:04 pm

koistinen wrote:I guess using your leapfrog idea with examining every black at each step would about double the needed cpu. The grandparent algorithm has an advantage in that its cpu cost is about the same with leapfrog as without. The mirroring indexing scheme might help you increase block size.

I think it should not, as the black moves are only done if the wonWTM bit gets set by the preceding white retro-move. And this can happen only once for every entry during TB build. Any later white retro-move that leads to the same position will be 'blocked' as it finds the wonWTM bit set.

I like the idea of run-length coding for indicating the 'win front'. Typically only 40% of the btm positions are lost even in a generally won endgame, if black has more than a bare King. (The rest are King captures, or captures of a hanging essential white piece.) And the bulk of these 40% have a DTx somewhere in a range of 10-20. So that is at most 2-4% of all positions at any cycle, and during many cycles a lot less. So with 6 bits encoding the distance to the next lost-in-N position it would take at most about 0.25 bit per position in the TB.

Now there would have to be two such data sets, one for the lost-in-N positions we are working from, one for the lost-in-N+1 we are in the process of building. During the creation of the latter, it cannot be in run-length encoded form, as we will find the lost-in-N+1 positions not is sequential order, but by following the moves. So we will create it as a bitmap, as large as the slice we are working on, and compress it to run-length (for each chunk of contiguous data separately) as soon as we are done with the slice, to store it together with the rest of the slice data.

The slice read from the HD will thus reside in memory as 2.5 bits per position, the lostBTM bit, the wonWTM bit, and two run-length encoded sets to single out the lost-in-N and lost-in-N+1 positions of 0.25 bit each. It will be treated in slices of one white and all black dynamic pieces (1+3 or 1+4). If any lost-in-N+1 data for the sub-slice already exists, it will be unpacked to a bitmap, and the new lost-in-N+1 positions uncovered by the white moves now under treatment will be added to it.

At a point where a cycle is completed (and leap-frogging makes us start the next cycle in that same memory-loaded slice) the lost-in-N information is discarded to make room for the lost-in-N+2 information. The lost-in-N+1 information takes over its role, and is written out to HD for merging into a final EGTB after the last cycle is completed. This requires ~6 bit per lost btm position for all cycles together.

koistinen · Post by **koistinen** » Fri May 23, 2008 5:33 am

h.g.muller:
About my guess about needing double the cpu for leapfrog, that is for my idea of doing a full search at every step. In order to be able to do that quickly I can't limit the search to just the positions that need to be searched.

Doing run length encoding with n bits costs in the worst case given proportion p of positions set: p*n+(1-p)*n*2^-n
For your example with n=6 and p=4% it becomes: 0.04*6+0.96*6*2^-6 = 0.24+0.09 = 0.33
To understand, think of what happens when all the set bits happen to be next to each other.

Why would you need to keep the lostBTM positions in memory? Whenever you find one it is enough to prove that all WTM positions that have a white move leading to it are won.
Alternatively you could use RLE when storing the BTM positions as only the lose front is needed.

How would you handle endgames like KQRKQRB and KQKQRBN?

h.g.muller · Post by **h.g.muller** » Fri May 23, 2008 12:05 pm

I guess that what I had in mind was not pure run-length coding. The worst-case is not very interesting, as the EGTB will look nothing like that. I see no reason to assume that the DTx=N positions will not be randomly scattered over the TB. So I was thinking about a one-sided run-length encoding, where only the distances between the win-front positions would be encoded as a binary number, and all positions themselves would be assumed to be isolated. (So that 0 would also have to be a possible number of intervening positions.)

In fact, if the number of bits is not 8 or 16, so that we have to shift and mask anyway, in order to unpack the data, I might as well encode the distance between the DTx=N positions in a Huffman code. For a small fraction p of randomly scattered DTx=N positions, the distance x between them would be distributed exponentially, as exp(-d*x). An optimal Huffman excoding for the length of these intervals would use enough bits to encode all distances until the probability has dropped a factor 2, (i.e. x = ln(2)/d). For p=4% that would be about 4 bits. The next group of codes would then use 1 bit more for the next 16 distances, etc. So

0 = 10000
1 = 10001
..
15 = 11111
16 = 010000
17 = 010001
...
31 = 011111
32 = 0010000

Another way to look at this is that there are 16 five-bit codes all starting with 1 that indicate intervals of 0-15 positions to be skipped, followed by a single position to be treated. The single-bit code 0 would indicate 16 positions to be skipped, without a following position to be treated.

As for the 3+4 or 2+5 men:

3+4 would not be different from 4+3 as far as disk access is concerned. The 16MB chunks that have to be gathered to bring a slice in memory would be 0+4 slices, in sted of 1+3 slices. We would need five different K-slices of 1+4 type in memory. The treatment once it is in RAM would have to be different, though. Doing a grand-parent step for the single white piece and all following black moves would heavily overflow the cache, even for the black moves, and would then totally have to run from DRAM, making it very slow. One possibility would be to do again separate passes for white and black moves. But this would necessitate to remember which wtm positions are part of the win-front. And there are usually far more of those than btm positions. And although it would not have to be done for the full 1.5+4 slice in memory, it should at least be done for a 1+4 slice. As a bitmap this would take an extra 1G bit = 128MB. This would probably not kill us.

For 2+5 I don't really have a satisfactory solution yet. There might not be many interesting end-games of this type, though, in FIDE Chess. Even Q vas 4 minors is probably not generally won, and the win-front would die out quite quickly.

h.g.muller · Post by **h.g.muller** » Fri May 23, 2008 2:54 pm

Oh, I overlooked this one:

koistinen wrote:Why would you need to keep the lostBTM positions in memory? Whenever you find one it is enough to prove that all WTM positions that have a white move leading to it are won.
Alternatively you could use RLE when storing the BTM positions as only the lose front is needed.

Very good question. All parent positions of a newly declared won wtm position must still be UNDECIDED, as they have a black move to escape to a position that upto now was also UNDECIDED. So it seems the information is indeed totally redundant.

This means the lostBTM bitmap only needs to be around when building the next-generation lose front through random access. And (at least for 4+3) it does not have to be present for the entire set of K-slices that is in memory, just for the 1+3 slice that we are working on. As soon as we step to the next position of a static white piece, we stop getting random accesses for new lost(N+1) positions, and thus can compress the lost(N+1) info in all the 0+3 chunks making up the 1+3 slice to (quasi-)RLE, not to touch it until we gather those chunks to prepare another 1+3 slice for anothe white dynamic piece. Then we decompress it again into a bitmap.

So the memory requirement would be:
five buffers, each holding a 1G K-slice consisting of
- the lostWTM bitmap. (1 bit/pos)
- the RL encoded lost-in-N info (~0.25 bit)
- the RL encoded lost-in-N+1 infor (~0.25 bit)
a bitmap to indicate the win front amongst the wonWTM positions, that can hold one 'working slice'
a bitmap to accumulate the new lose front for one working slice.

The buffers take ~1.5 bit per position, times 1G times 5, or nearly 1GB.
When building a 4+3 the working slices are 1+3, or 16M positions, making the memory demand of the two auxilary bitmaps negligible. For 3+4, the working slice would be 1+4, or 1G positions. This makes the auxilary bitmaps 128MB each. So total memory demand is only 1.25 GB. That is great, as my machine has 2GB!

Per cycle we read the wonWTM bitmap plus the lost(N) info, K-slice by K-slice. Each K-slice is 2+3 or 1+4. Staring from the lost(N) info, we update the wonWTM bitmap, and (piece-wise) decompress the lost(N+1) info to update it by adding the new lost positions we find through the white moves that we currently treat, and then compress it again as quasi-RLE.

At some point we have treated all the white moves that stay within the working set of 5 K-slices, and at that time we have completed the lost(N+1) info. We can then discard the lost(N) info, after making sure that it is or remains saved somewhere on HD. Because of leap-frog we immediately continue generating lost(N+2) info, starting from the lost(N+1) info, within the currently loaded working set. It can overwrite the lost(N) info in memory. The lost(N+2) info will not be complete, as the positions reached by moves with the currently static white pieces will still be missing. It will only be added when the first half of the HD pass that makes those other pieces dynamic will have completed its N+1 -> N+2 cycle. So both the lost(N+1) and lost(N+2) info remains relevant to that pass, and is written out to disk together with the wonWTM bitmap.

The data is stored on HD as chunks of 16M positions, a 2MB bitmap, and two ~512KB sets of lose-front info. These chunks ahave to be gathered on reading and scattered on writing in every pass. 40K such chunks (because of 8-fold symmetry) make up the entire EGTB under construction, i.e. 100GB in total. In addition, upto 10GB of RLE lost(N) info is written on each cycle to save that particular DTx. This can only happen some 20 times before we run out of positions to assign a DTx to; if the end-game is much more lengthy, it also means that there are fewer positions in the lost(N) lose fronts, and their files will be correspondingly smaller.

So it seems we can do with 300GB of HD space. 100GB of this is read and written once for each cycle, half of the time non-sequentially. The other 200GB is written once during build, and read once during the merging of all the passes into a single EGTB.

koistinen · Post by **koistinen** » Tue Jul 08, 2008 11:31 pm

It has been a while and I have had good time to think, thank you.
Here are my views on how to use compression when computing 7-man endgame tablebases:

A normal computer now has about 2-4 GB ram and 500-1000GB hard disk and can execute about 3-10 Gips (64-bit operations).

For 4-3, 5-2 and 6-1 leapfrog works very well:
Divide white moves into two parts, A with king moves and moves of one piece, B with rest of whites moves.
Put king index (0-9) and the invariant (under mirroring) part of the other piece in the highest part of the index.
Put the invariant part of two more white pieces in the next highest part of the index.
Compute wins for DTZ=0+A and store that compressed with RLE.
Compute wins for DTZ=0+B, DTZ=0+A+B=1, DTZ=1+B.
Compute wins for DTZ=1+A, DTZ=1+B+A=2, DTZ=2+A, etc.
Because there are about 100 DTZ to compute and they all are exclusive the maximum density is 1/100 and likely density is perhaps 1/160.
This means that by using 10-bit RLE(run length encoding) each DTZ (bitbase) can be stored in about 64GB/16 bits = 4GB on average.
Because that is much smaller than the full accumulated bitbase (it is about 64GB), a saving can be made by not storing the accumulated bitbase after each step.
Having stored the accumulated bitbase after N+A the next few steps could be computed like this:
(Each step has to load the stored accumulated bitbase.)
Load N+A, compute and store N+A+B, N+1+B.
Load N+A+B, N+1+B, compute and store N+1+B+A, N+2+A.
Load N+A+B, N+1+B, N+1+B+A, N+2+A, compute and store N+2+A+B, N+3+B.
Load N+A+B, N+1+B, N+1+B+A, N+2+A, N+2+A+B, N+3+B, compute and store N+3+B+A, N+4+A, accumulated N+4+A.
This way the steps cost: x+3y, x+4y, x+6y, 2x+8y giving a total of 5x+21y instead of 8x+12y as simple leapfrog would be.
As x is the cost of reading or writing 64GB and y is the cost of reading or writing about 4GB the benefit should be clear.

For 2-5 and 3-4 endgames leapfrog is not quite as good, but as there are more black pieces another possibility enters:
Perhaps it is much more uncommon for positions with black to move to be won by white.
In those cases (could be decided for each DTZ separately), compute and store all the btm positions where white win with the current DTZ in ram using RLE to make them fit.
Then when doing the next step, use those (in ram) values to compute the next value of the DTZ bitbase for wtm.
Cost of storing the accumulated DTZ bitbase could be reduced in a way similar as for leapfrog.
When the relevant btm positions does not fit in ram, store them and integrate them in a separate step. (this should about double the cost for those steps where this is neccessary)

Another idea is to avoid storing illegal positions in the accumulated bitbase. Might be good for 20-50% saving? (in the x value above) It is not good for the RLE-compressed tables because of the small benefit compared to the cost.

As it should take 64 times less time to do a 6-man using our methods and so should be doable in about 10-30 minutes on current computers, maybe that would be a good thing to start with? Improving theory is fun but to be useful in practice, working code is needed.

gambit3 · Post by **gambit3** » Fri Jul 11, 2008 6:57 am

hate to disappoint, but the 'normal' computer still runs a 32-bit operating system. vista is still not purely 64-bit, though it is a choice to be made when upgrading from xp (32-bit versions only, since vista will not upgrade a 64-bit xp, and ESPECIALLY doesn't like x64 corporate) whether you wnt to keep all the proggies you had for xp, thus installing 32 bit vista, or start afresh, thus installing 64 bit. most people that upgrade will go with what they know, since they have their machine set up the way they want it, and install the 32-bit.
correctly, most NEW machines now have 2+GB RAM ...

if i may ask a (semi-) naive question: how much time cpu-wise were you planning on devoting to the codec? i ask because currently most of the computation is (or should be) done in the 512k L1 cache of the cpu, while loading to and from RAM and disc as necessary. given that L1 RAM is considerably faster than main memory, even using the cpu-light RLE for a codec will cause considerable delay with
L1<->main memory transfers, or have i missed something lately in computer technology that changes that?

koistinen · Post by **koistinen** » Sun Apr 11, 2010 9:10 pm

A 64-bit computer running a 32-bit operating system is a software problem that has already been solved. Just replace with a 64-bit OS. I hear there is some free software for that.

koistinen · Post by **koistinen** » Fri Apr 16, 2010 5:50 am

gambit3 wrote:...
if i may ask a (semi-) naive question: how much time cpu-wise were you planning on devoting to the codec? i ask because currently most of the computation is (or should be) done in the 512k L1 cache of the cpu, while loading to and from RAM and disc as necessary. given that L1 RAM is considerably faster than main memory, even using the cpu-light RLE for a codec will cause considerable delay with
L1<->main memory transfers, or have i missed something lately in computer technology that changes that?

Compression should reasonably be for things on disk. If bandwidth between main memory and onchip cache becomes a limiting factor (with multicore cpu:s) it might be good to use it there as well, perhaps using one or two cores for that.
Initially, I would not bother with it, as main improvements can be done without it.

CCRL Discussion Board

EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals