EGTB Compression Goals

koistinen · Post by **koistinen** » Fri May 02, 2008 8:25 pm

What goals do you think would be worthwhile regarding compression of EGTB:s? (besides reducing (local) storage requirements)

jkominek · Post by **jkominek** » Mon May 05, 2008 3:24 am

koistinen wrote:What goals do you think would be worthwhile regarding compression of EGTB:s? (besides reducing (local) storage requirements)

Mandatory: direct indexing (block indexing), devoid of patent infringements
Desirable: variable block size, better encoding speed, faster initialization when loading a tablebase set, lower memory requirements for decoding data structures
Nice to have: single-pass compression, API that accepts a memory array (as an alternative to a file), API that acts as an incremental Writer with low buffering overhead.

john

h.g.muller · Post by **h.g.muller** » Mon May 05, 2008 8:51 am

I am interested (but, alas, not very knowledgeble) in the subject of compression, for the purpose of limiting the storage requirement during EGTB generation.

The algorithm I plan to use processes the table base each cycle slice by slice, each slice being defined by keeping certain (white) pieces fixed ('static pieces'), and having all other pieces (the 'dynamic pieces') in every possible position. The slices (the data of which in the full tablebase might be scattered over the slow storage medium) are loaded one by one into the faster storage medium for processing (a 'gather' operation). This processing involves treatment of all positions in the slice that needed treatment (i.e. those that are part of the 'win front'), by generating and accessing all their predecessors within that same slice (i.e. positions that can be reached by white retrograde moves with the dynamic pieces).

When the slice is in the fast storage, it has to be present in uncompressed form, for efficent random access, as the retrograde moving will cause a random access pattern to the predeccessors even if we sequentially scan the slice for win-front positions. But the gather operation and the later writing back to the slow storage through the inverse 'scatter' operation might benefit from compression. This might both reduce storage requirements in the slow medium, as well as reduce the required transfer time of the scatter and gather operations.

Now I am mainly interested in building EGTBs in RAM, which then would be the slow storage, the CPU caches being the fast storage. So if any compression scheme were to be used, it will have to compete with DRAM access time, and thus be very fast. Simple Hufman coding might fit the description, requiring only a few instructions per entry. The compression factor might not be very high, though (3 perhaps).

The fastest version of my compressionless algorithm would store move counts in the DTx field of the tablebase for all UNDECIDED positions, so that I don't have to generate forward black moves, and probe the positions they lead to, if such a position is encountered as a POTENTIALLY_LOST one. I just decrement the move count, and only if it hits zero I know the last move to an UNDECIDED successor disappeared (because that successor changed state), and that the position now is LOST and can be assigned a DTx (overwriting the move count).

The problem is that this strategy subverts the efficiency of Hufman compression: the UNDECIDED positions, which are most abundant and otherwise all would have the same code, now need all differend codes spread out over the range 0-40, typically. So I guess I would have to give up the move-count idea if I want to compress. (In the early cycles of lengthy end-games, such as KBBKN or KBNK, move counting and compression might still be compatible, as most positions would have similar numbers of legal moves, and the early cycles only dig up very few decided positions, so that only very few of these moves would have been taken away.)

I am not sure how much the move counting will save, time-wise. Generating forward moves and probing is definitely slower, but during a large part of the generation, the first or second move we try will hit an UNDECIDED position, after which we can stop and leave the position UNDECIDED. nly at the end of a generally won end-game the UNDECIDED positions start to disappear, and we would have to try on average a very large fraction of the moves before we find one that reaches the remaining UNDECIDED position. But perhaps a large fraction of the work can already be saved by including a one- or two-bit hint where to start searching the forward moves (e.g. which of the black pieces made the move that solved black's problems the last time we encountered this position).

I am not sure compression can be competitive speed-wise at all. I measured the speed of sequential DRAM access on my 2.4GHz Core 2 Duo, and dirty accesses (changing the probed data) amounted to only 0.9 ns per byte (57 ns for a 64-byte cache-line fill). That seems hard to outdo. But there could still be an incentive to compress for space reasons. I guess one should go for maximum compression in such case, no matter how large the time sacrifice. A pawnless 6-men end-game would have 8G positions (using 8-fold symmetry, for one side to move), and not many PCs would have that amount of RAM (plus that you need something for OS and code as well). So even a factor 2 compression might be the difference between being able to do it in RAM, or not.

Perhaps a plain nybble encoding would be a useful solution: store two positions per byte, one bit for the wtm state (UNDECIDED or WON), three bits for the btm DTx field. This would allow 8 btm states, which could be taken as 4 different DTx (so that we are able to distinguish the current win front from the future one we are generating, and earlier ones) and 4 different UNDECIDED positions (the different codes representing different move hints for which moves have already been determined to not lead to an UNDECIDED successor, and thus will not have to be retried). This at least should be useful when generating bitbases, and would reduce the requirement for a pawnless 6-men to 4GB.

koistinen · Post by **koistinen** » Thu May 08, 2008 9:23 pm

jkominek wrote:
koistinen wrote:What goals do you think would be worthwhile regarding compression of EGTB:s? (besides reducing (local) storage requirements)

Mandatory: direct indexing (block indexing), devoid of patent infringements
Desirable: variable block size, better encoding speed, faster initialization when loading a tablebase set, lower memory requirements for decoding data structures
Nice to have: single-pass compression, API that accepts a memory array (as an alternative to a file), API that acts as an incremental Writer with low buffering overhead.

john

Direct indexing: Is this for ability to get value of a position resonably quickly from the compressed EGTB.
Devoid of patent infringements: Impossible to be sure of. To me, software patents are just another form of corruption. Maybe putting the software under GPL 3 would be helpful? Perhaps with parts under a more permissive license?
Other: I think it would help to think about each usage of EGTB:s separately. (like finding the value of a position in a chess playing programs search tree: winning/not winning, DTZ or a best move.)

koistinen · Post by **koistinen** » Thu May 08, 2008 10:03 pm

I have deleted much from the quoted text.

h.g.muller wrote:I am interested (but, alas, not very knowledgeble) in the subject of compression, for the purpose of limiting the storage requirement during EGTB generation.

Now I am mainly interested in building EGTBs in RAM, which then would be the slow storage, the CPU caches being the fast storage. So if any compression scheme were to be used, it will have to compete with DRAM access time, and thus be very fast. Simple Hufman coding might fit the description, requiring only a few instructions per entry. The compression factor might not be very high, though (3 perhaps).

The fastest version of my compressionless algorithm would store move counts in the DTx field of the tablebase for all UNDECIDED positions, so that I don't have to generate forward black moves, and probe the positions they lead to, if such a position is encountered as a POTENTIALLY_LOST one. I just decrement the move count, and only if it hits zero I know the last move to an UNDECIDED successor disappeared (because that successor changed state), and that the position now is LOST and can be assigned a DTx (overwriting the move count).

Perhaps a plain nybble encoding would be a useful solution: store two positions per byte, one bit for the wtm state (UNDECIDED or WON), three bits for the btm DTx field. This would allow 8 btm states, which could be taken as 4 different DTx (so that we are able to distinguish the current win front from the future one we are generating, and earlier ones) and 4 different UNDECIDED positions (the different codes representing different move hints for which moves have already been determined to not lead to an UNDECIDED successor, and thus will not have to be retried). This at least should be useful when generating bitbases, and would reduce the requirement for a pawnless 6-men to 4GB.

I think simple bit tables are the way to go:

Illegal or white win: This gets updated when examining white moves and subtables are used to initialize it.
Legal, no known black draw or win: This is computed once from subtables.
Legal and black lose: This gets computed from the above by examining black moves into the first and then masking with the second.

For 6-man the above might need about 3GB(possibly on disk) plus some working memory and storage for DTZ(on disk). I have longer texts about this at http://teo.se/eg/ where I also write about an indexing scheme to reduce the need for seeks.
Anyway, I don't think compression is much use when generating EGTB:s. Disks are reasonably quick as long as the block size is large. Disks can be used in parallel if you need greater speed.
Compression for distribution is a different matter. This is something I feel can be done much better.

h.g.muller · Post by **h.g.muller** » Fri May 09, 2008 8:21 am

I am very interested in ways to reduce seeks. You were mentioning a 4K x 4K matrix, but to access such a matrix in column order would require 16M seeks (if it was stored sequentially by row), and I am afraid this will wear out my hard drive too quickly. So I am now looking for ways to store a 7-men EGTB as a 64x64x64 matrix, with contiguous 4-men chunks. That would reduce the number of seeks for a non-sequential scan to 256K.

Perhaps I am overly worried about this, because I don't know enough about the hardware. If reading a sequential chunk of 256B would already involve seeks, because it cannot be stored on a single track, there really is nothing to gain at all, and I would just be fooling myself in thinking that reading 16MB chunks would require fewer seeks.

koistinen · Post by **koistinen** » Fri May 09, 2008 8:12 pm

This is not about compression but I answer it here anyway.

h.g.muller wrote:I am very interested in ways to reduce seeks. You were mentioning a 4K x 4K matrix, but to access such a matrix in column order would require 16M seeks (if it was stored sequentially by row), and I am afraid this will wear out my hard drive too quickly. So I am now looking for ways to store a 7-men EGTB as a 64x64x64 matrix, with contiguous 4-men chunks. That would reduce the number of seeks for a non-sequential scan to 256K.

Perhaps I am overly worried about this, because I don't know enough about the hardware. If reading a sequential chunk of 256B would already involve seeks, because it cannot be stored on a single track, there really is nothing to gain at all, and I would just be fooling myself in thinking that reading 16MB chunks would require fewer seeks.

You want to read the table in as large chunks as possible. Number of seeks are inversely proportional to chunk size. Nearly all disks have a seek time of around 10ms. This has been so for many years while number of bytes and bytes/s has increased roughly together. One way to achieve larger chunks is by examining moves of one or a few pieces at a time, much like you have found. Another is by using a different indexing scheme. It is possible to index a squares by 3+3 bits such that mirroring only changes the last 3 bits. This in turn can help your scheme for computing endgames. Put the last part of the square indexes together in the big index and then use the first part to split the table into nice chunks. For KQRNvKBN you might want to index the black king using a traditional triangle and then use that, the first part of the bishop, queen and rook to divide the table into 80x64 parts each containing a chunk of 8x8x8x64x64x64 =134217728 positions. For the worst case (wrong side to move) this would require 80x64=5120 seeks for 1 pass (1 retrograde step). Since this seeking would take about 50000ms=50s it would be dwarfed by the time spent on sequential reading (80GB (1bit/position) at 100MB/s is 800s). At any time you would only need to keep about 80GB/64 = 1.25GB in ram when examining black moves and 80GB/80 = 1GB when examining white moves. (For endings like KQRBNNvK a more complicated and/or expensive scheme would be needed.)

This means a 7-man would ending would require about 102x850s = 86700s approx 24h of disk time to compute, but as more need to be read and written I think about 3 or 4 times more would be needed. Using more disks should however make it possible to do it more quickly. (speed proportional to number of disks up to at least16 disks) I don't know how much CPU time would be needed, but my guess is that it would be similar.

Example of indexing scheme in octal:
00, 10, 20, 30, 31, 21, 11, 01
12, 02, 40, 50, 51, 41, 03, 13
22, 42, 60, 70, 71, 61, 43, 23
32, 52, 72, 62, 63, 73, 53, 33
36, 56, 76, 66, 67, 77, 57, 37
26, 46, 64, 74, 75, 65, 47, 27
16, 06, 44, 54, 55, 45, 07, 17
04, 14, 24, 34, 35, 25, 15, 05

Note how the first part of each index is constant when a position is mirrored.
Sorry about sounding like a know-it-all. (I know I don't yet know how to go into subtables, that is, like from KRvKB to KRvK, in a good way.)

h.g.muller · Post by **h.g.muller** » Sat May 10, 2008 8:23 am

Sorry I am a little slow, but I am not sure I completely get it. So let me re-iterate:

A 128M chunk in your scheme would contain all constellations of white King and the white and black Knights, (so these are what we called 'dynamic pieces' elsewhere), would fix the black King ('static piece'), but would restrict the Q, R and B to a symmetry-related set of squares. So within the chuck, these pieces are 'semi-dynamic'.

Is the idea that you now would do one pass through the TB for each piece you must move, gathering the 8 chunks where this piece is in different symmetry sets? Eight symmetry sets would cover the entire board, and would thus make the selected piece fully dynamic. And 8x128M = 1G, so 8 chunks would likely fit in memory. For the black King we would have to hold 10 (=all) chunks in memory to make it dynamic, but by the clever subdivision of the board, that would be all. (And by treating them in a clever order, you might be able to even save on that, as the King moves very locally, but even 1.25G positions doesn't sound scary.)

In fact the chunks with the black King on a diagonal square could be half the size, as one other piece could be restricted to below the diagonal, but it is questionable if this is a worthwile saving.

But you would need 2 passes for doing the white retrograde moves, as you don't have the RAM to make both Q and R dynamic at the same time, and similarly, 2 passes for doing the black unmoves (and perhaps 3 if you do verification). Or do you really count on having 1 bit per position, so that you can hold 8 times as many chunks, and make 2 pieces dynamic at the same time? That seems a bit optimistic. But it would be very nice if it could be done, as it would allw us to do a cycle in only two passes through the on-disk data, one for the white unmoves, the other for the black unmoves + verify moves.

That would require the working set of 64 or 80 chunks (8G or 10G positions) to be stored in RAM in compressed form, I suppose. We would then make the multiple passes (one for each piece) only through the data in memory (not requiring any seeks or disk I/O), decompressing only those chunks to make one of the pieces dynamic, and then treat all moves of that piece. The decompressed part would contain 1G positions, and we could then do all moves for the dynamic piece within it, before compressing again and going to the next 'thick' slice (the semi-dynamic piece now restricted to another symmetry set) for this piece.

Could we really achieve a compression of 1 bit/pos (for wtm and btm together) in DTC/DTZ tablebases? Or does this only work for bitbases? My estimates for simple schemes came to at least 2 bits per position. But I guess even that would not be lethal: it means a buffer of 2-2.5GB for the compressed slice where all pieces of a certain color are dynamic (at 2 Hufman encoded bits per position), and a buffer of 1GB (1 byte per position, 2 bits for the wtm positions, 6 bits for the btm DTx). So a 4GB machine could do it (with memory to spare for OS and program), and you would not even need a 64-bit system.

h.g.muller · Post by **h.g.muller** » Sat May 10, 2008 1:18 pm

OK, I had to let it sink in a little, but indeed this seems extremely useful. I would use it in a slightly different way, though:

Again, look at KQRNKBN as an example.

Within a 128M chunk, I would make all black pieces fully dynamic, and the whiteh QRN all semi-dynamic. That means that 8 such chunks need to be loaded to make one white piece fully dynamic, and 8x8=64 chunks to make two white pieces fully dynamic. But all symmetry transitions will be completely confined to within a chunk. I would choose the white King to be restricted to the triangle, and within a chunk it would be fully static.

Supposing, as before, that we have enough memory to load 64 chunks, I would not do alternating wtm and btm passes, but I would stick to the "grandparent" algorithm: I would apply this algorithm in 2 passes, each pass doing all unmoves of two of the four white pieces, (say Q+R), followed by all black unmoves, and all black verification moves. In the next pass, I would do the unmoves of the other two white pieces, (plus subsequent black unmoves/moves), which requires me to make one of the semi-dynamic white pieces (N) fully dynamic, plus the white King.

This has the advantage over alternating black and white passes that the passes represent parallel, rather than serial operations, and can does be performed in any order witthin the cycle. This allows leap-frogging, where you would laternate the order of the two passes on subsequent cycles, and then merge the last pass of one with the first pass of the next (as they treat the same pieces as dynamic).

To allow an advance of two moves of the white King (as major symmetry piece), one needs to have at most 8 king positions in memory at any time (so not the full 10): If one scans the TB in the order a1-b1-b2-c1-c2-c3-d1-d2-d3-d4, after treating c3, a1 becomes unreachable in 2 moves for a King, and can be discarded. After treating d3 (at which point there are 8 K-slices in memory, d4 not yet loaded and a1 already discarded), b1, c1 and d1 become unreachable and can be discarded to make room for d4. So the pass that treats the sub-trees starting with white K+N moves can do with exactly the same amount of memory as the other, requiring 64 chunks to be loaded to make a complete (thick) slice.

So witout any extra demand for memory, the number of passes could be reduced two-fold. An additional advantage of the grandparent algorithm is that there is no need to remember if a WON wtm position was declared WON in the most recent pass, or earlier, as you immediately process the WON wtm positions as you assign them. So you only need to distinguish wtm UNDECIDED from wtm WON positions, allowing for more compact compression.

Note 1: This would also work with fewer black pieces (5+2 or 6+1). Simply replace one of the black pieces by a white piece. As this white piece is then fully dynamic, its unmoves can be treated in either pass, within the slice already loaded by the scheme above.

Note 2: Although it is nice to have "4.5-men" chunks of 128M positions, it is not completely fatal to have 4-men chunks of 16M positions: with the HD parameters you assume, and 1 bit/pos, a 4-men chunk would measure 2MB, and at 100MB/s require 20 ms of transfer time. Even with 10 ms/seek, the seeking would thus add only 50% to the I/O time. So it would be better to switch to the more flexible smaller (4-men) chunks, rather than using an algorithm that would require more passes per cycle just to accomodate the bigger 4.5-men chunks. E.g. for 3+4 men, using the K-slice method described in another thread.

koistinen · Post by **koistinen** » Sat May 10, 2008 8:24 pm

I don't fully understand unmoves so I'll just discuss moves that connect one table with another.

For KQRNKBN:

Index squares as in table above.
Split off high indexing in a high and a low index.
Put the index of the black king in the high part. (10 possible values)
Put the high part of the bishops index in the high part. (8 possible values)
Put the high part of the queens index in the high part. (8 possible values)
Put the high part of the rooks index in the high part. (8 possible values)
Put the remaining parts in the low index. (8*8*8*64*64*64 possible values)

Pseudocode for how to do/divide the computation:

When examining the effect of black moves:
For each value of high part of the queens and rooks index:
: The high part of the queens and rooks high index are now bound, decided, fixed or what you like to call it.
: This means we only need to look at 1/64 of the entire table at this time.
: Load into ram all positions where the queens and rooks high index is as set above.
: For each possible black move:
: : Examine its effect, and adjust the table accordingly.
: : When examining moves by the black king, the position might have to be mirrored, but that will only cause the kings index or the lower part to change.
: Save the result.

When examining the effect of white moves:
For each position of the black king and each value of the high part of the bishops index:
: The position of the black king and value of the high part of the bishops index is bound.
: This means we only need to look at 1/80 of the entire table at this time.
: Load into ram all positions where the black king and high index of the bishop is as set above.
: For each possible white move:
: : Examine its effect and adjust the table accordingly.
: : No mirroring is needed.
: Save the result.

Regarding Note 1:
For 6+1, doing 2 passes would be needed to get the size down to something like 1/64 (1/80 more likely).

Regarding Note 2:
The purpose of separating 3 bits of each square index that won't be changed by mirroring is to give you flexibility in how you use those bits.
Combine 3 bits like that with 3 bits for the opposite colour and you can reduce ram need by 8 while reducing block size by 64=8*8. This is good compared with using old-style square indexes where you could reduce ram need by 8 while reducing block size by 512=8*8*8.
(Ram need you want to be low, block size you want to keep large)

h.g.muller · Post by **h.g.muller** » Sat May 10, 2008 9:33 pm

I am still in doubt if this can be competitive. I guess a lot depends on the exact HD parameters (transfer rate, seek time) and the compression factor. 100MB/s and 1 bit/position seems a bit optimistic to me (I used to assume 50MB/s and 2 bits/pos), while a seek time of 10ms is on the high side (I was counting on 6ms). But even with those parameters, which are unfavorable for low chunk size, the difference between 128M chunks and 16M chunks is not very big.

Problem is that your algorithm needs two passes per cycle, one for the black moves, and one for the white moves. It is true that you then need many fewer seeks per pass, but the problem is that there is little benifit in reducing a time that is already small to one that is completely negligible. For both 128M and 16M chunk sizes, the problem is still dominated by transfer time. And you need twice as much data transfer, because of the two passes.

As my chunks are 8 times smaller, I would need 8 times as many seeks in the nonsequential pass, but I need only 1 non-sequential pass every other cycle. So I would need 4 times as many seeks as you do. But I would need only half the transfer time. With the mentioned HD parameters and chunk sizes, that seems a good deal. And the maximum RAM demand with a K-slice factorization would be 5 x 64 x 16M = 5G positions (1/128 of the TB), while you might need 10 x 8 x 128M = 10G positions. So the K-slice factorization seems faster, with lower RAM demand.

koistinen · Post by **koistinen** » Sun May 11, 2008 6:05 am

Why would you need more than 1 bit per position per DTZ? Maybe it has something to do with unmoves?
You only need 1 bit to tell if the position is a white win or not at a given DTZ.
Also, I don't understand how you manage to examine both white and black moves while only looking at part of the TB.
Perhaps I should learn how the K-slice factorization works.

I have measured my 3 hard disks: 250 GB, 320 GB, 500 GB and for reads they are each 70-80 MB/s. I have not measured writes lately but expect them to be about the same.
100 MB/s is what I expect from a 1 TB disk.
I have not measured seek time. I agree that 10 ms is a pessimistic approximation and 6 ms is likely closer to the truth at present.

The expected disk time I wrote of earlier is for computing a DTZ50 table. Is not 24h for a 7-man pretty quick? Then use 4 disk drive to make it 6h or 16 to make it 1.5h. Soon 8-man will not look too difficult. Only worry then is if the CPU can keep up.

guyhaw · Post by **guyhaw** » Sun May 11, 2008 10:42 am

Requirements of compression, using MoSCoW rules (Must, Should, Would be nice, Could)

Must - be 'lossless': decompression reverses back to the original, correct, uncompressed data
Must - give compressed files which can be used in the context required
Must - include an 'integrity check' on each block of the file
Should - (probably) compress a fixed-sized block of uncompressed file (Rob Hyatt optimised to 8k bytes here?)
Should - (probably) output compressed files in fixed-sized blocks
Would (be good if) - the priority metric was DTR rather than DTM, and in its absence DTZ
Could - have improved infrastructure to speed initialisation of EGT take-on at runtime

g

koistinen · Post by **koistinen** » Sun May 11, 2008 2:06 pm

guyhaw:
Nice to get back to discussing compression.

In the context of EGTB all data can be recomputed so lossless is not very meaningful.
I'd be more interested in what purposes the table might be compressed for and what different goals those purposes imply.
I guess DTR is the same as DTZ50 and that is about the easiest tablebase to compute.

Which are the contexts for using a EGTB?

Why would integrity check be needed? Can't the operating system handle this?

How to compress would depend on why.
What an ok cost of usage would be depends on why and where you use it.

h.g.muller · Post by **h.g.muller** » Sun May 11, 2008 7:31 pm

koistinen wrote:Why would you need more than 1 bit per position per DTZ? Maybe it has something to do with unmoves?
You only need 1 bit to tell if the position is a white win or not at a given DTZ.

Perhaps this is what I don't understand. In a DTZ egtB you don't need only to know if the position is a white win or not, but also what the DTZ actually is.

Also, I don't understand how you manage to examine both white and black moves while only looking at part of the TB.
Perhaps I should learn how the K-slice factorization works.

Well, K-slice factorization is only a refinement. The basic principle is the grandparent algorithm. If you have two slices (wtm+btm info) of the TB where all black pieces are dynamic, differing in the location of a single white piece in such a way that a white move is possible between the two locations, you already have all the data available that is needed to construct lost(n+1) btm positions in one of them from the lost(n) positions in the other. (Namely those that use that white move to win.)

I have measured my 3 hard disks: 250 GB, 320 GB, 500 GB and for reads they are each 70-80 MB/s. I have not measured writes lately but expect them to be about the same.
100 MB/s is what I expect from a 1 TB disk.

OK, it makes sense that a 1TB disk would use higher density and be faster. I only have 250GB disks. Anyway, 80 < 100, so my assesment that 100 MB/s is on the high end of the spectrum was correct.

I have not measured seek time. I agree that 10 ms is a pessimistic approximation and 6 ms is likely closer to the truth at present.

The expected disk time I wrote of earlier is for computing a DTZ50 table. Is not 24h for a 7-man pretty quick? Then use 4 disk drive to make it 6h or 16 to make it 1.5h. Soon 8-man will not look too difficult. Only worry then is if the CPU can keep up.

If you do not store more information per position than if the position is won or not, the CPU will almost certainly not be able to keep up with this, as it has to recalculate every won position cycle after cycle. The minimum information needed to generate efficiently is for the wtm positions if they are won or not, and from the btm positions if they were dclared lost in the previous cycle, in the current cycle, in a cycle before, or if they are still undecided. That sounds like 3 bits, but as the lost(n) and lost(n+1) positions aech make only a small to very small fraction of the TB at any time, it should be compressible to under 2 bits by a simple Hufman code.

koistinen · Post by **koistinen** » Mon May 12, 2008 4:48 pm

h.g.muller:
DTZ is another dimension of the table. So you have one table with white wins at DTZ=0 (black is checkmate), one with white wins at DTZ=1 (white make checkmate or conversion to win in 1 move), one with white wins at DTZ=2 (not stalemate, no conversion to subendgame where white doesn't win and no move into table with DTZ=1 avoids white win) etc. You know DTZ as it is decided by you when you look into the table. The table only tells if it is a white win with DTZ at that value and so it only needs one bit per position.

I have looked into your K-slice factorization idea and think it might be applied to my kind of tables, but only for a slight gain, perhaps reducing ram needed by about 20-30%.

When you examine all moves it can be done much more efficiently than when you do them position by position. For one thing you can do moves for 64 positions at a time by using a bitboard. Also, you can use dynamic programming techniques to make searching the moves for a queen cost only about as much as searching the moves for a knight. With 64-bit computers it should cost about the same searching all moves 50 times as searching them once one by one.

Once you are done you should of course store the table in the more compact form, storing the value of DTZ at which it is a white win instead of having one bit per possible value of DTZ. (This can be done a little cheaper, and saving hard disk space by only keeping the expanded form of the table as long as useful and instead storing differences using sparse matrix techniques such as run length encoding.)

Huffman coding is likely much too slow for use when generating tables. Even avoiding storing positions where pieces are in the same position might be too expensive, at least for the piece in the bitboard. Also, this would mean avoiding storing positions illegal because of kings proximity would also be too expensive.

h.g.muller · Post by **h.g.muller** » Mon May 12, 2008 8:59 pm

I am still not sure how this works. If you have a separate 'layer' for each DTZ, how can you construct DTZ=7 from DTZ=6? You would have to know which positions have DTZ=0, 1, 2, ... , 5 as well. Or is layer N storing all DTZ <= N?

The fact that you would have to read every layer again when combining all data seems another doubling of the required I/O. And in the mean time you bloat HD storage requirements enormously too. If the EGTB requires 50 moves, you would at the end have stored 50 bits per position, (or 100, when you keep wtm as well), while it could have been 6 (or 7) bits (uncompressed).

In my experience, doing things in bulk, rather than position by position, is not competative at all. The problem is that usually fewer than 1 out of every 64 positions needs attention. So if you treat 64 at a time, on the average the effort on 63 of them will be wasted, as they needed no treatment.

guyhaw · Post by **guyhaw** » Tue May 13, 2008 12:24 am

Seems like some people don't see the DTZ EGT as analogous to the DTM or DTC EGTs.

DTZ = 7 positions are derived from the DTZ = 6 positions.

Nope, DTR =/= DTZ50, otherwise it wouldn't deserve a different identity as a metric. Counterexample: if a KNNKP positions has DTZ = 51, then DTR = 51, but DTZ50 = 'draw'.

g

h.g.muller · Post by **h.g.muller** » Tue May 13, 2008 5:22 am

guyhaw wrote:Seems like some people don't see the DTZ EGT as analogous to the DTM or DTC EGTs.

DTx = 7 positions are derived from the DTx = 6 positions.

I am not sure if this was directed at me, but it seems a wrong / incomplete statement.

If you only have DTx=6 btm positions, and you have a position that can reach DTx=6 btm positions after one move, you cannot conclude the position has DTx=7 (wtm). It might be DTx=3. DTx=N+1 wtm positions are derived from all DTx<=N btm positions.

koistinen · Post by **koistinen** » Tue May 13, 2008 11:14 am

h.g.muller wrote:I am still not sure how this works. If you have a separate 'layer' for each DTZ, how can you construct DTZ=7 from DTZ=6? You would have to know which positions have DTZ=0, 1, 2, ... , 5 as well. Or is layer N storing all DTZ <= N?

Any position won with DTZ=N is also won with DTZ=N+1. In that way layer N is storing all DTZ <= N.

h.g.muller wrote:The fact that you would have to read every layer again when combining all data seems another doubling of the required I/O. And in the mean time you bloat HD storage requirements enormously too. If the EGTB requires 50 moves, you would at the end have stored 50 bits per position, (or 100, when you keep wtm as well), while it could have been 6 (or 7) bits (uncompressed).

Regarding storage I agree, but it can be avoided and is not particularly important for discussing speed. Regarding I/O I disagree as you only need to read each bitbase once in the combining step.

For white moves at DTZ=N, my scheme would need to:
read btm with DTZ=N-1 (to examine moves into btm)
read wtm with DTZ=N-2 (to compress info for combination step and to avoid having to examine conversion moves or compute illegal positions)
write wtm with DTZ=N

For black moves at DTZ=N, my scheme would need to:
read wtm with DTZ=N-1 (to examine moves into wtm)
read precomputed bitbase (to avoid having to examine conversion moves, compute illegal positions or stalemates)
write btm with DTZ=N

h.g.muller wrote:In my experience, doing things in bulk, rather than position by position, is not competative at all. The problem is that usually fewer than 1 out of every 64 positions needs attention. So if you treat 64 at a time, on the average the effort on 63 of them will be wasted, as they needed no treatment.

64*1/64 =1: Total cost is about the same as the cost of doing the 64 all at once is about 1/64 of doing the 1 singly.

Perhaps selecting the method to use depending on how many values changed in the last step would be good, at least when computing btm table?

guyhaw wrote:Seems like some people don't see the DTZ EGT as analogous to the DTM or DTC EGTs.

DTZ = 7 positions are derived from the DTZ = 6 positions.

Nope, DTR =/= DTZ50, otherwise it wouldn't deserve a different identity as a metric. Counterexample: if a KNNKP positions has DTZ = 51, then DTR = 51, but DTZ50 = 'draw'.

g

Interesting about DTZ50=/=DTR. With my scheme you would need to walk through the retrograde steps twice to get DTR as there is a difference depending on whether the win for white conversion you depend on in the end is a white or black move.

h.g.muller · Post by **h.g.muller** » Tue May 13, 2008 1:36 pm

koistinen wrote:Any position won with DTZ=N is also won with DTZ=N+1. In that way layer N is storing all DTZ <= N.

OK, then it is clear.

koistinen wrote:For white moves at DTZ=N, my scheme would need to:
read btm with DTZ=N-1 (to examine moves into btm)
read wtm with DTZ=N-2 (to compress info for combination step and to avoid having to examine conversion moves or compute illegal positions)
write wtm with DTZ=N

For black moves at DTZ=N, my scheme would need to:
read wtm with DTZ=N-1 (to examine moves into wtm)
read precomputed bitbase (to avoid having to examine conversion moves, compute illegal positions or stalemates)
write btm with DTZ=N

So you need to read 4 bits per position per cycle, and write 2. The reads would be distributed over four different pases, though, two of which nonsequential. So per 128M slice you would need 16MB / 100MB/s = 160ms, 6 times, = 960 ms transfer time, and 3 seeks = 30ms, for a total of 990 ms.

If you would merge the wtm and btm data, for 2 bits per position, and use the grandparent algorithm with leapfrogging, you would only need to read 2 bits and then write them back, half the time sequential. For eight 16M slices this would be 8 x 4MB = 32MB, reading plus writing = 640ms. And you would need 8 seeks for reading and 8 seeks for writing on half the passes, so on the average 80 ms. So 720 ms.

So the grantparent + leapfrog is faster. But what is even more important to me, it would only require half the amount of RAM. That makes it a win-win.

koistinen wrote: 64*1/64 =1: Total cost is about the same as the cost of doing the 64 all at once is about 1/64 of doing the 1 singly.

Yes, assuming that you know which of your 64-bit chunks contain an element that needs treatment. Otherwise, when the density is less than 1/64, you would also waste time processing positions that do not need treatment at all. And remembering that requires extra storage.

But I think the more important problem is that, because you accumulate the won positions, without keeping track of which were newly added, your data set will very quickly no longer be sparse in the number of elements that need treatment. So you will treat every position on every cycle over and over again, even the positions that just happen to be undecided positions that shared a 64-bit word with another position that was already treated 10 times before, but is treated again because you don't know it.
I don't expect the CPU to be able to keep up with this. Have you tested the in-RAM part of your algorithm, say on 5-men EGTBs? How fast can you do KBBKN in RAM? (And on which machine?)

koistinen · Post by **koistinen** » Tue May 13, 2008 5:58 pm

h.g.muller:
I got to think this over, writing some code to test speed of examining all moves.
Previously I simply assumed that being about as quick as examining every move once individually would be enough.

guyhaw · Post by **guyhaw** » Fri May 16, 2008 1:38 pm

I'm not following this koistinen-muller dialogue too closely, but when I say that when I say that a 'DTx = 7' position is derived from a 'DTx = 6' position, I am not saying that positions backed-up from DTx = 6 positions _are_ DTx = 7 positions.
'K' and 'M' may also be backing up positions 'too early'. It is important not to back up, e.g., a DTM=6 position until all DTM=6 positions are known. Then one knows that the new positions resolved in the next cycle are DTx = 7 positions.
g

h.g.muller · Post by **h.g.muller** » Sat May 17, 2008 7:28 am

Well, the question was what information exactly has to be around at which time, and how many bits would be needed to encode it. Initially there was some confusion as to what the single bit in one layer of the Koistinen scheme meant. (DTx=N as opposed to DTx<=N).

With DTx<=N you do have enough information to calculate the next DTx for the other side to move. But that would require a brute-force effort of trying every position, as you would not know which DTx<=N positions were part of the 'win front', and you also would not know which of the positions you are now trying to assign were assigned before.

This is why I am in favor of adding the information of which positions belong to the win front, for at least the btm positions. On most cycles it is only very little information, and exactly because of that saves an enormous amount of processing.

koistinen · Post by **koistinen** » Sun May 18, 2008 7:55 pm

h.g.muller:
I think I understand the grandparent part of your algorithm.
I think run length code is the compression method to use to keep track of recent changes.
The leapfrog part I don't quite understand. Mirroring seems a problem to me.
Perhaps I have just not found a detailed enough description?

I have written some test code: http://teo.se/eg/Farrah-0.0.20080518.tar.gz
My cpu is AMD Athlon 64 X2 4200+, and using a single core I estimate the cpu time for computing a 7-man is about 4h. Using both cores should give about twice the cpu power. This might be needed as each extra piece would need the same time. (almost double for queens) Perhaps a more recent cpu would be needed to keep up in unbalanced endgames.

This means that with 4 disks the computation would still be disk-time-bound.
So it looks to me that my algorithm should be good for computing a 7-man in about 6 hours using 4 disk drives and 24h with 1.

As 6-man endgames can be done mostly in ram they should be doable in about 10 minutes each.

The indexing and exhaustive computation part can be used separately.

I hope to hear you tell me where I am wrong.

CCRL Discussion Board

EGTB Compression Goals

EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

EGTB indexing scheme (not compression)

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals

Re: EGTB Compression Goals