what to do with broken positions

Endgame analysis using tablebases, EGTB generation, exchange, sharing, discussions, etc..
Post Reply
jkominek
Posts: 150
Joined: Mon Dec 04, 2006 9:02 am
Sign-up code: 0
Location: Pittsburgh, PA

what to do with broken positions

Post by jkominek »

Or: some recent experiments in EGTB compression.

Earlier this year codeman raised the idea of squeezing Nalimov's already-compact indexing scheme even further. I expressed doubt, warning not to underestimate the difficulty of said undertaking - and that compression picks up the slop anyway. So I drew up some questions and set forth with my hammer and measuring tape.
  • 1. How much of an impact does indexing have, post-compression?
    2. How much does it help to merge broken and drawn values?
    3. How well does an optimally compact index compress, i.e. no broken positions?
    4. What is the trend with increasing the compression block size?
    5. What can we expect from various forms of bitbases?
    6. What does bzip have to say about it?
    7. How are we comparing to FEG?
To create an optimally compact index, I write a copy of an existing tablebase while skipping all broken positions. It contains no more and no less than what is necessary. It can't be randomly accessed, but the point is to find the lower bound.

Amazingly, it doesn't help to squeeze out the juice 100%.

Code: Select all

        4-Men    |          normal            merged          compacted  
    ---------------------------------------------------------------------
    uncompressed |     159,781,072       159,781,072        135,546,986  
     datacomp 1k |      35,013,069        33,604,732         30,478,092  
              2k |      32,787,888        31,574,051         28,788,605  
              4k |      31,289,135        30,134,708         27,810,841  
              8k |      28,557,596        27,812,462         26,634,956  
             16k |      27,167,576        26,688,444         26,107,831  
             32k |      25,646,662        25,174,006         25,759,160  
             64k |      24,395,671 *      24,113,440 *       25,400,025 *
        bzip2 -9 |      26,108,450        25,469,296         23,396,262  
        deKoning |      23,230,836 **                                    
         Nalimov |      30,983,745   

Code: Select all

        5-Men    |          normal            merged          compacted  
    ---------------------------------------------------------------------
    uncompressed |  33,527,345,596    33,527,345,596     26,033,675,367  
     datacomp 1k |   8,728,031,845     8,547,021,647      7,959,221,556  
              2k |   8,227,824,249     8,087,536,135      7,638,536,972  
              4k |   7,878,680,116     7,771,436,365      7,417,596,573  
              8k |   7,404,751,015     7,319,889,259      7,192,668,341  
             16k |   7,118,743,447     7,041,928,611      7,052,201,966  
             32k |   6,905,076,267     6,829,779,193      6,935,576,730  
             64k |   6,625,764,600 *   6,549,849,083 *    6,839,679,244 *
        bzip2 -9 |   6,821,025,466     6,691,301,666      6,228,415,821  
        deKoning |   6,054,999,960 **                                    
         Nalimov |   7,549,300,897                                       
Bring your attention to the lines labeled 64k. 64 KB is the largest block size supported by datacomp, at least without engaging in a significant rewrite. In the 5-men suite, we save 75 MB by merging broken and drawn positions. Operating on the fully compact data increases the final storage size. bzip shows that the entropy really is lower, but from a practical perspective this is inaccessible to us.

The argument for merging broken into drawn values is that a) this improves compression, and b) they'll never be probed anyway so do with them what you want. On the downside, the full statistics can no longer be extracted by a simple linear scan of the data; you'll need the services of a position generator. This makes verification less straightforward. Thus it is not quite accurate to say the broken positions are "don't cares".

The gain had by increasing the block size from 8k (used by Nalimov) to 64k is significant. Combined with merging we shave off 1GB from 7.5. Not too shabby.

The difference between Nalimov's size of 7549 MB - I'm using base 10 nomenclature here - and the 8k Kominek size of 7405 MB is due to indexing simplification, I presume. The "Kominek" indexing of placing pawns before kings (and not exerting great effort in removing broken positions) leads to symbol streams that are more easily compressible. The catch is that when decompressed into the working memory of a program, they take up more room, likely leading to more cache swaps.

Johan deKoning's FEG format is the size champ. The key to his success, I'm inclined to believe, is an asymmetrical partitioning. His white-to-move files store only wins for white. They don't store losses. Where the Nalimov format requires two files per piece combination, deKoning needs up to four, plus separate index files. It works, but is not a strategy I'm going to adopt.

So to answer my own rhetorical question, finally. Leave the damn things in!

john

P.S. The attached text file contains my current compression numbers, including some for bitbases.
Attachments
compression_results.txt.gz
Size requirements of various indexing+compression combinations, partial results.
(1.86 KiB) Downloaded 361 times
guyhaw
Posts: 489
Joined: Sat Jan 21, 2006 10:43 am
Sign-up code: 10159
Location: Reading, UK
Contact:

Re: what to do with broken positions

Post by guyhaw »

The choice of block-size in the Nalimov EGTs was determined by some experiments with Rob Hyatt. It's not just about minimising the size of the compressed file: increasing block-size reduces total file-size. It's also about the accessibility of the compressed file, and for that, you don't want the largest possible block size. I think they settled on 8KB.
There's another way to optimise compression vis-a-vis the 'broken' positions. Since they are never going to be accessed, they can be set to W, D or L depending on what maximises the compression - after you have noted the correct stats. I don't think anyone has done any experiments on this.
g
jkominek
Posts: 150
Joined: Mon Dec 04, 2006 9:02 am
Sign-up code: 0
Location: Pittsburgh, PA

Re: what to do with broken positions

Post by jkominek »

guyhaw wrote:There's another way to optimise compression vis-a-vis the 'broken' positions. Since they are never going to be accessed, they can be set to W, D or L depending on what maximises the compression - after you have noted the correct stats. I don't think anyone has done any experiments on this.
g
You may be right about that. I first came across that idea reading online discussion of checkers tablebases. It wasn't clear if they actually did set the broken positions to be compressor-friendly, or just pointed out the possibility. I may know someone willing to do the experiments for chess.

jk
syzygy
Posts: 166
Joined: Sun Aug 03, 2008 4:02 pm

Re: what to do with broken positions

Post by syzygy »

guyhaw wrote:There's another way to optimise compression vis-a-vis the 'broken' positions. Since they are never going to be accessed, they can be set to W, D or L depending on what maximises the compression - after you have noted the correct stats. I don't think anyone has done any experiments on this.
My compressor does this, and it works very well.

Another thing I do is checking for each permutation of pieces how well a given tablebase compresses (based on a sample of the data) and picking the best one for that tablebase. The gain can be quite significant.

My current tables use very complex indexing that removes all symmetry and positions with two or more pieces on one square. I tend to think now that it is much better to use very simple indexing and to rely on the compressor. The data will be more regular and - I expect - compress better. I haven't experimented yet, but john's results seem to confirm this.

(I've built 5-piece tables for suicide chess, which has tables such as KKKvKK. If you want to have fun, try to come up with a symmetry-free indexing function for that ;))

There are lots of other options that could be considered to reduce table size. For example, most or all tables will have almost all wins or losses in less than N moves, say with N=10. Why not set all these positions to WDL and keep the positions >N unchanged? With a smart compressor the result will approach WDL size. It should not have any effect on game play, except that you will need to search a bit at the root node once you're in the table.
Post Reply