Earlier this year codeman raised the idea of squeezing Nalimov's already-compact indexing scheme even further. I expressed doubt, warning not to underestimate the difficulty of said undertaking - and that compression picks up the slop anyway. So I drew up some questions and set forth with my hammer and measuring tape.
- 1. How much of an impact does indexing have, post-compression?
2. How much does it help to merge broken and drawn values?
3. How well does an optimally compact index compress, i.e. no broken positions?
4. What is the trend with increasing the compression block size?
5. What can we expect from various forms of bitbases?
6. What does bzip have to say about it?
7. How are we comparing to FEG?
Amazingly, it doesn't help to squeeze out the juice 100%.
Code: Select all
4-Men | normal merged compacted
---------------------------------------------------------------------
uncompressed | 159,781,072 159,781,072 135,546,986
datacomp 1k | 35,013,069 33,604,732 30,478,092
2k | 32,787,888 31,574,051 28,788,605
4k | 31,289,135 30,134,708 27,810,841
8k | 28,557,596 27,812,462 26,634,956
16k | 27,167,576 26,688,444 26,107,831
32k | 25,646,662 25,174,006 25,759,160
64k | 24,395,671 * 24,113,440 * 25,400,025 *
bzip2 -9 | 26,108,450 25,469,296 23,396,262
deKoning | 23,230,836 **
Nalimov | 30,983,745
Code: Select all
5-Men | normal merged compacted
---------------------------------------------------------------------
uncompressed | 33,527,345,596 33,527,345,596 26,033,675,367
datacomp 1k | 8,728,031,845 8,547,021,647 7,959,221,556
2k | 8,227,824,249 8,087,536,135 7,638,536,972
4k | 7,878,680,116 7,771,436,365 7,417,596,573
8k | 7,404,751,015 7,319,889,259 7,192,668,341
16k | 7,118,743,447 7,041,928,611 7,052,201,966
32k | 6,905,076,267 6,829,779,193 6,935,576,730
64k | 6,625,764,600 * 6,549,849,083 * 6,839,679,244 *
bzip2 -9 | 6,821,025,466 6,691,301,666 6,228,415,821
deKoning | 6,054,999,960 **
Nalimov | 7,549,300,897
The argument for merging broken into drawn values is that a) this improves compression, and b) they'll never be probed anyway so do with them what you want. On the downside, the full statistics can no longer be extracted by a simple linear scan of the data; you'll need the services of a position generator. This makes verification less straightforward. Thus it is not quite accurate to say the broken positions are "don't cares".
The gain had by increasing the block size from 8k (used by Nalimov) to 64k is significant. Combined with merging we shave off 1GB from 7.5. Not too shabby.
The difference between Nalimov's size of 7549 MB - I'm using base 10 nomenclature here - and the 8k Kominek size of 7405 MB is due to indexing simplification, I presume. The "Kominek" indexing of placing pawns before kings (and not exerting great effort in removing broken positions) leads to symbol streams that are more easily compressible. The catch is that when decompressed into the working memory of a program, they take up more room, likely leading to more cache swaps.
Johan deKoning's FEG format is the size champ. The key to his success, I'm inclined to believe, is an asymmetrical partitioning. His white-to-move files store only wins for white. They don't store losses. Where the Nalimov format requires two files per piece combination, deKoning needs up to four, plus separate index files. It works, but is not a strategy I'm going to adopt.
So to answer my own rhetorical question, finally. Leave the damn things in!
john
P.S. The attached text file contains my current compression numbers, including some for bitbases.