about CCRL stat

Uri Blass · Post by **Uri Blass** » Wed Jun 21, 2006 9:20 am

I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.

You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.

I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.

I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions(for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).

My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.

Uri

Post by **Graham Banks** » Wed Jun 21, 2006 9:28 am

Hi Uri,

I'll get Kirill to respond to this one

Regards, Graham.

Post by **Shaun** » Wed Jun 21, 2006 12:03 pm

Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.

You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.

I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.

Uri,
I think it would be interesting to look at this - however taking the first 50 games would be prone to a large error as certain engines perform better/worse depending on the opponent and the first 50 may all be against the same engine. However if you could take the first n games against all opponents (where n attempts to get the total near 50) this could give valid comparison. Unfortunately I believe you are going to be disapointed with the results - it does seem you neeed a huge number of games to prove superiority with similar strength engines. I have been testing Toga versions for Thomas and am finding it very difficult to prove/disprove improvements.

Shaun

EDIT: Examples added:

Code: Select all

   1 Rybka 1.1 32-bit       2608 300.0 (182.0 : 118.0)
                                  50.0 ( 28.5 :  21.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 29.0 :  21.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 29.5 :  20.5) Toga II 1.2 Beta2a     2566
                                  50.0 ( 31.0 :  19.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 31.5 :  18.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 32.5 :  17.5) Toga II 1.2 Beta2aE26  2526
   3 Hiarcs X50 UCI         2555 300.0 (157.5 : 142.5)
                                  50.0 ( 23.0 :  27.0) Toga II 1.2 Beta3a     2551
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 25.5 :  24.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 27.0 :  23.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 28.5 :  21.5) Toga II 1.2 Beta3c     2528
                                  50.0 ( 29.5 :  20.5) Toga II 1.2 Beta2aE26  2526
   9 Fritz 9                2519 300.0 (141.5 : 158.5)
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 23.0 :  27.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 24.5 :  25.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 24.5 :  25.5) Toga II 1.2 Beta2aE26  2526
  10 Hiarcs 10 UCI          2515 300.0 (139.0 : 161.0)
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 22.0 :  28.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 22.5 :  27.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 23.0 :  27.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 23.0 :  27.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 27.0 :  23.0) Toga II 1.2 Beta3a     2551
  12 Spike 1.1              2473 300.0 (119.0 : 181.0)
                                  50.0 ( 17.5 :  32.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 17.5 :  32.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 18.0 :  32.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3a     2551
                                  50.0 ( 20.0 :  30.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 27.0 :  23.0) Toga II 1.2 Beta3b     2527
  13 CM10k Xperience        2423 300.0 (102.0 : 198.0)
                                  50.0 ( 10.0 :  40.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 17.0 :  33.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 19.5 :  30.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 19.5 :  30.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta2aE26  2526
  14 SlowChess Blitz WV2.1  2404 300.0 ( 93.0 : 207.0)
                                  50.0 ( 13.5 :  36.5) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 15.0 :  35.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 15.0 :  35.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3c     2528
  15 Pro Deo 1.1            2401 300.0 ( 91.0 : 209.0)
                                  50.0 ( 13.0 :  37.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.0 :  36.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 15.5 :  34.5) Toga II 1.2 Beta3c     2528
                                  50.0 ( 16.5 :  33.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 17.5 :  32.5) Toga II 1.2 Beta2aE26  2526
  16 Glaurung 1.0.2         2378 350.0 ( 99.0 : 251.0)
                                  50.0 (  9.5 :  40.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 12.0 :  38.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 12.5 :  37.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 14.0 :  36.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta3      2491
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3b     2527

see where Toga II 1.2 Beta2a* ended up in each match

These games were all CCRL 40/4 equivalent using common opening book and testing conditions.

I am also running CCRL 40/12 equivalent matches and Toga II 1.2 Beta2a is not leading!

*Toga II 1.2 Beta2a has the overall highest rating across all opponents so far at CCRL 40/4 equivalent

EDIT 2: Ignore Actual ELO ratings (relative is fine) these are based on a 2500 offset - no attempt has been made by me to standardise.

I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions(for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).

My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.

Uri

Uri Blass · Post by **Uri Blass** » Wed Jun 21, 2006 12:50 pm

ShaunBrewer wrote:

Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.

You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.

I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.

Uri,
I think it would be interesting to look at this - however taking the first 50 games would be prone to a large error as certain engines perform better/worse depending on the opponent and the first 50 may all be against the same engine. However if you could take the first n games against all opponents (where n attempts to get the total near 50) this could give valid comparison. Unfortunately I believe you are going to be disapointed with the results - it does seem you neeed a huge number of games to prove superiority with similar strength engines. I have been testing Toga versions for Thomas and am finding it very difficult to prove/disprove improvements.

Shaun

EDIT: Examples added:

Code: Select all

   1 Rybka 1.1 32-bit       2608 300.0 (182.0 : 118.0)
                                  50.0 ( 28.5 :  21.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 29.0 :  21.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 29.5 :  20.5) Toga II 1.2 Beta2a     2566
                                  50.0 ( 31.0 :  19.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 31.5 :  18.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 32.5 :  17.5) Toga II 1.2 Beta2aE26  2526
   3 Hiarcs X50 UCI         2555 300.0 (157.5 : 142.5)
                                  50.0 ( 23.0 :  27.0) Toga II 1.2 Beta3a     2551
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 25.5 :  24.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 27.0 :  23.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 28.5 :  21.5) Toga II 1.2 Beta3c     2528
                                  50.0 ( 29.5 :  20.5) Toga II 1.2 Beta2aE26  2526
   9 Fritz 9                2519 300.0 (141.5 : 158.5)
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 23.0 :  27.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 24.0 :  26.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 24.5 :  25.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 24.5 :  25.5) Toga II 1.2 Beta2aE26  2526
  10 Hiarcs 10 UCI          2515 300.0 (139.0 : 161.0)
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 22.0 :  28.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 22.5 :  27.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 23.0 :  27.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 23.0 :  27.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 27.0 :  23.0) Toga II 1.2 Beta3a     2551
  12 Spike 1.1              2473 300.0 (119.0 : 181.0)
                                  50.0 ( 17.5 :  32.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 17.5 :  32.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 18.0 :  32.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3a     2551
                                  50.0 ( 20.0 :  30.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 27.0 :  23.0) Toga II 1.2 Beta3b     2527
  13 CM10k Xperience        2423 300.0 (102.0 : 198.0)
                                  50.0 ( 10.0 :  40.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3b     2527
                                  50.0 ( 17.0 :  33.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 19.5 :  30.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 19.5 :  30.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 21.5 :  28.5) Toga II 1.2 Beta2aE26  2526
  14 SlowChess Blitz WV2.1  2404 300.0 ( 93.0 : 207.0)
                                  50.0 ( 13.5 :  36.5) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 15.0 :  35.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 15.0 :  35.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3c     2528
  15 Pro Deo 1.1            2401 300.0 ( 91.0 : 209.0)
                                  50.0 ( 13.0 :  37.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 14.0 :  36.0) Toga II 1.2 Beta3b     2527
                                  50.0 ( 14.5 :  35.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 15.5 :  34.5) Toga II 1.2 Beta3c     2528
                                  50.0 ( 16.5 :  33.5) Toga II 1.3 Beta1      2535
                                  50.0 ( 17.5 :  32.5) Toga II 1.2 Beta2aE26  2526
  16 Glaurung 1.0.2         2378 350.0 ( 99.0 : 251.0)
                                  50.0 (  9.5 :  40.5) Toga II 1.2 Beta3a     2551
                                  50.0 ( 12.0 :  38.0) Toga II 1.2 Beta2a     2566
                                  50.0 ( 12.5 :  37.5) Toga II 1.2 Beta2aE26  2526
                                  50.0 ( 14.0 :  36.0) Toga II 1.3 Beta1      2535
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta3c     2528
                                  50.0 ( 16.0 :  34.0) Toga II 1.2 Beta3      2491
                                  50.0 ( 19.0 :  31.0) Toga II 1.2 Beta3b     2527

see where Toga II 1.2 Beta2a* ended up in each match

As far as I see the difference between toga versions is very small and of course in these conditions you cannot find who is better based on 50 games.

I did not claim that it is possible to decide which version is better based on 50 games but that it may be possible to be wrong by less than 20 elo in guessing the rating after 50 games.

These claims are different.

Note that after 50 games you have possible error of something near 80 elo that mean that we can expect the rating to be wrong by more than 80 elo in 5% of the cases and the question is if this expectation is correct based on experience.

Uri

Post by **Shaun** » Wed Jun 21, 2006 2:13 pm

Hi again Uri,

Uri Blass wrote: As far as I see the difference between toga versions is very small and of course in these conditions you cannot find who is better based on 50 games.

I did not claim that it is possible to decide which version is better based on 50 games but that it may be possible to be wrong by less than 20 elo in guessing the rating after 50 games.

These claims are different.

Agreed

Note that after 50 games you have possible error of something near 80 elo that mean that we can expect the rating to be wrong by more than 80 elo in 5% of the cases and the question is if this expectation is correct based on experience.

Uri

If I get a chance I will split my db into 6 - 50 game chunks (non Toga engines) and produce ratings for each chunk and post here - that might be interesting.

Shaun

Post by **Shaun** » Wed Jun 21, 2006 10:42 pm

Hi again Uri,

Okay I have taken my database of 2,750 games and split it into 5 chunks of 550 games. Each chunk has identical pairings on identical hardware.

The resulting rating lists are below:

Chunk1

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Toga II 1.2 Beta3a     2554   61   61    90   62%  2476   29% 
   2 Toga II 1.2 Beta3      2548  156  156    10   70%  2447   60% 
   3 Rybka 1.1 32-bit       2540   71   71    60   52%  2529   37% 
   4 Toga II 1.3 Beta1      2538   60   60    90   59%  2476   30% 
   5 Hiarcs 10 UCI          2537   72   72    60   52%  2529   33% 
   6 Toga II 1.2 Beta2a     2526   60   60    90   57%  2476   30% 
   7 Toga II 1.2 Beta2aE26  2524   60   60    90   57%  2476   30% 
   8 Toga II 1.2 Beta3c     2519   58   58    90   57%  2476   39% 
   9 Toga II 1.2 Beta3b     2511   59   59    90   55%  2476   34% 
  10 Fritz 9                2498   74   74    60   46%  2529   28% 
  11 Spike 1.1              2498   71   71    60   45%  2529   37% 
  12 Hiarcs X50 UCI         2485   72   72    60   43%  2529   33% 
  13 CM10k Xperience        2457   75   75    60   39%  2529   25% 
  14 Glaurung 1.0.2         2447   67   67    70   36%  2531   33% 
  15 Pro Deo 1.1            2425   71   71    60   33%  2529   45% 
  16 SlowChess Blitz WV2.1  2392   78   78    60   31%  2529   22%

Chunk2

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Rybka 1.1 32-bit       2583   71   71    60   58%  2529   40% 
   2 Hiarcs X50 UCI         2581   73   73    60   58%  2529   33% 
   3 Toga II 1.2 Beta2a     2575   59   59    90   65%  2474   37% 
   4 Toga II 1.2 Beta3      2559  176  176    10   75%  2414   30% 
   5 Fritz 9                2548   76   76    60   53%  2529   22% 
   6 Toga II 1.2 Beta3c     2540   61   61    90   59%  2474   26% 
   7 Toga II 1.2 Beta2aE26  2535   62   62    90   59%  2474   24% 
   8 Toga II 1.2 Beta3b     2516   61   61    90   56%  2474   27% 
   9 Toga II 1.2 Beta3a     2510   60   60    90   55%  2474   32% 
  10 Toga II 1.3 Beta1      2499   60   60    90   54%  2474   32% 
  11 Spike 1.1              2481   71   71    60   42%  2529   40% 
  12 Hiarcs 10 UCI          2476   70   70    60   41%  2529   42% 
  13 CM10k Xperience        2427   77   77    60   36%  2529   18% 
  14 Glaurung 1.0.2         2414   70   70    70   32%  2533   24% 
  15 Pro Deo 1.1            2394   77   77    60   31%  2529   25% 
  16 SlowChess Blitz WV2.1  2365   79   79    60   27%  2529   23%

Chunk3

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Hiarcs X50 UCI         2603   72   72    60   58%  2556   35% 
   2 Rybka 1.1 32-bit       2596   71   71    60   57%  2556   40% 
   3 Toga II 1.2 Beta2a     2592   62   62    90   66%  2478   29% 
   4 Toga II 1.2 Beta3b     2578   62   62    90   64%  2478   29% 
   5 Toga II 1.2 Beta3a     2560   62   62    90   62%  2478   26% 
   6 Toga II 1.2 Beta2aE26  2551   62   62    90   61%  2478   26% 
   7 Toga II 1.2 Beta3c     2532   61   61    90   57%  2478   27% 
   8 Toga II 1.3 Beta1      2521   60   60    90   56%  2478   31% 
   9 Hiarcs 10 UCI          2512   73   73    60   43%  2556   30% 
  10 Spike 1.1              2505   71   71    60   42%  2556   37% 
  11 Fritz 9                2481   75   75    60   39%  2556   25% 
  12 SlowChess Blitz WV2.1  2427   76   76    60   31%  2556   28% 
  13 Pro Deo 1.1            2421   78   78    60   31%  2556   22% 
  14 CM10k Xperience        2418   80   80    60   32%  2556   17% 
  15 Toga II 1.2 Beta3      2363  158  158    10   55%  2339   50% 
  16 Glaurung 1.0.2         2339   75   75    70   25%  2528   21%

Chunk4

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Rybka 1.1 32-bit       2626   74   74    60   64%  2535   32% 
   2 Toga II 1.3 Beta1      2563   59   59    90   62%  2480   37% 
   3 Toga II 1.2 Beta3c     2559   61   61    90   62%  2480   31% 
   4 Fritz 9                2557   72   72    60   53%  2535   33% 
   5 Hiarcs X50 UCI         2553   72   72    60   53%  2535   35% 
   6 Toga II 1.2 Beta3a     2546   61   61    90   61%  2480   32% 
   7 Toga II 1.2 Beta2a     2535   59   59    90   59%  2480   36% 
   8 Toga II 1.2 Beta3b     2508   61   61    90   54%  2480   30% 
   9 Hiarcs 10 UCI          2508   73   73    60   46%  2535   32% 
  10 Toga II 1.2 Beta2aE26  2499   61   61    90   52%  2480   24% 
  11 Toga II 1.2 Beta3      2472  179  179    10   70%  2346   20% 
  12 Spike 1.1              2458   73   73    60   38%  2535   35% 
  13 SlowChess Blitz WV2.1  2443   73   73    60   36%  2535   35% 
  14 Pro Deo 1.1            2433   75   75    60   35%  2535   27% 
  15 CM10k Xperience        2394   76   76    60   29%  2535   28% 
  16 Glaurung 1.0.2         2346   73   73    70   24%  2526   27%

Chunk5

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Rybka 1.1 32-bit       2683   72   72    60   73%  2544   45% 
   2 Toga II 1.2 Beta2a     2600   62   62    90   68%  2473   31% 
   3 Toga II 1.2 Beta3a     2582   64   64    90   64%  2473   21% 
   4 Toga II 1.3 Beta1      2552   60   60    90   62%  2473   39% 
   5 Hiarcs X50 UCI         2550   72   72    60   51%  2544   32% 
   6 Hiarcs 10 UCI          2542   72   72    60   50%  2544   33% 
   7 Toga II 1.2 Beta2aE26  2523   60   60    90   57%  2473   31% 
   8 Toga II 1.2 Beta3b     2517   61   61    90   57%  2473   29% 
   9 Fritz 9                2516   69   69    60   45%  2544   47% 
  10 Toga II 1.2 Beta3c     2491   59   59    90   52%  2473   38% 
  11 Toga II 1.2 Beta3      2477  179  179    10   70%  2356   20% 
  12 CM10k Xperience        2435   75   75    60   34%  2544   28% 
  13 Spike 1.1              2424   77   77    60   33%  2544   25% 
  14 SlowChess Blitz WV2.1  2409   77   77    60   31%  2544   25% 
  15 Glaurung 1.0.2         2356   73   73    70   24%  2535   26% 
  16 Pro Deo 1.1            2344   82   82    60   23%  2544   22%

Here is the rating list for the 5 combined:

Code: Select all

Rank Name                    Elo    +    - games score oppo. draws 
   1 Rybka 1.1 32-bit       2608   32   32   300   61%  2539   39% 
   2 Toga II 1.2 Beta2a     2566   27   27   450   63%  2475   32% 
   3 Hiarcs X50 UCI         2555   33   33   300   53%  2539   34% 
   4 Toga II 1.2 Beta3a     2551   28   28   450   61%  2475   28% 
   5 Toga II 1.3 Beta1      2535   27   27   450   58%  2475   34% 
   6 Toga II 1.2 Beta3c     2528   27   27   450   58%  2475   32% 
   7 Toga II 1.2 Beta3b     2527   27   27   450   57%  2475   30% 
   8 Toga II 1.2 Beta2aE26  2526   28   28   450   57%  2475   27% 
   9 Fritz 9                2519   33   33   300   47%  2539   31% 
  10 Hiarcs 10 UCI          2515   33   33   300   46%  2539   34% 
  11 Toga II 1.2 Beta3      2491   80   80    50   68%  2378   36% 
  12 Spike 1.1              2473   33   33   300   40%  2539   35% 
  13 CM10k Xperience        2423   35   35   300   34%  2539   23% 
  14 SlowChess Blitz WV2.1  2404   35   35   300   31%  2539   27% 
  15 Pro Deo 1.1            2401   35   35   300   30%  2539   28% 
  16 Glaurung 1.0.2         2378   33   33   350   28%  2532   26%

Elo ratings all based on arbitry 2500 offset.

Shaun

Uri Blass · Post by **Uri Blass** » Wed Jun 21, 2006 11:38 pm

Here is some statistics about the results

1)In most cases the rating after 1/5 of the games did not guess correctly the final rating with error of more than 20 elo and it seems that 50 games are not enough for it but 90 games are enough for it.
2)out of 80 guesses of the rating only 2 were out of the bounds based on the table

one example is rybka1.1 32 bit when 2683-72>2608 when another example is glaurung1.0.2when 2447-67>2378

I do not know if it was an accident when they were the highest and lowest rating program in the list and it is possible that the error is higher for programs in the top of the list or the bottom of the list.

Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2608 32 32 300 61% 2539 39% 2540 2583 2596 2626 2683 2/5(1/5 errors)
2 Toga II 1.2 Beta2a 2566 27 27 450 63% 2475 32% 2526 2575 2592 2535 2600 1/5(0/5 errors)
3 Hiarcs X50 UCI 2555 33 33 300 53% 2539 34% 2485 2581 2603 2553 2550 2/5(0/5 errors)
4 Toga II 1.2 Beta3a 2551 28 28 450 61% 2475 28% 2554 2510 2560 2546 2582 3/5(0/5 errors)
5 Toga II 1.3 Beta1 2535 27 27 450 58% 2475 34% 2538 2499 2521 2563 2552 3/5(0/5 errors)
6 Toga II 1.2 Beta3c 2528 27 27 450 58% 2475 32% 2519 2540 2532 2559 2491 3/5(0/5 errors)
7 Toga II 1.2 Beta3b 2527 27 27 450 57% 2475 30% 2511 2516 2578 2508 2517 4/5(0/5 errors)
8 Toga II 1.2 Beta2aE26 2526 28 28 450 57% 2475 27% 2524 2535 2551 2499 2523 3/5(0/5 errors)
9 Fritz 9 2519 33 33 300 47% 2539 31% 2498 2548 2481 2557 2516 1/5(0/5 errors)
10 Hiarcs 10 UCI 2515 33 33 300 46% 2539 34% 2537 2476 2512 2508 2542 2/5(0/5 errors)
11 Toga II 1.2 Beta3 2491 80 80 50 68% 2378 36% 2548 2559 2363 2472 2477 2/5(0/5 errors)
12 Spike 1.1 2473 33 33 300 40% 2539 35% 2498 2481 2505 2458 2424 2/5(0/5 errors)
13 CM10k Xperience 2423 35 35 300 34% 2539 23% 2457 2427 2418 2394 2435 3/5(0/5 errors)
14 SlowChess Blitz WV2.1 2404 35 35 300 31% 2539 27% 2392 2365 2427 2443 2409 2/5(0/5 errors)
15 Pro Deo 1.1 2401 35 35 300 30% 2539 28% 2425 2394 2421 2433 2344 2/5(0/5 errors)
16 Glaurung 1.0.2 2378 33 33 350 28% 2532 26% 2447 2414 2339 2346 2356 0/5(1/5 errors)

Post by **Shaun** » Wed Jun 21, 2006 11:50 pm

Hi Uri,

The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.

I wonder if a longer time control will produce less deviation between the batches?

I am also running some round robin tournaments I will look at these too.

Anything interesting I will post in this thread.

All the best

Shaun

Post by **Kirill Kryukov** » Thu Jun 22, 2006 3:45 am

Thanks for comments, Uri. To original question.

Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.

You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.

I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.

Yes, we have dates for every game in our database. The problem with this method is: we often test by running 30 game matches, so the first 50 games will represent just two opponents. So the rating computed based on those 50 games will represent engine performance versus those 2 selected opponents. Much better would be to take random 50 games of an engine, not first 50. Also good is to repeat this resampling many times (say, 1000 times). Then we can estimate how big is the stochastic effect. This is called bootstrapping in statistics. It will be interesting to do, but not my priority at the moment. Of course anyone are welcome to do it, I will be very interested to see the results of such experiment.

Uri Blass wrote:I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions (for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).

+ and - are based on a statistical model. We compute ratings with Bayeselo, which is based on refined ELO model. This model (unlike original ELO model) does take into account white's advantage. This model however does not take into account distortions. So there is still space for improvement in this area. I am happy with Bayeselo model for now, but if a tool implementing more complex model will appear I will be curious to try.

Uri Blass wrote:My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.

Uri

My feeling, based on the dynamics of our ratings over time, is that the real error is smaller than those numbers in less than 95% of the cases. This is because, as I said, Bayeselo model (while being superior to ELO) does not take into account distortions caused by the possibility of some engines being unexpectedly strong or weak versus some particular opponents.

This is of course just feeling or guess, but I will be interested to hear if you (or anyone) will do any detailed analysis.

Post by **Kirill Kryukov** » Thu Jun 22, 2006 2:06 pm

ShaunBrewer wrote:Hi Uri,

The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.

I wonder if a longer time control will produce less deviation between the batches?

I am also running some round robin tournaments I will look at these too.

Anything interesting I will post in this thread.

All the best

Shaun

Thanks for interesting analysis! I would love to see some automated bootstrap test on game database.

I don't see why deviation at longer time control should be smaller than in blitz. I would be surprised if it is different for different time controls.

Post by **Shaun** » Thu Jun 22, 2006 4:44 pm

Kirill Kryukov wrote:
ShaunBrewer wrote:Hi Uri,

The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.

I wonder if a longer time control will produce less deviation between the batches?

I am also running some round robin tournaments I will look at these too.

Anything interesting I will post in this thread.

All the best

Shaun
Thanks for interesting analysis! I would love to see some automated bootstrap test on game database.

I don't see why deviation at longer time control should be smaller than in blitz. I would be surprised if it is different for different time controls.

Hi Kirill,

My though is that at faster time controls the engines move choice changes more frequently(i.e. when you watch analysis the 'best' move tends to change a fair bit at the start and then less frequently) - this early indecission might cause more randomness in results of faster games.

I should shortly have the same pairings at 40/12 I will do the same split.

I am not confident of the outcome - but it will be interesting either way.

Shaun

Post by **Kirill Kryukov** » Sat Jun 24, 2006 5:20 pm

ShaunBrewer wrote:My though is that at faster time controls the engines move choice changes more frequently(i.e. when you watch analysis the 'best' move tends to change a fair bit at the start and then less frequently) - this early indecission might cause more randomness in results of faster games.

After some thinking it seems possible to me.

It will be good to see if it results in larger deviation. I'm afraid it requires bootstrap test to check it. I'll add it to the end of my to-do list.

CCRL Discussion Board

about CCRL stat

about CCRL stat

Re: about CCRL stat

Re: about CCRL stat

Re: about CCRL stat

Re: about CCRL stat