about CCRL stat
about CCRL stat
I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.
You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.
I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.
I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions(for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).
My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.
Uri
You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.
I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.
I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions(for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).
My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.
Uri
- Graham Banks
- Posts: 27029
- Joined: Sun Dec 18, 2005 5:47 pm
- Sign-up code: 0
- Location: Auckland, NZ
Re: about CCRL stat
Uri,Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.
You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.
I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.
I think it would be interesting to look at this - however taking the first 50 games would be prone to a large error as certain engines perform better/worse depending on the opponent and the first 50 may all be against the same engine. However if you could take the first n games against all opponents (where n attempts to get the total near 50) this could give valid comparison. Unfortunately I believe you are going to be disapointed with the results - it does seem you neeed a huge number of games to prove superiority with similar strength engines. I have been testing Toga versions for Thomas and am finding it very difficult to prove/disprove improvements.
Shaun
EDIT: Examples added:
Code: Select all
1 Rybka 1.1 32-bit 2608 300.0 (182.0 : 118.0)
50.0 ( 28.5 : 21.5) Toga II 1.2 Beta3a 2551
50.0 ( 29.0 : 21.0) Toga II 1.2 Beta3c 2528
50.0 ( 29.5 : 20.5) Toga II 1.2 Beta2a 2566
50.0 ( 31.0 : 19.0) Toga II 1.2 Beta3b 2527
50.0 ( 31.5 : 18.5) Toga II 1.3 Beta1 2535
50.0 ( 32.5 : 17.5) Toga II 1.2 Beta2aE26 2526
3 Hiarcs X50 UCI 2555 300.0 (157.5 : 142.5)
50.0 ( 23.0 : 27.0) Toga II 1.2 Beta3a 2551
50.0 ( 24.0 : 26.0) Toga II 1.2 Beta2a 2566
50.0 ( 25.5 : 24.5) Toga II 1.2 Beta3b 2527
50.0 ( 27.0 : 23.0) Toga II 1.3 Beta1 2535
50.0 ( 28.5 : 21.5) Toga II 1.2 Beta3c 2528
50.0 ( 29.5 : 20.5) Toga II 1.2 Beta2aE26 2526
9 Fritz 9 2519 300.0 (141.5 : 158.5)
50.0 ( 21.5 : 28.5) Toga II 1.2 Beta3a 2551
50.0 ( 23.0 : 27.0) Toga II 1.3 Beta1 2535
50.0 ( 24.0 : 26.0) Toga II 1.2 Beta2a 2566
50.0 ( 24.0 : 26.0) Toga II 1.2 Beta3c 2528
50.0 ( 24.5 : 25.5) Toga II 1.2 Beta3b 2527
50.0 ( 24.5 : 25.5) Toga II 1.2 Beta2aE26 2526
10 Hiarcs 10 UCI 2515 300.0 (139.0 : 161.0)
50.0 ( 21.5 : 28.5) Toga II 1.2 Beta2aE26 2526
50.0 ( 22.0 : 28.0) Toga II 1.2 Beta3c 2528
50.0 ( 22.5 : 27.5) Toga II 1.2 Beta3b 2527
50.0 ( 23.0 : 27.0) Toga II 1.2 Beta2a 2566
50.0 ( 23.0 : 27.0) Toga II 1.3 Beta1 2535
50.0 ( 27.0 : 23.0) Toga II 1.2 Beta3a 2551
12 Spike 1.1 2473 300.0 (119.0 : 181.0)
50.0 ( 17.5 : 32.5) Toga II 1.3 Beta1 2535
50.0 ( 17.5 : 32.5) Toga II 1.2 Beta2aE26 2526
50.0 ( 18.0 : 32.0) Toga II 1.2 Beta2a 2566
50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3a 2551
50.0 ( 20.0 : 30.0) Toga II 1.2 Beta3c 2528
50.0 ( 27.0 : 23.0) Toga II 1.2 Beta3b 2527
13 CM10k Xperience 2423 300.0 (102.0 : 198.0)
50.0 ( 10.0 : 40.0) Toga II 1.2 Beta2a 2566
50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3b 2527
50.0 ( 17.0 : 33.0) Toga II 1.2 Beta3c 2528
50.0 ( 19.5 : 30.5) Toga II 1.2 Beta3a 2551
50.0 ( 19.5 : 30.5) Toga II 1.3 Beta1 2535
50.0 ( 21.5 : 28.5) Toga II 1.2 Beta2aE26 2526
14 SlowChess Blitz WV2.1 2404 300.0 ( 93.0 : 207.0)
50.0 ( 13.5 : 36.5) Toga II 1.2 Beta2a 2566
50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3a 2551
50.0 ( 15.0 : 35.0) Toga II 1.3 Beta1 2535
50.0 ( 15.0 : 35.0) Toga II 1.2 Beta3b 2527
50.0 ( 16.0 : 34.0) Toga II 1.2 Beta2aE26 2526
50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3c 2528
15 Pro Deo 1.1 2401 300.0 ( 91.0 : 209.0)
50.0 ( 13.0 : 37.0) Toga II 1.2 Beta2a 2566
50.0 ( 14.0 : 36.0) Toga II 1.2 Beta3b 2527
50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3a 2551
50.0 ( 15.5 : 34.5) Toga II 1.2 Beta3c 2528
50.0 ( 16.5 : 33.5) Toga II 1.3 Beta1 2535
50.0 ( 17.5 : 32.5) Toga II 1.2 Beta2aE26 2526
16 Glaurung 1.0.2 2378 350.0 ( 99.0 : 251.0)
50.0 ( 9.5 : 40.5) Toga II 1.2 Beta3a 2551
50.0 ( 12.0 : 38.0) Toga II 1.2 Beta2a 2566
50.0 ( 12.5 : 37.5) Toga II 1.2 Beta2aE26 2526
50.0 ( 14.0 : 36.0) Toga II 1.3 Beta1 2535
50.0 ( 16.0 : 34.0) Toga II 1.2 Beta3c 2528
50.0 ( 16.0 : 34.0) Toga II 1.2 Beta3 2491
50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3b 2527
These games were all CCRL 40/4 equivalent using common opening book and testing conditions.
I am also running CCRL 40/12 equivalent matches and Toga II 1.2 Beta2a is not leading!
*Toga II 1.2 Beta2a has the overall highest rating across all opponents so far at CCRL 40/4 equivalent
EDIT 2: Ignore Actual ELO ratings (relative is fine) these are based on a 2500 offset - no attempt has been made by me to standardise.
I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions(for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).
My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.
Uri
Re: about CCRL stat
As far as I see the difference between toga versions is very small and of course in these conditions you cannot find who is better based on 50 games.ShaunBrewer wrote:Uri,Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.
You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.
I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.
I think it would be interesting to look at this - however taking the first 50 games would be prone to a large error as certain engines perform better/worse depending on the opponent and the first 50 may all be against the same engine. However if you could take the first n games against all opponents (where n attempts to get the total near 50) this could give valid comparison. Unfortunately I believe you are going to be disapointed with the results - it does seem you neeed a huge number of games to prove superiority with similar strength engines. I have been testing Toga versions for Thomas and am finding it very difficult to prove/disprove improvements.
Shaun
EDIT: Examples added:
see where Toga II 1.2 Beta2a* ended up in each matchCode: Select all
1 Rybka 1.1 32-bit 2608 300.0 (182.0 : 118.0) 50.0 ( 28.5 : 21.5) Toga II 1.2 Beta3a 2551 50.0 ( 29.0 : 21.0) Toga II 1.2 Beta3c 2528 50.0 ( 29.5 : 20.5) Toga II 1.2 Beta2a 2566 50.0 ( 31.0 : 19.0) Toga II 1.2 Beta3b 2527 50.0 ( 31.5 : 18.5) Toga II 1.3 Beta1 2535 50.0 ( 32.5 : 17.5) Toga II 1.2 Beta2aE26 2526 3 Hiarcs X50 UCI 2555 300.0 (157.5 : 142.5) 50.0 ( 23.0 : 27.0) Toga II 1.2 Beta3a 2551 50.0 ( 24.0 : 26.0) Toga II 1.2 Beta2a 2566 50.0 ( 25.5 : 24.5) Toga II 1.2 Beta3b 2527 50.0 ( 27.0 : 23.0) Toga II 1.3 Beta1 2535 50.0 ( 28.5 : 21.5) Toga II 1.2 Beta3c 2528 50.0 ( 29.5 : 20.5) Toga II 1.2 Beta2aE26 2526 9 Fritz 9 2519 300.0 (141.5 : 158.5) 50.0 ( 21.5 : 28.5) Toga II 1.2 Beta3a 2551 50.0 ( 23.0 : 27.0) Toga II 1.3 Beta1 2535 50.0 ( 24.0 : 26.0) Toga II 1.2 Beta2a 2566 50.0 ( 24.0 : 26.0) Toga II 1.2 Beta3c 2528 50.0 ( 24.5 : 25.5) Toga II 1.2 Beta3b 2527 50.0 ( 24.5 : 25.5) Toga II 1.2 Beta2aE26 2526 10 Hiarcs 10 UCI 2515 300.0 (139.0 : 161.0) 50.0 ( 21.5 : 28.5) Toga II 1.2 Beta2aE26 2526 50.0 ( 22.0 : 28.0) Toga II 1.2 Beta3c 2528 50.0 ( 22.5 : 27.5) Toga II 1.2 Beta3b 2527 50.0 ( 23.0 : 27.0) Toga II 1.2 Beta2a 2566 50.0 ( 23.0 : 27.0) Toga II 1.3 Beta1 2535 50.0 ( 27.0 : 23.0) Toga II 1.2 Beta3a 2551 12 Spike 1.1 2473 300.0 (119.0 : 181.0) 50.0 ( 17.5 : 32.5) Toga II 1.3 Beta1 2535 50.0 ( 17.5 : 32.5) Toga II 1.2 Beta2aE26 2526 50.0 ( 18.0 : 32.0) Toga II 1.2 Beta2a 2566 50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3a 2551 50.0 ( 20.0 : 30.0) Toga II 1.2 Beta3c 2528 50.0 ( 27.0 : 23.0) Toga II 1.2 Beta3b 2527 13 CM10k Xperience 2423 300.0 (102.0 : 198.0) 50.0 ( 10.0 : 40.0) Toga II 1.2 Beta2a 2566 50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3b 2527 50.0 ( 17.0 : 33.0) Toga II 1.2 Beta3c 2528 50.0 ( 19.5 : 30.5) Toga II 1.2 Beta3a 2551 50.0 ( 19.5 : 30.5) Toga II 1.3 Beta1 2535 50.0 ( 21.5 : 28.5) Toga II 1.2 Beta2aE26 2526 14 SlowChess Blitz WV2.1 2404 300.0 ( 93.0 : 207.0) 50.0 ( 13.5 : 36.5) Toga II 1.2 Beta2a 2566 50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3a 2551 50.0 ( 15.0 : 35.0) Toga II 1.3 Beta1 2535 50.0 ( 15.0 : 35.0) Toga II 1.2 Beta3b 2527 50.0 ( 16.0 : 34.0) Toga II 1.2 Beta2aE26 2526 50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3c 2528 15 Pro Deo 1.1 2401 300.0 ( 91.0 : 209.0) 50.0 ( 13.0 : 37.0) Toga II 1.2 Beta2a 2566 50.0 ( 14.0 : 36.0) Toga II 1.2 Beta3b 2527 50.0 ( 14.5 : 35.5) Toga II 1.2 Beta3a 2551 50.0 ( 15.5 : 34.5) Toga II 1.2 Beta3c 2528 50.0 ( 16.5 : 33.5) Toga II 1.3 Beta1 2535 50.0 ( 17.5 : 32.5) Toga II 1.2 Beta2aE26 2526 16 Glaurung 1.0.2 2378 350.0 ( 99.0 : 251.0) 50.0 ( 9.5 : 40.5) Toga II 1.2 Beta3a 2551 50.0 ( 12.0 : 38.0) Toga II 1.2 Beta2a 2566 50.0 ( 12.5 : 37.5) Toga II 1.2 Beta2aE26 2526 50.0 ( 14.0 : 36.0) Toga II 1.3 Beta1 2535 50.0 ( 16.0 : 34.0) Toga II 1.2 Beta3c 2528 50.0 ( 16.0 : 34.0) Toga II 1.2 Beta3 2491 50.0 ( 19.0 : 31.0) Toga II 1.2 Beta3b 2527
I did not claim that it is possible to decide which version is better based on 50 games but that it may be possible to be wrong by less than 20 elo in guessing the rating after 50 games.
These claims are different.
Note that after 50 games you have possible error of something near 80 elo that mean that we can expect the rating to be wrong by more than 80 elo in 5% of the cases and the question is if this expectation is correct based on experience.
Uri
Re: about CCRL stat
Hi again Uri,
Shaun
AgreedUri Blass wrote: As far as I see the difference between toga versions is very small and of course in these conditions you cannot find who is better based on 50 games.
I did not claim that it is possible to decide which version is better based on 50 games but that it may be possible to be wrong by less than 20 elo in guessing the rating after 50 games.
These claims are different.
If I get a chance I will split my db into 6 - 50 game chunks (non Toga engines) and produce ratings for each chunk and post here - that might be interesting.Note that after 50 games you have possible error of something near 80 elo that mean that we can expect the rating to be wrong by more than 80 elo in 5% of the cases and the question is if this expectation is correct based on experience.
Uri
Shaun
Hi again Uri,
Okay I have taken my database of 2,750 games and split it into 5 chunks of 550 games. Each chunk has identical pairings on identical hardware.
The resulting rating lists are below:
Chunk1
Chunk2
Chunk3
Chunk4
Chunk5
Here is the rating list for the 5 combined:
Elo ratings all based on arbitry 2500 offset.
Shaun
Okay I have taken my database of 2,750 games and split it into 5 chunks of 550 games. Each chunk has identical pairings on identical hardware.
The resulting rating lists are below:
Chunk1
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Toga II 1.2 Beta3a 2554 61 61 90 62% 2476 29%
2 Toga II 1.2 Beta3 2548 156 156 10 70% 2447 60%
3 Rybka 1.1 32-bit 2540 71 71 60 52% 2529 37%
4 Toga II 1.3 Beta1 2538 60 60 90 59% 2476 30%
5 Hiarcs 10 UCI 2537 72 72 60 52% 2529 33%
6 Toga II 1.2 Beta2a 2526 60 60 90 57% 2476 30%
7 Toga II 1.2 Beta2aE26 2524 60 60 90 57% 2476 30%
8 Toga II 1.2 Beta3c 2519 58 58 90 57% 2476 39%
9 Toga II 1.2 Beta3b 2511 59 59 90 55% 2476 34%
10 Fritz 9 2498 74 74 60 46% 2529 28%
11 Spike 1.1 2498 71 71 60 45% 2529 37%
12 Hiarcs X50 UCI 2485 72 72 60 43% 2529 33%
13 CM10k Xperience 2457 75 75 60 39% 2529 25%
14 Glaurung 1.0.2 2447 67 67 70 36% 2531 33%
15 Pro Deo 1.1 2425 71 71 60 33% 2529 45%
16 SlowChess Blitz WV2.1 2392 78 78 60 31% 2529 22%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2583 71 71 60 58% 2529 40%
2 Hiarcs X50 UCI 2581 73 73 60 58% 2529 33%
3 Toga II 1.2 Beta2a 2575 59 59 90 65% 2474 37%
4 Toga II 1.2 Beta3 2559 176 176 10 75% 2414 30%
5 Fritz 9 2548 76 76 60 53% 2529 22%
6 Toga II 1.2 Beta3c 2540 61 61 90 59% 2474 26%
7 Toga II 1.2 Beta2aE26 2535 62 62 90 59% 2474 24%
8 Toga II 1.2 Beta3b 2516 61 61 90 56% 2474 27%
9 Toga II 1.2 Beta3a 2510 60 60 90 55% 2474 32%
10 Toga II 1.3 Beta1 2499 60 60 90 54% 2474 32%
11 Spike 1.1 2481 71 71 60 42% 2529 40%
12 Hiarcs 10 UCI 2476 70 70 60 41% 2529 42%
13 CM10k Xperience 2427 77 77 60 36% 2529 18%
14 Glaurung 1.0.2 2414 70 70 70 32% 2533 24%
15 Pro Deo 1.1 2394 77 77 60 31% 2529 25%
16 SlowChess Blitz WV2.1 2365 79 79 60 27% 2529 23%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Hiarcs X50 UCI 2603 72 72 60 58% 2556 35%
2 Rybka 1.1 32-bit 2596 71 71 60 57% 2556 40%
3 Toga II 1.2 Beta2a 2592 62 62 90 66% 2478 29%
4 Toga II 1.2 Beta3b 2578 62 62 90 64% 2478 29%
5 Toga II 1.2 Beta3a 2560 62 62 90 62% 2478 26%
6 Toga II 1.2 Beta2aE26 2551 62 62 90 61% 2478 26%
7 Toga II 1.2 Beta3c 2532 61 61 90 57% 2478 27%
8 Toga II 1.3 Beta1 2521 60 60 90 56% 2478 31%
9 Hiarcs 10 UCI 2512 73 73 60 43% 2556 30%
10 Spike 1.1 2505 71 71 60 42% 2556 37%
11 Fritz 9 2481 75 75 60 39% 2556 25%
12 SlowChess Blitz WV2.1 2427 76 76 60 31% 2556 28%
13 Pro Deo 1.1 2421 78 78 60 31% 2556 22%
14 CM10k Xperience 2418 80 80 60 32% 2556 17%
15 Toga II 1.2 Beta3 2363 158 158 10 55% 2339 50%
16 Glaurung 1.0.2 2339 75 75 70 25% 2528 21%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2626 74 74 60 64% 2535 32%
2 Toga II 1.3 Beta1 2563 59 59 90 62% 2480 37%
3 Toga II 1.2 Beta3c 2559 61 61 90 62% 2480 31%
4 Fritz 9 2557 72 72 60 53% 2535 33%
5 Hiarcs X50 UCI 2553 72 72 60 53% 2535 35%
6 Toga II 1.2 Beta3a 2546 61 61 90 61% 2480 32%
7 Toga II 1.2 Beta2a 2535 59 59 90 59% 2480 36%
8 Toga II 1.2 Beta3b 2508 61 61 90 54% 2480 30%
9 Hiarcs 10 UCI 2508 73 73 60 46% 2535 32%
10 Toga II 1.2 Beta2aE26 2499 61 61 90 52% 2480 24%
11 Toga II 1.2 Beta3 2472 179 179 10 70% 2346 20%
12 Spike 1.1 2458 73 73 60 38% 2535 35%
13 SlowChess Blitz WV2.1 2443 73 73 60 36% 2535 35%
14 Pro Deo 1.1 2433 75 75 60 35% 2535 27%
15 CM10k Xperience 2394 76 76 60 29% 2535 28%
16 Glaurung 1.0.2 2346 73 73 70 24% 2526 27%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2683 72 72 60 73% 2544 45%
2 Toga II 1.2 Beta2a 2600 62 62 90 68% 2473 31%
3 Toga II 1.2 Beta3a 2582 64 64 90 64% 2473 21%
4 Toga II 1.3 Beta1 2552 60 60 90 62% 2473 39%
5 Hiarcs X50 UCI 2550 72 72 60 51% 2544 32%
6 Hiarcs 10 UCI 2542 72 72 60 50% 2544 33%
7 Toga II 1.2 Beta2aE26 2523 60 60 90 57% 2473 31%
8 Toga II 1.2 Beta3b 2517 61 61 90 57% 2473 29%
9 Fritz 9 2516 69 69 60 45% 2544 47%
10 Toga II 1.2 Beta3c 2491 59 59 90 52% 2473 38%
11 Toga II 1.2 Beta3 2477 179 179 10 70% 2356 20%
12 CM10k Xperience 2435 75 75 60 34% 2544 28%
13 Spike 1.1 2424 77 77 60 33% 2544 25%
14 SlowChess Blitz WV2.1 2409 77 77 60 31% 2544 25%
15 Glaurung 1.0.2 2356 73 73 70 24% 2535 26%
16 Pro Deo 1.1 2344 82 82 60 23% 2544 22%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2608 32 32 300 61% 2539 39%
2 Toga II 1.2 Beta2a 2566 27 27 450 63% 2475 32%
3 Hiarcs X50 UCI 2555 33 33 300 53% 2539 34%
4 Toga II 1.2 Beta3a 2551 28 28 450 61% 2475 28%
5 Toga II 1.3 Beta1 2535 27 27 450 58% 2475 34%
6 Toga II 1.2 Beta3c 2528 27 27 450 58% 2475 32%
7 Toga II 1.2 Beta3b 2527 27 27 450 57% 2475 30%
8 Toga II 1.2 Beta2aE26 2526 28 28 450 57% 2475 27%
9 Fritz 9 2519 33 33 300 47% 2539 31%
10 Hiarcs 10 UCI 2515 33 33 300 46% 2539 34%
11 Toga II 1.2 Beta3 2491 80 80 50 68% 2378 36%
12 Spike 1.1 2473 33 33 300 40% 2539 35%
13 CM10k Xperience 2423 35 35 300 34% 2539 23%
14 SlowChess Blitz WV2.1 2404 35 35 300 31% 2539 27%
15 Pro Deo 1.1 2401 35 35 300 30% 2539 28%
16 Glaurung 1.0.2 2378 33 33 350 28% 2532 26%
Shaun
Here is some statistics about the results
1)In most cases the rating after 1/5 of the games did not guess correctly the final rating with error of more than 20 elo and it seems that 50 games are not enough for it but 90 games are enough for it.
2)out of 80 guesses of the rating only 2 were out of the bounds based on the table
one example is rybka1.1 32 bit when 2683-72>2608 when another example is glaurung1.0.2when 2447-67>2378
I do not know if it was an accident when they were the highest and lowest rating program in the list and it is possible that the error is higher for programs in the top of the list or the bottom of the list.
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2608 32 32 300 61% 2539 39% 2540 2583 2596 2626 2683 2/5(1/5 errors)
2 Toga II 1.2 Beta2a 2566 27 27 450 63% 2475 32% 2526 2575 2592 2535 2600 1/5(0/5 errors)
3 Hiarcs X50 UCI 2555 33 33 300 53% 2539 34% 2485 2581 2603 2553 2550 2/5(0/5 errors)
4 Toga II 1.2 Beta3a 2551 28 28 450 61% 2475 28% 2554 2510 2560 2546 2582 3/5(0/5 errors)
5 Toga II 1.3 Beta1 2535 27 27 450 58% 2475 34% 2538 2499 2521 2563 2552 3/5(0/5 errors)
6 Toga II 1.2 Beta3c 2528 27 27 450 58% 2475 32% 2519 2540 2532 2559 2491 3/5(0/5 errors)
7 Toga II 1.2 Beta3b 2527 27 27 450 57% 2475 30% 2511 2516 2578 2508 2517 4/5(0/5 errors)
8 Toga II 1.2 Beta2aE26 2526 28 28 450 57% 2475 27% 2524 2535 2551 2499 2523 3/5(0/5 errors)
9 Fritz 9 2519 33 33 300 47% 2539 31% 2498 2548 2481 2557 2516 1/5(0/5 errors)
10 Hiarcs 10 UCI 2515 33 33 300 46% 2539 34% 2537 2476 2512 2508 2542 2/5(0/5 errors)
11 Toga II 1.2 Beta3 2491 80 80 50 68% 2378 36% 2548 2559 2363 2472 2477 2/5(0/5 errors)
12 Spike 1.1 2473 33 33 300 40% 2539 35% 2498 2481 2505 2458 2424 2/5(0/5 errors)
13 CM10k Xperience 2423 35 35 300 34% 2539 23% 2457 2427 2418 2394 2435 3/5(0/5 errors)
14 SlowChess Blitz WV2.1 2404 35 35 300 31% 2539 27% 2392 2365 2427 2443 2409 2/5(0/5 errors)
15 Pro Deo 1.1 2401 35 35 300 30% 2539 28% 2425 2394 2421 2433 2344 2/5(0/5 errors)
16 Glaurung 1.0.2 2378 33 33 350 28% 2532 26% 2447 2414 2339 2346 2356 0/5(1/5 errors)
1)In most cases the rating after 1/5 of the games did not guess correctly the final rating with error of more than 20 elo and it seems that 50 games are not enough for it but 90 games are enough for it.
2)out of 80 guesses of the rating only 2 were out of the bounds based on the table
one example is rybka1.1 32 bit when 2683-72>2608 when another example is glaurung1.0.2when 2447-67>2378
I do not know if it was an accident when they were the highest and lowest rating program in the list and it is possible that the error is higher for programs in the top of the list or the bottom of the list.
Rank Name Elo + - games score oppo. draws
1 Rybka 1.1 32-bit 2608 32 32 300 61% 2539 39% 2540 2583 2596 2626 2683 2/5(1/5 errors)
2 Toga II 1.2 Beta2a 2566 27 27 450 63% 2475 32% 2526 2575 2592 2535 2600 1/5(0/5 errors)
3 Hiarcs X50 UCI 2555 33 33 300 53% 2539 34% 2485 2581 2603 2553 2550 2/5(0/5 errors)
4 Toga II 1.2 Beta3a 2551 28 28 450 61% 2475 28% 2554 2510 2560 2546 2582 3/5(0/5 errors)
5 Toga II 1.3 Beta1 2535 27 27 450 58% 2475 34% 2538 2499 2521 2563 2552 3/5(0/5 errors)
6 Toga II 1.2 Beta3c 2528 27 27 450 58% 2475 32% 2519 2540 2532 2559 2491 3/5(0/5 errors)
7 Toga II 1.2 Beta3b 2527 27 27 450 57% 2475 30% 2511 2516 2578 2508 2517 4/5(0/5 errors)
8 Toga II 1.2 Beta2aE26 2526 28 28 450 57% 2475 27% 2524 2535 2551 2499 2523 3/5(0/5 errors)
9 Fritz 9 2519 33 33 300 47% 2539 31% 2498 2548 2481 2557 2516 1/5(0/5 errors)
10 Hiarcs 10 UCI 2515 33 33 300 46% 2539 34% 2537 2476 2512 2508 2542 2/5(0/5 errors)
11 Toga II 1.2 Beta3 2491 80 80 50 68% 2378 36% 2548 2559 2363 2472 2477 2/5(0/5 errors)
12 Spike 1.1 2473 33 33 300 40% 2539 35% 2498 2481 2505 2458 2424 2/5(0/5 errors)
13 CM10k Xperience 2423 35 35 300 34% 2539 23% 2457 2427 2418 2394 2435 3/5(0/5 errors)
14 SlowChess Blitz WV2.1 2404 35 35 300 31% 2539 27% 2392 2365 2427 2443 2409 2/5(0/5 errors)
15 Pro Deo 1.1 2401 35 35 300 30% 2539 28% 2425 2394 2421 2433 2344 2/5(0/5 errors)
16 Glaurung 1.0.2 2378 33 33 350 28% 2532 26% 2447 2414 2339 2346 2356 0/5(1/5 errors)
Hi Uri,
The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.
I wonder if a longer time control will produce less deviation between the batches?
I am also running some round robin tournaments I will look at these too.
Anything interesting I will post in this thread.
All the best
Shaun
The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.
I wonder if a longer time control will produce less deviation between the batches?
I am also running some round robin tournaments I will look at these too.
Anything interesting I will post in this thread.
All the best
Shaun
- Kirill Kryukov
- Site Admin
- Posts: 7399
- Joined: Sun Dec 18, 2005 9:58 am
- Sign-up code: 0
- Location: Mishima, Japan
- Contact:
Re: about CCRL stat
Thanks for comments, Uri. To original question.
This is of course just feeling or guess, but I will be interested to hear if you (or anyone) will do any detailed analysis.
Yes, we have dates for every game in our database. The problem with this method is: we often test by running 30 game matches, so the first 50 games will represent just two opponents. So the rating computed based on those 50 games will represent engine performance versus those 2 selected opponents. Much better would be to take random 50 games of an engine, not first 50. Also good is to repeat this resampling many times (say, 1000 times). Then we can estimate how big is the stochastic effect. This is called bootstrapping in statistics. It will be interesting to do, but not my priority at the moment. Of course anyone are welcome to do it, I will be very interested to see the results of such experiment.Uri Blass wrote:I guess that you have a date for every game in your database so I think that it may be interesting if you can calculate rating for all programs based on only the first 50 games that they played and games that happened in that period.
You can do it by simply ignoring all games after the 50th game of the relevant program and calculate rating based on this data.
I think that it may be interesting to compare rating of programs after 50 games to rating of programs after more games to see the change.
+ and - are based on a statistical model. We compute ratings with Bayeselo, which is based on refined ELO model. This model (unlike original ELO model) does take into account white's advantage. This model however does not take into account distortions. So there is still space for improvement in this area. I am happy with Bayeselo model for now, but if a tool implementing more complex model will appear I will be curious to try.Uri Blass wrote:I do not trust the statistical error of + - because I guess that this + - is based on some wrong assumptions (for example an assumption that the probability is the same for white and black or an assumption that rating is not dependent on the opponent when it is possible that some program has opponents that it can score better or worse against them).
My feeling, based on the dynamics of our ratings over time, is that the real error is smaller than those numbers in less than 95% of the cases. This is because, as I said, Bayeselo model (while being superior to ELO) does not take into account distortions caused by the possibility of some engines being unexpectedly strong or weak versus some particular opponents.Uri Blass wrote:My guess based on comparison between CEGT and CCRL is that the real error is really smaller than the number that you give in more than 95% of the cases but I am not sure about it and it is also interesting to know if I am correct when I guess that after 50 games we can expect the rating not to be changed by more than 20 elo in most cases.
Uri
This is of course just feeling or guess, but I will be interested to hear if you (or anyone) will do any detailed analysis.
Last edited by Kirill Kryukov on Thu Jun 22, 2006 2:07 pm, edited 2 times in total.
KCEC | EGTB Online | 3x3 Chess | 3x4 Chess | 4x4 Chess | Longest Checkmates | EGTB Test Suite | Opening Sampler | EGTB Bounty | NULP
- Kirill Kryukov
- Site Admin
- Posts: 7399
- Joined: Sun Dec 18, 2005 9:58 am
- Sign-up code: 0
- Location: Mishima, Japan
- Contact:
Thanks for interesting analysis! I would love to see some automated bootstrap test on game database.ShaunBrewer wrote:Hi Uri,
The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.
I wonder if a longer time control will produce less deviation between the batches?
I am also running some round robin tournaments I will look at these too.
Anything interesting I will post in this thread.
All the best
Shaun
I don't see why deviation at longer time control should be smaller than in blitz. I would be surprised if it is different for different time controls.
KCEC | EGTB Online | 3x3 Chess | 3x4 Chess | 4x4 Chess | Longest Checkmates | EGTB Test Suite | Opening Sampler | EGTB Bounty | NULP
Hi Kirill,Kirill Kryukov wrote:Thanks for interesting analysis! I would love to see some automated bootstrap test on game database.ShaunBrewer wrote:Hi Uri,
The stats are interesting - these games were at 40/4 equivalent in a short while I will have a similar batch of 40/12 equivalent I will do the same split with these.
I wonder if a longer time control will produce less deviation between the batches?
I am also running some round robin tournaments I will look at these too.
Anything interesting I will post in this thread.
All the best
Shaun
I don't see why deviation at longer time control should be smaller than in blitz. I would be surprised if it is different for different time controls.
My though is that at faster time controls the engines move choice changes more frequently(i.e. when you watch analysis the 'best' move tends to change a fair bit at the start and then less frequently) - this early indecission might cause more randomness in results of faster games.
I should shortly have the same pairings at 40/12 I will do the same split.
I am not confident of the outcome - but it will be interesting either way.
Shaun
- Kirill Kryukov
- Site Admin
- Posts: 7399
- Joined: Sun Dec 18, 2005 9:58 am
- Sign-up code: 0
- Location: Mishima, Japan
- Contact:
After some thinking it seems possible to me. It will be good to see if it results in larger deviation. I'm afraid it requires bootstrap test to check it. I'll add it to the end of my to-do list.ShaunBrewer wrote:My though is that at faster time controls the engines move choice changes more frequently(i.e. when you watch analysis the 'best' move tends to change a fair bit at the start and then less frequently) - this early indecission might cause more randomness in results of faster games.
KCEC | EGTB Online | 3x3 Chess | 3x4 Chess | 4x4 Chess | Longest Checkmates | EGTB Test Suite | Opening Sampler | EGTB Bounty | NULP