An idea for new fixed depth rating list

Uri Blass · Post by **Uri Blass** » Wed Jan 23, 2008 7:36 am

In case that you are interested to have fixed depth rating list I can help by providing games and results.
conditions that I use
1)no tablebases
2)hash=64 mbytes per engine
3)opening 1.a3 1.a4 1.b3,... (all the possible opening of 1 ply)
Results so far that I have

Rybka2.3.2a depth 6: Yace0.99.87 depth 6: 37.5-2.5(yace depth 6 rejected)
Rybka2.3.2a depth 6: Yace0.99.87 depth 7: 30.5-9.5(yace depth 7 rejected)
Rybka2.3.2a depth 6: Yace0.99.87 depth 8: 17.5-22.5(yace depth 8 accepted)
Rybka2.3.2a depth 6:Glaurung2.0.1 depth 8: 36.5-3.5(Glaurung needs bigger depth)
Rybka2.3.2a depth 6:Glaurung2.0.1 depth 9: 26-14
Rybka2.3.2a depth 6:Glaurung2.0.1 depth 10:16-24(Glaurung depth 10 accepted as it scored closest to 50%)

I already posted about the ideas that I have in CCC here
http://64.68.157.89/forum/viewtopic.php ... 31&t=19138

Uri

Post by **Kirill Kryukov** » Wed Jan 23, 2008 9:08 am

I think any fixed depth comparison does not make sense. Because 1. Programs define depth differently. 2. Search extensions are used by most of programs. 3. Even if we knew how to define depth, we have no way to know at what depth a program is thinking, because all we have is program's own output.

Tord once commented about depth and I saved it:

It makes no sense to compare the search depths of different programs. It is common to refer to the numbers printed by the chess programs as the “search depth”, but this is very misleading. No serious chess program today search all lines in the search tree to exactly the same depth. Some lines are searched to many more plies than the search depth displayed on the screen, and other lines are pruned to a much lower depth. There are programs which extend (i.e. search more deeply) many lines and reduce or prune very little, and there are programs which do exactly the opposite. As a consequence, a ply N search can mean very different things for different programs.

The search depth displayed in the GUI is the iteration counter, nothing more. It can sometimes be useful for comparing different versions of a single program, but when comparing two completely different engines it tells you nothing of interest.

The best way to compare the search depths of two programs would probably be to compare the average depth of all nodes in the tree for both both programs. As far as I know, no current programs display such information.

I wish the community could stop using the misleading term “search depth” for the numbers we see in the GUI, and replace it with the more precise term “iteration counter”. Instead of saying “program X finds move Y at depth N”, let’s start saying “program X finds move Y at the Nth iteration”.

Uri Blass · Post by **Uri Blass** » Wed Jan 23, 2008 9:45 am

Kirill Kryukov wrote:I think any fixed depth comparison does not make sense. Because 1. Programs define depth differently. 2. Search extensions are used by most of programs. 3. Even if we knew how to define depth, we have no way to know at what depth a program is thinking, because all we have is program's own output.

Tord once commented about depth and I saved it:

The target of the list that I think about is not to compare different engine and to claim engine A is better than B but a different target.

The first target is to see how much rating the same engine get from additional ply and if there is diminishing return.
After having a rating list the list can be used later to help to determine rating of weak engines to see if they made progress and how much progress they made because it save part of the computer time.

If you test weak engine that is not used at fixed depth against rybka at fixed depth 6 or against Yace at depth 8 you can get
more reliable result relative to testing it against weak engines and one problem that happens with some weak engines is that they may do stupid mistakes because of bugs and lose against significantly weaker engines.

Imagine an engine that usually play like 2300 and sometime plays stupid mistakes because of hash bug(in 30% of the games) so it gets rating of 2100.
This engine may score only 60% against 1900 and 40% against 2300 and the result may be that the rating gap between them becomes smaller and the 1900 gets practically higher rating than the rating that it deserves.

I think that
using non buggy engines at fixed depth to determine rating of weak engines is a better solution and later the buggy engines
can get rating mainly based on performance against the fixed depth non buggy engines.

Uri

h.g.muller · Post by **h.g.muller** » Thu Jan 24, 2008 8:50 am

I agree with Uri that strong, stable engines, weakened by somehow limiting their resources, would provide a lot better rating scale than engines that are weak because of horrendous bugs. The latter do often not conform to the underlying rating model at all. E.g. Omar is known to resign as soon as he sees he can checkmate the opponent. So the weaker the opponent, the lower Omar is likely to score against it. Such engines completely corrupt the rating scale if you extract ratings in the normal way.

I am not sure that fixed depth would be a good solution. The problem is that at fixed depth most engines start to play like an idiot in the end-game, because they no longer see obvious promotions in time. So a very large fraction of the games would be decided by luck, or trivial evaluation differences like Pawn push bonus, which would enable a program to win even when 5 Pawns behind, just because it happens to get the promotion within its horizon first if the opponet happens to be a bit reluctant in pushing Pawns.

I understand why Uri prefers to take some hardware-independent metric to limit program performance, in stead of limiting search time. But I think the number of nodes would be a more natural choice, as it would allow engines to adapt their search depth if the position smplifies. That different programs count nodes in a different way is also of no concern; That Rybka@100,000 nodes would beat Glaurung@200,000 nodes, because Rybka is lying about its node count, will not be taken to mean that Rybka is stronger than Glaurung. To make that comparison, you could simultaneously publish the average time/move that the program takes. If Rybka@100k would use 500ms/move, and Glaurung@200k only 200ms/move, on the same hardware, it would be obvious that Rybka@100k did have the unfair advantage, and its higher rating would not really be considered an impressive feat. If on other hardware the times would be just the other way around, it could be concluded that the strength of one or both of the programs is very hardware dependent.

I think that making a rating list of node-limited top engines is a very interesting undertaking from a fundamental point of view. It would tell us how rating scales with search-tree size, and would reveal which engines had the best scaling (e.g. Elo/log(nodes)). Buggy programs could then be played against a large variety of these accurately calibrated standards, to determine their win-probabilty as a function of opponent rating, so that not only a rating could be derived, but also the width and shape of their performance curve can be measured without distorting the rating scale.

Uri Blass · Post by **Uri Blass** » Thu Jan 24, 2008 2:11 pm

I agree that fixed node rating list is also interesting.
The main problems with fixed nodes are the following problems

1)Fritz interface does not support fixed nodes but support fixed depth and I believe that less engines support fixed number of nodes(I did not check it).
2)If you use small number of nodes as fixed number of nodes then it is possible that some engines may not have a move in the pv and if you use big number of nodes to prevent this problem then the level may be too high for the weak engines.

There are even extreme examples when some engine need a long time to get a move in the pv
Here is one composed example that is not the most extreme example that you can compose
Strelka needs 8 seconds to get a pv at depth 1(the analysis was under fritz and fritz simply does not report nodes for depths 1-5).

I guess that in practical cases it never happens but even if an engine needs only 10000 nodes to practically get a move in the pv then still practically 10,000 nodes per move may be too strong for the weak engines.

This problem can be solved if an engine starts by generating all legal moves and doing fast evaluation of every legal move based on piece square table to put one move in the pv at depth 0 even without making a move but I think that most engines do not work like that

4K3/PPPPPPP1/8/8/8/8/ppppppp1/4k3 w - - 0 1

Analysis by Strelka 2.0 B:

1.b7-b8Q a2-a1Q 2.a7-a8Q Qa1xa8 3.Qb8xa8 g2-g1Q
= (0.20) Depth: 1 00:00:08
1.b7-b8Q a2-a1Q 2.a7-a8Q Qa1xa8 3.Qb8xa8 g2-g1Q
= (0.20) Depth: 2 00:00:10
1.b7-b8Q a2-a1Q 2.a7-a8Q Qa1xa8 3.Qb8xa8 g2-g1Q
= (0.20) Depth: 3 00:00:15
1.g7-g8Q g2-g1Q 2.a7-a8Q a2-a1Q
= (0.08) Depth: 4 00:00:28
1.g7-g8Q g2-g1Q 2.c7-c8Q a2-a1Q 3.d7-d8Q b2-b1Q 4.b7-b8Q c2-c1Q
³ (-0.41) Depth: 5 00:00:39
1.g7-g8Q g2-g1Q 2.c7-c8Q a2-a1Q 3.d7-d8Q b2-b1Q 4.b7-b8Q c2-c1Q
³ (-0.41) Depth: 6 00:00:46 42960kN
1.a7-a8Q a2-a1Q 2.f7-f8Q c2-c1Q 3.g7-g8Q g2-g1Q 4.Qg8-e6 b2-b1Q 5.c7-c8Q
³ (-0.29) Depth: 7 00:00:59 62513kN
1.a7-a8Q a2-a1Q 2.f7-f8Q c2-c1Q 3.g7-g8Q g2-g1Q 4.Qg8-e6 b2-b1Q 5.c7-c8Q
³ (-0.29) Depth: 8 00:01:10 66303kN
1.b7-b8Q a2-a1Q 2.g7-g8Q g2-g1Q 3.Qb8-b6 Qg1xg8+ 4.f7xg8Q f2-f1Q 5.Qg8-g1 Ke1-d1 6.d7-d8Q c2-c1Q 7.a7-a8Q Qa1xa8 8.Qd8xa8 Qf1xg1 9.Qb6xg1+ e2-e1Q
= (0.08) Depth: 9 00:01:43 99980kN
1.b7-b8Q a2-a1Q 2.g7-g8Q g2-g1Q 3.Qb8-b6 Qg1xg8+ 4.f7xg8Q f2-f1Q 5.Qg8-g1 Ke1-d1 6.d7-d8Q c2-c1Q 7.a7-a8Q Qa1xa8 8.Qd8xa8 Qf1xg1 9.Qb6xg1+ e2-e1Q
= (0.08) Depth: 10 00:02:26 125636kN
1.b7-b8Q a2-a1Q 2.g7-g8Q g2-g1Q 3.Qb8-b6 Qg1-h2 4.c7-c8Q b2-b1Q 5.d7-d8Q Qb1xb6 6.Qd8xb6 Qa1-a4+ 7.Ke8-f8 c2-c1Q 8.a7-a8Q Qa4xa8 9.Qc8xa8 d2-d1Q 10.e7-e8Q f2-f1Q
= (0.04) Depth: 11 00:04:07 193777kN
1.g7-g8Q g2-g1Q 2.c7-c8Q c2-c1Q 3.b7-b8Q a2-a1Q 4.Qb8-b6 Qg1-h2 5.a7-a8Q Qa1xa8 6.Qc8xa8 b2-b1Q 7.d7-d8Q Qb1xb6 8.Qd8xb6 d2-d1Q 9.f7-f8Q f2-f1Q
= (0.09) Depth: 12 00:07:19 399458kN

(, 24.01.2008)

Post by **Shaun** » Thu Jan 24, 2008 2:35 pm

Uri,

I am probably missing something but is not a time handicap what you are looking for and you have used these before...

Shaun

Uri Blass · Post by **Uri Blass** » Thu Jan 24, 2008 6:37 pm

I also suggested time handicapped matches in the past and I think it is good to play 40/40 against 40/4 but there is a problem with the same idea of 40/4 against faster time controls.

The problem with time handicap at blitz time control is that I am afraid that things may be dependent too much on the hardware and the interface and that programs may have specific problems with very fast time control.

Rybka at 40/0.4 minutes(average of 0.6 seconds per move) may be too strong for many programs and in very fast time control there may be a problem if the interface steals 0.2 seconds per move from the engine or if the engine does some initializations before every move that takes 0.2 seconds.

The reason for my post about fixed depth was also because there is an interesting question in computer chess if there is diminishing returns in computer chess and list without fixed depth does not answer it.

diminishing returns from time may be result of branching factor that is not constant so checking if there is diminishing returns from time does not answer the question if there is diminishing returns from depth.

Uri

Post by **Kirill Kryukov** » Sat Jan 26, 2008 6:11 am

Uri Blass wrote:The target of the list that I think about is not to compare different engine and to claim engine A is better than B but a different target.

My target is 1. To find ranking of free engines. 2. To estimate rating differences for close engines. (3. To find various other things in the process like ponder hit or draw rate statistics).

Uri Blass wrote:The first target is to see how much rating the same engine get from additional ply and if there is diminishing return.

I think this is not generally interesting, but may be interesting for developer of particular engine?

Uri Blass wrote:After having a rating list the list can be used later to help to determine rating of weak engines to see if they made progress and how much progress they made because it save part of the computer time.

There is plenty of free engines or all strengths, they can also be used as reference to determine whether particular engine made progress. I think it is important to test engines with close opponents only, and fortunately there are enough free engines of any strength to enable good comparison.

Uri Blass wrote:If you test weak engine that is not used at fixed depth against rybka at fixed depth 6 or against Yace at depth 8 you can get
more reliable result relative to testing it against weak engines and one problem that happens with some weak engines is that they may do stupid mistakes because of bugs and lose against significantly weaker engines.

You always mention engine bugs. Each free engine has its own bugs, so using many different engines would minimize the effect of each particular bug. As long as an engine is able to play valid moves, it can have its rating estimated by playing with other engines. Any bugs particular engine may have will not distort the rating scale as a whole because other engines don't have that same bug but different bugs.

On the other hand, using Rybka or other strong engines with fixed depth can also be considered a "bug". In this case you are proposing to use many engines with the same sort of "bug". And this exactly will lead to the distortion of the rating scale which you are trying to avoid. Example scenario (though other scenarios are possible too): An engine with capped search depth is obviously weak tactically. Therefore the free engines (tested versus these search-depth-capped strong engines) which are for some reason good versus a tactically weak engine will get advantage. So this strategy will reward some engines more than other engines. This is just a hypothesis, reality may be different. Anyway my point is that introducing the same kind of handicap you are introducing a systematic error which will distort the rating list. As a result, the ratings will be incorrect and may lead to wrong predictions when comparing free engines to each other. That's why I think that it's better and more natural to test each free engine versus other free engines, but only those that have similar strength.

Uri Blass wrote:Imagine an engine that usually play like 2300 and sometime plays stupid mistakes because of hash bug(in 30% of the games) so it gets rating of 2100.
This engine may score only 60% against 1900 and 40% against 2300 and the result may be that the rating gap between them becomes smaller and the 1900 gets practically higher rating than the rating that it deserves.

You can't "usually" play like 2300. Either you play like 2300 or you don't. If an engine gets tested and receives a rating of 2100, we have no way to determine whether it was because of bugs or something else. We can't say that it "usually" plays like 2300, if it got rating of 2100. All engines make sometimes strong moves and sometimes weak moves. Was it a lack of positional knowledge, a bad luck, a result of bug, or anything else - is difficult to answer. May be for an engine developer it is possible to answer that. Anyway even if 1900 gets practically higher it's OK because more importantly it does not distort the ranking (order) of engines, and it does not distort the rating scale much either because it is only one engine out of 50 or more.

Uri Blass wrote:I think that using non buggy engines at fixed depth to determine rating of weak engines is a better solution and later the buggy engines
can get rating mainly based on performance against the fixed depth non buggy engines.

Such rating list is not efficient because: 1. It requires 2 times more games than normal approach. 2. It is difficult to say what engines are buggy and what engines are not buggy. You are probably proposing to consider all free engines buggy? 3. Limiting engines to fixed depth is introducing a very specific kind of weakness which may reward some engines more than others, thus resulting in a very real distortion of a rating list. 4. Most of people are interested in comparing free engines with other free engines, especially those of similar strength. It is just natural to let those engines play each other and compute the ratings based on those games. This will answer the question of how those engines rank compared to each other. Making a list based on fixed depth engines does not answer the ranking question because it is not obvious that performance of free engines with each other correlates well with performance of free engines versus fixed depth versions of stronger engines.

Uri Blass · Post by **Uri Blass** » Mon Jan 28, 2008 6:31 am

I clearly care about rating and not only about ranking.

The problem is that I want rating to tell us which programs earns more from time and situation of unrealistic rating can cause wrong impression.

I also want rating to be closer to the rating against humans and
I think that it can be good if humans with known fide rating volunteer to play games against engines for the ccrl list(in ccrl conditions that mean generic book and engine that does not ponder).

Uri

Uri Blass · Post by **Uri Blass** » Mon Jan 28, 2008 6:45 am

Kirill Kryukov wrote:
Uri Blass wrote:The target of the list that I think about is not to compare different engine and to claim engine A is better than B but a different target.
My target is 1. To find ranking of free engines. 2. To estimate rating differences for close engines. (3. To find various other things in the process like ponder hit or draw rate statistics).

Uri Blass wrote:The first target is to see how much rating the same engine get from additional ply and if there is diminishing return.
I think this is not generally interesting, but may be interesting for developer of particular engine?

Uri Blass wrote:After having a rating list the list can be used later to help to determine rating of weak engines to see if they made progress and how much progress they made because it save part of the computer time.
There is plenty of free engines or all strengths, they can also be used as reference to determine whether particular engine made progress. I think it is important to test engines with close opponents only, and fortunately there are enough free engines of any strength to enable good comparison.

Uri Blass wrote:If you test weak engine that is not used at fixed depth against rybka at fixed depth 6 or against Yace at depth 8 you can get
more reliable result relative to testing it against weak engines and one problem that happens with some weak engines is that they may do stupid mistakes because of bugs and lose against significantly weaker engines.
You always mention engine bugs. Each free engine has its own bugs, so using many different engines would minimize the effect of each particular bug. As long as an engine is able to play valid moves, it can have its rating estimated by playing with other engines. Any bugs particular engine may have will not distort the rating scale as a whole because other engines don't have that same bug but different bugs.

On the other hand, using Rybka or other strong engines with fixed depth can also be considered a "bug". In this case you are proposing to use many engines with the same sort of "bug". And this exactly will lead to the distortion of the rating scale which you are trying to avoid. Example scenario (though other scenarios are possible too): An engine with capped search depth is obviously weak tactically. Therefore the free engines (tested versus these search-depth-capped strong engines) which are for some reason good versus a tactically weak engine will get advantage. So this strategy will reward some engines more than other engines. This is just a hypothesis, reality may be different. Anyway my point is that introducing the same kind of handicap you are introducing a systematic error which will distort the rating list. As a result, the ratings will be incorrect and may lead to wrong predictions when comparing free engines to each other. That's why I think that it's better and more natural to test each free engine versus other free engines, but only those that have similar strength.

Uri Blass wrote:Imagine an engine that usually play like 2300 and sometime plays stupid mistakes because of hash bug(in 30% of the games) so it gets rating of 2100.
This engine may score only 60% against 1900 and 40% against 2300 and the result may be that the rating gap between them becomes smaller and the 1900 gets practically higher rating than the rating that it deserves.
You can't "usually" play like 2300. Either you play like 2300 or you don't. If an engine gets tested and receives a rating of 2100, we have no way to determine whether it was because of bugs or something else. We can't say that it "usually" plays like 2300, if it got rating of 2100. All engines make sometimes strong moves and sometimes weak moves. Was it a lack of positional knowledge, a bad luck, a result of bug, or anything else - is difficult to answer. May be for an engine developer it is possible to answer that. Anyway even if 1900 gets practically higher it's OK because more importantly it does not distort the ranking (order) of engines, and it does not distort the rating scale much either because it is only one engine out of 50 or more.

Uri Blass wrote:I think that using non buggy engines at fixed depth to determine rating of weak engines is a better solution and later the buggy engines
can get rating mainly based on performance against the fixed depth non buggy engines.
Such rating list is not efficient because: 1. It requires 2 times more games than normal approach. 2. It is difficult to say what engines are buggy and what engines are not buggy. You are probably proposing to consider all free engines buggy? 3. Limiting engines to fixed depth is introducing a very specific kind of weakness which may reward some engines more than others, thus resulting in a very real distortion of a rating list. 4. Most of people are interested in comparing free engines with other free engines, especially those of similar strength. It is just natural to let those engines play each other and compute the ratings based on those games. This will answer the question of how those engines rank compared to each other. Making a list based on fixed depth engines does not answer the ranking question because it is not obvious that performance of free engines with each other correlates well with performance of free engines versus fixed depth versions of stronger engines.

Note that the advantage of testing strong engines at fixed is that you can get more games at the same time.

The reason is that engines can finish a game in less than a minute
and even if you test against weak engines you can save almost half of the time because the opponent of the weak engine is going to use small percentage of the time.

I understand why you think that fixed depth can distort the rating but I think that
it is not going to do it.

Chess is basically mainly about tactics and many weak engines are weaker mainly because they get smaller depths.

I do not consider all weak engines buggy and only part of them have bugs.
The main advantage of testing fixed depth relative to testing non buggy weak engines is simply the fact that testing strong engines at fixed depth can be faster.

Uri

Post by **Kirill Kryukov** » Mon Jan 28, 2008 6:57 am

Hm.. yeah, I did not realize that fixed depth takes less time. It will help with the time problem, but still fixed depth testing is less time efficient than normal testing. Suppose you have 10 free engines you want to compare and you require 1000 games per engine. In usual way you need 1000*10/2 = 5000 games, because each game provides information about both sides. With fixed depth testing (if I understand it correctly) you instead introduce a number of additional engines and those 10 free engines will have to play full 10000 games before each gets 1000 games. (Or may be less than 10000 if you allow mixing normal games too).

Uri Blass wrote:I also want rating to be closer to the rating against humans and
I think that it can be good if humans with known fide rating volunteer to play games against engines for the ccrl list(in ccrl conditions that mean generic book and engine that does not ponder).

Yes, this would be very nice.

h.g.muller · Post by **h.g.muller** » Tue Jan 29, 2008 8:43 am

Uri Blass wrote:I agree that fixed node rating list is also interesting.
The main problems with fixed nodes are the following problems

1)Fritz interface does not support fixed nodes but support fixed depth and I believe that less engines support fixed number of nodes(I did not check it).
2)If you use small number of nodes as fixed number of nodes then it is possible that some engines may not have a move in the pv and if you use big number of nodes to prevent this problem then the level may be too high for the weak engines.

I don't see this as major problems. Even for engines that do not support a fixed nmber of nodes (and WinBoard engines are likely to not support that, as there is no WB command defined for it), you could still uses a GUI that extracts the move from the 'show thinking' output, which in WB protocol should give number of nodes. You simply would program the GUI the accept the move of the last status line with a node number below the preset limit.

Surely you do not want to imply that Strelka will start to play illegal or random moves if you set it to a time of less than 8 sec per move? The information must be there. It might just be that Fritz GUI doesn't print it.

Even if engines like Rybka could not be limited to play below 1000 Elo, I don't see that as a major problem. At some point of the scale another, naturally weaker calibration engine could take over.

h.g.muller · Post by **h.g.muller** » Tue Jan 29, 2008 9:18 am

Kirill Kryukov wrote:You can't "usually" play like 2300. Either you play like 2300 or you don't. If an engine gets tested and receives a rating of 2100, we have no way to determine whether it was because of bugs or something else. We can't say that it "usually" plays like 2300, if it got rating of 2100. All engines make sometimes strong moves and sometimes weak moves. Was it a lack of positional knowledge, a bad luck, a result of bug, or anything else - is difficult to answer. May be for an engine developer it is possible to answer that. Anyway even if 1900 gets practically higher it's OK because more importantly it does not distort the ranking (order) of engines, and it does not distort the rating scale much either because it is only one engine out of 50 or more.

I think this point deserves some clarification: I think it is possible to 'play usually 2300', except for a bug.

The meaning of such a statement should be sought in the probability distribution of the performance-vs.- opponent-rating of an engine. Suppose we would determine the win probability of engine X against any other engine on the scale. This will in general be a noisy curve, starting at 100% for infinitely weak opponent, with a decreasing trend all the way to zero for extremely strong opponents. It will be noisy because os particular engine styles might make it perform better agains engine A than against the very little weaker engine B. But if the density of opponents along the rating scale is large enough everywhere, this noise could be smoothened out, and we would have a decreasing curve.

For 'normal' engines, this curve would be a (decreasing) sigmoid curve, and minus the derivative of this curve would be a bell curve centered on some average rating, giving the probability distribution of the performance of engine X. Now if the sub-optimal performance of an engine is the result of many small bugs, that each would suppress the performance by only a few rating points, and are sampled independently during the game, the central-limit theorem tells us that the performance will be distributed like a Gaussian. This is the assumption underlying the Elo model. (It even goes a bit further, in assuming that the width of this Gaussian is the same for any player).

Now suppose I construct engine X by taking Rybka, and put in code that makes it resign after 10 moves with 70% probability. If I would measure its performance distribution, I would get the bell curve of Rybka, a clean Gaussian centered around 3100 Elo, except that its surface area would not be 100%, but only 30%. Next to that there would be a sharp peak of area 70% at -infinite Elo, corresponding to the 70% of games that it resigns (which would make it lose against almost any opponent, as there are not much engines, no matter how weak, that would lose to even Rybka within 10 moves).

The problem is that such a probability distribution of the performance does not conform to the underlying rating model at all. Converting the percentage score of a gauntlet of such an engine against a set of opponents (say from 2000-2500 Elo) would give a certain rating (X would score about 30% against them, so it would get something like 2100), but a gauntlet against 1000-1500 Elo engines would also give a 30% score (suggesting 1100), and testing it against 500-100 engines would suggest 600. There simply is no way to determine the rating of X through standard methods.

Look at the results of Rattate Bologna in the current WBEC 5th-division play-offs, and you will see exactly what I mean.

mpm · Post by **mpm** » Fri Feb 15, 2008 1:13 am

h.g.muller wrote:I agree with Uri that strong, stable engines, weakened by somehow limiting their resources, would provide a lot better rating scale than engines that are weak because of horrendous bugs. The latter do often not conform to the underlying rating model at all. E.g. Omar is known to resign as soon as he sees he can checkmate the opponent. So the weaker the opponent, the lower Omar is likely to score against it. Such engines completely corrupt the rating scale if you extract ratings in the normal way.

Rating of an engine is a product of all it's behaviour. So, it (ofcourse) includes it's bugs. That IS the normal way and there's no corruption. If you don't want that, repair the bugs

.

Uri Blass · Post by **Uri Blass** » Fri Feb 15, 2008 8:39 am

mpm wrote:
h.g.muller wrote:I agree with Uri that strong, stable engines, weakened by somehow limiting their resources, would provide a lot better rating scale than engines that are weak because of horrendous bugs. The latter do often not conform to the underlying rating model at all. E.g. Omar is known to resign as soon as he sees he can checkmate the opponent. So the weaker the opponent, the lower Omar is likely to score against it. Such engines completely corrupt the rating scale if you extract ratings in the normal way.
Rating of an engine is a product of all it's behaviour. So, it (ofcourse) includes it's bugs. That IS the normal way and there's no corruption. If you don't want that, repair the bugs .

You miss the point.
The problem is not that the rating of buggy engine is lower but it spoil the rating of other engines.

As an extreme example if an engine plays half of the game like rybka and half of the games like beginner then it may get near 50% against almost everything.
If this engine get rating of 2500 then the rating of engines that play with it and are weaker than 2500 is going to go up and the rating of engines that play it and are stronger than 2500 are going to go down.

You may give that engine some rating but it is not fair if the games of it change the rating of the other engines.
Of course I know no practical examples when something so extreme happens but even if engine has higher draw rate because it often allow draws by repetition in better positions then including this engine can reduce the elo gap between engines and I think that it is not fair.

I want to see rating gap that is based on non buggy engines when buggy engines do not change the rating of other but only get a rating.

Uri

CCRL Discussion Board

An idea for new fixed depth rating list

An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list

Re: An idea for new fixed depth rating list