Data-mining the HvdH Chess Study Database III

guyhaw · Post by **guyhaw** » Mon Nov 30, 2009 4:36 am

Eiko Bleicher and I have mined the remarkable Chess Study Database III, compiled by Harold van der Heijden and aailable via all good chess shops and via http://home.studieaccess.nl/heijd336/home.html. At this version, it includes 67,691 studies.

Using CQL ( http://rbnn.com/cql/contents.html ), pgn2fen ( http://www.7sun.com/chess/ ) and Eiko's complete copy of Nalimov's EGTs, we identified all studies ending in the sub-7-man zone. Of these, we raised questions about some 2,950 studies, there appearing to be (in the mainline solution) a drawn or lost (for White) position in a 'win study' or a non-draw in a 'Draw Study'.

About half of the distilled 2,950 studies have an '@' against them in HvdH's database, indicating that he knew of a flaw (or 'cook' in Chess Composition language) in the study when db_III was published. Further cooks have accumulated in his files as input to version IV.

Nevertheless, the flaws we found add knowledge when they are found upstream of the already-noted flaw. The same will be true when MB checks the excellent HvdH corpus with his 7-man EGTs.

We had to remember here that the Nallimov EGTs do not include positions iwth castling rights, a long-standing lacuna in the history of EGTs. Positions with castling rights were identified and checked out manually.

g

Post by **Kirill Kryukov** » Mon Nov 30, 2009 2:59 pm

Are the results availabe? Or will they be incorporated into future edition of Chess Study Database?

guyhaw · Post by **guyhaw** » Mon Nov 30, 2009 3:28 pm

My results are available in pairs of spreadsheets - one set for 'Draw Studies' and the other set for 'Win Studies'. However, they can only be followed up in context if you have the HvdH dbIII of Studies.

The smallest pair of spreadsheets just gives the first position and the first 'failure indicator' position per cooked study, together with tags identifying the serial number in the database, the author, the GBR and the King positions.

Email if you want these: the other pairs of spreadsheets (e.g. all sub-7-man positions in all studies) run into many Megabytes.

HvdH is working though the information, but even at 10 questioned studies per day, he would take most of a year. The plan was to get dbIV out next year so I don't see all our information being in there. If MB follows up with his plan to inspect all sub-8-man positions in databases, there will be much more 'cook' information, either new or upstream of previous cook information.

I have also found over 3,000 wtm postiions in otherwise unfaulted studies where White does not play a DTM-optimal move, indicating that there is a dual of some kind at that point.

g

Post by **Kirill Kryukov** » Tue Dec 01, 2009 5:35 am

If HvdH has the information and will include it in the future release of the database, that's probably the best option for me. Thanks!

kb2ct · Post by **kb2ct** » Thu Dec 03, 2009 10:02 pm

I bought the database. Exactly my cup of tea.

guyhaw · Post by **guyhaw** » Fri Dec 04, 2009 10:53 pm

Have also discovered 3,356 cases of White making a non-DTM-optimal move, which indicates some sort of dual.

These occur in 2,485 studies: about 1/3rd of these have an '@' in the Event tag, indicating that some sort of flaw is already known about them - maybe the dual I've picked up, maybe not.

g

kb2ct · Post by **kb2ct** » Sat Dec 05, 2009 2:30 am

How do you use your findings with the database??

guyhaw · Post by **guyhaw** » Sat Dec 12, 2009 4:06 am

I am not sure I understand the last question (as there are several 'databases' in play), but here's an attempted answer:

1) CQL can subset into a .pgn file all those studies in HvdH's dbIII database where (a) the result is marked as 1-0 and (b) the mainline includes a sub-7-man position. [... or the result is a Draw and there are non-draws in the mainline, but that's a separate 'job'.]

2) pgn2fen converts all the positions in the mainline into FEN notation, including the position after the last move if required

3) I delete, from two copies of the FEN, all the characters not indicating a man (or a Pawn) - to help identify the start/end of 'positions in a study' and because the super-6-man positions will be deleted.

4) I identify the start and end of each sequence of positions from a study, (re-sort), delete the positions with more than 6-men in (and re-sort).

Note that before sorting and resorting, it is necessary to convert cells with formulae (involving adjacent rows) into Values - so it's worth keeping a copy of the formulae in rows below the title of the worksheet, and outside the area that is subject to sorting and resorting.

5) I extracted a list of FEN positions, all supposed to be '1-0' for 'Win Studies', and sent them to Eiko Bleicher by email. He returned the positions with data taken from the EN EGTs - DTM(ate), number of optimal moves, winning/drawing/losing moves, number of moves - a bit of inexpensive redundancy there - a useful check on the integrity of the data, row by row.

6) This new 'EN EGT' information was zipped back into the spreadsheet and the zipping was checked as correct.

7) Now it was possible to identify non-win positions, first-fail positions in apparently-flawed studies, the downstream tail from the first-fail position etc. It was possible to extract these first-fail positions by another sort and extract.

Early on here, it is also necessary to mark positions with castling rights as these will have to be assessed by non-EGT means, especially as the castling rights were often the point of the study - White not being able to win without them.

Down a parallel limb of the process, by putting the HvdH dbIII pgn itno a spreadsheet [there are less than 1,000,000 rows of text], I could cut this down to just the 'White tag', 'Black tag', 'initial FEN position', 'Event' (giving publication data), and 'Original game comment inserted by CQL - vital for connecting back to the serila number of the study in the HvdH dbIII daabase. (G)AWK is useful here but it is doable in EXCEL.

9) This information was (SQL-)joined to the list of first-fail positions - ACCESS or any RDBMS technology would be useful here but it is doable in EXCEL.

10) A few random checks were run to ensure (or at least reduce one's belieft to zero) that there had been no corruption in the above data management process. The outputs were then sent to HvdH who is putting them in the context of what he already knows. Remarkably, he has already dealt with over one quarter of the data - about 800 of the studies I sent him.

The sort of thing that is coming up:
a) a study could be simply mislabelled '1-0' when it is a 'Draw Study' ... pretty obvious if all the positions in the mainline are drawn,
b) the study in HvdH dbIII may have been subject to a mis-transcription in time before it reached HvdH's dbIII,
c) the 'cook' [flaw] that the above process finds has been notified to HvdH before, and is (or is not) already in HvdH's dbIII,
d) the cook that the above process finds is upstream of a cook already noted in HvdH's dbIII, so it adds information without identifying a 'new' cooked-study,
e) the cook that the above process finds is the first 'cook' to be found in the study.

HvdH will integrate the new information into his database of studies, within the constraints imposed by the CHESSBASE technology. It is an open question as to what IT provision would allow everything that one wants to know about a study to be recorded in a database ... publication history of the study in time, its relationship to other studies, comments about it including cooks and duals, ... The problem with the CHESSBASE database, and with the Studies Quarterly EG, is that it is not clear what was the composer's original content and what has been added by judges and commentators afterwards.

In a study, White is supposed to have (essentially) only one move that achieves the objective - 'essentially' because there are often moves that simply 'temporise', allowing Black to force a repetition of position if White still achieves the goal.

Obviously, if White plays a non-DTM-optimal move,there must also have been at least one DTM-optimal move, so there is a possibly-significant dual here. Of course, it may be that the White move in the given mainline simply takes a long route to a position that would have been in the mainline earlier if a DTM-optimal move had been chosen ... i.e. we are seeing the temporising move rather than the most efficient move. So the dual may be more (or less) significant.

I hope that's all clear.

g

kb2ct · Post by **kb2ct** » Sat Dec 12, 2009 5:21 pm

I asked for your spread sheet by private message.

If it is as thorough as your answer, I won't have any problem.

CCRL Discussion Board

Data-mining the HvdH Chess Study Database III

Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III

Re: Data-mining the HvdH Chess Study Database III