I am not sure I understand the last question (as there are several 'databases' in play), but here's an attempted answer:
1) CQL can subset into a .pgn file all those studies in HvdH's dbIII database where (a) the result is marked as 1-0 and (b) the mainline includes a sub-7-man position. [... or the result is a Draw and there are non-draws in the mainline, but that's a separate 'job'.]
2) pgn2fen converts all the positions in the mainline into FEN notation, including the position after the last move if required
3) I delete, from two copies of the FEN, all the characters not indicating a man (or a Pawn) - to help identify the start/end of 'positions in a study' and because the super-6-man positions will be deleted.
4) I identify the start and end of each sequence of positions from a study, (re-sort), delete the positions with more than 6-men in (and re-sort).
Note that before sorting and resorting, it is necessary to convert cells with formulae (involving adjacent rows) into Values - so it's worth keeping a copy of the formulae in rows below the title of the worksheet, and outside the area that is subject to sorting and resorting.
5) I extracted a list of FEN positions, all supposed to be '1-0' for 'Win Studies', and sent them to Eiko Bleicher by email. He returned the positions with data taken from the EN EGTs - DTM(ate), number of optimal moves, winning/drawing/losing moves, number of moves - a bit of inexpensive redundancy there - a useful check on the integrity of the data, row by row.
6) This new 'EN EGT' information was zipped back into the spreadsheet and the zipping was checked as correct.
7) Now it was possible to identify non-win positions, first-fail positions in apparently-flawed studies, the downstream tail from the first-fail position etc. It was possible to extract these first-fail positions by another sort and extract.
Early on here, it is also necessary to mark positions with castling rights as these will have to be assessed by non-EGT means, especially as the castling rights were often the point of the study - White not being able to win without them.
Down a parallel limb of the process, by putting the HvdH dbIII pgn itno a spreadsheet [there are less than 1,000,000 rows of text], I could cut this down to just the 'White tag', 'Black tag', 'initial FEN position', 'Event' (giving publication data), and 'Original game comment inserted by CQL - vital for connecting back to the serila number of the study in the HvdH dbIII daabase. (G)AWK is useful here but it is doable in EXCEL.
9) This information was (SQL-)joined to the list of first-fail positions - ACCESS or any RDBMS technology would be useful here but it is doable in EXCEL.
10) A few random checks were run to ensure (or at least reduce one's belieft to zero) that there had been no corruption in the above data management process. The outputs were then sent to HvdH who is putting them in the context of what he already knows. Remarkably, he has already dealt with over one quarter of the data - about 800 of the studies I sent him.
The sort of thing that is coming up:
a) a study could be simply mislabelled '1-0' when it is a 'Draw Study' ... pretty obvious if all the positions in the mainline are drawn,
b) the study in HvdH dbIII may have been subject to a mis-transcription in time before it reached HvdH's dbIII,
c) the 'cook' [flaw] that the above process finds has been notified to HvdH before, and is (or is not) already in HvdH's dbIII,
d) the cook that the above process finds is upstream of a cook already noted in HvdH's dbIII, so it adds information without identifying a 'new' cooked-study,
e) the cook that the above process finds is the first 'cook' to be found in the study.
HvdH will integrate the new information into his database of studies, within the constraints imposed by the CHESSBASE technology. It is an open question as to what IT provision would allow everything that one wants to know about a study to be recorded in a database ... publication history of the study in time, its relationship to other studies, comments about it including cooks and duals, ... The problem with the CHESSBASE database, and with the Studies Quarterly EG, is that it is not clear what was the composer's original content and what has been added by judges and commentators afterwards.
In a study, White is supposed to have (essentially) only one move that achieves the objective - 'essentially' because there are often moves that simply 'temporise', allowing Black to force a repetition of position if White still achieves the goal.
Obviously, if White plays a non-DTM-optimal move,there must also have been at least one DTM-optimal move, so there is a possibly-significant dual here. Of course, it may be that the White move in the given mainline simply takes a long route to a position that would have been in the mainline earlier if a DTM-optimal move had been chosen ... i.e. we are seeing the temporising move rather than the most efficient move. So the dual may be more (or less) significant.
I hope that's all clear.