Generic Massive Parallelization Algorithms

Ed Trice · Post by **Ed Trice** » Mon Nov 01, 2010 11:58 pm

I've seen posts where certain endgame tablebase generators have options such as "-2" to take advantage of systems with 2 cores, usually followed up shortly by a request for "when is your -4 version going to become available?"

I started thinking about a generalized approach to minimize the time taken to solve a particular tablebase set given you have N cores, provided you have enough storage.

There seems to be a few hurdles to be overcome in order to have one sophisticated application to service all needs that are required to keep all of the cores busy.

1. Sharing information across cores

How does core #1 know what the other cores are doing? Is there a way to communicate to every executable running on a system that would be saving data across many different drives?

The problem is, if I launch an app on core #1, and another app on core #2, and another on core #3 and yet another on core #4, they each "have to know" what the others are working on, in order to determine what they will work on next. If you have a computer system where you know there is a C, D, G, and H drive, you might have app #1 write to C, app #2 write to D, app #3 write to G, etc. So, you can write each executable to act like a "traffic cop", telling them all where to look for various files, in order to see what tablebases can be solved next, and by whom.

This seems very tedious.

But, if you make your tablebase generator flexible enough to handle "any situation", you delegate a fair amount of work to do to the end user of the app, who really just wants to run the thing to get it started.

Ideally, you could launch just one executable, query the user about how many cores or threads to use for solving the specified tablebases, and one single app will take care of everything.

Easier said than done.

2. Auto-sensory of prioritization, keeping parallelism in mind

Sometimes solving the next "available tb" once a core becomes free could lead to multiple idle cores later on. It would be better to pre-emptively examine the holistic set of configurations needing to be solved before engaging a single idle core. This would require coding a "priority tree" and not just a flat "dependency tree." The priority tree would need to be more than just a list of tbs to attempt to solve: it would need to forecast time on task, impact on idleness of other cores should a high-priority task be engaged "too soon", and evolve its list as the tablebases are solved. In short, it should track the performance of its own management and priorities, and make adjustments as necessary to maximize the use of all available cores.

3. Bidirectional Asymmetric Processing

Say you are trying to solve one huge tablebase where the material on the white side is different than the material on the black side. After solving a position to be "side to move and win" (for whatever side) you could give another core work to do immediately. Execute the winning move, send that position as a loss to the other core, have it make every unmove for the side that just made the winning move, and literally every move from every unmoved position are wins that can be harvested by that core.

In this fashion, you can have two cores working on one large tablebase (pair).

At some point though, all of the time spent "fishing" for work would start to decay, and one core can finish up. Determining what the inflection point is might be difficult, however.

koistinen · Post by **koistinen** » Tue Nov 02, 2010 8:16 pm

1. Sharing information across cores

Don't! It will only complicate things without improving performance much.

2. Auto-sensory of prioritization, keeping parallelism in mind

Just use one core. That helps keep complexity of writing the generator down.

3. Bidirectional Asymmetric Processing

That is another premature optimization.

Write a simple generator. Measure performance when running it to see if and where it needs to be improved.

Unless of course you find enjoyment in doing it another way

That is, do what you feel like, maybe it will be nice for others too.

guyhaw · Post by **guyhaw** » Tue Nov 02, 2010 10:35 pm

... 'posts where cetrain endgame tablebase generators say ...'

Do you have any specific, accurate references to these? What generators are out there, not including - please - the overly-inscrutable 'Skipper_NORTON'.

I take it they do not include Nalimov, not geared for multi-core, FEG (ditto), Konoval (not published) or MB's code (ditto).

g

Post by **Kirill Kryukov** » Wed Nov 03, 2010 1:26 am

My current approach to parallellism is simple: keep separate instances of the generator as individual processes, unaware of each other's existance. Then it boils down to whether you have enough RAM for multiple instances, or you don't. If you have enouth RAM, by all means, run multiple instances of the generator. (My generators can figure out by themselves which endgames are already solved, which are being solved by another instance, and which ones still need solving).

I may look into other ways to do parallalization, like multiple cores cooperating while solving the same endgame, but it will be only after I squeeze all performance I can out of the other parts, which won't be very soon. My current priority is to optimize the RAM and IO usage, utilizing the additional cores is less important at the moment.

Post by **Kirill Kryukov** » Wed Nov 03, 2010 1:31 am

Ed Trice wrote:How does core #1 know what the other cores are doing? Is there a way to communicate to every executable running on a system that would be saving data across many different drives?

It scans the directory tree where the tables are being saved. After finding the first missing table it creates an empty file there (to stop other instances from solving the same table) and starts crunching. When the table is completed, it is saved instead of that empty file. There are a few more details, but this is the overall scheme.

Ed Trice · Post by **Ed Trice** » Wed Nov 03, 2010 11:48 am

Kirill Kryukov wrote: It scans the directory tree where the tables are being saved. After finding the first missing table it creates an empty file there (to stop other instances from solving the same table) and starts crunching. When the table is completed, it is saved instead of that empty file. There are a few more details, but this is the overall scheme.

Hi Kirill,

Yes, I understood that part

That is what I was primarily referring to when I said:

Ed Trice wrote:This would require coding a "priority tree" and not just a flat "dependency tree." The priority tree would need to be more than just a list of tbs to attempt to solve: it would need to forecast time on task, ...

It would be great if the "overlord" program that has "agents" running could do this:

Code: Select all

I can start working on either Tablebase_A, Tablebase_B, Tablebase_C, Tablebase_D now.

If Tablebase_A: Nothing in my priority list identifies it as a potential parallelism bottleneck, based on current status of existing work being performed, and future work that might depend on this core.

If Tablebase_B: This file would take a long time to solve, and has only "X" dependencies on it. Lower its priority.

If Tablebase_C: This file can be solved more quickly than _B, and it has "X " dependencies also. Raise its priority.

If Tablebase_D: This file is the largest, but is has more dependencies. Therefore, it could impact the future idleness of other cores if all of the "smaller files" are solved, and the cores must wait for _D to be solved before proceeding. Apply some logic here to determine if it would be better to increase the priority of this tablebase, and by how much if so. 

Possible outcomes:

Priority list could become re-ordered to...  _C, _A, _D, _B ...if... the solving time for _C is much less than _B and _D had its priority increase only slightly.

Priority list could become re-ordered to...  _D, _C, _A, _B ...if... the amount of idle time estimated for _D to complete with a lower priority would be intolerable for some many idle cores. Increase its priority the most.

Priority list could become re-ordered to...  _A, _C, _B, _D ...if... the "time to solve" logic dominants over other bottleneck estimations. If the impact to future core bottlenecks is small enough, just solve whichever will most likely be done the fastest.

You can see if this type of logic is evoked on each idle core that has just become free, and if each core can see the priority list that has been (perhaps) modified by one or more of the other cores, then even the "worst initial prioritization" would gradually become improved, and the final list will be one that maximized the efficiency of the machine and minimized the solving time for it.

The big benefit would be: different solving orders could result on different machines, and the solving time would be minimized for each user's tablebase generation run.

Ed Trice · Post by **Ed Trice** » Wed Nov 03, 2010 11:51 am

Ed Trice wrote:
Kirill Kryukov wrote: It scans the directory tree where the tables are being saved. After finding the first missing table it creates an empty file there (to stop other instances from solving the same table) and starts crunching. When the table is completed, it is saved instead of that empty file. There are a few more details, but this is the overall scheme.
Hi Kirill,

Yes, I understood that part That is what I was primarily referring to when I said:

Ed Trice wrote:This would require coding a "priority tree" and not just a flat "dependency tree." The priority tree would need to be more than just a list of tbs to attempt to solve: it would need to forecast time on task, ...
It would be great if the "overlord" program that has "agents" running could do this:
Code: Select all
I can start working on either Tablebase_A, Tablebase_B, Tablebase_C, Tablebase_D now.

If Tablebase_A: Nothing in my priority list identifies it as a potential parallelism bottleneck, based on current status of existing work being performed, and future work that might depend on this core.

If Tablebase_B: This file would take a long time to solve, and has only "X" dependencies on it. Lower its priority.

If Tablebase_C: This file can be solved more quickly than _B, and it has "X " dependencies also. Raise its priority.

If Tablebase_D: This file is the largest, but is has more dependencies. Therefore, it could impact the future idleness of other cores if all of the "smaller files" are solved, and the cores must wait for _D to be solved before proceeding. Apply some logic here to determine if it would be better to increase the priority of this tablebase, and by how much if so. 

Possible outcomes:

Priority list could become re-ordered to...  _C, _A, _D, _B ...if... the solving time for _C is much less than _B and _D had its priority increase only slightly.

Priority list could become re-ordered to...  _D, _C, _A, _B ...if... the amount of idle time estimated for _D to complete with a lower priority would be intolerable for so many projected idle cores. Increase its priority the most.

Priority list could become re-ordered to...  _A, _C, _B, _D ...if... the "time to solve" logic dominants over other bottleneck estimations. If the impact to future core bottlenecks is small enough, just solve whichever will most likely be done the fastest.
You can see if this type of logic is evoked on each idle core that has just become free, and if each core can see the priority list that has been (perhaps) modified by one or more of the other cores, then even the "worst initial prioritization" would gradually become improved, and the final list will be one that maximized the efficiency of the machine and minimized the solving time for it.

The big benefit would be: different solving orders could result on different machines, and the solving time would be minimized for each user's tablebase generation run.

Ed Trice · Post by **Ed Trice** » Wed Nov 03, 2010 12:33 pm

guyhaw wrote:... 'posts where cetrain endgame tablebase generators say ...'

Do you have any specific, accurate references to these? What generators are out there, not including - please - the overly-inscrutable 'Skipper_NORTON'.

I take it they do not include Nalimov, not geared for multi-core, FEG (ditto), Konoval (not published) or MB's code (ditto).

g

I was referring to this post:

Smartie33 wrote:Bingo, now it works with Vista64:

TbGen -2 -m auto -b 128 -d C:\TB\345 -v -c 256 knnknn

Time for this:: 1h 5min, its ok;-)

I have 6 GB RAM, in Vista32 "-m auto" crashes permanently, 4 GB- 32bit- Bug?

Is there a "-4" option in the future (for my quad;-)

thank you, codeman!

Post by **Kirill Kryukov** » Thu Nov 04, 2010 3:36 am

Ed Trice wrote:Hi Kirill,

Yes, I understood that part That is what I was primarily referring to when I said:

Ed Trice wrote:This would require coding a "priority tree" and not just a flat "dependency tree." The priority tree would need to be more than just a list of tbs to attempt to solve: it would need to forecast time on task, ...
It would be great if the "overlord" program that has "agents" running could do this:
Code: Select all
...
You can see if this type of logic is evoked on each idle core that has just become free, and if each core can see the priority list that has been (perhaps) modified by one or more of the other cores, then even the "worst initial prioritization" would gradually become improved, and the final list will be one that maximized the efficiency of the machine and minimized the solving time for it.

The big benefit would be: different solving orders could result on different machines, and the solving time would be minimized for each user's tablebase generation run.

I think I understand now: You are probably talking about checkers-specific problem here. In chess the dependency tree is so simple that none of these complications are needed. You just create a list of endgames to solve. Every solver instance will follow the same list, and all cores will be occupied 100% of time. (Assuming you can afford solving multiple endgames at the same time). This is what I did in 3x4 chess, all endgames up to 12 pieces are solved with no idling cores.

I haven't thought much about checkers, but when you can convert to almost any sub-endgame with a single move, probably it can complicate things slightly.

Ed Trice · Post by **Ed Trice** » Thu Nov 04, 2010 4:55 pm

Kirill Kryukov wrote: I think I understand now: You are probably talking about checkers-specific problem here.

Kirill Kryukov wrote: I haven't thought much about checkers, but when you can convert to almost any sub-endgame with a single move, probably it can complicate things slightly.

When solving checkers tablebases, you have to make one pass where you examine nothing but jumps first. Since multiple jumps are possible, you could land in just about any subgame at that point.

For example, consider 3 kings + 2 checkers vs. 2 kings + 3 checkers and the side with 3 kings is doing its jump pass. It can jump into:

3 kings + 2 checkers vs. 2 kings + 2 checkers
3 kings + 2 checkers vs. 1 king + 3 checkers
3 kings + 2 checkers vs. 2 kings + 1 checker
3 kings + 2 checkers vs. 1 king + 2 checkers
3 kings + 2 checkers vs. 3 checkers
3 kings + 2 checkers vs. 2 kings
3 kings + 2 checkers vs. 2 checkers
3 kings + 2 checkers vs. 1 king + 1 checker
3 kings + 2 checkers vs. 1 king
3 kings + 2 checkers vs. 1 checker
3 kings + 2 checkers vs. no pieces

So the lookup procedure during the jump pass is fairly complicated.

Once the jump pass is completed, there is also a promotion pass.

3 kings + 2 checkers can only promote to 4 kings + 1 checker, even though there are 2 checkers on the board. Why? Only one can promote on any single move.

But, to complicate matters, the DTW for a promotion is determined very early, but there may be a non-promotion move that is encountered later that wins more quickly. Once the DTW is updated, then many DTL positions for the other side to move change, and the cycle continues for a while. Then, a few iterations later, another move may be determined to win, and that win could be faster, and this goes on and on. Win length does not correspond to iteration in checkers, and that's the hard part.

Post by **Kirill Kryukov** » Wed Nov 10, 2010 5:56 am

This is why I so much love shogi - no endgame, no sub-endgames, no conversion, no irreversible moves.

CCRL Discussion Board

Generic Massive Parallelization Algorithms

Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms

Err ...

Re: Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms

Re:

Re: Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms

Re: Generic Massive Parallelization Algorithms