Yes, with computation of this scale (and even with minichess) we have to be prepared for random crashes and data corrruption. I am willing to take my chances, as my computation benefits greatly from larger RAM, and one error every several month is something I can (try to) live with. Also, storage can create bit errors too (with compressed data it will just corrupt a data block).
My general naive approach to hardware-induced errors:
1. Verification is absolutely necessary. (I checksum the data, then verify, then checksum again).
2. Verification should not follow the computation immediately, but only be done after making sure that the data is stored to disk (and not taken from cache).
3. I try to use the highest quality components (especially power supply, motherboard).
4. I slightly underclock the memory (currently running 16 GB of 2133 MHz DDR3 at 1866 MHz).
5. I try to provide abundant cooling to all parts.
I have to say that I never saw hardware-induced errors in my data so far, although I saw some infrequent crashes (once every few months) - they could be caused by hardware, power surge, cosmic rays, or, perhaps by subtle bugs in my code (as well as OS, drivers, libraries or compiler code).
I'd welcome ECC if it did not cost three times as much as normal RAM. I used ECC memory in the past, but not anymore.
Allan, good point: If more RAM makes the computation faster, then perhaps the total number of errors experienced during one solving run can be lower on a system with more RAM. This is probably the case for me.