It does not matter what algorithm is used for the CRC/checksum/hash. In all cases it is a smaller number generated from data that (if taken as one string of bits) constitutes a massively larger number, and it takes time to compute and storage to keep around. The question is this: is it worth the extra storage and the extra computation times for every single I/O operation performed on the filesystem? I say it isn’t.
Hard drives DO in fact know if something has bit rotted, assuming the rot isn’t so severe that it extends beyond the error detection capabilities of the on-disk ECC. Whenever a drive reports an “uncorrectable error” it’s actually reporting an on-disk ECC error that was severe enough that the data couldn’t be corrected. In my opinion, on-disk checksums (CRCs, hashes, whatever term is preferred) are targeting a few types of very rare hardware failures (they must mangle data despite all hardware error checking mechanisms AND must not cause any other damage that crashes the program or machine which would process or write that data out to disk) and do so at significant expense (a check must be done for every piece of data that is read from disk). Even ZFS checksums are not foolproof; for example, if data is damaged in RAM or even in a CPU register before being sent to ZFS, the damaged data will still be treated as valid by ZFS because it has no way to know anything is wrong.
As discussed in my post, ZFS checksums are useless without a working backup of the data to pull from, preferably a ZFS-specific RAID configuration that enables real-time “self-healing” as you’ve mentioned. Without some sort of redundancy…well, what are you going to do? You know it’s damaged but you have no way to fix it.
You seem to take particular issue with my assertion that checksums are a waste of space. Granted, they’re relatively small compared to file data, however the space issue pales in comparison to the processing time and additional I/O for storing and retrieving those checksums. If the checksums aren’t beside the data then that 128K read will incur at least one 4K read to fetch the checksum which is not nearby, resulting in a disk performance hit. Enough read operations with checksum checking at once and streaming read speeds approach the speed of fully random I/O a lot faster than it would otherwise. It also takes CPU time to calculate a hash value over a 128K block; while some are faster than others, all take CPU time and large enough block sizes will repeatedly blow away CPU D-cache lines during the checksum work, reducing overall system performance. Since many ZFS users seem to pair it with FreeNAS and relatively small, weak systems like NAS enclosures, the implications of all this extra CPU hammering should be obvious. Of course, a Core i7 machine with 16GB of DDR4 RAM might do it so fast that it doesn’t matter as much, but being able to buy a bigger box to minimize the impact of lower efficiency does not change the fact that such a drop exists.
In computing, we have to choose a set of compromises since rarely does any given solution satisfy speed, precision, reliability, etc. all at the same time. In my opinion, ZFS data checksums are not worth the added cost, particularly since the problem surface area is very small and unlikely to ever happen once the error checking coverage of hard drive ECC, RAM and on-CPU ECC if applicable, and various bus-level transceiver error detection methods are taken away. The beauty of computing is that you are free to make a different trade-off in favor of bit rot paranoia if it makes you sleep better at night. What’s right for me may not be right for you. I do not consider the very tiny risk of highly specific and unlikely corruption circumstances that can be detected to be worth covering ESPECIALLY since the same cosmic rays that can bit-flip the data in a detectable place could just as easily flip it in an undetectable place, but I’m not in your situation and making your choices.
tl;dr: one of us is less risk-averse, and that’s okay.