Originally posted to Data Center POST
Random bit flips are far more common than most people, even IT professionals, think. Surprisingly, the problem isn’t widely discussed, even though it is silently causing data corruption that can directly impact our jobs, our businesses, and our security. It’s really scary knowing that such corruptions are happening in the memory of our computers and servers – that is before they even reach the network and storage portions of the stack. Google’s in-depth study of bit-level DRAM errors showed that such uncorrectable errors are a fact of life. And, do you remember the time when Amazon had to globally reboot their entire S3 service due to a single bit error?
The Error-Prone Data Trail
Let’s assume for a moment that your data survives its many passes through a system’s DRAM and emerges intact. That data must then be safely transported over a network to the storage system where it is written to disk. How do you assure the data remains unaltered along the way? Well, if you’re using one of the storage protocols that lack end-to-end checksums (e.g. NFSv2, NFSv3, SMBv2), your data remains susceptible to random bit flips and data corruption. Even NFSv4 plus Kerberos 5 with integrity checking (krb5i) doesn’t offer true end-to-end checksums. Once the data is extracted from the RPC, it is unprotected once again. In addition, the widespread adoption of NFSv4 hasn’t happened, and fewer still use krb5i.
To read the full article please click here.