ZFS checksums add reliability to NAS storage.
Companies are increasingly turning to IP-based video surveillance technologies to meet their security needs. Higher video resolution and longer data retention time are driving storage capacity requirements which have grown substantially. With the constantly growing demand for storage offering the same degree of data protection as enterprise-class IT systems becomes a critical challenge for video surveillance.
A common, but often neglected failure point in the storage system is the bit-error rate specified for the drive. Enterprise-class disks have a bit error rate of 1 in 10^15. While this rating is not significant enough to affect a normal operation, it is in reconstruct mode that traditional RAID shows that its protection capabilities failed to keep up with the rapid growth in size of modern disks.
Traditional single-parity RAID-5 offers protection against a single disk failure. When one disk fails, RAID will recreate data from both parity and the remaining disks in the array onto a spare. During this process the entire capacity of all of the remaining disks in the group has to be read – even if there is no data in the blocks. Every remaining disk must be read perfectly from start to finish or else rebuild will fail, leading to a total data loss.
An enterprise-class disk bit error rate of 1 in 10^15 translates into a read failure every 125 terabytes. So with a 65 terabyte RAID-5 array using enterprise-class disks, if we were to lose a single disk, we have less then 50% chance that the array will recover successfully. The rebuild is more likely to fail than to succeed.
Increasing disk capacities and large RAID-5 arrays have led to an increasing inability to successfully rebuild a RAID set after a drive failure and occurrence of bit error on the remaining drives.
Similar to RAID-5, double-parity RAID-6 provides redundancy for up to two failed drives. It has become popular in video surveillance applications due to its increased fault-tolerance for mission critical security applications.
When RAID-6 rebuild is initiated after single drive failure, the occurrence of bit error results in the RAID group losing its protection against second drive failure and the array continues running as RAID-5. During this time, it’s like driving a car without a spare tire – if a second drive fails, all data in the RAID system is lost. When the disks are busily being used for other tasks such as video surveillance storage, they work harder processing both the normal and the parity I/O load. The chances of a second drive failure begin to increase dramatically and the probability of losing data is increasing to dangerous levels.
The solution to this problem is bit error aware storage operating system that does integrity checking every time data is read. With LucidNAS, the checksums are built into the data storage format and stored separately from the data they protect, eliminating the risk of bit errors and failed RAID rebuilds caused by them.