Current thinking regarding XFS and RAID arrays

 

I recently advised a customer as follows:

OK. First, a note about performance testing.

 

You don’t want to use the default DD block size of 512bytes. That’s much too small to get the best performance. 1megabyte (bs=1M) is probably the minimum I would use for streaming copy/read/write testing.

 

The Linux kernel’s memory caching of writes turns out to actually get in the way for any large streaming write tests. You can see this for yourself by using a “raw” device or the “DIRECT” flag on open. DD supports the “direct” flag like this:

dd if=/dev/shm/cam00.mhs bs=1M of=/dev/null
vs
dd if=/dev/shm/cam00.mhs bs=1M of=/dev/null iflag=direct oflag=direct

 

and

dd if=/dev/sdc bs=1M count=4000 iflag=direct of=/dev/null
vs
echo 3 > /proc/sys/vm/drop_caches ; dd if=/dev/sdc bs=1M count=4000 of=/dev/null

 

and (WARNING writing to disk devices)

dd if=/dev/zero bs=1M count=4000 of=/dev/sdc oflag=direct
vs
dd if=/dev/zero bs=1M count=4000 of=/dev/sdc conv=fdatasync

 

The direct flag prevents the kernel from making wasteful copies of the buffers to and from the disk and instead does I/O directly between the user buffers and the hardware.

 

Another way to “prevent” cache effects is to use the “conv=fdatasync” flag to cause DD to issue an “fdatasync” call at the end of the write just before closing the file. This forces the filesystem to write any unwritten data to disk before closing the file. With that, DD will report repeatable and accurate performance numbers for a buffered copy. And writing user C code that does an “fdatasync” after writing a file is a good idea as well.

 

To prevent read caching from affecting results, you can use “echo 3 >/proc/sys/vm/drop_caches” which will cause all buffered data in memory to be “forgotten” so that it will need to be read from the disk again. That way you don’t have to “unmount/remount” the filesystems or other tricks to clear the read cache when you want to benchmark reads with buffering enabled. (iflag=direct bypasses the read buffering entirely)

Tuning hints

 

When creating your RAID groups, I would recommend you go with 4 sets of 9 drives if you’re comfortable with RAID-5. If you would rather be able to recover from double drive faults, then three sets of 10 and one set of 6 RAID-6 arrays is a better configuration. If you want more parallelism, 6 sets of 6 drives in RAID-6 is another good configuration at the cost of some usable disk space. All of those configurations have 4 or 8 data drives per array and gives a stripe width that is a power of 2. This is important when creating LVM stripes and aggregates of the arrays as LVM can only do “chunk” sizes in powers of 2.

 

Enable both read and write caching (when the battery/ZMM is available) for each RAID array as you create them “RON” and “WBB”

 

Set the global task priority to medium (down from the default high) so that a rebuild won’t starve the OS for I/O

   # arcconf setpriority 1 medium

 

If you want maximum data protection, disable the drive write caches. This will impact write performance, but the write back cache on the RAID controller will mitigate it for small writes. Large streaming writes will be impacted more.

 

Enable background consistency checking so that the RAID controller is constantly (at a low rate) checking the RAID arrays for bad blocks.

 

I recommend LVM over MD RAID for creating aggregates. When aggregating storage I recommend you use the *entire* drive as an LVM PV. This makes it much easier to insure your data is aligned on a RAID stripe boundary which can make a 2-3x performance improvement if you can help the RAID controller perform full stripe writes and not read-modify-write updates.

 

When creating the LVM PV’s, it is *vital* that you specify “–dataalignment=xxxx” where xxx is the size of data in a full stripe. For example, 6-drive RAID-6 with 256k chunk size is 4 data drives * 256k = 1024k stripe size (1 Megabyte) Note that LVM is CASE SENSITIVE and “k” is a different size than “K” in some LVM contexts. “k” is kibytes (2**10) and “K” is 10**3.

   # pvcreate --dataalignment=1024k /dev/sdc
   # pvs -o +pe_start --unit k

I leave LVM’s PE size at 4megabytes. It’s a multiple of the stripe size and we’re going to be allocating the full extents of the PV’s so it doesn’t really matter what granularity we have.

   # vgcreate storage /dev/sd[cdef]

 

When creating the logical volume, I recommend you use “linear” and *not* stripe mode. This let’s you handle more parallelism and IO against multiple files at the same time. If you choose the 3×10+1×6 RAID-6 mode, then you want to select the 3×10 arrays first, then append the 6 drive array storage last

   # lvcreate -n data -l 100%PVS storage /dev/sdc /dev/sdd /dev/sde
   # lvextend -l +100%PVS storage/data /dev/sdf

 

This also simplifies the stripe unit and width determination for XFS as it will only need to match the underlying RAID array values (typically 256k and 8)

   # mkfs -L label -d su=128k,sw=8 /dev/storage/data

When using XFS on a large, multi-segmented storage, it’s important to specify “inode64” in the local mount options so that you distribute the data throughout the filesystem to avoid fragmentation and pressure on the lower blocks of the filesystem. The default “inode32” was necessary when NFS couldn’t handle 64-bit inodes, but Red Hat 5 can support 64-bit inodes in NFS version 3.

 

The default mount option “logbufs” defaults to 2 and increasing it to 8 should help with meta-data and write performance.

 

And if you don’t need “atime” accuracy, either select the default “relatime” or specify “noatime” to completely disable it. “relatime” is probably good enough.

   # mkdir /data
   # mount -t xfs -o logbufs=8,inode64 /dev/storage/data /data

 

Now test it with DD and the “direct” flag:

   # time dd if=/dev/zero of=/data/bigfile bs=1M count=4000 oflag=direct
   # time dd if=/data/bigfile bs=1M of=/dev/null iflag=direct

 

You should see pretty good performance.

 

NFS adds a whole ‘nother layer of ways to mess with performance and it can be difficult to get the local performance from remote clients.

Recent Posts