We had to deal with several emergency disaster recovery calls outside of our own customer support base last year and many of them were related to bandwidth expansion (Xsan 1.4.x — a lot of installations are still running on this, Xsan 2 and StorNext). We never recommend bandwidth expansion to our own customer base.
In the case of Xsan 2, even "successful" expansions (as indicated in the logs) were not successful. We saw cvfsck reporting bus errors after such "successful" expansions and in a few instances, we ended up recovering several terabytes of data as a last resort.
There is very little technical documentation at Apple or Quantum on issues like this, so we investigated this issue and simulated our own tests. I also had a lot of technical discussions with some of my peers and colleagues in Xsan, StorNext, QFS, GFS etc.
This article is a summary of the conclusions drawn from the above.
First recall two definitions from Apple's document on Xsan API. (The document was available at this URL but it is no longer there.)
Stripe Breadth (sb): The stripe breadth is the maximum amount of data that is read or written before switching to the next LUN in the storage pool. When the last LUN is reached, I/O operations go back to the first LUN.
For example, if you have defined the file system block size as 16KB, and the stripe breadth as 64 blocks, the stripe breadth data size is 16 * 64 = 1024KB.
Stripe Depth (sd): The number of LUNs assigned to a storage pool. Note that even though Apple's document defines this as the number of disks that has been assigned to a storage pool (likely to be a typo in Apple's document), the long output of cvadmin shows the correct definition:
Xsanadmin (Volume1) > show long Show stripe groups (File System "Volume1") ... ... Stripe Group 1 [Data] Status:Up Total Blocks:30627054 (116.83 GB) Reserved:1082880 (4.13 GB) Free:25267210 (96.39 GB) (82%) MultiPath Method:Rotate Stripe Depth:3 Stripe Breadth:2 blocks (8.00 KB) Affinity Key:Data ... ... Disk stripes: Primary Stripe [Data] Read:Enabled Write:Enabled Node 0 [XsanLUN1] Node 1 [XsanLUN3] Node 2 [XsanLUN4]
Bandwidth expansion is defined as expanding an existing volume by adding one or more equal sized LUNs to an existing Storage pool.
Why is this a bad idea?
When one or more LUNs are added to an existing storage pool, free space for the new LUNs needs to be inserted into the existing data of the storage pool. This is because the data needs to be striped across all the LUNs after the expansion (see the definition of stripe breadth above).
This will introduce free space fragmentation. In Xsan 1.4.x and in StorNext (at least up to 3.1.1 since we haven't tested 3.1.2 and above for this), there appears a maximum limit of 1000000 free space fragmentation chunks during bandwidth expansion and if this number is exceeded, cvupdatefs will quit with an error and bandwidth expansion will fail.
The reason for this limitation seems twofold:
- Too much free space fragmentation will result in file system performance problems.
- Each fragment needs a bit of memory for FSM, so potentially FSM can be swapped out and can't be started.
So, it is a good idea ahead of time to calculate how much free space fragmentation will be introduced. First, calculate the used up space in the storage pool (from the long output of cvadmin or from the cvlog file where the summary out for space usage in the pools will show you the information). Let us assume it is m KB.
StorNext (up to 3.1.1 we tested): m / (sb * sd) should not exceed 1000000.
Xsan 1.4.x: m / sb should not exceed 1000000.
Xsan 2: The behavior of Xsan 2 (we did not test the current version) appears different. There are two parameters StoragePoolIdealLUNCount and StoragePoolStripeBreadth in the -auxdata.plist under the /Library/Filesystems/Xsan/config directory. MattG alerted me to the fact that they come from the volume type chosen when the file system is initially created so that the system has a fighting chance of expanding the volume correctly in the future. However, it is not clear if the volume expands "correctly" in the bandwidth expansion case as opposed to storage expansion case (by creating a separate new pool with the new LUNs).
cvupdatefs in Xsan 2 does not seem to care about the 1000000 free space fragmentation figure, but it appears to stop inserting free space fragment chunks if a lot of FSM memory space is required to track free space. This can solve memory and FSM swapping problem, but the free space usage is not optimized and there is excessive free space fragmentation. Even after running
snfsdefrag -d
to correct stale depth extents, the performance of the resulting volume could be worse than a freshly created volume with the same configuration parameters as the expanded volume.
On the other hand, in the Xsan 2 disaster recovery cases alluded to at the beginning of this article, the bandwidth expansion seems to have been "successful" as indicated in the cvupdatefs log in /Library/Filesystems/Xsan/data//trace but there were constant FSM panics and cvfsck bus errors, so in reality the expansion failed.
Thus, expanding the volume using bandwidth expansion seems to be a good idea only in a relatively empty volume. But then, if it is relatively empty, just saving that data, and rebuilding the volume with additional LUNs seems to be a better course. I am happy to hear any counter arguments to this conclusion.
Image attribution:

