A Couple of Xsan Disaster Recovery Techniques

ravi's picture


I. ACL RELATED:

We have discussed in the past Xsan ACL support and the related two NTSD inodes sb_NTSecurityIdxInode and sb_NTSecurityDatInode governing the ACL data.
When one or both of these inodes are corrupted, we have seen Xsan volumes crashing with panic errors and volumes themselves unmountable.

Since the days of Xsan 1.4.2, cvfsck has seen a lot of improvements. The file system check will now gracefully recover the volume by resetting the security descriptors.

However, there might be still be instances where if the sb_NTSecurityIdxInode is damaged, the file system might mount, but not all files/folders will be visible and some folders might be visible as empty files. Depending on the type of damage, cvfsck might simply quit with

Verifying NT Security Descriptors
Segmentation Fault

or

Verifying NT Security Descriptors
  • Fatal*: Fatal error attempting to verify NTSD's
  • Error*: Fatal error checking NTSD's

One traditional Unix approach is to try with earlier versions of the file system check (after removing certain directives from the config file that does not apply to the earlier versions) and hope for a file system fix and recovery. I outline below a different approach.

Please remember, improper use of such techniques might further damage your volume, so use at your own risk. It is assumed below that the volume is not mounted and it is stopped.

Step I

First, note down the inode number of sb_NTSecurityIdxInode. Assume the Xsan volume name is “Victim.” You can use this command (as root from a terminal) to find it:

echo show sb | cvfsdb Victim | grep sb_NTSecurityIdxNode

Let us assume the output shows the inode number as 0x5.

Step II

Now create an empty Xsan volume somewhere else (firewire drives/USB sticks/lab Xsan) with the name “Victim” (it can be any name for that matter). Set up an ACE giving Full Control to the admin user for /Volumes/Victim in this set up. Check the inode number of sb_NTSecurityIdxInode in this volume as above and assuming it is 0x5, save the 8 blocks of its contents:

echo save 0x5 /var/tmp/new_idx | cvfsdb Victim

Step III

Now use this new_idx file to rewrite the 8 blocks of contents of the sb_NTSecurityIdxInode in your original SAN from Step I:

echo replace 0x5 /<wherever you copied>/new_idx | cvfsdb Victim

Step IV

Run cvfsck, it should reset the security descriptors, and you should be able to mount the volume and recreate your ACLs.


II. LOST LUN LABELS:

Due to improper upgrades to Xsan 2.2 in conjunction with a bug, sometimes LUN labels are “lost” and Disk Utility.app pops up in one of the metadata controllers with initialize/ignore/eject message. Unfortunately, some customers mistakenly went ahead and clicked the initialize button.

The recovery is fairly straightforward. First, use diskutil to identify the LUN, let us call it disk10. Unmount the HFS+ volume associated with this LUN (if necessary use lsof to find the fseventsd process associated with this volume, kill it, then use diskutil to unmount the volume). From a terminal (as root), then do

gpt destroy disk10

You should see Disk Utility.app pop up again, just ignore it, use cvlabel to put the original Xsan LUN label back, and run cvfsck.

The usual disclaimers apply. Use all these kinds of techniques at your own risk.

Comments

3
ravi's picture

The line

echo replace 0x5 /<wherever you copied>/new_idx | cvfsdb Victim

should read

echo replace 0x5 //new_idx | cvfsdb Victim

R

ludo's picture

Hi,

I know the post is three years old but I have the same issue with someone that
must have initialised some LUNs by mistake.
The gpt destroy disk5s1 returns error: device doesn't contain a GPT
What does that mean?

Ludo

ludo's picture

Hi again,

Used gpt destroy disk5 and I got the Disk Utility prompt. But I still can't relabel
though. Other posts suggest it is not possible to recover the LUN. Did it work for
you?

Kind regards,

Ludo