Xsanity Sanity for Apple's Xsan and Final Cut Server.
  
Wednesday, May 22 2013 @ 03:21 AM EDT
Topics
Storage (39)
People (1)
Xsan (103)
How To (26)
User Functions
Username:

Password:

Don't have an account yet? Sign up as a New User
Who's Online
Guest Users: 8
Sponsorship

Xsanity is proudly sponsored by:

Tekserve
The Old Reliable Mac Shop

Volume corrupt after update from 1.2 to 1.4

 
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting
View previous topic :: View next topic  
Author Message
axello
fully protected
fully protected


Joined: 18 Sep 2006
Posts: 12

PostPosted: Mon Sep 18, 2006 3:33 pm    Post subject: Volume corrupt after update from 1.2 to 1.4 Reply with quote

Hi
We have a serious problem after we converted our XSan from 1.2 to 1.4, and were hoping anyone here can help us.

It is an XSan 1.2 with 2 XServe RAIDs. We didn't install this ourselves, but it was done by a previous service provider.

We stopped the Xsan on both MDCs and shut down all the clients as well. Then we upgraded the MDC to 1.4. According to Apple's update docs it is possible to do this without going through 1.3 first, because we stopped the SAN. After the update of the MDC, we updated the backup-MDC as well, so that a fail-over wouldn't go back to older Xsan software.
We generally apply the rule "if it ain't broke, don't fix it". The main reason for us to update to 1.4 is that a problem was fixed resharing the San over NFS, which is what we do. We also thought we could always uninstall the update and go back to 1.2 if necessary, but that was not the case! Please read on.

Problem:
After the update the volume store doesn't come online, because a PANIC occurs. XSan 1.4 has a different LUN size in the store.cfg file. There is a discrepancy between the old 1.2 config file and the new 1.4 one. To prevent misunderstandings: the san volume is called 'store'.

First impression:
The 1.4 update changed the store.cfg, but then it didn't mount with a 'bad configuration' error. Why it changed the config file we have no idea.

Some lines from the system.log:
Quote:
Sep 18 15:58:16 server fsm[936]: Xsan FSS 'store[1]': PANIC: /Library/Filesystems/Xsan/bin/fsm " The stripe group "VideoDataPool" size is 1432559, expected 1430759. You must maintain the original configuration or re-initialize the file system. File system 'store' not started. " file alloc.c, line 2923
Sep 18 15:58:16 server fsm[936]: PANIC: /Library/Filesystems/Xsan/bin/fsm "
The stripe group "VideoDataPool" size is 1432559, expected 1430759.
You must maintain the original configuration or re-initialize the file system.File system 'store' not started. " file alloc.c, line 2923
Sep 18 15:58:16 server servermgrd: xsan: [84/395430] ERROR: get_fsmvol_at_index: Could not connect to FSM because Admin Tap Connection to FSM failed. - Connection reset by peer
FSM may have too many connections active.

So the 1.4 update changed the config file. OK, then we change it back to the old 1.2 backup and try again. This doesn't work as well: when you open the XSan Admin, the store.cfg is changed again it seems.
2 of the 6 LUNs are 3GB bigger than the other four, but the 1.2 software handled this correctly. There was even a separate icon in the GUI that indicated that the sector size was smaller than the physical size of the LUNs. The 1.4 update makes all LUN sizes equal, but then gives an error when you start the volume.

So we deinstalled 1.4, installed 1.1, applied the 1.2 update, put the old Xsan config directory back (and re-entered the license key). But still no-go.
We get the following error messages in the log:

Quote:
Sep 18 18:03:18 server fsmpm[289]: PortMapper: Starting FSS service 'store[1]' on host server.aaa.bbb.\n
Sep 18 18:03:18 server fsm[520]: *WARNING*: StripeGroup "VideoDataPool" can use only 99 percent of its space due to inconsistent disk sizes!\n
Sep 18 18:03:18 server fsmpm[289]: PortMapper: FSS 'store'[1] (pid 520) at 10.0.1.200:49407 is registered.\n
Sep 18 18:03:37 server fsmpm[289]: PortMapper: Initiating activation vote for FSS 'store'.\n
Sep 18 18:05:10 server kernel[0]: CVFS: 3.5TBUpperData: configured sector count 1960835072 exceeds labeled capacity 1953462272
Sep 18 18:05:10 server kernel[0]: Could not mount filesystem store, cvfs error ' Invalid argument' (11)

We checked with cvlabel -l to the LUN labels, because we thought that was the cause of our troubles:

Quote:
/dev/rdisk6 [APPLE Xserve RAID 1.50] CVFS "MetaDataLUN" Sectors: 490190848. SectorSize: 512. Maximum sectors: 490190848.
/dev/rdisk4 [APPLE Xserve RAID 1.50] CVFS "5.6TBUpperSlice1" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1953462272.
/dev/rdisk8 [APPLE Xserve RAID 1.50] CVFS "5.6TBUpperSlice0" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1953462272.
/dev/rdisk9 [APPLE Xserve RAID 1.50] CVFS "3.5TBUpperData" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1960835072.
/dev/rdisk5 [APPLE Xserve RAID 1.50] CVFS "3.5TBLowerData" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1960835072.
/dev/rdisk3 [APPLE Xserve RAID 1.50] CVFS "5.6TBLowerSlice0" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1953462272.
/dev/rdisk7 [APPLE Xserve RAID 1.50] CVFS "5.6TBLowerSlice1" Sectors: 1953462272. SectorSize: 512. Maximum sectors: 1953462272.

We see that disk5 and disk9 now have a sectorsize which is the same as the other LUNs, but the 'Maximum sectorsize' is the same as the 'Sectors' entry in the OLD store config. So it seems the old config file used a different LUN size for the 3.5TB LUNs as compared to the 5.6TB LUNs.

Our suspicion now is that the XSan 1.4 update CHANGED the LUN labels on the LUNs themselves. How else could the old 1.2 software have worked if the LUNs had a different number of sectors than the store config file?
The 1.4 software now won't work because the volume size has changed, and the 1.2 software won't work because the LUN labels have changed.
So the SAN is corrupt. AAaargh!

Our question, and we could not find any documentation neither in the Apple docs, nor in the StorNext docs:
Is the usage of cvlabel to bring the LUNs back to their old size destructive for the contents of the LUNs, like a format, or is it non-destructive but more comparable to e.g. a partitionmap: you can twiddle with the partitionmap, but the data stays intact?

We hope someone can help us, thanks in advance,


Axel Roest
Back to top
View user's profile Send private message Visit poster's website
aaron
Site Admin
Site Admin


Joined: 19 Mar 2005
Posts: 405

PostPosted: Mon Sep 18, 2006 4:46 pm    Post subject: Reply with quote

Wow. This one is beyond me. Anyone else?
_________________
Aaron Freimark
http://www.tekserve.com/vcard/af.vcf
Back to top
View user's profile Send private message Visit poster's website
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 456

PostPosted: Mon Sep 18, 2006 11:49 pm    Post subject: 1.2 to 1.4 woes... Reply with quote

Ok three thoughts...

1) The main cause of your woes is that your original integrator used slicing on your large RAID to get four LUNs whose sizes were similar, but not equal, to the LUN sizes of your 3.5TB RAID. When these slices were placed into a Storage Pool along with the non-sliced LUNs of the 3.5TB RAID, the 3.5TB LUNs got smaller to be exactly the same size as the slices. So, to be clear, the LUNs on the 3.5TB RAID have been truncated to match the size of the sliced LUNs. In fact, you even mention that you saw an "icon" (probably an arrow facing downwards) that indicated that the sector size was smaller than the physical size.

In essence, your integrator did you no favors using this technique. We stress in class that using any combination of sliced and unsliced LUNs to create a multiple LUN storage pool is more effort than it's worth.

2) No question that the LUNs were re-labeled in the update.

A full disclaimer: messing with LUN labels is the surest way of ensuring that the volume will never mount again.

But, in class we did practice once with unlabeling and relabeling LUNs with the exact same name and seeing that the LUNs' data still remained intact. However, we didn't re-specify sizes during that exercise.

3) I hope you have AppleCare, because I am almost certain that using the cvlabel command, with its myriad of switches, could bring about a forced sector size on those two LUNs that reflect the original truncated size, which would really allow you to run the volume in 1.2 or 1.4. I'm just not familiar of how to do it, and I think their tier 3 has probably done one or two of these in the past year.

So, once again, truncating LUNs to fit the size of others in the name of performance is a bad idea.
Back to top
View user's profile Send private message Visit poster's website
axello
fully protected
fully protected


Joined: 18 Sep 2006
Posts: 12

PostPosted: Tue Sep 19, 2006 1:32 am    Post subject: thx, will look further Reply with quote

So, it basically surmounts to : you should be glad it worked in the first place! I've also posted this to Apple, maybe they come up with something...

Who or what is this 'tier 3' you are talking about in regard to AppleCare?

Thanks
Back to top
View user's profile Send private message Visit poster's website
axello
fully protected
fully protected


Joined: 18 Sep 2006
Posts: 12

PostPosted: Tue Sep 19, 2006 4:43 am    Post subject: more info Reply with quote

To recap:
We have problems updating a 1.2 xsan system that was hacked to support truncated LUNs. the main storage pool contains 6 LUNS, 4x 931GB and 2x 935GB (from a new xserve raid and an older model xserve raid). The larger LUNs were truncated to match the smaller LUN size in the pool. This was visible in the Xsan Admin GUI with the greyed out total size and an arrow pointing to the usable size.

We ran some more tests:

- boot the MDC (all clients switched off)
now the volume runs (automatically)
Then we try to mount (read only) in the Xsan Admin GUI on the MDC, but this gives the error "configured sector count exceeds labeled capacity"

- We check cvlabel and the volume.cfg file:
cvlabel shows the smaller LUNs as using their full size (931GB), and the larger LUNS as using only 931GB out of the 935GB total. So that seems OK.
the volume config file shows a larger 935GB for the two bigger LUNs. This is strange as they are truncated. However:

When we adjust the store.cfg file to match the larger LUN's to their truncated size, a subsequent cvfsck (read only) gives many, many errors.

When we revert the store.cfg file to the original config on xsan 1.2 and run cvfsck it reports all is well and 0 errors. It seems that the larger size setting in the config file corresponds to the actual data on the volume.

The only problem we have is that in the Xsan Admin GUI the larger LUNs no longer show up as truncated 935GB, but as actual 931GB LUNs... which they aren't. And because of this the mount command fails...
Back to top
View user's profile Send private message Visit poster's website
axello
fully protected
fully protected


Joined: 18 Sep 2006
Posts: 12

PostPosted: Tue Sep 19, 2006 2:30 pm    Post subject: Solved! Reply with quote

Update
We were able to recover the SAN, by redefining the labels with the original cvlabel configuration file. Some people convinced us that this would not cause irrepairable damage, so we took the plunge.

After that and restoring the old store config file, cvfsck gave no errors and the SAN mounted correctly. Then we updated to 1.3: no problems, the LUN labels stay the same, and the SAN mounts automagically.
We then upgraded to XSan 1.4. That also went well. So it seems something went wrong with the LUN labels during the immediate upgrade from 1.2 to 1.4, and it went well when we did it in smaller steps...

So in short:

Uninstall 1.4
Reinstall 1.1 + 1.2
restore old 1.2 config back
change cvlabels to original size
1.3 update + test: store mounts OK
1.4 update: also mounts OK.

Thanks for all who helped.

Case closed.
Back to top
View user's profile Send private message Visit poster's website
aaron
Site Admin
Site Admin


Joined: 19 Mar 2005
Posts: 405

PostPosted: Tue Sep 19, 2006 3:15 pm    Post subject: Reply with quote

Congratulations! Let me know if you are in the market for a good backup solution.
_________________
Aaron Freimark
http://www.tekserve.com/vcard/af.vcf
Back to top
View user's profile Send private message Visit poster's website
axello
fully protected
fully protected


Joined: 18 Sep 2006
Posts: 12

PostPosted: Wed Sep 20, 2006 2:20 am    Post subject: Reply with quote

javascript:emoticon('Smile')

They now bought a DLT2 tape archive solution from Quantum. (Note: I explicitely don't say: backup solution).

This is mainly to free up space on the san.
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting All times are GMT - 5 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Best Viewed on a Mac | Suggested Browser: Whatever floats yer boat.