Xsanity Sanity for Apple's Xsan and Final Cut Server.
  
Friday, September 03 2010 @ 05:37 AM EDT
Topics
Storage (23)
Xsan (72)
How To (25)
User Functions
Username:

Password:

Don't have an account yet? Sign up as a New User
Who's Online
Guest Users: 15
Sponsorship

Xsanity is proudly sponsored by:

Tekserve
The Old Reliable Mac Shop

Xsan 2.1.1 volume corrupting itself- doubly allocated inodes
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting
View previous topic :: View next topic  
Author Message
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Mon May 04, 2009 3:21 am    Post subject: Xsan 2.1.1 volume corrupting itself- doubly allocated inodes Reply with quote

Hello,

I have a strange problem where Xsan volume is corrupting itself contantly.

A little bit background on the problem. Couple of weeks ago I had a crash. After that I've ran cvfsck to correct the problems. The check found free list warning that could not be corrected without -C flag. I've done that and made sure that the volume is clean by running -nv and -wv multiple times after that.

A week later I've decided to check the volume if it healed itself. Sadly no. I don't get free list warning anymore, but some files shows that they have doubly allocated inodes. If I run cvfsck with -wv those files dissappear as they seem to be counted as orphans. It seems that in most cases these errors tend to occur on large files (3-30GBs).

I'm not sure if it is related but system log also reports "add_to_free_list: inode 0x7f8000000001ac failed lookup" every time I start the volume. However volume starts just fine and works normally.

I have another volume on the same MDC controller and it works without such problems.

Any hint where to look for the problem would be appreciated. I can post detailed logs and system details if needed.
Back to top
View user's profile Send private message
lotte
Xsan Master
Xsan Master


Joined: 11 Dec 2008
Posts: 164

PostPosted: Mon May 04, 2009 4:00 am    Post subject: Reply with quote

Hi Proton, I would start in looking for defective harddisks on any of the raids used for this Volume. I´ve seen this behaviour before, turns out it was a bad hd turning the logical disk not into degraded mode.
Replacing the hd fixed the problem.

Lotte
Back to top
View user's profile Send private message
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Mon May 04, 2009 4:03 am    Post subject: Reply with quote

OK. How exactly should I do this? Maybe some kind of program to check for hardware HDD problems?

Swapping every single disk in 5 LUNs till this fixed is not an option I'm afraid Smile
Back to top
View user's profile Send private message
lotte
Xsan Master
Xsan Master


Joined: 11 Dec 2008
Posts: 164

PostPosted: Mon May 04, 2009 4:19 pm    Post subject: Reply with quote

Of course not, what kind of Raids are you using? Normally you should see anything strange that occurs to the disks in the log file.

Lotte
Back to top
View user's profile Send private message
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Mon May 04, 2009 4:34 pm    Post subject: Reply with quote

Apple Xserve RAIDs. Sadly I didn't spot anything strange in the xsan or raid logs.
Back to top
View user's profile Send private message
lotte
Xsan Master
Xsan Master


Joined: 11 Dec 2008
Posts: 164

PostPosted: Tue May 05, 2009 2:50 am    Post subject: Reply with quote

Well, then I would suggest to backup all the data and recreate the filesystem...
Could be more helpful than corrupting the filesystem even more.

Lotte
Back to top
View user's profile Send private message
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Tue May 05, 2009 4:45 am    Post subject: Reply with quote

Unfortunatelly, I don't have where to backup such amount of data. Maybe someone knows a good way to identify broken disk? How about Verify in RAID Admin utility?
Back to top
View user's profile Send private message
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 422

PostPosted: Tue May 05, 2009 7:53 am    Post subject: Reply with quote

You need to do it with your eyes, and hopefully one of your LUNs will reveal itself as the bad one. Do a large I/O operation (large file copy) so you can see all drive activity. Usually, the bad LUN will have an activity pattern that is abnormal, usually with one drive showing activity when the others are off or vice versa. Bring a chair to the equipment room. Wink
Back to top
View user's profile Send private message Visit poster's website
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Tue May 05, 2009 7:59 am    Post subject: Reply with quote

Hehe Smile

In fact I have such drive in metadata LUN for that volume (ouch!). It's the oldest drive in entire SAN with more than 40K hours of work. On big operations on the volume it flashes almost constantly even another drive in that mirror sits a lot more silent. Could be it? If so, what's the proper procedure to replace drive in metadata LUN?

Edit: maybe I should run Verify from RAID Admin on the LUN?
Back to top
View user's profile Send private message
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 422

PostPosted: Tue May 05, 2009 8:44 am    Post subject: Reply with quote

Well, the tough love is that no Xserve RAID drive with 40K hours (which dates it back to 2004!) should be spinning, anywhere.

I would stop the volume, do an snmetadump to get a complete image of the metadata LUN into a file, and then, while still stopped, pull the drive and replace it with a new one (We still have 9 750GB drives! Couldn't resist a plug.).

Wait for the rebuild to complete before starting the volume again. This could take between 2-8 hours.

Fire it back up and see where you are. If all goes awry, you can always build yourself a new RAID1 and load the dumped metadata image onto it.
Back to top
View user's profile Send private message Visit poster's website
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Tue May 05, 2009 8:58 am    Post subject: Reply with quote

I'm not sure if I have more than 4 hours of time to do this at night. So in case rebuild will not complete in time, can I start the volume during the process?
Back to top
View user's profile Send private message
lotte
Xsan Master
Xsan Master


Joined: 11 Dec 2008
Posts: 164

PostPosted: Tue May 05, 2009 3:05 pm    Post subject: Reply with quote

@MattG, excelent hint! I will remember that!

Lotte, who assumes that 4hour will not be enough, so I would suggest not starting the Volume till all is fine.
Back to top
View user's profile Send private message
cedge318
Could work for Apple
Could work for Apple


Joined: 12 Jan 2009
Posts: 47

PostPosted: Fri May 15, 2009 12:03 pm    Post subject: clobber Reply with quote

I'm assuming this was a physical disk issue. But if not:
There's been no mention of fragmentation. There are two reasons that an inode might not be written: metadata storage and target storage. With regard to metadata storage the focus has been on the physical disks and leveraging cvfsck to repair and clobber. I'd also check the volume itself for fragmentation. If the process to determine where data will reside cannot complete then the inode should not be written. Should being the operative word, as I have seen duplicitous inodes also happen in the case of heavy fragmentation, something else mentioned in the original post. Having said this, I usually see a different error altogether when that's the problem but it's worth mentioning...
Back to top
View user's profile Send private message Visit poster's website
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Thu Jun 04, 2009 3:41 am    Post subject: Reply with quote

I've changed what looked like a failing disk, but it didn't help. Multiple orphan/directory errors still appear on the volume during cvfsck checks.

Can anyone comment on "add_to_free_list: inode 0x7f8000000001ac failed lookup" error when starting a volume? I'm almost positive that this could be the problem because I don't see such error on other volumes. Can I use somekind of tool to check where exactly that inode is (LUN or disk)? Or at least what data files lies under it?
Back to top
View user's profile Send private message
proton
Could work for Apple
Could work for Apple


Joined: 29 Nov 2007
Posts: 48

PostPosted: Fri Jun 05, 2009 3:43 pm    Post subject: Reply with quote

OK. I found out that "add_to_free_list: inode 0x7f8000000001ac failed lookup" error is caused by clobbering command, because the inode number changes after every cvfsck -C. If I'm correct, it seems that clobbering somehow reserves bad free list data to unavailable inode and makes it unavailable to later use. Hence the error message.

Now let's get back to the mentioned data fragmentation. I didn't told you one thing. We had to increase the volume a while ago. So I've added 4th LUN (4TB) to existing storage pool (8TB). It was strange at first because ~300GB was gone during expansion process. However after that I've ran snfsdefrag -dr /Volumes/Volumename. It failed 3 times during operation, but I kept resuming it. After that I've ran cvfsck -wv Volumename to see if everything is OK. It was the first time when I saw these doule allocated inode errors. But the check corrected all errors and I've got my 300GB back!

The question is: could volume expansion or/and snfsdefrag operation corrupt metadata in such a way that cvfsk is unable to fix 'em? (ACL corruption deja vu anyone? Smile)

P.S. oh and if it is important, File System Status *always* reports Clean even if there are double allocated/orphan inodes on the volume.

ANOTHER UPDATE: i've ran snfsdefrag a whole weekend on the volume. Some of the files had thousands extents, but fragmentation generally was OK. 99% of files had only 1 extent. Nevetheless process crashed my volume 5-6 times with same error messages when I ran it the first time after volume expansion:

Jun 6 00:18:08 xserve3 fsm[29592]: Xsan FSS 'EditSAN[0]': PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "ip->i_extender_hint == L64(0) || ip->i_idinode.idi_extents[NUMEXTENTS - 1].idiext_flags == ExtentFlagExtender" file inode.c, line 11590
Jun 6 00:18:08 xserve3 KernelEventAgent[99]: tid 00000000 received VQ_NOTRESP event (1)
Jun 6 00:18:08 xserve3 KernelEventAgent[99]: tid 00000000 type 'acfs', mounted on '/Volumes/EditSAN', from '/dev/disk12', not responding
Jun 6 00:18:08 xserve3 KernelEventAgent[99]: tid 00000000 found 1 filesystem(s) with problem(s)
Jun 6 00:18:08 xserve3 loginwindow[98]: 1 server now unresponsive
Jun 6 00:18:08 xserve3 fsm[29592]: PANIC: /Library/Filesystems/Xsan/bin/fsm ASSERT failed "ip->i_extender_hint == L64(0) || ip->i_idinode.idi_extents[NUMEXTENTS - 1].idiext_flags == ExtentFlagExtender" file inode.c, line 11590
Jun 6 00:18:08 xserve3 fsm[29592]: Xsan FSS 'EditSAN[0]': PANIC: aborting threads now.
Jun 6 00:18:10 xserve3 kernel[0]: Reconnecting to FSS 'EditSAN'
Jun 6 00:18:10 xserve3 kernel[0]: No FSS registered with PortMapper on host 10.5.5.3, retrying...
Jun 6 00:18:20 xserve3 ReportCrash[29842]: Formulating crash report for process fsm[29592]
Jun 6 00:18:21 xserve3 fsmpm[145]: PortMapper: FSS 'EditSAN' disconnected.
Jun 6 00:18:21 xserve3 fsmpm[145]: PortMapper: kicking diskscan_thread -264712192.
Jun 6 00:18:21 xserve3 fsmpm[145]: Portmapper: FSS 'EditSAN' (pid 29592) exited on signal 6
Jun 6 00:18:21 xserve3 ReportCrash[29842]: Saved crashreport to /Library/Logs/CrashReporter/fsm_2009-06-06-001819_xserve3.crash using uid: 0 gid: 0, euid: 0 egid: 0

Also after defragmenting cvfsck showed another bunch of files that have orphan inodes. These files were old and rewritten by snfsdefrag I suppose.

UPDATE 2: I had a time to reran defragmentation second time. This time it completed successfully and didn't crashed the volume. But running cvfsck after, I again found some corrupted inodes even if it was clean before defrag. So now it is safe to say that it's not fragmentation problem?
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting All times are GMT - 5 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Best Viewed on a Mac | Suggested Browser: Whatever floats yer boat.