Xsanity Sanity for Apple's Xsan and Final Cut Server.
  
Wednesday, May 22 2013 @ 11:58 AM EDT
Topics
Storage (39)
People (1)
Xsan (103)
How To (26)
User Functions
Username:

Password:

Don't have an account yet? Sign up as a New User
Who's Online
Guest Users: 17
Sponsorship

Xsanity is proudly sponsored by:

Tekserve
The Old Reliable Mac Shop

Finder crashes after xsan finder copy, Apple is stumped
Goto page 1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting
View previous topic :: View next topic  
Author Message
dmastroluca
fully protected
fully protected


Joined: 12 Nov 2009
Posts: 14

PostPosted: Fri Aug 06, 2010 11:40 am    Post subject: Finder crashes after xsan finder copy, Apple is stumped Reply with quote

I have been seeing a problem on my two xsan environments. Once or twice a week, an xsan client will become unresponsive after a finder copy or file transfer. When it happens on the client, the mounted xsan volume is slow to display its contents, or will show nothing at all when there is 12TB of data there. When the client is in this state, if I do a "get info" on the mounted volume, the "more info" window says "Fetching" and there is a spinning gear. the client never recovers from this state, and has to be power cycled to break free of this funk. If I go into xsan admin and try to unmount the volume from the problem client, it will not let me unmount it. I have seen this on my xserve and MacPro clients. I am running all the current softwares on the os and xsan. I first experienced this issue on one san environment, then noticed it on a completely different san. I have un-installed and re-installed the os and xsan on the clients and un-installed and re-installed the os and xsan on the MDC's to no avail. I have destroyed one of my xsan volumes and rebuilt it and the problem came back.

I have been working with high level Apple XSAN engineers on this problem since May and they have no idea why this is happening. I have been sending them EDC reports after the issue and all the logs are fine.

Has anyone else experienced anything like this? Sorry for the long post, but this issue is driving me crazy.
Back to top
View user's profile Send private message
nrausch
Xsan Master
Xsan Master


Joined: 14 Sep 2007
Posts: 202

PostPosted: Fri Aug 06, 2010 3:52 pm    Post subject: Reply with quote

I did experience this in two different locations.

Once it was bad Fiber between MDC and Switch.
The other time we were able to isolate it to a bad SFP.
It was going in and out, so it wasn't always busted.

But when it did go out....

You could still see the Volume and maybe some Folders. Because the Metadata network thought things were still connected. You couldn't usually see files. If you clicked on anything it would lock up and I would have to force reboot everything.

After shutting down all clients and still getting the symptoms at the rack with servers and storage only... we just replaced all the copper fiber cables in the rack. The problem never came back.

We found it had to be hardware after doing what you did, reinstalling and rebuilding...
Back to top
View user's profile Send private message Visit poster's website
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 456

PostPosted: Fri Aug 06, 2010 6:08 pm    Post subject: Reply with quote

As usual, Apple's ability to only look at software is what's leading you astray. To pick up on what nrausch mentioned, also look at any kind of errors registering on your fibre channel switches on the ports that are connected to those clients, specifically decode and CRC Errors. Look for inconsistencies in connectivity on your metadata network. Make sure the metadata network switch itself is not going bad. Connectivity is likely your issue.
Back to top
View user's profile Send private message Visit poster's website
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 863

PostPosted: Mon Aug 09, 2010 12:17 pm    Post subject: Reply with quote

I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth.
Back to top
View user's profile Send private message
dmastroluca
fully protected
fully protected


Joined: 12 Nov 2009
Posts: 14

PostPosted: Mon Aug 09, 2010 3:03 pm    Post subject: Another possible cause????? Reply with quote

Both myself and the Apple engineers looked at my switch logs and they came back clean. I even replaced a metadata ethernet switch with no results.

Could this be another possible cause? Permissions???

I can make it happen on my production san by doing a finder copy while logged in as "admin" I am asked to authenticate when I am copying to the xsan volume, it starts the copy, gets almost to the end, and then the client goes wonky. After examining the permissions (via xsan admin file management) on the xsan volume, I saw that the admin account I am using does not have RW access on the ACL of the volume. Instead of admin being on the ACL with RW access, there is a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 that has RW access. What the hell is this??
Back to top
View user's profile Send private message
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 863

PostPosted: Mon Aug 09, 2010 3:45 pm    Post subject: Reply with quote

are all your machines pointing to the same directory?
Back to top
View user's profile Send private message
dmastroluca
fully protected
fully protected


Joined: 12 Nov 2009
Posts: 14

PostPosted: Mon Aug 09, 2010 4:05 pm    Post subject: Reply with quote

Yes, OD and DNS configs are top notch.
Back to top
View user's profile Send private message
lotte
Xsan Master
Xsan Master


Joined: 11 Dec 2008
Posts: 190

PostPosted: Mon Aug 09, 2010 4:11 pm    Post subject: Reply with quote

Have you tried changing all your SFP Modules (in case you donīt use copper).
If you are using copper I would also as nrausch recommend switching to fibre cables!
I once had the same and it was indeed a SFP Module... In our case one plugged into the raid system, so no errors on the switch...

Lotte
Back to top
View user's profile Send private message
ACSA
Xsan Master
Xsan Master


Joined: 28 Jan 2007
Posts: 104

PostPosted: Tue Aug 10, 2010 8:22 am    Post subject: Reply with quote

abstractrude wrote:
I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth.


I'll have to agree, the copper cables are more prone to failure and give a lot of headaches...... Just replace those cables to fibre.....
Back to top
View user's profile Send private message Visit poster's website
mark raudonis
Xsan Master
Xsan Master


Joined: 23 Sep 2005
Posts: 123

PostPosted: Tue Aug 10, 2010 10:27 am    Post subject: Reply with quote

Over the past five years we HAVE had a couple of copper cables go bad. My question is "HOW DOES THAT HAPPEN"? Nobody touches them. They just sit there. It's not high voltage going through. Are there any electrical components within the connectors? If not, how does a piece of copper "go bad?" Anybody have a theory?

Mark
Back to top
View user's profile Send private message
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 863

PostPosted: Tue Aug 10, 2010 6:07 pm    Post subject: Reply with quote

i have no idea! Sad Im thinking that they dont bend well. its funny one of the reasons i used to like the copper cables over optical was because i thought they were more durable. turns out optical cables are pretty durable. i treat them pretty bad on the client machines all the time.

i remember one guy told me they are affected by lighting like fluorescent lights and stuff, but he was one guy.
Back to top
View user's profile Send private message
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 456

PostPosted: Wed Aug 11, 2010 7:15 am    Post subject: Reply with quote

Let's steer this back on topic. Permissions are almost never to blame for something like this. Even though, again, you've looked at logs that come back clean, you'd have to be doing serious I/O on each connection of the fibre channel switch, setting baselines, and then looking at _realtime_ stats to see if there is a bad connection. Logs are not going to help you. If you have qlogic switching, you can see this in their GUI. This will often lead to where the bad connectivity is. It's up to you to then further troubleshoot whether it's bad HBA port, bad HBA transceiver, bad cabling, bad switch transceiver or bad switch port.

Further, seeing a bogus user as you mention is clear evidence that at least that machine you mentioned is not properly bound to the directory, or bad permissions exist on the volume.
Back to top
View user's profile Send private message Visit poster's website
dmastroluca
fully protected
fully protected


Joined: 12 Nov 2009
Posts: 14

PostPosted: Thu Aug 12, 2010 12:14 pm    Post subject: Reply with quote

FYI, we are all fiber with SFP's.

But the plot thickens on the user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480. I created a test folder on the san. I added my admin (501) to the ACL with RW permissions. When I checked back later on the permissions of that test folder the admin user was replaced by a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 with RW access. Again, not sure if this is causing THE problem, but this does not look kosher. Why would the admin user change to garbage?
Back to top
View user's profile Send private message
MattG
Xsan Master
Xsan Master


Joined: 15 Apr 2005
Posts: 456

PostPosted: Thu Aug 12, 2010 7:40 pm    Post subject: Reply with quote

The long string of characters might be the GUID of the 501 user. Simply add the 501 user to your generic San Users group in the OD and that part should go away.
Back to top
View user's profile Send private message Visit poster's website
rstasel
Xsan Master
Xsan Master


Joined: 03 Aug 2007
Posts: 120

PostPosted: Sun Aug 15, 2010 8:36 pm    Post subject: Reply with quote

I've seen these ACL results before with 10.6.

So, my questions would be:

What OS are the clients/MDCs running? And where is the OD Master in the whole process?

I kept getting these same weird ACL results on my setup since my OD Master wasn't my primary MDC (it's my backup). The fix was to configure my primary MDC to be an OD replica. It seems that xsan is doing some part of the permissions mapping in Xsan 2.2. That said, a 10.5 Xsan client showed ACLs perfectly. It was only with the 10.6 Xsan clients. There also was the issue that some user accounts didn't have UUIDs (they'd been around since before OD).

That said, I never saw anything like you're seeing. So, guess I'm just helping clear up the potentially misleading problem so you can focus back on the main one.

Have a post about it on my site here: http://www.staze.org/10-6-server-xsan-2-2-1-and-acl-oddities/
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    Xsanity Forums Forum Index -> Troubleshooting All times are GMT - 5 Hours
Goto page 1, 2, 3, 4, 5, 6  Next
Page 1 of 6

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Best Viewed on a Mac | Suggested Browser: Whatever floats yer boat.