| View previous topic :: View next topic |
| Author |
Message |
dmastroluca fully protected

Joined: 12 Nov 2009 Posts: 14
|
Posted: Fri Aug 06, 2010 11:40 am Post subject: Finder crashes after xsan finder copy, Apple is stumped |
|
|
I have been seeing a problem on my two xsan environments. Once or twice a week, an xsan client will become unresponsive after a finder copy or file transfer. When it happens on the client, the mounted xsan volume is slow to display its contents, or will show nothing at all when there is 12TB of data there. When the client is in this state, if I do a "get info" on the mounted volume, the "more info" window says "Fetching" and there is a spinning gear. the client never recovers from this state, and has to be power cycled to break free of this funk. If I go into xsan admin and try to unmount the volume from the problem client, it will not let me unmount it. I have seen this on my xserve and MacPro clients. I am running all the current softwares on the os and xsan. I first experienced this issue on one san environment, then noticed it on a completely different san. I have un-installed and re-installed the os and xsan on the clients and un-installed and re-installed the os and xsan on the MDC's to no avail. I have destroyed one of my xsan volumes and rebuilt it and the problem came back.
I have been working with high level Apple XSAN engineers on this problem since May and they have no idea why this is happening. I have been sending them EDC reports after the issue and all the logs are fine.
Has anyone else experienced anything like this? Sorry for the long post, but this issue is driving me crazy. |
|
| Back to top |
|
 |
nrausch Xsan Master

Joined: 14 Sep 2007 Posts: 202
|
Posted: Fri Aug 06, 2010 3:52 pm Post subject: |
|
|
I did experience this in two different locations.
Once it was bad Fiber between MDC and Switch.
The other time we were able to isolate it to a bad SFP.
It was going in and out, so it wasn't always busted.
But when it did go out....
You could still see the Volume and maybe some Folders. Because the Metadata network thought things were still connected. You couldn't usually see files. If you clicked on anything it would lock up and I would have to force reboot everything.
After shutting down all clients and still getting the symptoms at the rack with servers and storage only... we just replaced all the copper fiber cables in the rack. The problem never came back.
We found it had to be hardware after doing what you did, reinstalling and rebuilding... |
|
| Back to top |
|
 |
MattG Xsan Master

Joined: 15 Apr 2005 Posts: 456
|
Posted: Fri Aug 06, 2010 6:08 pm Post subject: |
|
|
| As usual, Apple's ability to only look at software is what's leading you astray. To pick up on what nrausch mentioned, also look at any kind of errors registering on your fibre channel switches on the ports that are connected to those clients, specifically decode and CRC Errors. Look for inconsistencies in connectivity on your metadata network. Make sure the metadata network switch itself is not going bad. Connectivity is likely your issue. |
|
| Back to top |
|
 |
abstractrude Xsan Master

Joined: 13 Mar 2008 Posts: 860
|
Posted: Mon Aug 09, 2010 12:17 pm Post subject: |
|
|
| I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth. |
|
| Back to top |
|
 |
dmastroluca fully protected

Joined: 12 Nov 2009 Posts: 14
|
Posted: Mon Aug 09, 2010 3:03 pm Post subject: Another possible cause????? |
|
|
Both myself and the Apple engineers looked at my switch logs and they came back clean. I even replaced a metadata ethernet switch with no results.
Could this be another possible cause? Permissions???
I can make it happen on my production san by doing a finder copy while logged in as "admin" I am asked to authenticate when I am copying to the xsan volume, it starts the copy, gets almost to the end, and then the client goes wonky. After examining the permissions (via xsan admin file management) on the xsan volume, I saw that the admin account I am using does not have RW access on the ACL of the volume. Instead of admin being on the ACL with RW access, there is a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 that has RW access. What the hell is this?? |
|
| Back to top |
|
 |
abstractrude Xsan Master

Joined: 13 Mar 2008 Posts: 860
|
Posted: Mon Aug 09, 2010 3:45 pm Post subject: |
|
|
| are all your machines pointing to the same directory? |
|
| Back to top |
|
 |
dmastroluca fully protected

Joined: 12 Nov 2009 Posts: 14
|
Posted: Mon Aug 09, 2010 4:05 pm Post subject: |
|
|
| Yes, OD and DNS configs are top notch. |
|
| Back to top |
|
 |
lotte Xsan Master

Joined: 11 Dec 2008 Posts: 190
|
Posted: Mon Aug 09, 2010 4:11 pm Post subject: |
|
|
Have you tried changing all your SFP Modules (in case you donīt use copper).
If you are using copper I would also as nrausch recommend switching to fibre cables!
I once had the same and it was indeed a SFP Module... In our case one plugged into the raid system, so no errors on the switch...
Lotte |
|
| Back to top |
|
 |
ACSA Xsan Master

Joined: 28 Jan 2007 Posts: 104
|
Posted: Tue Aug 10, 2010 8:22 am Post subject: |
|
|
| abstractrude wrote: | | I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth. |
I'll have to agree, the copper cables are more prone to failure and give a lot of headaches...... Just replace those cables to fibre..... |
|
| Back to top |
|
 |
mark raudonis Xsan Master

Joined: 23 Sep 2005 Posts: 123
|
Posted: Tue Aug 10, 2010 10:27 am Post subject: |
|
|
Over the past five years we HAVE had a couple of copper cables go bad. My question is "HOW DOES THAT HAPPEN"? Nobody touches them. They just sit there. It's not high voltage going through. Are there any electrical components within the connectors? If not, how does a piece of copper "go bad?" Anybody have a theory?
Mark |
|
| Back to top |
|
 |
abstractrude Xsan Master

Joined: 13 Mar 2008 Posts: 860
|
Posted: Tue Aug 10, 2010 6:07 pm Post subject: |
|
|
i have no idea! Im thinking that they dont bend well. its funny one of the reasons i used to like the copper cables over optical was because i thought they were more durable. turns out optical cables are pretty durable. i treat them pretty bad on the client machines all the time.
i remember one guy told me they are affected by lighting like fluorescent lights and stuff, but he was one guy. |
|
| Back to top |
|
 |
MattG Xsan Master

Joined: 15 Apr 2005 Posts: 456
|
Posted: Wed Aug 11, 2010 7:15 am Post subject: |
|
|
Let's steer this back on topic. Permissions are almost never to blame for something like this. Even though, again, you've looked at logs that come back clean, you'd have to be doing serious I/O on each connection of the fibre channel switch, setting baselines, and then looking at _realtime_ stats to see if there is a bad connection. Logs are not going to help you. If you have qlogic switching, you can see this in their GUI. This will often lead to where the bad connectivity is. It's up to you to then further troubleshoot whether it's bad HBA port, bad HBA transceiver, bad cabling, bad switch transceiver or bad switch port.
Further, seeing a bogus user as you mention is clear evidence that at least that machine you mentioned is not properly bound to the directory, or bad permissions exist on the volume. |
|
| Back to top |
|
 |
dmastroluca fully protected

Joined: 12 Nov 2009 Posts: 14
|
Posted: Thu Aug 12, 2010 12:14 pm Post subject: |
|
|
FYI, we are all fiber with SFP's.
But the plot thickens on the user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480. I created a test folder on the san. I added my admin (501) to the ACL with RW permissions. When I checked back later on the permissions of that test folder the admin user was replaced by a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 with RW access. Again, not sure if this is causing THE problem, but this does not look kosher. Why would the admin user change to garbage? |
|
| Back to top |
|
 |
MattG Xsan Master

Joined: 15 Apr 2005 Posts: 456
|
Posted: Thu Aug 12, 2010 7:40 pm Post subject: |
|
|
| The long string of characters might be the GUID of the 501 user. Simply add the 501 user to your generic San Users group in the OD and that part should go away. |
|
| Back to top |
|
 |
rstasel Xsan Master

Joined: 03 Aug 2007 Posts: 120
|
Posted: Sun Aug 15, 2010 8:36 pm Post subject: |
|
|
I've seen these ACL results before with 10.6.
So, my questions would be:
What OS are the clients/MDCs running? And where is the OD Master in the whole process?
I kept getting these same weird ACL results on my setup since my OD Master wasn't my primary MDC (it's my backup). The fix was to configure my primary MDC to be an OD replica. It seems that xsan is doing some part of the permissions mapping in Xsan 2.2. That said, a 10.5 Xsan client showed ACLs perfectly. It was only with the 10.6 Xsan clients. There also was the issue that some user accounts didn't have UUIDs (they'd been around since before OD).
That said, I never saw anything like you're seeing. So, guess I'm just helping clear up the potentially misleading problem so you can focus back on the main one.
Have a post about it on my site here: http://www.staze.org/10-6-server-xsan-2-2-1-and-acl-oddities/ |
|
| Back to top |
|
 |
|