"OpHangLimitSecs exceeded VOP-Setattr 183.18 secs"

MaXi-XCeL's picture

Hi Guys,

My XSAN setup is driving me nuts... My MDC's keep crashing on the following error message:

[code]
Oct 25 14:38:08 xsan02 fsm[348]: Xsan FSS 'StudioWorkDisc[1]': PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-Setattr 183.18 secs Conn[1] Thread-0x187fc00 Pqueue-0x404b78 Workp-0x1cf9218 MsgQ-0x1cf9208 Msg-0x1cf9264 now 0x43d5081a1bb34 started 0x43d5076b6961b limit 180 secs. " file queues.c, line 612
Oct 25 14:38:08 xsan02 fsm[348]: PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-Setattr 183.18 secs Conn[1] Thread-0x187fc00 Pqueue-0x404b78 Workp-0x1cf9218 MsgQ-0x1cf9208 Msg-0x1cf9264 now 0x43d5081a1bb34 started 0x43d5076b6961b limit 180 secs.\n" file queues.c, line 612\n
Oct 25 14:38:08 xsan02 fsm[348]: Xsan FSS 'StudioWorkDisc[1]': PANIC: wait 3 secs for journal to flush
Oct 25 14:38:08 xsan02 fsm[348]: Xsan FSS 'StudioWorkDisc[1]': PANIC: aborting threads now.
Oct 25 14:38:17 xsan02 fsmpm[274]: Portmapper: FSS 'StudioWorkDisc' (pid 348) exited on signal 4
Oct 25 14:59:11 xsan02 crashdump[671]: fsm crashed
Oct 25 14:59:11 xsan02 crashdump[671]: crash report written to: /Library/Logs/CrashReporter/fsm.crash.log
/code

My current configuration is:
2 MDC's
3 XRAIDs
6 LUNS

Any ideas? Did all the cable stuff already...

ACSA's picture

Hi can you give us more info than you have provided us sofar?

What OS are the MDC's running?
What version is the XSAN?
What is the firmware versions of the RAID's?
What versions are on the Clients?
Is the Meta Data lun seperate from the Data LUN?

etc

Then we perhaps now what to do.

MaXi-XCeL's picture

MDC's: Mac OS X Server 10.4.10
XSAN: Version 1.4.1
XRAID FIRMWARE: 1.5/1.50f

Metadata is separate from the data luns.

XRAID 1:
RAID 1 Metadata LUN
RAID 5 Data LUN

XRAID 2:
RAID 5 Data LUN
RAID 5 Data LUN

XRAID 3:
RAID 5 Data LUN
RAID 5 Data LUN

StudioWorkDisc (13.65 TB)
-> metadatapool
- XSERVE1-LUN1-METADATA 465,73 GB
-> storagepool
- XSERVE1-LUN2-RAID5 2,73 TB
-> Any
- XSERVE2-LUN1-RAID5 2,73 TB
- XSERVE2-LUN2-RAID5 2,73 TB
-> Last
- XSERVE2-LUN1-RAID5 2,73 TB
- XSERVE2-LUN2-RAID5 2,73 TB

I know my choice of config and storagepools degrade the way XSAN should be used but that has some historical issues.

donald's picture

From the log it seems you have trouble with metadata.

with [i]cvlabel -l -s/i you can check if the lun for metadata is available.
from [i]cvadmin/i you can check the storagepool (aka stripegroups) with the command [i]show long/i.

ipott's picture

Does the ophanglimit message appear BEFORE the fsm crashed message or after the fsm crashed message?

do you have filesharing active for the volume on more than one node?

jordan's picture

Maxi,

have you found a fix for this one?

BenT's picture

We have recently started seeing this same behaviour:
XSAN 1.4.1
25 FC-connected users
24TB volume approx 92% full
OSX 10.4.9
MDC is Intel XServe with 2GB RAM
All clients and servers dual-FC connected
FSM process on MDC sits at very high CPU usage, when it gets to 100% usage the FSM process dies with the error:

0x180ac00 (**FATAL**) PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-VopLookupV4 183.31 secs Conn[71] Thread-0x1881200 Pqueue-0x405018 Workp-0x959f618 MsgQ-0x959f608 Msg-0x959f664 now 0x43e4a1e3e8923 started 0x43e4a13517fd5 limit 180 secs." file queues.c, line 612

Anyone else found a solution to this problem?
We are going to increase MDC RAM, upgrade the 10.4.10 and XSAN 1.4.2 and then maybe try an MDC with faster CPUs

drocamor's picture

Hey gang,

I'm working with romannumeral5 on his Xsan. We see the "OpHangLimitSecs exceeded" message after the san has already frozen for the users. (approximately 180 seconds later...)

We've escalated up through Applecare and they tell us that the message means that the MDC is unable to write to the metadata LUN for this period of time. When this happens fsm panics and attempts to shut itself down. In our situation the process does not exit and a manual failover must be forced.

We have seen some high latency on the fibre channel and a handful of IO errors in the system.log on the hosting MDC every once in a while. We've replaced the fiber cable, switch port, SFPs, RAID controller, and one disk in the metadata LUN while trying to fix this. Our next steps are to replace the second disk in the LUN and then move the disks and controllers to a new XServe RAID.

We upgraded to 1.4.2 in an attempt to resolve the issue but this just brought us to a situation where the manual failover does not work reliably. We were unable to fail the volume over when just the two MDCs were online and being unable to stop production outside of our downtime window we brought the volume back up without knowing if we would be able to failover. There was a panic yesterday and we were able to failover. I attribute this to bad luck when we were testing. We've got a ticket in Applecare open for this as well as one for the original panics.

BenT, I would hope that you're not having a similar experience to ours. Does your volume failover after the panic? I would check all the fiber connections look for errors on the Xserve RAID that hosts your metadata LUN. You should also probably have more than 2 GB of RAM in your MDC because of the size of your volume. Your volume is also mad full. In my experience this just makes any Xsan problem worse.

If anyone has any other ideas about things to try please let me know.

Thanks,
Dave

ipott's picture

we have the same issue.. fsm is going up to 350% CPU and then crashing.
The failover hangs most of the time. Sometimes we have to failover several times.

We opened a ticket at applecare 6!!! months ago. Till now they did not come up with any solution, telling us we are the only customer with this kind of problem.

We will throw the XSAN out in the next weeks.

ipott's picture

our MDC has 4 cores. 16 Gb RAM and the filesystem is filled up to 75%.
10.4.11 and XSAN 1.4.2

drocamor's picture

ipott,

Sorry to hear you are having issues. How many clients do you have? What are they doing? How large is your volume?

ipott's picture

Hi,

the volume is 17 TB. 5 Xsan clients and smb + afp filesharing over 2 of the clients for the renderfarm.

We tried a lot for solving the issue, but currently even a simple "find" command crashes the volume after some minutes.

drocamor's picture

What kind of renderfarm software is accessing the Xsan over the AFP and SMB share?

I wonder if that could be an issue, we are resharing to several machines that are transcoders and others that do some heavy searching and indexing of certain folders.

Is everyone else having this problem also resharing the volume over AFP, SMB, or NFS?

ipott's picture

it's a royal render renderfarm. with about 25 nodes. but I think that is not the problem, because even if filesharing is turned off, I can crash the volume by typing a simple "find" command. Or by doing an snfsdefrag -r on the volume.

we are currently thinking about tuning Volume parameters like buffercache and inodecachesize, threadpoolsize. ADICs documentation says these parameters are increasing metadata performance...

we also have had 3 weeks without crash some month ago and people were rendering a lot on the volume. That was after we installed more memory.

we are trying some things at this stage and I will post the results here.

drocamor's picture

To help me track down my own problems and maybe help you out too, do you think you could send me some of the system.logs, cvlogs, and volume configuration for your xsan?

I'm looking for possible issues with my volume involving latency and disk io. If you could send me those files I'd like to compare them to my own. Would that be alright? Maybe we can help each other.

Thanks,
Dave

rogerbg's picture

Hello,

I'm having the same problem. At least I think it's the same...

We have setup two XSERVEs. One for DHCP/DNS. The other for XSAN. We have an XRAID, a QLogic Fiber Switch and a Linksys switch. I can get the whole thing up and running, but XSAN only runs for 3 or 4 hours and stops with a PANIC in the logs. When this panic occurs, the clients lock up and display the "you must restart this machine" message. Once I restart the two clients and the XServe/XSAN, everything goes back to normal. The clients are already mounted, the volume is running. It's like nothing ever happened...

The XRAID has no data on it yet. We don't fully trust this setup at this time. Both clients are running OS X 10.4.11 and XSAN 1.4.2. Both XServes are running OS X Server 10.4.11. The XServe that is running XSAN is running XSAN 10.4.2.

Here's the XServe log that is running XSan...
Nov 26 11:46:09 server2 kernel[0]: FusionMPT: Notification = 1 (Log Data) for SCSI Domain = 2
Nov 26 11:46:09 server2 kernel[0]: FusionMPT: No log data type available, data = 0x07610400
Nov 26 15:09:41 server2 fsm[289]: Xsan FSS 'R1V1[0]': PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-Setattr 183.34 secs Conn[6] Thread-0x186e600 Pqueue-0x4049d8 Workp-0x188bc18 MsgQ-0x188bc08 Msg-0x188bc64 now 0x43fda8b74223c started 0x43fda808692f7 limit 180 secs. " file queues.c, line 612
Nov 26 15:09:41 server2 fsm[289]: PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-Setattr 183.34 secs Conn[6] Thread-0x186e600 Pqueue-0x4049d8 Workp-0x188bc18 MsgQ-0x188bc08 Msg-0x188bc64 now 0x43fda8b74223c started 0x43fda808692f7 limit 180 secs.\n" file queues.c, line 612\n
Nov 26 15:09:41 server2 fsm[289]: Xsan FSS 'R1V1[0]': PANIC: aborting threads now.
Nov 26 15:09:50 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:09:50 server2 servermgrd: xsan: [45/3ADA70] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:09:50 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_clients_for_fsmvol_named(R1V1): Could not connect: Connect to FSM failed - Connection refused
Nov 26 15:09:50 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_computer_properties(R1V1): No client list for volume
Nov 26 15:10:51 server2 servermgrd: xsan: [45/3BD1B0] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:10:51 server2 servermgrd: xsan: [45/3A0860] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:10:51 server2 servermgrd: xsan: [45/3BD1B0] ERROR: get_clients_for_fsmvol_named(R1V1): Could not connect: Connect to FSM failed - Connection refused
Nov 26 15:10:51 server2 servermgrd: xsan: [45/3BD1B0] ERROR: get_computer_properties(R1V1): No client list for volume
Nov 26 15:11:51 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 0
Nov 26 15:11:51 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 2
Nov 26 15:11:51 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 0
Nov 26 15:11:51 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 2
Nov 26 15:11:52 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:11:52 server2 servermgrd: xsan: [45/39B8D0] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:12:17 server2 servermgrd: xsan: [45/3BA980] ERROR: get_remote_properties(R1V1): Could not reach 192.168.1.22:311
Nov 26 15:12:17 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_clients_for_fsmvol_named(R1V1): Could not connect: Connect to FSM failed - Connection refused
Nov 26 15:12:17 server2 servermgrd: xsan: [45/39DDC0] ERROR: get_computer_properties(R1V1): No client list for volume
Nov 26 15:13:17 server2 servermgrd: xsan: [45/348030] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:13:17 server2 servermgrd: xsan: [45/3B3820] ERROR: get_fsmvol_at_index: Could not connect to FSM because Connect to FSM failed - Connection refused
Nov 26 15:13:31 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 0
Nov 26 15:13:31 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 2
Nov 26 15:13:33 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 0
Nov 26 15:13:33 server2 kernel[0]: FusionMPT: Notification = 9 (Logout) for SCSI Domain = 2
Nov 26 15:13:43 server2 servermgrd: xsan: [45/3CD0F0] ERROR: get_remote_properties(R1V1): Could not reach 192.168.1.22:311
Nov 26 15:13:43 server2 servermgrd: xsan: [45/348030] ERROR: get_clients_for_fsmvol_named(R1V1): Could not connect: Connect to FSM failed - Connection refused
Nov 26 15:13:43 server2 servermgrd: xsan: [45/348030] ERROR: get_computer_properties(R1V1): No client list for volume

Any idea what causes this sort of thing? Does this log post help anyone else in this thread?

Thanks,
R

matx's picture

I don't know if it helps, but our Metadata failovers and crashing occurred during high latency periods brought on by our render farm which involved several xsan clients resharing out over afp. Turning off the farm brought sanity back to xsan. I think multiple points of reshare is bad, and fiber attached clients fare better than network shared out nodes. My 2 cents.

BenT's picture

In our instance this issue has been resolved:
- reduce usage of the volume with the FSM crashing to below 90%
- increase physical RAM in the MDC

According to the Apple engineers I discussed this problem with, the "OpHangLimitSecs exceeded" error is a pretty generic one despite the specifics in the log file.
There are a number of things in their database that can cause this - it basically just means the XSAN FSM process died for some reason.

drocamor's picture

We're still experiencing our problem. We know that load plays a part in it so we're working on changing the workflow to help this.

We have an Episode Engine cluster that does a lot of transcoding. It works right off the Xsan. On an average day it probably does a little more than 1000 jobs. I'm going to see if I can scale it back or move it to some other storage. Is anyone else using Episode Engine on their Xsan?

sundraghan's picture

We've had a similar issue in our Xsan setup for several months now where the MDC would get maxed out on the CPU and then we'd have to force a failover in order to get it operational again. When we looked at the MDC Xserve itself we could see that the ethernet lights were fully on whenever the state occurred. Sometimes it would happen several times a day, while others we'd get 2-3 days of sanity. The one process that would pretty much always trigger the freeze on the MDC though was a search. For months and months we tried everything...

Finally on a support call to Apple yesterday I was told that I should make sure that all the permissions were set to a network account. We use an AD - OD setup, but some of the file permissions on the SAN in the Owner / Group and ACL fields had Local User or Group accounts instead of AD or OD ones. After spending months trying to fix the issue I was tempted to laugh and say how could that possibly cause this issue. However, I had nothing to lose and our setup is essentially 'open' so everyone has read / write access to everything. To make life simple I just did a Permissions change at the top of the directories and then propagated down after ensuring that all account were AD or OD in the Owner, Group and ACL fields. Presto - our SAN is now working stably (fingers crossed).

I ran three back to back searches without a single crash or even seeing any of the CPU processes spike like they used to. We have hundreds of thousands of files on there too, so each search took about 15 minutes. Thus so far so good!

I'm not sure if this is everyone's penicillin on this thread, but just wanted to share a simple something to try that was a miracle cure for us.

maccebu's picture

Hi all,

After four months of xsan going fine, for some reason, just today i now have this kind of error? anyone has suggestion on how to eliminate this problem?

im running 10.4.10 and xsan 1.4.1 all intel based machine.

Im just curious if you are all using cisco switch as your metadata/public network

drocamor's picture

maccebu,

We are using Cisco switches for the ethernet networks. Is there a reason that you think this might be causing a problem?

-Dave

maccebu's picture

drocamor,

im not so sure, we are using also here a cisco switch, i was amaze that all my mdc and some of my xsan clients are running only in 10baseT and half duplex on the metadata and public network (could you imagine this) - you can check this on system profiler select network and your built-in interface then see the MediaOption and MediaSubType, but i dont know if this maybe causing the OpHangLimitSec or crashing of FCP from time to time, but im pretty sure that this is not really good.

i have tried changing the settings on the system pref-> Network pane to 1000T Full Duplex, flow control as the docs of the apple was telling and since the cisco is capabe of this settings as per our network guy but seems that the interface will use only 10baseT/Half duplex.

here is the link:
http://docs.info.apple.com/article.html?artnum=301740

it says

At each MDC and Xsan client, open Apple System Profiler. Select Network, then select the interface from the list which is the metadata interface for that computer. Ensure that the settings for Ethernet > Media Options are set to "Full Duplex, flow-control" and Media Subtype is "1000baseT".

could you tell me if you are having same Media Option or SubType with me.

i still have to confirm with the network guy of there are some settings on the switch to change the speed as they were telling us that its already full duplex and auto negotiate.

drocamor's picture

maccebu,

Early on we found some problems with interfaces negotiating to 10 half which we determined were due to cabling reasons. We have a CCIE on staff who helped with troubleshooting. We recabled workstations that were having problems and in some cases hard set the ports to gigabit full on the switch and the machine.

Note that if you are doing this then you need to set the speed on both the switch port and the computer port. If you don't do it on both then you can be in for trouble.

For us I do not think that the ethernet network is the problem. We fixed the negotiation issue and still have the same panic error every week or so. Applecare has told me this is sort of a generic error that can mean a lot of different things. I recommend you open a case with them if you haven't already.

Definitely fix your network issues, that should bring you a lot closer.

-Dave

maccebu's picture

drocamor wrote:
We're still experiencing our problem. We know that load plays a part in it so we're working on changing the workflow to help this.

We have an Episode Engine cluster that does a lot of transcoding. It works right off the Xsan. On an average day it probably does a little more than 1000 jobs. I'm going to see if I can scale it back or move it to some other storage. Is anyone else using Episode Engine on their Xsan?/quote

drocamor,

we have this application also episode engine were it will convert videos from highres to lowres directly on the xsan volume but not as heavy workload as yours, this is used rarely cause we are still on the test basis of the application, we are only running this application on one xserve xsan client.

already fixed our network problem regarding 10baseT by changing cables and setting some configuration on switch and on the mac os x and now im observing if the ophanglimitsec error will occur again, i hope not.

a lot of people are saying that this is a problem communicating on metadata pool.

how do we get the logs of the fiber switch? like qlogic 5200 or 9000?

donald's picture

maccebu wrote:
how do we get the logs of the fiber switch? like qlogic 5200 or 9000?/quote

Read all about this in [url=http://www.servants.co.jp/qlogic/pdf/2-2_59060-00_A-QLogicFCSwitch_Event... qlogic event guide/url

maccebu's picture

thanks,

it was very informative.

drocamor's picture

maccebu,

Please keep me posted on whether or not your Xsan keeps crashing.

I migrated our transcode system off the Xsan and had another crash this morning so it's not just that. I have a feeling that my Xsan crashing for 4 months has caused some problems with the actual volume.

We're going to rebuild the volume sometime in January after we get storage in to migrate processes off the Xsan. We'll probably rebuild much smaller and use the other LUNs for something else.

-Dave

maccebu's picture

drocamor wrote:
maccebu,

Please keep me posted on whether or not your Xsan keeps crashing.

I migrated our transcode system off the Xsan and had another crash this morning so it's not just that. I have a feeling that my Xsan crashing for 4 months has caused some problems with the actual volume.

We're going to rebuild the volume sometime in January after we get storage in to migrate processes off the Xsan. We'll probably rebuild much smaller and use the other LUNs for something else.

-Dave/quote

its been a week now, and seems like OpHangLimitSec was gone but im still crossing my hand, but for this span of days we did not run the episode engine, i mean its running background but its not transcoding any video file.

if this gonna happen again i will do a fresh installation on the BMDC making it a new controller and do the hosting of xsan in this new MDC, and then hopefully i can do this in all of the controllers.

MaXi-XCeL's picture

Hi Guys,

Sorry for abandoning this topic... It sucks to find more XSAN admins that are experiencing these issues. When the problems started on my SAN google didn't show as much as it does now.
[url]http://discussions.apple.com/thread.jspa?threadID=1341596&tstart=0/url

So taking that in consideration updates could possibly be a culprit in this issue. Also I found out that connecting a renderfarm (> 10 computers) to a over SMB shared xsan volume causes a lot of stress / crashes of this type.

I've changed my xsan configuration to 1 metadatapool and 1 datapool because also when 1 of the datapools has less than 10% of free space the OpHangLimitSecs exceeded error occurs more often.

As for the failover issues I've red in this topic, be absolutely sure your DNS configuration is correct. For example XSAN for some reason wants to reverse dns your metadata network. If you create a zone in your DNS server without hosts the DNS server responds "Host not Found" which is better (apparently) then the "Zone not Known" responds without the zone.

Quote:
Now I can tell you for sure that the Xsan client and the MDCs are going to attempt reverse DNS lookups on the private network IPs. I don't know why -- maybe for logging, maybe for security, or maybe it is a bug. If the Xsan client gets a valid PTR response, great! If it gets a negative response, great! But if it gets no response, if there is a timeout, or if the PTR is incorrect, your SAN won't start./quote

[url]http://www.xsanity.com/article.php?story=20060920201633799&query=DNS/url

As for my problem, it still exists but it happens way less than before. The storagepool with less than 10% free space really messed up XSAN. Now with all the LUNs combined in 1 pool I've got enough free space. Also when closing down the filesharing (SMB/AFP) the volumes do not crash at all. This could be because the renderfarm then is unable to work or it has something to do with de secondary MDC which is sharing the volume (thus has it mounted as well).

Also it is a good idea to follow the apple xsan optimize guidelines and pay some attention to your fabric.
[url]http://docs.info.apple.com/article.html?artnum=301740/url

Check back on you guys later!

D :D

XSAN 1.4.2
OS X 10.4.11
2 MDC's
5 Clients
3 XRAIDS

MaXi-XCeL's picture

Hey guys!

Would it be possible to make the OpHangLimitSecs value larger in "YourXsanVolume.cfg" ?

[url]http://bbs2.chinaunix.net/archiver/tid-813178-page-2.html/url

It would look something like this:

[code]

  1. ****************************************************************************
  2. A global section for defining file system-wide parameters.
  3. ****************************************************************************

FileLocks Yes
GlobalSuperUser Yes
Quotas Yes
WindowsSecurity Yes
ForceStripeAlignment Yes
UnixIdFabricationOnWindows Yes
EnforceACLs Yes
Debug 0
AllocationStrategy Round
InodeExpandMin 8
InodeExpandInc 32
InodeExpandMax 2048
BufferCacheSize 32M
JournalSize 16M
FsBlockSize 4K
InodeCacheSize 8K
MaxConnections 75
MaxLogSize 10M
ThreadPoolSize 128
OpHangLimitSecs 300
/code

Dying to try ;)

MattG's picture

In Xsan, this tag is not in the .cfg file by default.

In our lab, we added the tag with a value of 300 and it started up fine.

However, I think this would just prolong the effects of the symptom. This panic (OpHangLimitSecs exceeded) is due to a very large latency (180 seconds!) and therefore the troubleshooting needs to revolve around either latency/collisions in the metadata network, or any delays in the MDC being able to flush its changed filesystem metadata to the metadata LUN.

MaXi-XCeL's picture

Hey Matt!

Many thanks for testing this in your lab. I understand that it will only prolong the symptom but maybe it gives the MDC enough time to free itself of anything that makes the response take longer then usual.

In my xsan case before the volume fails over (after the 180 sec) the volume is still accessible for the clients. But once it fails over a couple of finders start to hang which makes the clients very unhappy. Also I use the backup mdc for sharing out the volume, because not al clients are xsan enabled.

I know this is not the desired solution of course and I'm working hard to fine tune my setup.

Are you sure the OpHangLimitSecs has to do something with collisions/latency on the metadata network? in my case I have a dedicated 1000Mbps switch and all metedata connected interfaces are hard set to 1000Mbps/Full Duplex.

Again, thanks for testing!
David

maccebu's picture

were experiencing this thrice for a week now.. any suggestions on how to avoid this problem?

MaXi-XCeL's picture

Hi Maccbu,

Here are some bullets for you to shoot with:

- Every storagepool needs at least 10% free space
- Limit the number of files per folder (move sequence files to other storage or render them to quicktime)
- Limit the number of resharing of the volume (AFP / SMB)
- Limit the number of clients connecting through AFP / SMB
- Access to large number of files, a lot of small I/O over SMB / AFP causes a lot of stress (i.e. renderfarm rendering with sequence files)
- Check your XSAN filesystem. Corruption can occur because of the numerous failovers [url]http://docs.info.apple.com/article.html?artnum=301911/url
- Follow the tuning guidelines for optimal performance
[url]http://docs.info.apple.com/article.html?artnum=301740/url
- Looking to your permissions, when connected to OD/AD applecare recommend to use directory accounts when setting permissions (see a post earlier in this thread)
- preferably only use fibre connected workstations to connect to your xsan.

We are all praying for a solution from apple. :(

Join us!
David

masashi's picture

Hi,

I had OpHangLimitSecs exceeded error last night, and eventually I found what was the cause (one of the port on FC switch)
during troubleshooting, I erased and create the Xsan volume from scratch, but after reviewing the cause I now realized I didn't have to do it.

Here is my 2 cents,
check all the FC ports stats including those for interconnecting switches. If you are using Qlogic's FC switch, you can monitor those by using Sansurfer Switch manager or EFS, in my case I had invalid CRC error.
It may tell you which port, or what connection is having problem.

maccebu's picture

just this week it happend and im pretty sure this will happen again :( , the last time i remember there were b4m folks installing an application on other xserve/macpro xsan clients last last week. has any body of you guys running b4m?

MaXi-XCeL's picture

masashi wrote:

check all the FC ports stats including those for interconnecting switches. If you are using Qlogic's FC switch, you can monitor those by using Sansurfer Switch manager or EFS, in my case I had invalid CRC error.
It may tell you which port, or what connection is having problem./quote

Hi Masashi,

Do you mean one port on your FC switch seems defect? That sounds pretty odd? ;)

I can reproduce the OphangLimitExceeded error just by changing persmissions on a folder with a lot of files, or by defraging a folder with a lot of files (thus high/small I/O) and it doesn't matter on which MDC de volume is active...

David

MaXi-XCeL's picture

maccebu wrote:
the last time i remember there were b4m folks installing an application on other xserve/macpro xsan clients last last week. has any body of you guys running b4m?/quote

I'm not running b4m. Apple does recommend using your MDC's only and exclusively for XSAN (see tuning guide). I'm in the process of freeing my MDC's from any other burden. That seems to improve the issue (improvement as in less occurrence).

But still when induced high I/O the error occurs.

masashi's picture

MaXi-XCeL wrote:

Hi Masashi,

Do you mean one port on your FC switch seems defect? That sounds pretty odd? ;)

I can reproduce the OphangLimitExceeded error just by changing persmissions on a folder with a lot of files, or by defraging a folder with a lot of files (thus high/small I/O) and it doesn't matter on which MDC de volume is active...

David/quote

Hi David,

Sorry, please forget my previous input. You are right.
After reviewing the last night troubleshooting, the cause of the problem I had last night has nothing to do with "OphangLimitExceeded error" It was left in the log somehow and subsidiary error and it's gone after I fixed the main problem.

maccebu's picture

just wanted to share,

another fsm crash today, another ophanglimits damn! :(, in frustration i restarted all servers, the thing that intrigues me is one server that is an xsan client sharing out file services using AFP, when i am about to restart a dialog pops up saying, the server had 230 connections do you want to disconnect them all and continue restarting? and to think our macintosh are not more than 50

i cancelled the restarting on this server and tried checking the connection tab under AFP and indeed there were so many connections.

im sure the ones experiencing this problem has AFP sharing turned on especially those that have render farm that uses AFP to share out resources.

I wonder if turning off AFP will help avoid this issue. For us this is a no no since we need to share out resources from xsan.

ipott's picture

the AFP connection are probably queued up, because the clients all try to reconnect, when the volume disappears. Thats a result of the volume crash.

We fixed that issue for our customer with a perl script, that fixes his ACLs. He had lots of bogus ACLs on his files, created by MACs over smb (or AFP?). Due to some reason this caused the volume to crash, when he accessed these files.

You can try to delete the ACLs with
chmod -R -N
and repropagate the permissions for it.

Also find the files that are belonging to no user and give them a valid user.
find -nouser ....

Then try to crash it.

In our case the chmod crashed the volume several times, but it finally went through. No crashes for 4 months now!! The customer is regularly running our perl script to fix the permissions and that did it.

MaXi-XCeL's picture

ipott wrote:
Due to some reason this caused the volume to crash, when he accessed these files.

The customer is regularly running our perl script to fix the permissions and that did it./quote

That makes some sense. Permissions are mentioned earlier as a possible cause for the fsm crashes...

ipott, could you share your perl script with us? Also a question for all of you, should we and how should we use ACL's?

ipott's picture

Hi,

I think I can share the perl script in about 2 weeks. We are trying to make an article out of it and post it here. It is customized for the customers environment and creates the ACLs as the customer needs them. You certainly will have to change it.

At the first stage I would suggest to throw away all of the ACLs as I mentioned above and recreate them. Neither the delete nor the recreate process worked without trouble.

Also find all the files with no valid user and give them a valid user. I saw the xsan crashing, when snfsdefrag touched a file with an invalid user.

The script is just for maintenance and a workaround, because due to unknown reasons the macs keep creating bogus permissions over smb fileshares.

MaXi-XCeL's picture

Hi Guys,

It seems that xsan 1 (1.4.2) doesn't support ACL's at all. You are able to use them but it causes major stress on the metadata which causes these types of crashes. In my SAN particulair because it consists of > 1.000.000 files spread over >40.000 folders which xsan 1 doesn't like either.

It seems that xsan 2 has better support for ACL's and high file/folder count and using XSAN as a NAS for the rest of the network.

(
jpaquin's picture

Is this what you have determined from this thread and your experience, or do you have any confirmation from Apple on this as well? It certainly seems to make sense.

It definitely helps to fix this problem by cleaning up bogus ACLs and the nagging problem I am seeing of having multiple records of duplicate inherited ACLs. But, by cleaning that up are we actually cleaning up the previous metadata, or just alleviated the stress on the metadata from that point on? I find myself constantly monitoring my clients who are doing any file sharing at all with an XSan Volume as we have been running to this here as well.

jordanwwoods's picture

I don't know if the word, "doesn't support ACLs" is all together correct. I have been using ACLs for the past year and never experienced a FSM crash... but the ACLs don't always behave correctly and I have since thrown them out and gone with wide open users to get around this "read only" permissions issue that is a default. I have yet to see a video environment that actually needs heavily structured permissions to warrant ACLs. I tell them, if you don't trust the editor reading files in a certain folder, just get rid of the editor, they are a dime a dozen here in LA. :)

Pablitus's picture

I have the following setup

1 Xserve Quad Core / 4GB RAM / MDC / OS X 10.4.11 / XSAN 1.4.2
1 Xserve Quad Core / 8GB RAM / NFS / OS X 10.4.11 / XSAN 1.4.2
1 Xserve Quad Core / 6GB RAM / OD Master / OS X 10.4.11 / XSAN 1.4.2
1 Xserve Quad Core / 4GB RAM / NFS / OS X 10.4.11 / XSAN 1.4.2
1 Xserve Quad Core / 16GB RAM / Episode Engine (currently off) / OS X 10.5.2 / XSAN 1.4.2

1 XRAID of 5.4TB with data and meta in the same volume
4 XRAID of 28TB (750GB HDD's) with data and meta separated (2 disk dedicated to meta)

6 MAC PRO's with FCP

Cisco MDS 9506 SAN Switch

one or two times a day i get this error in one of the volumes and i need to reset the xraid and fire up the volume again.

[0327 16:59:44] 0x1809800 (**FATAL**) PANIC: /Library/Filesystems/Xsan/bin/fsm "OpHangLimitSecs exceeded VOP-VopLookupV4 183.21 secs Conn[16] Thread-0x187c800 Pqueue-0x404c18 Workp-0x6081218 MsgQ-0x6081208 Msg-0x6081264 now 0x44970a080da94 started 0x4497095954c68 limit 180 secs.
" file queues.c, line 612

I don't know how to proceed with that anymore...i need to get the system stable...any ideas?

If you ask me if we are using ACL, the answer is yes...and the posix permissions are setted to local server users.

The funny thing is that the volume who has meta and data together NEVER goes down....just the other volume who has the meta and data separated

Thanks in advance

Pablitus's picture

Double post...

Do anyone thinks that migrating to XSAN 2.0 will solve this issue? I have the licenses so it can be a possibility to do this.

Thanks again! :)

gimpbully's picture

we have this ophanglimit problem. We aren't sure how long it's existed but we have records of it back to 2 days after we installed xsan2. from the looks of it, it likely existed before this. We have a support ticket in with apple for the issue. It is going no where. It has been open for over a month and they acknowledge that it is a bug in the fsm and will not share what causes it. We have had 49 occurrences since apr 15th (now july 9th). We can get up to 9 of these a day and it's really starting to mess with our editors and encoding workflow. I'm furious to say the least and have little hope for any resolution any time soon. I can say that there is a procedure for getting things back up fairly quickly. Normally it goes like this:
interactive session hangs (this happens on renames, moves, deletes, attribute setting.. basically any metadata actions). After 180+ seconds, the OpHangLimit kicks in and tries to restart the fsm. There is a 50/50 chance the fsm will just start back up and everything will come back up clean. The other 50%, we just do a quick `cvfsck -wv [fs]` and then manually do a `start [fs]` from cvadmin. This all takes about 5 minutes.

keithkoby's picture

Can you get your mdcs to fail manually through cvadmin?

Pages