Xsan Failback

Andrew Allen's picture

Hi,

We have an Xsan 2.1 environment with 3 SANs, 3 MDCs and 11 clients. Occasionally we've had the odd failover occur over the years. We're currently in a state where the 3 MDCs and 2 of the clients are running Mavericks and Xsan 3.1. We're eventually aiming to move all cilents to Mavericks and Xsan 3.1 but in the mean time the older snow leopard machines are running Xsan 2.2.2 (build 148)

Every once in a while a SAN will failover to it's secondary metadata controller. However, recently we had our second SAN failover and then fail BACK to the original controller. I read in the physical copy of the Xsan 2 Administration guide (2009) I have that this is called failback and that a failback should never occur automatically: it must be manually instigated by person. However, we had a failback occur without a person insitigating it.

Has anyone experienced this? Is it a concern? I'm heading to the site to investigate the console logs and I'll post them below shortly.

 

Andrew Allen's picture

The situation has become more complicated. Error logs are inconsistant, but I'll cover those in a minute.

The first potential problem I've found is that 2 of the 3 Metadata controllers have forgotten their system serial numbers after having been upgraded to Mavericks. That upgrade happened almost 2 months ago however, and the Machines have been functioning seemingly ok. No, we did not replace any logic boards or have anyone at the Apple Store do anything with the Xserves. There's a rare but documented problem with machines being upgraded to Mavericks and forgetting their serial numbers. We're trying to figure out how to resolve this. It could cause major Xsan problems. One of the two MDCs who has forget their serial number somewhat frequently has an error like this in the logs:

"apsd: Hardware SerialNumber "System Serial#" looks incorrect or invalid

Now unto the error logs.

SAN1 has a massive amount of this error:

Xsan Admin: CFNetowkr SSLHandshake failed (-9806).

To my knowledge, this might just mean that a machine in the SAN is asleep or offline.

The only other error on SAN1 is this error, roughly around the time (8:23 AM) when we think the SAN failed over (it happened on the weekend when no one was in the building). Can someone confirm what this error means?

Xsan Admin: tcp_connection_destination_prepare_complete 713297 connectx to 10.36.2.153#312 failed: 65 - no route to host
Xsan Admin: tcp_connection_handle_destination_prepare_complete 713297 failed to connect

There's a short burst of these errors, targeted at 4 of the work stations. The should have been shutdown over the week, and I'm not sure why being unable to see them would cause a failover. But that's what seems to have happened. 

 

 

MDC2 (controller for SAN2 that failed over and then failed back) has a huge number of this error:

Xsan Admin: ERROR: Error getting list of volumes: kOfflineError (0)

In the midst of all these errors, I found these two curious lines:

fsm: Xsan FSS 'FSN_SAN2 [0]': FSS Socket SEND to client [7] 'fcserver.metadata.net:53811' has returned bad status - [errno 64]: Host is down

fsm: Xsan FSS 'FSN_SAN2 [0]': SNFS Client (192.168.1.4) disconnecte dunexpectedly from file system 'FSN_SAN2', reason: network communication error

192.168.1.4 is the FCserver machine, our FinalCut Server. It's an Xserve, but not a metadata controller. It was at one point, but it's been demoted, removed from the SAN and readded as a client. Thie FinalCut Server is now starting to crash and disconnect from the SAN frequently, but only from SAN1, not SAN2 or SAN3. 

More errors from MDC2:

fsm: Xsan FFS ' FSN_SAN[1]'L Node 192.168.1.12 [13] does nt support Directory Quotas. DQ limits will not be enforced on this client.

This is one of the edit bays. We don't have any quotas set in Xsan Admin, nor have we ever on any of the machines to my knowledge. I know we have a Directory Server set up, but I don't think it has any quotas attached to it either.

 

Finally, for the Xsanity occurring on MDC3/Shareserverfrom MDC2:

fsm: Xsan FFS ' FSN_SAN3[0]'L Node 192.168.1.4 [85] does not support Directory Quotas. DQ limits will not be enforced on this client.

fsm: Xsan FFS ' FSN_SAN[1]'L Node 192.168.1.4 [67] does not support Directory Quotas. DQ limits will not be enforced on this client.

192.168.1.4 is our MDC3/ShareServer. IT was a client once upon a time, but should not be a client now. It's a metadata controller, and was in fact hosting SAN2 and SAN3 when these errors occurred.

Also from MDC3:

fsm: Xsan FSS 'FSN-SAN3[0]': SNFS Client '192.168.1.4' (192.168.1.4) disconected unexpectedly from the file system 'FSN_SAN3', reason: client socket shut down

And then after these messages, there came a TON of these two messages back to back spam repeated over and over

fsm: Xsan FSS 'FSN_SAN[1]': Node 192.168.1.4 does not support Directory Quotas. DQ limits will not be enforced on this client.

fsm: Xsan FSS 'FSN_SAN[1]': SNFS client 192.168.1.4 (192.168.1.4) discconected unexpectedly from file systrem 'FSN_SAN', reason: client socket shut down

Any help on any of these error messages would be much appreciated.

 

 

 

 

During all of this, MDC3/Shareserver took over hosting SAN1 (as we configured it to do). However, it's console logs are strewn with all manner of errors.

 

 

bpolyak's picture

What you have had was two consecutive failovers. Xsan does not have failback.

Xsan Admin errors have nothing to do with it. Xsan Admin is nothing but a management layer on top of the actual file system. Never mind those.
The rest of your errors have nothing to do with actual file system restarts.
Client socket shutdown errors also are most likely a result of a restart, not a reason behind it.

Search your logs for a word PANIC. If there's an error in your filesystem, it would cause a PANIC error, and a forced failover.
Check out cvlogs of the file system: they reside in /Library/Logs/Xsan/data/[Volume name]/cvlog or something similar (can't look it up right now).

Do you experience any other issues apart from a restart? Does the filesystem hang forever for some clients until they force reboot? Do files disappear?
Could you also check sleep settings on the controllers, and check that they are not set to automatically install updates and restart?