Shutting down Metadata Controllers in XSAN 3.1

Andrew Allen's picture

I've got another question. Before we upgraded to XSAN 3.1 and Mavericks on our metadata controllers, they were running Snow leopard 10.6.8. Our client machines also ran 10.6.8. Occasionally for some reason an MDC would become unresponsive and a failover would occur. It used to be that when a MDC had stopped hosting a volume, you could reboot or shutdown that MDC (the one no longer hosting) without negatively effecting the SAN. Perhaps this is a huge no-no and we just weren't aware of it, but it never caused issues. 

Well we needed to upgrade from Mavericks 10.9.1 to 10.9.2 on one of our MDCs. We did a fail over so that a different mavericks controller was hosting the third SAN and then we shut down the machine after waiting 2-3 minutes. We have 3 mavericks 10.9.1 controllers, 1 mountain loin 10.8.5 client and 5 snow leopard 10.6.8 clients (which we will be upgrading to Mavericks soon). This time when we shut down the controller the 10.6.8 machines freaked out. They froze, lost connection to the Volume we did the failover on and needed a reboot. The Mountain Lion machine was just fine--it still had connectivity and could open files on the SAN. To our knowledge, nothing was damaged on the SAN. When we used to do failovers this way we'd make the editors safe their files and close their programs while we did the failover. We did that this time as well but had this issues.

 

What is the proper procedure in Mavericks to shut down a MDC? If an MDC isn't hosting, can it be shutdown safely in XSAN 3.1 or not?

thomasb's picture

You most likely experienced the bug with 10.9.1 where one or both of your MDCs had lost its ability to communicate properly with the local FSMPM.

You would see log messages like these in your system logs:
xsand[32]: Unable to connect to local FSMPM
opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7498 closed failed 9 - Bad file descriptor
assertion failed: 13B42: AppleLDAP + 23042 [DEADA232-8912-3ACB-8262-B3B764C6777F]: 0x16
opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7504 closed failed 9 - Bad file descriptor
opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7507 closed failed 9 - Bad file descriptor
opendirectoryd[28]: assertion failed: 13B42: AppleLDAP + 23042 [DEADA232-8912-3ACB-8262-B3B764C6777F]: 0x16

This is supposed to be fixed in 10.9.2. 

It should work perfectly fine to simply shut down an MDC that's not in control of any volumes. There is no other way to do it. However, if the failover didn't go as expected, the volume will try to fail back to the MDC you just shut down. 

I did this with 7 MDCs over the weekend, when upgrading from 10.9.1 to 10.9.2.

10.6.8 clients are more likely to lock up/freeze when a failover occurs. Especiallly if those 10.6.8 clients are doing any AFP/SMB re-sharing of your Xsan volume(s). 10.7.5 clients all the way up to 10.9.2 seem to handle failovers better, even when there is activity going on.