XSAN Maverick FSMPM die randomly on 2 MDC.

serged's picture
Tags: 

We have a new install with 2 macmini, promise thunderbolt and osx 10.9.1    

 

Randomly the fsmpm seems to die and restart. This does happend on both mdc and a failover is triggered if the one failing owns the volume.  

 

 

 

First Maverick install...

THX
 

Serge.

 

 

 

 

 

 xsand[30]: Unable to connect to local FSMPM

 kernel[0]: Reconnecting to local portmapper on host '127.0.0.1'

20140130 13:48:07] 0x7fff777e6310 (debug) PortMapper: FSD on port 49170 disconnected.
[20140130 13:48:07] 0x7fff777e6310 (debug) PortMapper: FSS 'SANVOL01' disconnected.
[20140130 13:48:07] 0x7fff777e6310 (debug) PortMapper: kicking diskscan_thread 4389867520.
[20140130 13:48:07] 0x7fff777e6310 (debug) FSS: State Change 'SANVOL01' REGISTERED: (no substate) -> DYING: (no substate) , next event in 60s (/SourceCache/XsanFS/XsanFS-508/snfs/fsmpm/fsmpm.c#5597)
[20140130 13:48:07] 0x105a81000 INFO Starting Disk rescan
[20140130 13:48:07] 0x105a81000 (debug) Disk rescan delay completed
[20140130 13:48:07] 0x7fff777e6310 (debug) PortMapper: new_input authenticating protocol type(134) pmt_type(0) FSD(127.0.0.1)...
[20140130 13:48:07] 0x7fff777e6310 (debug) PortMapper: Local FSD client is registered, on port 49170.

thomasb's picture

We have the exact same issue with three 10.9.1 Xsan environments at work, and we have yet to figure out why. We can go a week without any failovers, and suddenly we have multiple failovers on one day.

We had the same thing happen occationally with 10.8.4 MDCs too, but it is happening more frequently with 10.9.1 it seems.

Anybody else?

Thomas

abstractrude's picture

Let's take a look at more of your logs. CVLOG and system log at times of failover.

-Trevor Carlson
THUMBWAR

serged's picture

nssd log:

20140208 15:51:32] 0x7fff777e6310 (debug) PortMapper: FSD on port 49170 disconnected.
[20140208 15:51:32] 0x7fff777e6310 (debug) PortMapper: FSS 'SANVOL01' disconnected.
[20140208 15:51:32] 0x7fff777e6310 (debug) PortMapper: kicking diskscan_thread 4389867520.
[20140208 15:51:32] 0x7fff777e6310 (debug) FSS: State Change 'SANVOL01' REGISTERED: (no substate) -> DYING: (no substate) , next event in 60s (/SourceCache/XsanFS/XsanFS-508/snfs/fsmpm/fsmpm.c#5597)
[20140208 15:51:32] 0x105a81000 INFO Starting Disk rescan
[20140208 15:51:32] 0x105a81000 (debug) Disk rescan delay completed
[20140208 15:51:32] 0x7fff777e6310 (debug) PortMapper: new_input authenticating protocol type(134) pmt_type(0) FSD(127.0.0.1)...
[20140208 15:51:32] 0x7fff777e6310 (debug) PortMapper: Local FSD client is registered, on port 49170.
[20140208 15:51:32] 0x105a81000 INFO Disk rescan found 4 disks
[20140208 15:51:32] 0x7fff777e6310 (debug) NSS: Active FSS 'SANVOL01[0]' at 192.168.1.150:64540 (pid 56080) - dropped.
[20140208 15:51:33] 0x7fff777e6310 NOTICE PortMapper: Initiating activation vote for FSS 'SANVOL01'.
[20140208 15:51:33] 0x7fff777e6310 (debug) Initiate_nss_vote for FSS SANVOL01
[20140208 15:51:33] 0x7fff777e6310 (debug) NSS: sending message (type 2) to Name Server '10.10.10.150' (10.10.10.150:53944).
[20140208 15:51:33] 0x7fff777e6310 (debug) NSS: sending message (type 2) to Name Server '10.10.10.151' (10.10.10.151:50545).
[20140208 15:51:33] 0x7fff777e6310 INFO NSS: election initiated by 10.10.10.150:53944 (id 192.168.1.150) - client request.
[20140208 15:51:33] 0x7fff777e6310 (debug) NSS: Vote call for FSS SANVOL01 is inhibited - vote dis-allowed.
[20140208 15:51:33] 0x7fff777e6310 (debug) NSS: Vote call for FSS SANVOL01 is inhibited - vote dis-allowed.
[20140208 15:51:33] 0x7fff777e6310 (debug) NSS_VOTE2 to 10.10.10.150:53944
[20140208 15:51:33] 0x7fff777e6310 INFO NSS: Starting vote for FSS SANVOL01 using 2 voting members: 10.10.10.172, 192.168.1.150.
[20140208 15:51:33] 0x7fff777e6310 (debug) Connectivity test[1] to FSS 10.10.10.151:49199 passed .
[20140208 15:51:33] 0x7fff777e6310 (debug) local_fss_vote: WINNER SANVOL01 at 10.10.10.151:49199.
[20140208 15:51:33] 0x7fff777e6310 (debug) local_fss_vote: sending tally to 192.168.1.150:49199
[20140208 15:51:33] 0x7fff777e6310 (debug) tally_member:FSS/SANVOL01 COUNTED 1 vote for 10.10.10.151:49199 from member 192.168.1.150:53944
[20140208 15:51:33] 0x7fff777e6310 (debug) tally_member:FSS/SANVOL01 COUNTED 1 vote for 10.10.10.151:49199 from member 10.10.10.172:59434
[20140208 15:51:33] 0x7fff777e6310 (debug) Elect FSS[0] 10.10.10.151:49199 svc_pri/0x1 votes/2 wght/0x2703697a2e0
[20140208 15:51:33] 0x7fff777e6310 (debug) Elect FSS winner so far 10.10.10.151:49199 svc_pri/01 votes/2 wght/0x2703697a2e0
[20140208 15:51:33] 0x7fff777e6310 INFO NSS: Vote selected FSS 'SANVOL01[1]' at 10.10.10.151:49199 (pid 277) - attempting activation.
[20140208 15:51:33] 0x7fff777e6310 (debug) set_fss_active: sending to 10.10.10.151:50545 (id 192.168.1.151)
[20140208 15:51:34] 0x105e84000 (debug) FSS: State Change 'SANVOL01' DYING: (no substate) -> RELAUNCH: (no substate) , next event in 10s (/SourceCache/XsanFS/XsanFS-508/snfs/fsmpm/fsmpm.c#6555)
[20140208 15:51:36] 0x1048bd000 NOTICE PortMapper: Reconnect Event for /Volumes/SANVOL01
[20140208 15:51:36] 0x1048bd000 NOTICE PortMapper: Requesting MDS recycle of /Volumes/SANVOL01
[20140208 15:51:45] 0x105e84000 (debug) FSS: State Change 'SANVOL01' RELAUNCH: (no substate) -> LAUNCHED: (no substate) , next event in 60s (/SourceCache/XsanFS/XsanFS-508/snfs/fsmpm/fsmpm.c#2452)
[20140208 15:51:45] 0x105e84000 (debug) PortMapper: FSM 'SANVOL01' queued for restart on host xsansrv01.pimiento.prod (pri=0).
[20140208 15:51:45] 0x105e84000 NOTICE PortMapper: Starting FSS service 'SANVOL01[0]' on xsansrv01.pimiento.prod.
[20140208 15:51:45] 0x105e84000 NOTICE PortMapper: Started FSS service 'SANVOL01' pid 80631.
[20140208 15:51:46] 0x7fff777e6310 (debug) FSS: State Change 'SANVOL01' LAUNCHED: (no substate) -> REGISTERED: (no substate)  (/SourceCache/XsanFS/XsanFS-508/snfs/fsmpm/fsmpm.c#2715)

 

CVLOGS:

 

 

20140208 15:45:17.943192] 0x7fff777e6310 (Debug) Node [125] [10.10.10.151:55666] connected.
[20140208 15:46:18.750845] 0x7fff777e6310 (Debug) Node [126] [10.10.10.151:55684] connected.
[20140208 15:47:18.762551] 0x7fff777e6310 (Debug) Node [127] [10.10.10.151:55701] connected.
[20140208 15:48:19.330040] 0x7fff777e6310 (Debug) Node [128] [10.10.10.151:55718] connected.
[20140208 15:49:19.290263] 0x7fff777e6310 (Debug) Node [129] [10.10.10.151:55735] connected.
[20140208 15:50:17.936021] 0x7fff777e6310 (Debug) Node [130] [10.10.10.151:55751] connected.
[20140208 15:51:17.932995] 0x7fff777e6310 (Debug) Node [131] [10.10.10.151:55768] connected.
[20140208 15:51:32] 0x7fff777e6310 (Info) File System 'SANVOL01' stopped by FSM Portmapper - [errno 54]: Connection reset by peer
[20140208 15:51:32] 0x7fff777e6310 (Info) Shutting down file system.
[20140208 15:51:32] 0x7fff777e6310 (Info) File system shut down successful.
[20140208 15:51:32.932383] 0x7fff777e6310 (Debug) FSM qustat archive created: /Library/Logs/Xsan/qustats/FSM/SANVOL01/xsansrv01.pimiento.prod/qustat_FSM_SANVOL01_xsansrv01.pimiento.prod_1391892692.csv
[20140208 15:51:32.932415] 0x7fff777e6310 (Debug) NETWORK SUMMARY [10.10.10.173:49229]: QueuedInputMsgs max/0 QueuedOutputBytes max/0.
[20140208 15:51:32.932425] 0x7fff777e6310 (Debug) NETWORK SUMMARY [montage001.pimiento.:49169]: QueuedInputMsgs max/0 QueuedOutputBytes max/0.
[20140208 15:51:32.939287] 0x7fff777e6310 (Debug) FSM memory SUMMARY virtual size 2692MB resident size 78MB.
[20140208 15:51:33.441367] 0x7fff777e6310 (Debug) PioRIOQuiescent: All devices are idle.
[20140208 15:51:33.441400] 0x7fff777e6310 (Debug) journal_quiescent: journal_writeq is empty.
[20140208 15:51:33] 0x7fff777e6310 (Info) wait for last arb block update.
[20140208 15:51:33] 0x10c7ed000 (Info) Failover - fsm has written the IO Quiescent arb block!
[20140208 15:51:33] 0x7fff777e6310 (Info) Exiting.
Logger_thread: sleeps/1043454 signals/0 flushes/11120 writes/11120 switches 0
Logger_thread: logged/41022 clean/41022 toss/0 signalled/0 toss_message/0
Logger_thread: waited/0 awakened/0
[20140208 15:51:46] 0x7fff777e6310 (Info) Revision 4.3.2 Build 508[30118] Branch Head
[20140208 15:51:46] 0x7fff777e6310 (Info) Built for Darwin 13.0 x86_64
[20140208 15:51:46] 0x7fff777e6310 (Info) Created on Mon Sep 16 17:32:54 PDT 2013
[20140208 15:51:46] 0x7fff777e6310 (Info) Built in /SourceCache/XsanFS/XsanFS-508/buildinfo
[20140208 15:51:46] 0x7fff777e6310 (Info)
Configuration:
    DiskTypes-4
    Disks-4
    StripeGroups-4
    MaxConnections-139
    ThreadPoolSize-256
    StripeAlignSize-16
    FsBlockSize-65536
    BufferCacheSize-32M
    InodeCacheSize-8192
    RestoreJournal-Disabled
    RestoreJournalDir-None
    ClientTableSize-144
[20140208 15:51:46] 0x7fff777e6310 (Info) Potentially using volume promiseMDC1 via device /dev/rdisk2 (active)
[20140208 15:51:46] 0x7fff777e6310 (Info) Self (xsansrv01.pimiento.prod) IP address is 192.168.1.150.
[20140208 15:51:46] 0x7fff777e6310 (Info) Process ID 80631
[20140208 15:51:46.204345] 0x7fff777e6310 (Debug) No fsports file - port range enforcement disabled.
[20140208 15:51:46] 0x7fff777e6310 (Info) Listening on TCP socket xsansrv01.pimiento.prod:64914
[20140208 15:51:46] 0x7fff777e6310 (Info) Node [0] [xsansrv01.pimiento.p:64914] File System Manager Login.
[20140208 15:51:46] 0x7fff777e6310 (Info) Opened volume promiseMDC1 via device /dev/rdisk2 (active)
[20140208 15:51:46] 0x7fff777e6310 (Info) ForceStripeAlignment is enabled.
[20140208 15:51:46] 0x7fff777e6310 (Info) Wopen_Wait_Interval=35
[20140208 15:51:46] 0x7fff777e6310 (Info) Service standing by on host 'xsansrv01.pimiento.prod:64914'.
[20140208 15:51:56] 0x107f01000 (Info) DiskArb: won't set Spotlight search level; volume is not hosted locally

 

 

 

Thx

 

 

Serge

 

serged's picture

Feb  8 15:51:32 xsansrv01 kernel[0]: Reconnecting to local portmapper on host '127.0.0.1'
Feb  8 15:51:32 xsansrv01.pimiento.prod KernelEventAgent[54]: tid 54485244 received event(s) VQ_NOTRESP (1)
Feb  8 15:51:32 xsansrv01.pimiento.prod KernelEventAgent[54]: tid 54485244 type 'acfs', mounted on '/Volumes/SANVOL01', from '/dev/disk6', not responding
Feb  8 15:51:32 xsansrv01.pimiento.prod KernelEventAgent[54]: tid 54485244 found 1 filesystem(s) with problem(s)
Feb  8 15:51:32 xsansrv01 kernel[0]: Local portmapper OK
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: AS-REQ _ldap_replicator@XSANSRV01.PIMIENTO.PROD from 127.0.0.1:61939 for krbtgt/XSANSRV01.PIMIENTO.PROD@XSANSRV01.PIMIENTO.PROD
Feb  8 15:51:33 --- last message repeated 1 time ---
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: Need to use PA-ENC-TIMESTAMP/PA-PK-AS-REQ
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: AS-REQ _ldap_replicator@XSANSRV01.PIMIENTO.PROD from 127.0.0.1:51697 for krbtgt/XSANSRV01.PIMIENTO.PROD@XSANSRV01.PIMIENTO.PROD
Feb  8 15:51:33 --- last message repeated 1 time ---
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: Client sent patypes: ENC-TS
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: ENC-TS pre-authentication succeeded -- _ldap_replicator@XSANSRV01.PIMIENTO.PROD
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: Client supported enctypes: aes256-cts-hmac-sha1-96, aes128-cts-hmac-sha1-96, des3-cbc-sha1, arcfour-hmac-md5, using aes256-cts-hmac-sha1-96/aes256-cts-hmac-sha1-96
Feb  8 15:51:33 xsansrv01.pimiento.prod kdc[55]: Requested flags: forwardable
Feb  8 15:51:33 xsansrv01 kernel[0]: Reconnecting to FSS 'SANVOL01'
Feb  8 15:51:33 xsansrv01.pimiento.prod fsmpm[173]: PortMapper: Initiating activation vote for FSS 'SANVOL01'.
Feb  8 15:51:34 xsansrv01.pimiento.prod apsd[274]: Unrecognized leaf certificate
Feb  8 15:51:36 xsansrv01 kernel[0]: Reconnect successful to FSS 'SANVOL01' on host '10.10.10.151'.
Feb  8 15:51:36 xsansrv01 kernel[0]: Using v2 readdir for 'SANVOL01'
Feb  8 15:51:36 xsansrv01.pimiento.prod fsmpm[173]: PortMapper: Reconnect Event for /Volumes/SANVOL01
Feb  8 15:51:36 xsansrv01.pimiento.prod fsmpm[173]: PortMapper: Requesting MDS recycle of /Volumes/SANVOL01
Feb  8 15:51:36 xsansrv01.pimiento.prod KernelEventAgent[54]: tid 54485244 received event(s) VQ_NOTRESP (1)
Feb  8 15:51:36 xsansrv01.pimiento.prod mds[47]: (Normal) Volume: volume:0x7f859c03e800 ********** Bootstrapped Creating a default store:0 SpotLoc:(null) SpotVerLoc:(null) occlude:0 /Volumes/SANVOL01
Feb  8 15:51:45 xsansrv01.pimiento.prod fsmpm[173]: PortMapper: Starting FSS service 'SANVOL01[0]' on xsansrv01.pimiento.prod.
Feb  8 15:51:45 xsansrv01.pimiento.prod fsmpm[173]: PortMapper: Started FSS service 'SANVOL01' pid 80631.
Feb  8 15:52:32 xsansrv01.pimiento.prod kdc[55]: TGS-REQ _ldap_replicator@XSANSRV01.PIMIENTO.PROD from 127.0.0.1:53827 for ldap/xsansrv02.pimiento.prod@XSANSRV01.PIMIENTO.PROD [canonicalize, forwardable]
Feb  8 15:52:32 xsansrv01.pimiento.prod kdc[55]: TGS-REQ _ldap_replicator@XSANSRV01.PIMIENTO.PROD from 127.0.0.1:50036 for ldap/xsansrv02.pimiento.prod@XSANSRV01.PIMIENTO.PROD [forwardable]
Feb  8 15:57:41 xsansrv01.pimiento.prod storeagent[313]: multibyte ASN1 identifiers are  not supported.

 

thomasb's picture

I would recommend upgrading to 10.9.2. The update is supposed to fix some network issues in OS X, which seems to have caused the "random" failovers. Basically, it seems like OS X stopped being able to open new TCP sockets.

The problem is very similar to this:
http://nslog.com/2013/12/14/lost_internet_connection

If you look through the logs of your MDCs, you'll most likely see messages like these:

<code>opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7498 closed failed 9 - Bad file descriptor
assertion failed: 13B42: AppleLDAP + 23042 [DEADA232-8912-3ACB-8262-B3B764C6777F]: 0x16
opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7504 closed failed 9 - Bad file descriptor
opendirectoryd[28]: tcp_connection_destination_cleanup_fds 7507 closed failed 9 - Bad file descriptor
opendirectoryd[28]: assertion failed: 13B42: AppleLDAP + 23042 [DEADA232-8912-3ACB-8262-B3B764C6777F]: 0x16</code>

After these messages, error messages like these start appearing, which means cvadmin doesn't work and that things will quickly go bad. Thus I chose to do a manual failover of all the volumes on that MDC, before it would happen automatically.

<code>xsand[32]: Unable to connect to local FSMPM</code>

mattriley's picture

Have you installed 10.9.2 on any MDCs yet? I'm on 10.8.5 on my MDCs still and have been waiting for Mavericks to stabilize before upgrading.

I will probably wait a couple of days to see if anything shakes out but maybe Mavericks is suitable for MDCs now, yeah?

-Matt

thomasb's picture

I've been running OS X 10.9.2 for a day now in our Xsan lab environment without issues, and I just upgraded two MDCs (controlling three volumes) from 10.9.1 to 10.9.2 tonight. I'm upgrading an other Xsan environment with two MDCs and two volumes tomorrow, and over the weekend I will upgrade our big MultiSAN with 7 MDCs and 16 volumes from 10.9.1 to 10.9.2.

I'll report back on Monday next week with a status update :)

mattriley's picture

I appreciate you sharing your experience and look forward to reading what you have to say about 10.9.2.

Finding resources for Xsan has become increasingly difficult. Luckily, it just works for us, so I'm thankful I haven't needed to do a ton of digging except on rare occasions (big upgrades).

Thanks!
Matt

mattriley's picture

Any luck with the upgrade over the weekend?

-Matt

thomasb's picture

So far so good :)

We have not had any failovers in any of our three Xsan environments since upgrading to 10.9.2.

I need to see all volumes without failovers for 2-3 weeks before I can say for sure though.

Thomas

mattriley's picture

Thanks for posting!

I may give it a shot on my MDCs (both older model Xserves running 10.8.5 now) later this week and see how it goes.

Please post back if anything abnormal jumps out from your installs and I will do the same. :-)

-Matt

Andrew Allen's picture

Please update this again when you can. I'm running into the same failover issues on 10.9.1: 3 MDCs and 11 clients.

I'll be upgrading to 10.9.2 tonight. I hope that works.

serged's picture

Thanks for info , We will update to 10.9.2  and share the results.

 

thomasb's picture

After two weeks, the failovers begin again it seems :-/

Same symptoms. The connection between FSMPM and FSM simply dies, and all volumes fail from one MDC to the other.

I have begun always rebooting the MDC the volumes failed over from, to make sure it's ready for the next failover.

Exsample from /Library/Logs/Xsan/debug/nssdbg.out

[20140315 10:14:24] 0x7fff7cd61310 (debug) PortMapper: FSS 'MyVolume' disconnected.
[20140315 10:14:24] 0x7fff7cd61310 (debug) PortMapper: kicking diskscan_thread 4579147776.
[20140315 10:14:24] 0x7fff7cd61310 (debug) FSS: State Change 'MyVolume' REGISTERED: (no substate) -> DYING: (no substate) , next event in 60s (/SourceCache/XsanFS/XsanFS-508.2/snfs/fsmpm/fsmpm.c#5597)

mattriley's picture

That's disappointing to hear, Thomas.

I scheduled some time later this week to do the upgrade here but after reading your notes, I think I may wait for 10.9.3. My MDCs have been rock-solid for me on 10.8.5 so I'm nervous about messing with them in the first place. Reading about persistent issues in 10.9.2 definitely doesn't give me the warm fuzzies about upgrading and I will continue to put it off as long as I can.

Thanks again for posting about your experiences. Have you tried to contact Apple about the problem through their forums or an official channel? Did you work with a reseller to build your XSAN who might be able to help contact Apple?

Also, what hardware are your MDCs on? I don't know if that would make a difference or not, just trying to feel this out and help if I can.

-Matt

thomasb's picture

I would wait, yes.

For those of you that have upgraded, here is my advice until Apple manages to find and fix the issues.

1. Check the logs asap after the failover, and make note of any suspicous log messages

/private/var/log/system.log*
/Library/Logs/Xsan/debug/nssdgb.out*

2. Always reboot the MDC that ran the volumes prior to the failover

3. Always check after a reboot that both MDCs are listed when running "sudo cvadmin"

 

We have a case with Apple about these random failovers. We actually saw the issues start back with OS X 10.8.4, which is the first time we bug-reported it to Apple. It's the same case we're still tracking with Apple in 10.9.2.

We have five separate Xsan environments in production, and one in our lab environment, and all of them have the same issues. The largest environment consists of 7 MDCs and 16 volumes, with 100 clients. Doesn't seemt to be any difference in the frequency of failovers, between the larger environment and the smaller ones.

The failovers happen about every two weeks in our production environments, sometimes more often. The problem seems to be OS X network related, as there are no signs in logs related to Xsan. We have had full debugging enabled for multiple of our Xsan volumes, without any addiontal clues as to why the failovers occur.

I'll try to keep you posted.

Thomas

serged's picture

Sadly, i confirm  that after upgrade, still the same. Took 2 weeks but still.

 

 

 

I'm also seeing something similar to this on 10.9.2, though I'm pretty sure our volumes have been running fine on 10.9.2 for several weeks.  This issue cropped up on the primary MDC yesterday.  If PortMapper reported a hiccup from FSM on the primary MDC, the volume would fail over to the secondary MDC.  Both my volumes run fine on the secondary MDC, so I've kept them like that until I can sort out what's wrong with the primary.  I watched the primary and saw a few more of these hiccups throughout the day yesterday, so on a hunch, I rebooted it.  I haven't seen any more yet, and if it holds steady for another day or so, I'll fail one of the volumes back to it and see how it goes.

thomasb's picture

Sorry to say this, but I really doubt that this issue is isolated to your primary MDC. If you haven't see it happen to your secondary MDC yet, you will soon enough unfortunately.

We'll hopefully see a fix from Apple for these issues soon.

thomasb's picture

Sorry to say this, but I really doubt that this issue is isolated to your primary MDC. If you haven't seen it happen to your secondary MDC yet, you will soon enough unfortunately.

We'll hopefully see a fix from Apple for these issues soon.

Thomas,

You're right about the secondary MDC.  I've been rebooting them whenever one acts up, and that seems to stabilize it for a bit.

You mentioned that you thought it was a network issue in OS X.  Just out of curiosity, what ethernet switches are you using?  We recently had some OS X network issues that traced back to our Cisco gear, which apparently doesn't play nice like all the other switches on the block.  This was just with a PEG6 card that we were trying to bond, though--completely unrelated to Xsan.  I can't say that I've had any other issues with our Ciscos, but I'm not willing to rule them out completely.  I haven't been able to find any telltail log messages about networking either.

Another possibility might be our ethernet configuration on the MDCs.  What hardware are you using?  Xserves?  Minis?  Pros?  I've got two Mac Minis, each with a Promise SANLink adapter and a Thunderbolt to GigE adapter.  The GigE adapters daisychain through the Promise adapter back to the single Thunderbolt port on the Minis.  We're currently using the GigE adapters for our metadata network, but I'm considering flipping them around.  If anyone is having this issue with non-Mini MDCs, then I'd scratch that idea.

Pete

thomasb's picture

Pete,

Our network is all enterprise Cisco based. I'm not managing the network, so I can't give you too much details there. However, we didn't have failover issues this frequently (every 1-2 weeks) before upgrading to OS X 10.9, so I can't see how this could be an external network issue.

Considering we see the same issues in all our five separate Xsan environments, and the fact that everyone I have talked to who are running Xsan on OS X 10.9 have these issues, I can't see how this can be anything else than a bug in OS X.

Our MDCs consists of Xserves, Mac Pros and Mac minis. The hardware for the MDC doesn't matter, they all have failover issues.

The Mac mini MDCs are running in a lab environment with a simple HP ProCurve switch for the Metadata Network. We have failovers happen there too, just not as frequently, which I guess is because of the limited activity in our lab environment.

 

Gerard's picture

Just talked to Apple. This a known issues and they are working on a fix.

I have a pair of MDCs (Xserve and mac mini) running 10.8.5 and I was considering upgrading to 10.9 until I read this thread. Guess we all play the waiting game now.

Anyone know if the 10.9.3 update fixes this issue?

Gerard's picture

What Apple has told me, 10.9.3 doesn't fix this MDC issue.

Ugh. That's frustrating to hear.

Still waiting to upgrade here. 10.8.5 has been solid for me so far, so I'm glad I am able to stay back on this version while this issue gets sorted out.

-Matt

sf809's picture

Hej Xanity Crew,

does someone has news about this Issue?

I'll install tomorrow an Xsan with 10.6.3.

Will also contact Apple Care for Feedback!

Cheers from Berlin
Seb

sf809's picture

sf809 wrote:

I'll install tomorrow an Xsan with 10.6.3.

10.9.3. (mistake in writing)

sf809's picture

sf809 wrote:

Hej Xanity Crew,

does someone has news about this Issue?

I'll install tomorrow an Xsan with 10.6.3.

Will also contact Apple Care for Feedback!

Cheers from Berlin
Seb

/quote

10.9.3. (mistake in writing)

dmastroluca's picture

Has anyone received anything official from Apple about fixing this issue? Is Apple going to continue to disappoint us enterprise users (again)? Just venting.

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

abstractrude's picture

Has someone actually reported this bug?

-Trevor Carlson
THUMBWAR

sf809's picture

Y E S

they are still investigating

Gerard's picture

Straight from the horses' (Apple) mouth:

"The current developer seed of 10.9.4 contains networking improvement which product engineering believes may address this behavior. As more information becomes available, I’ll let you know."

This is what my Apple case holder told me yesterday.

mattriley's picture

OK, who is brave enough to update to 10.9.4 and see if it is the magical release we've been waiting for?

thomasb's picture

I have been running our Xsan lab environment on an early seed of 10.9.4 for 18 days, and there have been no failovers. Looks promising, but I'm still waiting a couple more days before starting to upgrade some of our smaller environments.

Gerard's picture

10.9.4 is officially out today. Hopefully it corrects this MDC issue.

ruggieru's picture

I did fail over and then primary MDC this week.
No problems so far.
XSEVE's Early 2009
5 volume XSAN just short of 100Tb

mattriley's picture

It's been a week or so now. How's 10.9.4 treating everyone? Still going OK?

Thinking of upgrading tomorrow, so I have the weekend to fix things if it goes wrong.

-Matt

thomasb's picture

We have 29 days of uptime without any failovers in our Xsan lab at work, which is promising.

I also upgraded four different Xsan production environments to 10.9.4 on Sunday last weekend. Upgrade went without a hitch, and everything is still running fine so far.

I'm still waiting until Friday next week, before upgrading our large 104 LUN, 7 MDC, 16 volumes, 100+ client MultiSAN.

Gerard's picture

Had a followup with Apple and they have confirmed that 10.9.4 finally fixes this issue.

mattriley's picture

Great to hear! Thanks for the feedback everyone!

I think I'll give it a go ASAP.

-Matt