Xsan AFP reshare problems

Hi all,

Firstly, sorry for cross-posting, but I did already post this over on [url=http://forums.creativecow.net/readpost/180/857759]Creative Cow/url.

I've been hitting a huge problem with our Xsan reshare this week. I'm getting to the end of my rope pretty fast, so any insights are greatly appreciated.

Our setup:

Xsan 2.3
2x MDCs @ 10.7.5
4x (ActiveStorage) head raids + 1x head chassis for metadata and other data
Qlogic 5600 switches
1x beefy Mac Pro AFP reshare @ 10.7.5
- 12-core @ 2.4GHz
- 64GB RAM
- 6x 4Gb FC link
- 2x 10Gb Ethernet bonded link via SmallTree card (lab1)
- 6x 1Gb Ethernet bonded link via SmallTree card (lab2)
Lab 1 = 40x 2013 iMacs @ 10.8.4
Lab 2 = 17x 2012 iMacs @ 10.8.4

The media that the students are using is primarily old DV-NTSC footage or AVHCD right off our new cameras. We also have several students building simple websites, and they store all their source files and images on this volume. The volume is definitely able to handle these data rates, so the problem must be with my reshare configuration. After a server reboot, the labs all perform fine, but as the day wears on, we eventually reach a point where new users can no longer connect to the AFP reshare. One class arrives, logs in, edits, logs out, (repeat), but by the fourth class, the students get a nasty prompt that the sever is unavailable. Users who are already connected can continue working just fine.

In additional to that, we occasionally have users experience choppy connections to the server. Their footage will suddenly just pause for about 30 seconds. A reboot of the client workstation usually clears this up, but not always (it'll crop up again). When I check the AFP connection list on the server, I usually see two or more duplicate connections from the workstation in question, the older "stale" ones all with increasing idle times while the active connection shows 0 idle time.

When new connections can no longer be made to the server, that's usually the point where I need to run and forcibly shut it off. Even after a fresh reboot of the server and the clients, things aren't perfect. I'm now seeing constant metadata traffic between my reshare server and the acting MDC. It's all very small data (<1MB/s), but it's constant, even when no users are connected via AFP. My acting MDC also has >100% CPU load for the fsm process, which currently boasts 299 threads (is this normal?). The kernel process on the reshare server is constantly at about 20% CPU, and the AppleFileServer process always has about 204 threads (seems high, right?).

Some frequent log messages I'm seeing on the reshare server are:

Oct 24 18:10:39 reshare-server kernel[0]: add_fsevent: unable to get path for vp 0xffffff80881e67c0 (-UNKNOWN-FILE; ret 22; type 2)
Oct 24 18:10:39 reshare-server kernel[0]: add_fsevent: unable to get path for vp 0xffffff80881e67c0 (-UNKNOWN-FILE; ret 22; type 4)
Oct 24 19:03:51: --- last message repeated 1 time ---

Oct 24 19:55:23 reshare-server AppleFileServer[958]: received message with invalid client_id 1583
Oct 24 19:55:25 reshare-server AppleFileServer[958]: _Assert: /SourceCache/afpserver/afpserver-585.7/afpserver/AFPRequest.cpp, 2006
Oct 24 19:55:25: --- last message repeated 2 times ---

Oct 24 20:33:01 reshare-server AppleFileServer[958]: MDSChannelPeerCreate: (os/kern) invalid argument
Oct 24 20:33:51: --- last message repeated 1 time ---

The first group of log messages above seems to relate to when someone has experienced a bad (stale) connection, though I can't confirm this. After said client reboots, these message cease. I was really happy when I made it the whole morning without seeing any of them, but then they started cropping up again. All these messages tend to increase with more use of the client systems, but I can't pinpoint the cause.

I've been debating converting this setup to NFS instead of AFP, but several of our workflows rely pretty heavily on the ACLs that are in place on this volume. My understanding is that Apple's nfsd doesn't pass the ACLs along to the clients. We tried NFS for a small subset of machines for a semester, but AFP was much easier and gave us our ACLs. I also have another almost identical reshare server ready and waiting to go in the rack, but I'd like to make sure I'm heading in the right direction before I throw more hardware at the situation.

Thank you!

Additional log info from the MDC about the server in question:

[1029 12:00:00.399694] 0x10ba28000 (Debug) TKN SUMMARY [server] TokenRequestV3 cnt/1888 avg/31+159 min/13+18 max/118+10989.
[1029 12:00:00.399700] 0x10ba28000 (Debug) TKN SUMMARY [server] OpenChange cnt/48 avg/0+0 min/0+0 max/0+0.
[1029 12:00:00.399705] 0x10ba28000 (Debug) TKN SUMMARY [server] TokenReqAlloc cnt/9977 avg/35+1649 min/13+573 max/28866+28603.
[1029 12:00:00.399711] 0x10ba28000 (Debug) TKN SUMMARY [server] TokenChangeV5 cnt/128570 avg/26+37 min/0+3 max/
[1029 12:00:00.399759] 0x10ba28000 (Debug) BRL SUMMARY [server] SetLock cnt/17291 avg/33+35 min/13+12 max/673+406.
[1029 12:00:00.399765] 0x10ba28000 (Debug) BRL SUMMARY [server] ClearLock cnt/82152 avg/32+28 min/13+9 max/3040+3

I eventually found out that the Xsan volume had gotten corrupted at some point. Numerous attempts to clean it up with cvfsck didn't help, so I ended up reinitializing it and restoring it from backups. The restore is still in progress.

Also got some handy sysctl.conf tuning info from Chris at Small-Tree. I'll be applying this to our file servers and clients after everything's back online.

Hi all,

Still having plenty of issues. Things I've attempted to date:

    - reinitialized and restored SAN volume - added an additional AFP server to split the load between our labs
    - tuned sysctl.conf on AFP servers and clients per instructions from Small-Tree:/list
    [code]net.inet.tcp.doautorcvbuf=0
    net.inet.tcp.doautosndbuf=0
    kern.ipc.maxsockbuf=8388608
    net.inet.tcp.sendspace=3145728
    net.inet.tcp.recvspace=3145728
    net.inet.tcp.maxseg_unacked=32
    net.inet.tcp.delayed_ack=2
    net.inet.tcp.win_scale_factor=3
    kern.maxnbuf=524288 #this line is removed from the client/code
      - configured AFP servers and clients for jumbo frames - configured AFP servers and clients for data traffic over separate subnet
      - set afp:authenticationMode="standard" (was set to standard and kerberos)
      - ensured spotlight was disabled for the Xsan volume and the AFP reshares
      - added the Xsan volume to the spotlight privacy list on both AFP servers, just for good measure
      - configured AFP servers to purge inactive memory nightly, when no users are connected
      /list
      What I've observed:

        - FSM process on the acting MDC rises and falls as expected, and the volume seems to perform fine. - Everytime a student opens a project or ingests footage to the SAN, I see tons of "allocator race detected" messages fly by in the logs on the AFP server:/list
        [code]Nov 8 09:33:50 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488568 - 00011.MTS
        Nov 8 09:33:51 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488579 - 00011.CPI
        Nov 8 09:33:52 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488579 - 00012.CPI
        Nov 8 09:33:52 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488579 - 00013.CPI
        Nov 8 09:33:53 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488568 - 00014.MTS
        Nov 8 09:33:53 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488568 - 00015.MTS
        Nov 8 09:33:53 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/488579 - 00015.CPI
        Nov 8 09:34:28 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90938 - 00012-2f82795e-552d-7506-c565-d29b0000018e.ims
        Nov 8 09:34:30 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90936 - b6d8a43f-85f4-62bf-99d4-dd81000000281536.mcdb
        Nov 8 09:34:34 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90936 - ecb8f473-2500-670f-eac7-a1d10000018e1536.mcdb
        Nov 8 09:34:35 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90938 - 00007.MTS 48000_3.pek
        Nov 8 09:34:38 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90936 - a1077181-71c7-9e85-bc49-d234000000281536.mcdb
        Nov 8 09:34:40 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90938 - 00011-3c79c119-f0bc-0379-54be-496d0000018e.ims
        Nov 8 09:34:40 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90936 - 3c79c119-f0bc-0379-54be-496d0000018e1536.mcdb
        Nov 8 09:34:40 bronson AppleFileServer[13062]: Allocator race detected: transaction is not verified for -102/90938 - 00011.MTS 48000_1.pek/code
          - also seeing occasional MDSChannelPeerCreate messages, which is what prompted making sure spotlight was disabled. these have not gone away./list [code]Nov 8 09:39:21 bronson AppleFileServer[13062]: MDSChannelPeerCreate: (os/kern) invalid argument
          Nov 8 09:40:04: --- last message repeated 1 time ---
          Nov 8 09:41:41 bronson AppleFileServer[13062]: MDSChannelPeerCreate: (os/kern) invalid argument/code
            - prior to the volume crashing and burning last time, i had seen lot of inode and "add_fsevent: unable to get path for vp…" errors on the AFP servers. these error message have disappeared since reinitializing the volume. - While I know each server should be able to push a lot of data in and out, I rarely see anything higher than 300-400MB/s. The average output data is about 100MB/s. (this is via Activity Monitor)
            - When there's a lot of traffic through the AFP servers, I'm seeing a LOT of inbound packets. Tens of thousands of packets per seconds. The acting MDC rarely surpasses 1000 packets per seconds, but I have seen it jump to 2000-3000 packets per second during heavy use. I've been told that the hardware can handle this many packets per second (or more), but I have little experience with these numbers to know whether or not this looks normal. (again, all this data observed via Activity Monitor)/list
            - on the AFP servers, I'm seeing lots of errors in the opendirectoryd logs:
            [code]2013-11-08 08:39:29.490 CST - Module: SystemCache - Misconfiguration detected in hash 'Kerberos':
            User 'researchsubject10' (/LDAPv3/od.server) - ID 4510 - UUID 5FB0280D-ECCB-4471-A0FC-E2045E87E6DD - SID S-1-5-21-4096-2147483673-3823609622-10020
            User 'cjlimburg' (/LDAPv3/od.server) - ID 11420395 - UUID B47B82BE-5902-461A-9A75-AF6BFDF6C9D3 - SID S-1-5-21-4096-2147483673-3823609622-22841790
            2013-11-08 08:39:48.912 CST - Module: SystemCache - Misconfiguration detected in hash 'Kerberos':
            User 'researchsubject10' (/LDAPv3/od.server) - ID 4510 - UUID 5FB0280D-ECCB-4471-A0FC-E2045E87E6DD - SID S-1-5-21-4096-2147483673-3823609622-10020
            User 'cjlimburg' (/LDAPv3/od.server) - ID 11420395 - UUID B47B82BE-5902-461A-9A75-AF6BFDF6C9D3 - SID S-1-5-21-4096-2147483673-3823609622-22841790
            2013-11-08 09:04:38.688 CST - Module: SystemCache - Misconfiguration detected in hash 'Kerberos':
            User 'gabriel2' (/LDAPv3/od.server) - ID 11420386 - UUID E14BB6CB-6AFE-4A2A-9850-682E1002195C - SID S-1-5-21-4096-2147483673-3823609622-22841772
            User 'cjlimburg' (/LDAPv3/od.server) - ID 11420395 - UUID B47B82BE-5902-461A-9A75-AF6BFDF6C9D3 - SID S-1-5-21-4096-2147483673-3823609622-22841790/code

            I'm really starting to lose it here. Hoping anyone can shed a little insight into all this…

            Thank you!

Quote:
- prior to the volume crashing and burning last time, i had seen lot of inode and "add_fsevent: unable to get path for vp…" errors on the AFP servers. these error message have disappeared since reinitializing the volume. /quote

And it looks like those messages are actually still there. Between the two servers, I count 6 messages total within the last four days…