Xsan AFP reshare problems
I've been hitting a huge problem with our Xsan reshare this week. I'm getting to the end of my rope pretty fast, so any insights are greatly appreciated.
2x MDCs @ 10.7.5
4x (ActiveStorage) head raids + 1x head chassis for metadata and other data
Qlogic 5600 switches
1x beefy Mac Pro AFP reshare @ 10.7.5
- 12-core @ 2.4GHz
- 64GB RAM
- 6x 4Gb FC link
- 2x 10Gb Ethernet bonded link via SmallTree card (lab1)
- 6x 1Gb Ethernet bonded link via SmallTree card (lab2)
Lab 1 = 40x 2013 iMacs @ 10.8.4
Lab 2 = 17x 2012 iMacs @ 10.8.4
The media that the students are using is primarily old DV-NTSC footage or AVHCD right off our new cameras. We also have several students building simple websites, and they store all their source files and images on this volume. The volume is definitely able to handle these data rates, so the problem must be with my reshare configuration. After a server reboot, the labs all perform fine, but as the day wears on, we eventually reach a point where new users can no longer connect to the AFP reshare. One class arrives, logs in, edits, logs out, (repeat), but by the fourth class, the students get a nasty prompt that the sever is unavailable. Users who are already connected can continue working just fine.
In additional to that, we occasionally have users experience choppy connections to the server. Their footage will suddenly just pause for about 30 seconds. A reboot of the client workstation usually clears this up, but not always (it'll crop up again). When I check the AFP connection list on the server, I usually see two or more duplicate connections from the workstation in question, the older "stale" ones all with increasing idle times while the active connection shows 0 idle time.
When new connections can no longer be made to the server, that's usually the point where I need to run and forcibly shut it off. Even after a fresh reboot of the server and the clients, things aren't perfect. I'm now seeing constant metadata traffic between my reshare server and the acting MDC. It's all very small data (<1MB/s), but it's constant, even when no users are connected via AFP. My acting MDC also has >100% CPU load for the fsm process, which currently boasts 299 threads (is this normal?). The kernel process on the reshare server is constantly at about 20% CPU, and the AppleFileServer process always has about 204 threads (seems high, right?).
Some frequent log messages I'm seeing on the reshare server are:
Oct 24 18:10:39 reshare-server kernel: add_fsevent: unable to get path for vp 0xffffff80881e67c0 (-UNKNOWN-FILE; ret 22; type 2)
Oct 24 18:10:39 reshare-server kernel: add_fsevent: unable to get path for vp 0xffffff80881e67c0 (-UNKNOWN-FILE; ret 22; type 4)
Oct 24 19:03:51: --- last message repeated 1 time ---
Oct 24 19:55:23 reshare-server AppleFileServer: received message with invalid client_id 1583
Oct 24 19:55:25 reshare-server AppleFileServer: _Assert: /SourceCache/afpserver/afpserver-585.7/afpserver/AFPRequest.cpp, 2006
Oct 24 19:55:25: --- last message repeated 2 times ---
Oct 24 20:33:01 reshare-server AppleFileServer: MDSChannelPeerCreate: (os/kern) invalid argument
Oct 24 20:33:51: --- last message repeated 1 time ---
The first group of log messages above seems to relate to when someone has experienced a bad (stale) connection, though I can't confirm this. After said client reboots, these message cease. I was really happy when I made it the whole morning without seeing any of them, but then they started cropping up again. All these messages tend to increase with more use of the client systems, but I can't pinpoint the cause.
I've been debating converting this setup to NFS instead of AFP, but several of our workflows rely pretty heavily on the ACLs that are in place on this volume. My understanding is that Apple's nfsd doesn't pass the ACLs along to the clients. We tried NFS for a small subset of machines for a semester, but AFP was much easier and gave us our ACLs. I also have another almost identical reshare server ready and waiting to go in the rack, but I'd like to make sure I'm heading in the right direction before I throw more hardware at the situation.