Finder crashes after xsan finder copy, Apple is stumped

dmastroluca's picture

I have been seeing a problem on my two xsan environments. Once or twice a week, an xsan client will become unresponsive after a finder copy or file transfer. When it happens on the client, the mounted xsan volume is slow to display its contents, or will show nothing at all when there is 12TB of data there. When the client is in this state, if I do a "get info" on the mounted volume, the "more info" window says "Fetching" and there is a spinning gear. the client never recovers from this state, and has to be power cycled to break free of this funk. If I go into xsan admin and try to unmount the volume from the problem client, it will not let me unmount it. I have seen this on my xserve and MacPro clients. I am running all the current softwares on the os and xsan. I first experienced this issue on one san environment, then noticed it on a completely different san. I have un-installed and re-installed the os and xsan on the clients and un-installed and re-installed the os and xsan on the MDC's to no avail. I have destroyed one of my xsan volumes and rebuilt it and the problem came back.

I have been working with high level Apple XSAN engineers on this problem since May and they have no idea why this is happening. I have been sending them EDC reports after the issue and all the logs are fine.

Has anyone else experienced anything like this? Sorry for the long post, but this issue is driving me crazy.

nrausch's picture

I did experience this in two different locations.

Once it was bad Fiber between MDC and Switch.
The other time we were able to isolate it to a bad SFP.
It was going in and out, so it wasn't always busted.

But when it did go out....

You could still see the Volume and maybe some Folders. Because the Metadata network thought things were still connected. You couldn't usually see files. If you clicked on anything it would lock up and I would have to force reboot everything.

After shutting down all clients and still getting the symptoms at the rack with servers and storage only... we just replaced all the copper fiber cables in the rack. The problem never came back.

We found it had to be hardware after doing what you did, reinstalling and rebuilding...

MattG's picture

As usual, Apple's ability to only look at software is what's leading you astray. To pick up on what nrausch mentioned, also look at any kind of errors registering on your fibre channel switches on the ports that are connected to those clients, specifically decode and CRC Errors. Look for inconsistencies in connectivity on your metadata network. Make sure the metadata network switch itself is not going bad. Connectivity is likely your issue.

abstractrude's picture

I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth.

-Trevor Carlson
THUMBWAR

dmastroluca's picture

Both myself and the Apple engineers looked at my switch logs and they came back clean. I even replaced a metadata ethernet switch with no results.

Could this be another possible cause? Permissions???

I can make it happen on my production san by doing a finder copy while logged in as "admin" I am asked to authenticate when I am copying to the xsan volume, it starts the copy, gets almost to the end, and then the client goes wonky. After examining the permissions (via xsan admin file management) on the xsan volume, I saw that the admin account I am using does not have RW access on the ACL of the volume. Instead of admin being on the ACL with RW access, there is a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 that has RW access. What the hell is this??

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

abstractrude's picture

are all your machines pointing to the same directory?

-Trevor Carlson
THUMBWAR

dmastroluca's picture

Yes, OD and DNS configs are top notch.

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

lotte's picture

Have you tried changing all your SFP Modules (in case you don´t use copper).
If you are using copper I would also as nrausch recommend switching to fibre cables!
I once had the same and it was indeed a SFP Module... In our case one plugged into the raid system, so no errors on the switch...

Lotte<

ACSA's picture

abstractrude wrote:
I know this makes me an xsan snob or whatever but I no longer recommend the copper cables. They are fine for small environments but it seems whenever I cable them up in racks a few go bad and cause more headaches than they are worth./quote

I'll have to agree, the copper cables are more prone to failure and give a lot of headaches...... Just replace those cables to fibre.....

mark raudonis's picture

Over the past five years we HAVE had a couple of copper cables go bad. My question is "HOW DOES THAT HAPPEN"? Nobody touches them. They just sit there. It's not high voltage going through. Are there any electrical components within the connectors? If not, how does a piece of copper "go bad?" Anybody have a theory?

Mark

abstractrude's picture

i have no idea! :( Im thinking that they dont bend well. its funny one of the reasons i used to like the copper cables over optical was because i thought they were more durable. turns out optical cables are pretty durable. i treat them pretty bad on the client machines all the time.

i remember one guy told me they are affected by lighting like fluorescent lights and stuff, but he was one guy.

-Trevor Carlson
THUMBWAR

MattG's picture

Let's steer this back on topic. Permissions are almost never to blame for something like this. Even though, again, you've looked at logs that come back clean, you'd have to be doing serious I/O on each connection of the fibre channel switch, setting baselines, and then looking at _realtime_ stats to see if there is a bad connection. Logs are not going to help you. If you have qlogic switching, you can see this in their GUI. This will often lead to where the bad connectivity is. It's up to you to then further troubleshoot whether it's bad HBA port, bad HBA transceiver, bad cabling, bad switch transceiver or bad switch port.

Further, seeing a bogus user as you mention is clear evidence that at least that machine you mentioned is not properly bound to the directory, or bad permissions exist on the volume.

dmastroluca's picture

FYI, we are all fiber with SFP's.

But the plot thickens on the user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480. I created a test folder on the san. I added my admin (501) to the ACL with RW permissions. When I checked back later on the permissions of that test folder the admin user was replaced by a user called FFFFEEEEE-DDDD-CCCC-BBBB-AAAA820480 with RW access. Again, not sure if this is causing THE problem, but this does not look kosher. Why would the admin user change to garbage?

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

MattG's picture

The long string of characters might be the GUID of the 501 user. Simply add the 501 user to your generic San Users group in the OD and that part should go away.

rstasel's picture

I've seen these ACL results before with 10.6.

So, my questions would be:

What OS are the clients/MDCs running? And where is the OD Master in the whole process?

I kept getting these same weird ACL results on my setup since my OD Master wasn't my primary MDC (it's my backup). The fix was to configure my primary MDC to be an OD replica. It seems that xsan is doing some part of the permissions mapping in Xsan 2.2. That said, a 10.5 Xsan client showed ACLs perfectly. It was only with the 10.6 Xsan clients. There also was the issue that some user accounts didn't have UUIDs (they'd been around since before OD).

That said, I never saw anything like you're seeing. So, guess I'm just helping clear up the potentially misleading problem so you can focus back on the main one.

Have a post about it on my site here: http://www.staze.org/10-6-server-xsan-2-2-1-and-acl-oddities/

abstractrude's picture

so in a multi san config do you just need multple replicas. nothing like reading this before moving a huge SAN to 10.6.4 2.2. MattG, Aaron you guys seeing anything like this in clients?

-Trevor Carlson
THUMBWAR

dmastroluca's picture

Clients and MDC's running 10.6.4 on one san, and 10.6.3 on the another san. OD Master is on a mac mini running 10.6.3. OD Master runs the show for both san environments.

UPDATEUPDATEUPDATEUPDATEUPDATE

Apple had me go through all my FC Switches and I replaced some fiber and transceivers with copper. All switch errors are now gone, but the issue is still happening. Just yesterday I tried to delete a video file off the volume and it caused the mounted volume to become unresponsive. Looks like it is not a hardware problem. Apple now has me collecting stackshots and DirectoryService and Finder processes during an "event" for their review.

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

abstractrude's picture

booooooooooo

-Trevor Carlson
THUMBWAR

smlsan's picture

Have you checked for fragmented files? Run "snfsdefrag -cr " where is the mount point for your XSAN filesystem (/Volumes/...) and look for files with >1000 extents.

If there are particular files you have trouble with you can run "snfsdefrag -c " to get that file's extent count.

We are working with Apple on an issue that appears to hang Finder (and afterward CLI commands) when files have thousands of extents. This is only happening with 10.6 clients (XSAN 2.2.1 or 2.1.1) - the 10.5 clients are not affected.

Defragmenting files is a temporary workaround.

abstractrude's picture

I have not been able to replicate this issue in the lab or production. Running 10.6.4/2.2.1 no problemo.

-Trevor Carlson
THUMBWAR

smlsan's picture

Hang due to high extent count may not be dmastroluca's issue. Just something to look for.

Took us months to figure out it was highly fragmented files (thousands of extents) triggering the hang. We're still trying to figure out exactly where the bug is and get a fix. In the mean time we adjusted filesystem parameters to reduce the likelihood of having lots of extents. The person who setup the system originally had used low values for InodeExpand{Min,Inc,Max).

dmastroluca - when the problem occurs does it affect the entire SAN, ie. all clients? Or does the problem happen on only one client at a time?

dmastroluca's picture

The volume is a new volume and has not ben used very much so the extents are very low.

When the event happens it is only on one client. The rest of the clients have RW access just fine.

Who have you been working with at Apple? My case number is 160722155 if you want them to take a look and see if this is a bigger problem.

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

smlsan's picture

The issue we've run into is number of extents in a file, not number of extents in use overall. Issue occurs only on clients that access the file(s) that have very high extent count. Only occurs on 10.6 clients.

Disabling Icon Preview and Preview Column (column view) in Finder reduces the chances of a hang. We're doing testing now to see if copying a file in Finder can induce the hang. Note that move/rename will not because Finder does not read through the file for a move/rename and it is reading all the extent info that appears to trigger the hang.

We have recreated the problem in one of our labs with a new volume. Age of volume and amount of content does not have a direct impact, although a highly fragmented file system is more likely to create the right conditions that lead to the hang.

Repro case is fastest with a file >10,000 extents. Simply browsing a directory containing a file with that many extents will hang in column view with preview enabled. Touching the file (touch command in Terminal) will make Finder rescan the file and usually speeds up the hang occurrence.

I need to check with our client before sending their case number or any customer specific info.

dmastroluca's picture

Wow!! This looks like the same issue we are having. My Master Control Operator just tried to copy a large video file yesterday via finder and it triggered a hang. I sent stackshot logs and process samples for Finder and DirectoryService to our Apple engineer.

The Apple engineers told me to disable Icon Preview, but I had it disabled long ago thinking that was the problem. Since I had it disabled already, I did not see a difference in the frequency of hangs.

What were the tweaks you did to help with the number of extents created in a file?

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

smlsan's picture

We have reproduced the Finder hang with 4 modes: Icon Preview, Preview Column (column view), Quick Look and Get Info. All 4 of these appear to read through the file extents in order to generate the preview. Something about this process relates to the hang, although we are still researching the root cause. We can even get a hang to occur with Quick Look or Get Info after turning off Icon Preview and Preview Column.

Again we are only seeing this on 10.6 clients. We have tested 10.5 clients in parallel and had no hangs to date.

Check extent count on the file that hung: snfsdefrag -c . If extent count is in the thousands this may be the same issue. When the hang occurs try doing 'ls -l' on the file in Terminal. The ls should hang as well. Sometimes the file that actually caused the hang is not the one you have trouble with. You may need to do a recursive ls on the directory tree to find the object that is actually hung.

You can defragment the file with snfsdefrag and should wind up with 1 extent which will eliminate the hang behavior for that file. For reasons discussed in other forum threads it isn't a good idea to run snfsdefrag on the whole filesystem or constantly, so use per file defrag and that is at best a temporary workaround.

We were able to eliminate the high extent count by increasing InodeExpand* parameters in the filesystem.cfg. InodeExpandInc is helpful, but InodeExpandMax was the main improvement. Defaults for XSAN 2.2.1 are:

InodeExpandMin 8 blocks (128K @ 16K block size)
InodeExpandInc 32 blocks (512K @ 16K block size)
InodeExpandMax 2048 blocks (32M @ 16K block size)

Filesystem needs to be stopped and restarted for those .cfg changes to take effect.

If you are working on very large files (100GB+) even the 32MB InodeExpandMax can lead to a few thousand extents under specific conditions. Check with Apple first, but it may make sense to increase those parameters beyond default. If they are set lower than default it may make sense to change them to default first.

Note, if you have only one process writing to a stripe group and the filesystem free space is not heavily fragmented you should get a file with 1 or maybe 2 extents. The high extent counts occur when writing two or more files at the same time in the same stripe group and they are competing for allocation of free space. That is where the InodeExpand* parameters kick in.

Please post a follow up with some of the extent count info. You can run a recursive 'snfsdefrag -c -r ' for a directory tree, or use 'cvfsck -e' to get detailed extent information about every file in the filesystem. I'm very curious to see if you are running into the same issue we are working on.

dmastroluca's picture

I am so excited that there might actually be a workaround for this!! Here is my extent count on just one of my folders. This folder is for long form records of our City Council Meetings, usually 3-4 hours in length @ DV25. These files are the ones that give us the most problems.

sh-3.2# snfsdefrag -c -r /Volumes/KCLV_AIR_OPS/PLAYOUT/Council\ Meeting
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council PM Pt.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM Pt. 2 080410.mov: 11513 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council AM Pt. 2 080410.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM 072110 INGEST.mov: 61 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council (CLOSE).mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM Pt. 1 090110.mov: 1363 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM 072110 Item 55.mov: 6075 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/City Council AM Pt. 2 081810.mov: 12880 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council PM Pt. 3 080410.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council PM Pt. 3 080410.mov: 8248 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council PM Pt. 3 081810.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Special City Council 051810.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council PM Pt. 3 081810.mov: 9315 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council Closed Captioning Announcement.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM Pt. 2 090110.mov: 11156 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council AM Pt. 1 081810.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM Pt. 1 081810.mov: 1037 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/._Council AM Pt. 1 080410.mov: 1 extent(s)
/Volumes/KCLV_AIR_OPS/PLAYOUT/Council Meeting/Council AM Pt. 1 080410.mov: 1448 extent(s)

I usually use "Get Info" as a way of telling that a hang is happening. The "more info" pane will always say "Fetching" with a spinning gear during the hang. That is when I know I have to do a power cycle to shake it free of the funk.

I am defragging my high extent count files right now!!!!!! Thanks for this info. Could you please contact me via email for a sidebar question? My email is dmastroluca@lasvegasnevada.gov

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

smlsan's picture

Definitely sounds like the issue we have been working on. Let us know if defrag functions as a workaround for now.

FYI - testing in our lab reliably produces hangs with 10,000+ extents. It's less reliable in the 3,000-8,000 range. This is observational only - no specific testing was done to find a particular threshold and it appears to affected by memory use on the client.

I suspect those files with 6,000 extents and above will produce hangs more often than not if you play them in Quicktime or preview/get info.

dmastroluca's picture

I wanted to give everyone an update.

Apple had me change my xsan config files to

InodeExpandMin = 16 (**to match the stripe breadth)
InodeExpandInc = 64
InodeExpandMax = 4096

Then they had me move my media off the volume, erase the volume, and then transfer the media back to remove all the extents. Sometime after the transfer back the san became unresponsive on 2 of my clients. Same issue as before. Spinning gear and "fetching" in more info. Files do not show up in folders, when I re-launch finder, it blue screens with a spinning gear. Only option is power cycle with required disk repair of HD.

They had me run a cvfsck -x after transfer and it came back with a lot fewer extents. My Apple engineer said the EDC's for both affected clients and my 2 MDC's were clean with no indication why this was happening.

They wanted me to reset everything and "push" the volume to see if the problem re-appears.

On one of my clients I did a few finder copies, and played some media in quicktime, and it started acting up again. Same issue as before. Spinning gear and "fetching" in more info. Files do not show up in folders.

This time all the clients are affected. When I right click get info on volume in finder the fetching gear is spinning.

This is INSANE!! I am thinking about getting off of Snow Leopard until this issue is addressed by Apple.

Dan Mastroluca
Chief Engineer
KCLV-City of Las Vegas
www.kclv.tv

abstractrude's picture

i still havent seen this issues in my SL deployments. sorry about all the trouble you're having.

-Trevor Carlson
THUMBWAR

lucasnap's picture

I am seeing the same problem at one of my sites. They are using very large ProRes files, up to 150GB per file. The problems occur with the files who are ingested or copied simultaneously, like smlsan said.

When we duplicate a file with a lot of extent(s) in a moment when the Xsan is not used by anyone else the new file is fine.

The setup is fully 10.6.4 and 2.2.1.

I just wanted to say to abstractrude that he is not alone in this.

abstractrude's picture

i totally believe him. are you using native extended attributes? i have about 70 hollywood movies at prores 422 hq in our SAN. they are usually about 125-175 gigs. Haven't been able to reproduce those issues...what are your metadata latency times like. can you give us a cvsummary.

Typical large prores file found on the big SAN.
Macgruber.mov: 8 extent(s)
137G Aug 6 15:33 Macgruber.mov

Have you thought about doing a perl script or even bash that does nightly extent scrubbing?? so doesnt solve the problem but it would be a great workaround.

-Trevor Carlson
THUMBWAR

lucasnap's picture

Latency is between 250-280 us. So that's OK.
Extended Attributes are turned off, you never know when they come with the question if they can hook up a Linux machine for this and that.

What do you mean by cvsummary? Do you mean a cvgather?

Does anyone have any news on the Apple bugreport?

Our extent counts are up to 6000, but even more files are holy. This is because of the way they work. They record a show with 3 HD camera's simultaneously ingesting on the Xsan Volume.
Our workaround is a big bummer: The ingest Mac's record on an extra internal disk and after that the file is copied (one by one) to the Xsan Volume.

I've played with snfsdefrag, but because of the hanging and instability of machines when only touching the corrupt file i'm not so keen on scripting on this subject. I rather see an update.

abstractrude's picture

cvsummary is in the volume log. Also, how much free space do you usually have on your Xsan? cvadmin show for us? im trying to figure out why are you getting such high extent counts on your files.

-Trevor Carlson
THUMBWAR

lucasnap's picture

These are the summary's:

[code][1016 14:00:00.485958] 0x108573000 (Debug) FSM memory SUMMARY virtual size 2683MB resident size 122MB.
[1016 14:00:00.485978] 0x108573000 (Debug) FSM I/O SUMMARY writes total/9075.
[1016 14:00:00.485984] 0x108573000 (Debug) FSM I/O SUMMARY writes journal/942 sb/0 buf/954 abm/0.
[1016 14:00:00.485996] 0x108573000 (Debug) FSM I/O SUMMARY writes inode/954 ganged/2160 (7.08%).
[1016 14:00:00.486001] 0x108573000 (Debug) FSM wait SUMMARY inode pool expand waits/0.
[1016 14:00:00.486006] 0x108573000 (Debug) FSM wait SUMMARY journal waits/0 max depth/0 max time/0
[1016 14:00:00.486011] 0x108573000 (Debug) FSM wait SUMMARY journal bytes used avg/443051 max/1230848.
[1016 14:00:00.486016] 0x108573000 (Debug) FSM wait SUMMARY journal rsv count/2160 avg/409600 bytes.
[1016 14:00:00.486020] 0x108573000 (Debug) FSM wait SUMMARY free buffer waits/0.
[1016 14:00:00.486025] 0x108573000 (Debug) FSM wait SUMMARY free inode waits/0.
[1016 14:00:00.486030] 0x108573000 (Debug) FSM wait SUMMARY revokes/0 avg/0 min/0 max/0.
[1016 14:00:00.486034] 0x108573000 (Debug) FSM wait SUMMARY revokes/0 avg/0 min/0 max/0.
[1016 14:00:00.486042] 0x108573000 (Debug) FSM Trans SUMMARY Reserve cnt/2160 avg/2 min/0 max/20.
[1016 14:00:00.486049] 0x108573000 (Debug) FSM Trans SUMMARY Commit cnt/2160 avg/23 min/4 max/176.
[1016 14:00:00.486054] 0x108573000 (Debug) FSM threads SUMMARY max busy hi-prio/6 lo-prio/3.
[1016 14:00:00.486059] 0x108573000 (Debug) FSM threads SUMMARY max busy dmig/0 events/0.
[1016 14:00:00.486064] 0x108573000 (Debug) FSM msg queue SUMMARY hi-prio now/0 min/0 max/4.
[1016 14:00:00.486069] 0x108573000 (Debug) FSM msg queue SUMMARY lo-prio now/0 min/0 max/2.
[1016 14:00:00.486073] 0x108573000 (Debug) FSM msg queue SUMMARY dmig now/0 min/0 max/0.
[1016 14:00:00.486078] 0x108573000 (Debug) FSM msg queue SUMMARY events now/0 min/0 max/0.
[1016 14:00:00.486084] 0x108573000 (Debug) FSM cache SUMMARY inode lookups/22500 misses/0 hits/100.00%.
[1016 14:00:00.486089] 0x108573000 (Debug) FSM cache SUMMARY free incore inodes now/8192 min/8189 max/8192.
[1016 14:00:00.486095] 0x108573000 (Debug) FSM cache SUMMARY buffer lookups/84626 misses/0 hits/100.00%.
[1016 14:00:00.486100] 0x108573000 (Debug) FSM cache SUMMARY free buffers now/2048 min/2045 max/2048.
[1016 14:00:00.486105] 0x108573000 (Debug) FSM cache SUMMARY attrs now/3726 min/3726 max/3726.
[1016 14:00:00.486110] 0x108573000 (Debug) FSM extent SUMMARY extent lookups/11695 misses/0 hits/100.00%.
[1016 14:00:00.486115] 0x108573000 (Debug) FSM extent SUMMARY hint tries/0 misses/0 hits/0.00%.
[1016 14:00:00.486124] 0x108573000 (Debug) VOP SUMMARY Setattr cnt/2160 avg/24+62 min/16+34 max/108+254.
[1016 14:00:00.486133] 0x108573000 (Debug) VOP SUMMARY Getquota cnt/120 avg/23+54 min/17+29 max/28+78.
[1016 14:00:00.486141] 0x108573000 (Debug) VOP SUMMARY VopLookupV4 cnt/8640 avg/27+44 min/16+19 max/123+199.
[1016 14:00:00.486151] 0x108573000 (Debug) TKN SUMMARY TokenChangeV5 cnt/2160 avg/20+18 min/10+4 max/107+166.
[1016 14:00:00.486165] 0x108573000 (Debug) SG SUMMARY WorkingMTD space total/1.82 TB free/1.82 TB (99.96%)
[1016 14:00:00.486174] 0x108573000 (Debug) SG SUMMARY WorkingMTD space minfree/1.82 TB (99.96%) maxfree/1.82 TB (99.96%)
[1016 14:00:00.486181] 0x108573000 (Debug) SG SUMMARY WorkingMTD alloc extent cnt/0 avgsize/0.00 B.
[1016 14:00:00.486186] 0x108573000 (Debug) SG SUMMARY WorkingMTD 19 btree free space fragments 35 splay tree fragments
[1016 14:00:00.486191] 0x108573000 (Debug) SG SUMMARY WorkingMTD perfectfit succeeded/0 failed/0
[1016 14:00:00.486196] 0x108573000 (Debug) SG SUMMARY WorkingMTD stripealign succeeded/0 failed/0
[1016 14:00:00.486203] 0x108573000 (Debug) SG SUMMARY WorkingData-1 space total/43.66 TB free/39.12 TB (89.61%)
[1016 14:00:00.486211] 0x108573000 (Debug) SG SUMMARY WorkingData-1 space minfree/39.12 TB (89.61%) maxfree/39.12 TB (89.61%)
[1016 14:00:00.486217] 0x108573000 (Debug) SG SUMMARY WorkingData-1 alloc extent cnt/0 avgsize/0.00 B.
[1016 14:00:00.486222] 0x108573000 (Debug) SG SUMMARY WorkingData-1 21402 btree free space fragments 42666 splay tree fragments
[1016 14:00:00.486227] 0x108573000 (Debug) SG SUMMARY WorkingData-1 perfectfit succeeded/0 failed/0
[1016 14:00:00.486232] 0x108573000 (Debug) SG SUMMARY WorkingData-1 stripealign succeeded/0 failed/0
[1016 14:00:00.486239] 0x108573000 (Debug) SG SUMMARY WorkingData-2 space total/43.66 TB free/38.53 TB (88.26%)
[1016 14:00:00.486246] 0x108573000 (Debug) SG SUMMARY WorkingData-2 space minfree/38.53 TB (88.26%) maxfree/38.53 TB (88.26%)
[1016 14:00:00.486252] 0x108573000 (Debug) SG SUMMARY WorkingData-2 alloc extent cnt/0 avgsize/0.00 B.
[1016 14:00:00.486257] 0x108573000 (Debug) SG SUMMARY WorkingData-2 19447 btree free space fragments 38732 splay tree fragments
[1016 14:00:00.486262] 0x108573000 (Debug) SG SUMMARY WorkingData-2 perfectfit succeeded/0 failed/0
[1016 14:00:00.486267] 0x108573000 (Debug) SG SUMMARY WorkingData-2 stripealign succeeded/0 failed/0
[1016 14:00:00.486273] 0x108573000 (Debug) PIO HiPriWr SUMMARY Working_MTD cnt/942 maxq/0.
[1016 14:00:00.486280] 0x108573000 (Debug) PIO HiPriWr SUMMARY Working_MTD avg/1691 min/479 max/14271.
[1016 14:00:00.486286] 0x108573000 (Debug) PIO HiPriWr SUMMARY Working_MTD sysavg/1690 sysmin/479 sysmax/14267.
[1016 14:00:00.486292] 0x108573000 (Debug) PIO HiPriWr SUMMARY Working_MTD avglen/1686 minlen/1024 maxlen/2048.
[1016 14:00:00.486298] 0x108573000 (Debug) PIO Read SUMMARY Working_MTD cnt/7179 maxq/1.
[1016 14:00:00.486304] 0x108573000 (Debug) PIO Read SUMMARY Working_MTD avg/161 min/111 max/10382.
[1016 14:00:00.486310] 0x108573000 (Debug) PIO Read SUMMARY Working_MTD sysavg/130 sysmin/99 sysmax/10363.
[1016 14:00:00.486315] 0x108573000 (Debug) PIO Read SUMMARY Working_MTD avglen/512 minlen/512 maxlen/512.
[1016 14:00:00.486321] 0x108573000 (Debug) PIO Write SUMMARY Working_MTD cnt/8133 maxq/1.
[1016 14:00:00.486327] 0x108573000 (Debug) PIO Write SUMMARY Working_MTD avg/1933 min/482 max/209779.
[1016 14:00:00.486333] 0x108573000 (Debug) PIO Write SUMMARY Working_MTD sysavg/1919 sysmin/472 sysmax/209766.
[1016 14:00:00.486339] 0x108573000 (Debug) PIO Write SUMMARY Working_MTD avglen/2373 minlen/512 maxlen/16384.
[1016 14:00:00.486351] 0x108573000 (Debug) Cache SUMMARY [homes.metadata.hmth] attrs now/2 min/2 max/2.
[1016 14:00:00.486357] 0x108573000 (Debug) RSVD SUMMARY [homes.metadata.hmth] reserved space max requested/282 MB accounted now/0 MB
[1016 14:00:00.486366] 0x108573000 (Debug) VOP SUMMARY [homes.metadata.hmth] Getquota cnt/120 avg/23+54 min/17+29 max/28+78.
[1016 14:00:00.486373] 0x108573000 (Debug) Cache SUMMARY [editset02.metadata.hmth] attrs now/1 min/1 max/1.
[1016 14:00:00.486378] 0x108573000 (Debug) RSVD SUMMARY [editset02.metadata.hmth] reserved space max requested/282 MB accounted now/0 MB
[1016 14:00:00.486387] 0x108573000 (Debug) Cache SUMMARY [editset03.metadata.hmth] attrs now/184 min/184 max/184.
[1016 14:00:00.486392] 0x108573000 (Debug) RSVD SUMMARY [editset03.metadata.hmth] reserved space max requested/282 MB accounted now/0 MB
[1016 14:00:00.486400] 0x108573000 (Debug) Cache SUMMARY [metadata02.metadata.hmth] attrs now/27 min/27 max/27.
[1016 14:00:00.486405] 0x108573000 (Debug) RSVD SUMMARY [metadata02.metadata.hmth] reserved space max requested/282 MB accounted now/282 MB
[1016 14:00:00.486414] 0x108573000 (Debug) VOP SUMMARY [metadata02.metadata.hmth] Setattr cnt/2160 avg/24+62 min/16+34 max/108+254.
[1016 14:00:00.486422] 0x108573000 (Debug) VOP SUMMARY [metadata02.metadata.hmth] VopLookupV4 cnt/8640 avg/27+44 min/16+19 max/123+199.
[1016 14:00:00.486428] 0x108573000 (Debug) Cache SUMMARY [metadata01.metadata.hmth] attrs now/3511 min/3511 max/3511.
[1016 14:00:00.486433] 0x108573000 (Debug) RSVD SUMMARY [metadata01.metadata.hmth] reserved space max requested/282 MB accounted now/0 MB
[1016 14:00:00.486442] 0x108573000 (Debug) Cache SUMMARY [editset01.metadata.hmth] attrs now/1 min/1 max/1.
[1016 14:00:00.486447] 0x108573000 (Debug) RSVD SUMMARY [editset01.metadata.hmth] reserved space max requested/282 MB accounted now/0 MB
[1016 14:00:00.486462] 0x108573000 (Debug) TKN SUMMARY [metadata02.metadata.hmth] TokenChangeV5 cnt/2160 avg/20+18 min/10+4 max/107+166.
[1016 14:00:00.486488] 0x108573000 (Debug) NETWORK SUMMARY [homes.metadata.hmth:50228]: QueuedInputMsgs max/2 QueuedOutputBytes max/0.
[1016 14:00:00.486494] 0x108573000 (Debug) NETWORK SUMMARY [editset02.metadata.h:49167]: QueuedInputMsgs max/2 QueuedOutputBytes max/0.
[1016 14:00:00.486500] 0x108573000 (Debug) NETWORK SUMMARY [editset03.metadata.h:49167]: QueuedInputMsgs max/1 QueuedOutputBytes max/0.
[1016 14:00:00.486506] 0x108573000 (Debug) NETWORK SUMMARY [metadata02.metadata.:49199]: QueuedInputMsgs max/3 QueuedOutputBytes max/0.
[1016 14:00:00.486511] 0x108573000 (Debug) NETWORK SUMMARY [metadata01.metadata.:49963]: QueuedInputMsgs max/1 QueuedOutputBytes max/0.
[1016 14:00:00.486516] 0x108573000 (Debug) NETWORK SUMMARY [editset01.metadata.h:49167]: QueuedInputMsgs max/0 QueuedOutputBytes max/0.
[1016 14:00:00.486522] 0x108573000 (Debug) NETWORK SUMMARY [editset01.metadata.h:49167]: QueuedInputMsgs max/2 QueuedOutputBytes max/0.
[1016 14:00:00.486529] 0x108573000 (Debug) FSM RSVD SPACE current 4230 MB actual 282 MB max since boot 4230 MB since last stats 4230 MB
[1016 14:00:00.486534] 0x108573000 (Debug) FSM RSVD clients with files open for write: now 5 max 6 last 5
[1016 14:00:00.486540] 0x108573000 (Debug) FSM RSVD client dirty in slot 4 with IP 192.168.0.202
[1016 14:00:00.486545] 0x108573000 (Debug) FSM RSVD clients writing: now 5 empty 4 max 6 last 5/code

Free space:

[code]Xsanadmin> select Working
Select FSM "Working"

Created : Tue Jul 20 14:54:49 2010
Active Connections: 6
Fs Block Size : 16K
Msg Buffer Size : 4K
Disk Devices : 9
Stripe Groups : 3
Fs Blocks : 5859369856 (87.31 TB)
Fs Blocks Free : 5210703728 (77.65 TB) (88%)/code

The large files are made by Pictureready. Mostly 3 Mac's ingesting at the same time. The way Pictureready (and FCP) are ingesting is the reason why we get big amounts of extents: they just start writing. We've just started using MovieRecorder for our ingests and it does a much nicer job. It allocates the space on the volumes it needs (you have to give a max time), this results in one extent. So just like the Finder it preallocates the disk space.

Render files of a FCP project is now our main issue. We still have no nice workaround for this.

Instead of increasing the InodeExpand settings of the volume it sounds more logical to decrease the InodeExpandMin setting cause this will result in less inodes.. I will try this next time i'm there.

Karepi's picture

Wanted to add a me too, to this situation. Currently experiencing slow browsing just as described in this thread. Sometimes it will become unstuck but most of the times a client reboot will be necessary

So right now the possible causes are:

Cables.
Fragmentation.

Any other option that we might be missing?

Have a good day.

jfw's picture

I have a client with these symptoms too.

I saw it on a directory without any files with extents problems, but perhaps they hit a file with the problem without noticing it before navigating to the directory that wouldn't show up at all.

Apparently once the Finder gets in this state, you can still do an "ls" in Terminal to see the contents of the directory. But you have to reboot to get the Finder to work again (for a while anyway).

I'd like to get an update from the original poster, [b]dmastroluca/b, who hasn't posted since 3 weeks ago when he said ([url]http://www.xsanity.com/forum/viewtopic.php?p=53703#53703/url) that essentially defragmenting the whole volume (by erasing it and recopying the files) didn't help.

lucasnap's picture

Nice to know there are more sites with this problem. Maybe we can try to figure out some more.

Do you guys use very large files as well? (around 100GB and up)
And how do these files get on the Xsan? Ingest, copy, FCP, FCSvr?

We have seen the same extreme amounts of extends on files who got picked up by FCSvr (Watch folder > copy and delete) so we need to make a work around on that as well.

Karepi's picture

Very large files as well.
We get the files on the SAN through ingest.
And have some scripts moving files around from volume to volume.

Running snfsdefrag will get stuck while defragging, would seem like a latency problem the thing is if you test the whole fiber setup and network setup it seems ok.

If the Volume is clean there are no problems. It starts when the volume is around 50%.

Doing ls through the terminal works sometimes, but i bet if you do ls -ls it will get stuck.

My question is, is it a client-side problem or a server-side problem? Meaning what if the mdc's are hosted by a leopard instead of a snow leopard?

lucasnap's picture

If you run Leopard on the MDC you also need to run Leopard (or lower) on the clients.. So it certainly is a good question but it is a rhetorical one.

The MDC makes too many extends, that is the problem. So the solution will be either at the server (to make less) or at the client (read more).

lucasnap's picture

I posted a bugreport at radar.apple.com:

Quote:
25-Oct-2010 02:03 PM Lucas Nap:
Summary:
Large file (100GB+) will get holy (with holes) on an Xsan volume under some circumstances: a busy filesystem and a application who is not using correct memory allocation (such as Final Cut Pro and Quicktime).

Steps to Reproduce:
1. Setup an Xsan 2.2.1 Volume with default settings (I used the Xsan Admin for HiRes)
2. Connect multiple MacPro's with Mac OS 10.6.4 as an Xsan client
3. Ingest with Final Cut Pro 7.x 3 large movies (3 hours) ProRes (one stream per MacPro with AJA Kona card)
4. Check the files with snfsdefrag -c
5. The amount of extends will be up to 6000 or even "holy".
6. Check the files with the Finder (Command-I or dubble-click)
7. Finder will be unresponsive from this moment on, all the folder on the Xsan volume will seem empty until the Xsan client is restarted.

Expected Results:
Normal Finder behavior.

Actual Results:
Finder will crash, quicktime will crash trying to play the large movie.

Regression:
I'm using all up to date software:
- Mac OS 10.6.4 (for the MacPro's running Final Cut Studio 3)
- Mac OS 10.6.4 Server (for the Xserve's running as metadata controller and for the Xserve running Open Directory )
- Xsan 2.2.1
- 2 x Promise Vtrack E-class (firmware 10.6.2270.00)
- 2 x Promise Vtrack J-class
- 2 x Qlogic 5602 FC switch (firmware 7.4.0.16.00)
- Netgear gigabit switch for metadata
- Netgear gigabit switch for normal LAN

We did not had this problem with our old setup (differences):

- Mac OS 10.5.x
- Apple Xserve RAID
- Only one Qlogic FC switch

/quote

I hope to get a response soon but I had less productive results with radar before.

AL's picture

Hello, I'm administering a system that's having similar problems. There's separate post about our issue (http://www.xsanity.com/forum/viewtopic.php?t=10603) but I think I should continue posting here.

We don't think we have any hardware issues but big problems with Finder when browsing the file system. I've been to contact with Apple's Xsan support and currently waiting information from their engineers of the collected log files. I suspect that the problem is caused by some the following or a combination of them: high fragmentation of the volume, high number of files and used inodes or/and volume settings.

mcjunkie's picture

Today Apple released Mac OS X 10.6.5.
There is an Xsan related note in update description. (http://support.apple.com/kb/HT4249)

Xsan
Improves reliability of fibre channel connections, resolving a potential Xsan volume availability problem.

What do you think about this?
I hope it may fix issue we've experienced long time.

Thanks.

lucasnap's picture

Doesn't sound like what we need... I was hoping more in the line of "improved reliability memory allocation/fragmentation".

I've seen this in the Pro App 2010-02 update:
Final Cut Pro "Several memory fragmentation issues have been resolved."

That sounds about right. Did anyone test this already?

Karepi's picture

Still haven't tested 10.6.5. Anyone out there that has? So what's the current status of everybody? Over here still same old same old. Would moving to full fledged Stornext help?

lucasnap's picture

As promised I decreased the InodeExpandMin settings to 2 and 4 and interesting enough it did change the behavior of the volume. (default is 8 )

First with "2" the volume became pretty unresponsive, especially with writhing big files.

With the "4" setting I exported 2 150GB QT files with QT 7 Pro on two different Xsan clients. The result was extents of 7000 and 14000!!! That must be a record! But unfortunately still all the Finder's crashed trying to do something with the files.

Next test: 10.6.5 (I'm at the test environment at the end of the month, so we will have to wait).

abstractrude's picture

I have read the 10.6.5 server update has been pulled due to serious bugs.
http://reviews.cnet.com/8301-13727_7-20022832-263.html

-Trevor Carlson
THUMBWAR

lucasnap's picture

Is there someone with news on this topic?

Karepi's picture

  • bump
  • mw10dot1's picture

    On a recent new install with 10.6.5 and XSAN 2.2.1 i noticed similar behaviour with huge file fragmentation and finder becoming unusable.

    The files where being pushed to the SAN via SMB from a windows machine and for a 200GB file the extents where about 22,000

    With a rejigging of the workflow and switching off SMB. This time pulling the files via finder we where back down to 1 extent.

    Transferring the files with a ruby script caused ~1,000 extents for a 70GB file.

    Using cp from the terminal caused files to be created with 1 extent.

    I hope this is useful information in hunting down this issue.

    Michael

    jsiegers's picture

    I think I'm the first to report this on a 10.5.8 client! Yay!

    Our MDC's are 10.6.5 and also Xsan 2.2.1.

    We've had this kind of behavior as well. But on 10.5.8 it's a little bit different.
    The Finder works fine, Quicktime works fine and most of the time Quicklook as well. But as soon as you open the Final Cut Pro project it gets stuck on these movies and crashes. After that you're no longer able to see the contents of the folder in the Finder or through the Terminal on that client.

    I've had a file with more the 14443 extents for only 18GB! I guess I hold the record now lucasnap?

    We do have FCP 6 and still have to install the Pro Applications Update 2010.2 where several memory fragmentation issues have been resolved.
    Perhaps it does work after that?

    lucasnap's picture

    Wow, that is a lot for 18Gig.

    What kind of file is it? How did it get on you volume? (ingest, finder copy, render file, QT export.. etc)

    Pages