Xsanity Sanity for Apple's Xsan and Final Cut Server.
  
Thursday, June 20 2013 @ 07:13 AM EDT
Topics
Storage (39)
People (1)
Xsan (105)
How To (26)
User Functions
Username:

Password:

Don't have an account yet? Sign up as a New User
Who's Online
Guest Users: 7
Sponsorship

Xsanity is proudly sponsored by:

Tekserve
The Old Reliable Mac Shop

My recent experience with Xsan2 and smaller file sizes
Goto page 1, 2  Next
 
Post new topic   Reply to topic    Xsanity Forums Forum Index -> General Xsan Talk
View previous topic :: View next topic  
Author Message
memblin
Been around the blocks
Been around the blocks


Joined: 22 Apr 2009
Posts: 20

PostPosted: Wed Mar 03, 2010 4:21 pm    Post subject: My recent experience with Xsan2 and smaller file sizes Reply with quote

About two years ago I started work for a new company that already had an
existing Xsan built running one volume. They chose the technology due to
the need for a clustered file system and the price point being very very right.
I've been pulling my hair out for weeks trying to configure a new and faster
volume on a new set of equipment. I got some good results and haven't
found a single place with the information posted as a for sure "this is what
I did, this is how it worked out" scenario.

I see everywhere that Xsan is much more robust when used with larger files
and generally tunes itself for sequential reads / writes but that performance
suffers when we move into the realm of general file server utilization.

The performance we are seeing on this original volume were between 35
and 60MB/s depending on what we were copying around. The read speeds
when just reading a file were not to bad enough to impact daily use but were
horrible when doing maintenance and moving file sets around. That said,
the performance on large files was phenomenal, anything over 1Gig 600MB/s
or more when reading, 450MB/s when writing.

Our file sizes are generally from 500k to 5MB, sometimes larger sometimes
smaller but most directories have over 8,000 files in them of that size.

I was able to work the settings over I don't know how many times to try
and find the best performance for our data set. Keep in mind I'm not a
professional Apple XSan or Promise guy, the support from Apple was almost
completely lacking. I even asked the most simple of questions of If my
file sizes are smaller, what should I click after I click the + to add a new
volume. I was told they don't do that kind of support even with a paid
XSan Agreement for every Xsan seat we have on the Xsan.

I have seen so much back and fourth and contradicting information I finally
just threw caution to the wind and while some of what I've done may be
considered dangerous, it seems to be working for now. If it bombs at a later
date I'll correct myself and warn folks. heh

After much trial and error I was able to get our speeds up from a
sustained 40MB/s - 60MB/s on the old volume to a sustained
90MB/s to 110MB/s on the new volume.

The original volume is running on the following hardware / settings.

2x VTrak E610F Chasis fully loaded 16x 750GB SATA
4x QLogic SANBox 5600 FC Switches
2x OS X (Leopard) MDC
2x OS X (Leopard) Clients
6x RedHat StorNext Clients

The new equipment purchased

1x VTrak E610f Chasis Loaded w/ 1TB SATA drives
1x VTrak J610f Chasis Loaded w/ 1TB SATA drives

We connected these into the existing fiber mesh set the zoning and moved
on into configuration testing.

VTrak Array configurations used were identicle to the original set of VTraks.

E-Class
1x 2 Disk RAID 1 Metadata Array
2x Global Revertable Spares
2x 6 Disk RAID 5 Arrays

J-Class
2x 7 DISK RAID 5 Arrays
2x Global Revertable Spares

Not an "Apple approved" script from their site but it's the one the original
consultant from Apple used on the original setup.

With the exact same settings, I got the exact same performance as the
original volume. After two weeks of messing with array load outs, Apple
approved scripts, and begging for support at Apple which I finally did get
a minor amount of help from, this is where we landed.

(same as above and original volume config)
E-Class
1x 2 Disk RAID 1 Metadata Array
2x Global Revertable Spares
2x 6 Disk RAID 5 Arrays

J-Class
2x 7 DISK RAID 5 Arrays
2x Global Revertable Spares

I changed all the read and write policies including the one for the metadata
array to..

readpolicy=readcache,writepolicy=writeback

We have raid batteries as well as in-rack power backup, plus a nice
generator so I'm not worried about loosing power and corrupting things.

I originally left Forced ReadAhead enabled, but went back after the array
had finished initializing and I unchecked the Forced ReadAhead on the
controllers. I've read in one post that this will burn out your disks faster or
something like that but after reading the 'Optimizing_VTrak_1.0.pdf' file I
found over on Promise' site, I've decided this probably isn't true.. Again
I'm not a professional but ForcedReadAhead is mentioned int hat document
as a method for optimizing sequential data types.

So I did it, that document is also what made me decide to change all of
the arrays read policy to ReadCache instead of ReadAhead.

I setup the volume with the General File Server defaults but then changed
the advanced settings just a smidge. Turned on Spotlight, turned off native
extended attributes, set block size to 4K blocks, we don't use ACLs and are
very careful to make sure any server that is a part of the XSan have the
exact same UID / GID groups so that was disabled as well, not awesome
but works for what we do.

I labled the LUNs up after that and dropped the MDC LUN into the Metadata
and Journal area, dropped the other four Data LUNs into the Data area.

Then I got a little goofy, set the stripe breadth on the MDC affinity / storage
pool (not sure what the proper terminology is) to 32. The Data affinity /
storage pool I set to 256. Let the volume finish building and we've got
better performance than I've ever seen for our particular data set.

Copying from the old volume to the new one using rsync is rocking 80MB/s
overall for a copy of 140GB or so. We are still migrating data over and plan
to keep an rsync'd copy between the two volumes until we are sure about
the stability of the new volume.

I'm sure this would be terrible for anyone who's data set is made up of larger
files than ours. I'm not sure where the performance would start to degrade
since I couldn't really test that. I'm also not sure yet about the stability of
this new setup.

I did learn one other thing and that is any Xsan client with a .dmg mounted
can present that .dmg file in the LUN list in Xsan Admin.. the EDC.dmg shows
up as a 82k LUN... now THAT one messed with my head a bit, luckily one
of the techs at Apple pointed me in that direction. Ejected the DMG and
tada no more phantom funky LUN.

I tried every combo of drive/array configs, block sizes, stribe breadths I
could think of and logged them all so I wouldn't get a repeat. The three
things that seemed to make the biggest impact on performance was disabling
ForcedReadAhead on the vtrak controllers, changing the read policy to
ReadCache instead of ReadAhead, and dropping the stripe breadth on the
metadata affinity / storage pool down from 256 to 32.

Anyway, hope that helps someone out there. Mileage may vary and all that,
only test on non-production systems, etc.. I'm no professional so for all I
know what we did may burn up those drives just like the other post I read
said it would or cause other stability issues that I haven't even thought about
yet.

Now if I could only get my hands on a set of two sets of E+J+J 16x chasis
with SAS drives in it. This thing would REALLY fly, to bad it would take that
many to make up the same amount of storage space we have in an E+J.
Back to top
View user's profile Send private message Visit poster's website
bradbraddington
partially protected
partially protected


Joined: 13 Apr 2009
Posts: 5

PostPosted: Wed Mar 03, 2010 10:47 pm    Post subject: Reply with quote

I'm fairly sure dropping the number of blocks on your metadata pool helped, but in reality what you are seeing is the increased performance from the lack of readahead. When you disable that you no longer allow the controllers and drives to utilize the cache. Instead the data is immediate on the drives, this forces the drives to work much much harder since they are not allowed to fall back on what is in cache. In the short term this will show increased results because the drives don't have much to work with, but run your tests again when your drive is over 50% full and you will see horrific results, then run the tests again in 1 year and you will have URE that is amazing. (URE = in rough terms, simultaneous dead drives)

Instead of turning off read ahead what you should do is attempt to tune read ahead. Some drive manufacturers have auto tuning policies and some have hand tuning. I am not familiar with what you have but I'm sure it exists. Have you tried to call Promise on this? Don't ask about Xsan, just as about their read ahead tuning parameters that best help small file transfers.

-Brad
Back to top
View user's profile Send private message
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 881

PostPosted: Wed Mar 03, 2010 11:17 pm    Post subject: Reply with quote

Im sorry apple support was so bad. Sometime you get someone that will help you sometimes you dont. Apple may not say it in documentation but what lots of places that need better small performance are using two RAID 1 luns in their metadata pool. Also tuning block size and such are important. I think the 4k block size is a good idea. I am experimenting with SAS drives for metadata too.

The fact is STORnext is designed for tiered storage like most datacenter filesystems, apple sells it as a simple clustered file system but you can do so much with it. You can have some storage thats fast I/O and good for small files, you can have storage for large files and high bandwith, you can have storage thats always slow but is cheap and is always available and its all in the same environment. the beauty of stornext is all the problems that people have. its super configurable, what a shitty problem to have! i say run with it. learn from the mistakes and build the best SAN you can.

its high performance and very powerful, with those powers come the downsides. My biggest mistake is listening to people that have said oh its ok to just do that this new update or some new RAID will take care of it. build your pools and storage for the type of data they are handling. The best data rates I have ever got on small file sizes was 60-70 MB a second. that was severe tuning! I tuned the hell out of every possible variable and was able to move image sequences around fast.

one very important note. if you are moving data around your xsan use cvcp. you will freak when you get 160 megabytes a second on small files. after all the pain that is xsan, i love the product.
Back to top
View user's profile Send private message
memblin
Been around the blocks
Been around the blocks


Joined: 22 Apr 2009
Posts: 20

PostPosted: Thu Mar 04, 2010 12:14 pm    Post subject: Reply with quote

I appreciate the recommendations Brad, I've got a call in with Promise now to
ask them directly about all that for a clear answer. I went ahead and did some
reading on the unrecoverable read errors (URE) and how they come about and
that is just plain scary. I'd never heard of such a thing before I ready your
post. At least now I can be on the lookout for it and prepared to handle it when
the inevitable eventually happens in the coming months / years. *grin*

Abstractrude, cvcp did step it up a bit. I didn't quite hit 160MB/s but anything
is better than the 40MB/s we were getting on that old volume. I'd love to get
my hands on some more documentation for really tweaking out the available
options on Xsan or even ditching it and going straight STORnext. heh
Back to top
View user's profile Send private message Visit poster's website
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 881

PostPosted: Thu Mar 04, 2010 1:38 pm    Post subject: Reply with quote

FYI i use XSAN controllers.
Back to top
View user's profile Send private message
memblin
Been around the blocks
Been around the blocks


Joined: 22 Apr 2009
Posts: 20

PostPosted: Thu Mar 04, 2010 2:40 pm    Post subject: Reply with quote

This just in direct from Promise support.

Disabling the Forced ReadAhead option that is available in the controller
settings can increase performance for smaller non-sequential reads and
is better left on for larger sequential reads.

This option does not affect the actual disks caching that is built into the
individual drives themselves. So turning this feature off or leaving it on
will not impact the longevity of a drive in the chasis. It only affects the
performance characteristics of the arrays and caching at the controller
level.

Changing ReadAhead to ReadCache on the logical drives built within the
arrays also does not modify the individual drives in-built caching
functionality. It only affects the manner in which data is read from the logical
drives and how the cache is utilized, and also cannot affect the longevity of
the individual drives in the chasis.

I gather from my discussion with the guys at Promise that with ReadAhead
turned on in the logical drive settings the unit attempts to predict what
you will be reading next, and puts that in the cache. With large sequentially
accessed files this is GREAT for performance, with our data set of smaller
very randomly accessed files it is better to change this setting on the logical
drives from ReadAhead to ReadCache.

I know there are apparently two schools of thought on this one and while I
am wary of the warning about turning off these features that I have seen
posted in a couple of places including Brad's post above. I think I'm just
going to have to risk it for performance sake based on Promise's response
to the question. The guy did put me on hold to go ask his lead to make sure
he was giving the right answer. I know the Promise guys roam these forums
so any more information I'm sure would be appreciated.

I decided to hunt down a link for the Promise pdf that originally had me
run down this road in the first place just in case someone wants to give it
a look. I had a hard time finding it myself and just lucked into the right
Google combination that led me to it.

http://www.promise.com/apple/Optimizing_VTrak_1.0.pdf

Will post more as results become available or if catastrophe strikes.
Back to top
View user's profile Send private message Visit poster's website
SS
partially protected
partially protected


Joined: 25 Feb 2010
Posts: 6

PostPosted: Thu Mar 04, 2010 7:39 pm    Post subject: Reply with quote

Hi memblin,
Indeed, the information offered up by our technical support was correct.

Think of the controller forced read ahead option as form of aggressive prefetching done at the controller level. Logical drive read-ahead is still prefetching, but not as aggressive. When you enable both, large-block, sequential IO performance is increased but small block IO performance suffers.

At no time do the settings above affect the physical drive caching policies. However, we do offer the option to enable/disable physical drive settings (although it's not recommended). The feature in WebPam is located under Enclosures > Enclosure X > Physical Drives > Global Settings Tab.

I understand your feeling regarding URE's. As drive density has increased exponentially in the past few years, unrecoverable read errors have been an area of concern. Our Vtraks have implemented a feature to recover from such errors. When a URE is encountered during a rebuild, we effectively log the bad block into a "Check Table", skip this block, and then continue with the rebuild. A read to this LBA by the host will be returned as Medium Error. A write to that LBA will clear the entry. This will allow the file system to remain intact and the data accessible. For more information regarding this topic, please see this knowledge base article:

http://kb.promise.com/KnowledgebaseArticle10144.aspx

As abstractrude mentioned earlier, you should definitely consider building a metadata LUN comprising of 2*RAID1 LUNs. It will allow the journal to be put on the first LUN and the metadata to be striped across both. You will get a considerable boost in small file performance.

Feel free to reply, PM, or email me - stephen(dot)shyn(at)promise(dot)com - if you have any questions or concerns. Thanks.

-Steve
Back to top
View user's profile Send private message
daver
fully protected
fully protected


Joined: 29 Jun 2008
Posts: 10

PostPosted: Thu Mar 04, 2010 10:40 pm    Post subject: Reply with quote

Hi Steve,

I'm intrigued by the concept of building a metadata LUN comprising of 2*RAID1 LUNs. Will this make any significant difference in the case where the metadata logical volume is set to writeback caching? I can see that it could help where the metadata is set to write-through, but in the writeback case the disk write accesses are buffered anyway. Or am I wrong - it could be that the gains com from using two controllers rather than from more luns.

Perhaps I should test this...

Thanks,
Dave.
Back to top
View user's profile Send private message
abstractrude
Xsan Master
Xsan Master


Joined: 13 Mar 2008
Posts: 881

PostPosted: Fri Mar 05, 2010 12:21 am    Post subject: Reply with quote

dont forget block size and stripe bredth!
also, I got this 2 lun metadata thing from WWDC last year from people at the top...
Back to top
View user's profile Send private message
memblin
Been around the blocks
Been around the blocks


Joined: 22 Apr 2009
Posts: 20

PostPosted: Fri Mar 05, 2010 9:59 am    Post subject: Reply with quote

Steve, some great info there and I appreciate your post. The info you have on
the URE and how VTrak systems can recover from them is very good to hear.

I've seen some posts here and articles sprinkled about the web about
separating metadata from journal data. I haven't gone that route yet myself
because I wasn't sure about the kind of performance increase I might be
looking at.

We are still doing the data migration from the old volume to the new, once
it is loaded up and moved in production we're going to let it cook for a while
before I blow away the other volume and rebuild it for other uses.

When I do get that opportunity I would like to try the 2*RAID1 LUNs for
metadata and journal, I don't suppose you guys have a document with some
direction for getting that going. The only part I'm wondering about is how
to seperate the metadata and journal data in the manner you described in a
safe and reliable manner.

I'm also a bit curious about how spare drives affect the outcome of running
into a disk failure and and a URE on recovery. The KB article leaves spares
out of the equation for simplicity.
Back to top
View user's profile Send private message Visit poster's website
SS
partially protected
partially protected


Joined: 25 Feb 2010
Posts: 6

PostPosted: Fri Mar 05, 2010 12:55 pm    Post subject: Reply with quote

daver wrote:
Hi Steve,

I'm intrigued by the concept of building a metadata LUN comprising of 2*RAID1 LUNs. Will this make any significant difference in the case where the metadata logical volume is set to writeback caching? I can see that it could help where the metadata is set to write-through, but in the writeback case the disk write accesses are buffered anyway. Or am I wrong - it could be that the gains com from using two controllers rather than from more luns.

Perhaps I should test this...

Thanks,
Dave.


I haven't experimented with metadata LUNs set to write-back cache, so I can't speak to this; I'd imagine it would make less of an impact if this was the case. However - you are correct - the load from the metadata IO will be spread across two controllers (assuming that you have the LUN affinities set correctly) and two LUNs, thereby increasing performance.

abstractrude wrote:
dont forget block size and stripe bredth!


Agreed!


@ memblin:
No problem, glad to be of help.

I'm a bit confused with your statement here:

memblin wrote:
I'm also a bit curious about how spare drives affect the outcome of running into a disk failure and and a URE on recovery. The KB article leaves spares out of the equation for simplicity.


URE's generally occur during a rebuild to a spare. We consider this a "double-fault" scenario. The first "fault" being a member drive of the array died. The second "fault" would be the URE on a specific LBA located on a parity block contained within the source drive(s). RAID6 is more tolerable of URE's since the specific LBA can be recovered from another parity block if the first is unreadable.

memblin wrote:
When I do get that opportunity I would like to try the 2*RAID1 LUNs for metadata and journal, I don't suppose you guys have a document with some direction for getting that going. The only part I'm wondering about is how to seperate the metadata and journal data in the manner you described in a safe and reliable manner.


This is simple. Create two RAID1's during initial setup. When dragging LUNs into the Metadata pool, simply drag both of these RAID1's into it and you're good to go. Obviously, you'll have to do away with the two spares in the top chassis for this. You'll still have the protection of the spares located in the JBOD as they can span enclosures. This definitely strays from Apple's recommended practices but then again, this particular configuration isn't exactly posted on their website either. If you'd like help scripting the LUN creation when you decide to try this, get in touch with me and I'll whip something up for you.

BTW, I work with one of our largest XSAN deployments utilizing Promise storage (1+ PETABytes) and they utilize this 2*RAID1 Metadata configuration with much success.
Back to top
View user's profile Send private message
memblin
Been around the blocks
Been around the blocks


Joined: 22 Apr 2009
Posts: 20

PostPosted: Fri Mar 05, 2010 2:24 pm    Post subject: Reply with quote

SS wrote:

@ memblin:
No problem, glad to be of help.

I'm a bit confused with your statement here:

memblin wrote:
I'm also a bit curious about how spare drives affect the outcome of running into a disk failure and and a URE on recovery. The KB article leaves spares out of the equation for simplicity.


URE's generally occur during a rebuild to a spare. We consider this a "double-fault" scenario. The first "fault" being a member drive of the array died. The second "fault" would be the URE on a specific LBA located on a parity block contained within the source drive(s). RAID6 is more tolerable of URE's since the specific LBA can be recovered from another parity block if the first is unreadable.


The KB article mentions pulling a failed drive and replacing with a new drive
which then begins rebuilding at which point if the conditions are just right
a very rare error, the unrecoverable read error, can occur. Then it goes into
how Vtrak can minimize the damage if one does run into a URE.

I guess what I'm asking is, does having the global revertible hot spares in a
chasis avoid troubles of URE a little better, or are they just like dropping in a
new drive minus the physically swapping the new disk in since it's already on
standby as a spare. Now that I've spelled it out I realize I can probably find
the answer to this in a manual some place. heh

-Mem
Back to top
View user's profile Send private message Visit poster's website
SS
partially protected
partially protected


Joined: 25 Feb 2010
Posts: 6

PostPosted: Fri Mar 05, 2010 3:07 pm    Post subject: Reply with quote

Hi Mem,
Thanks for clarifying. The KB article assumes you don't have any hot spares installed. A rebuild is a rebuild - be it to a cold or hot spare. Smile

-Steve
Back to top
View user's profile Send private message
aaron
Site Admin
Site Admin


Joined: 19 Mar 2005
Posts: 407

PostPosted: Fri Mar 05, 2010 3:23 pm    Post subject: Reply with quote

Quote:
are they just like dropping in a new drive minus the physically swapping the new disk in since it's already on standby as a spare


Yes. My understanding is that a spare on standby has the same URE risks as popping in a new drive. (Or perhaps just slightly lower, since there are likely fewer drives in your LUNs.)
_________________
Aaron Freimark
http://www.tekserve.com/vcard/af.vcf
Back to top
View user's profile Send private message Visit poster's website
staze
Been around the blocks
Been around the blocks


Joined: 15 Oct 2007
Posts: 25

PostPosted: Mon Mar 08, 2010 1:23 am    Post subject: Reply with quote

[quote="SS"]
daver wrote:

This is simple. Create two RAID1's during initial setup. When dragging LUNs into the Metadata pool, simply drag both of these RAID1's into it and you're good to go. Obviously, you'll have to do away with the two spares in the top chassis for this. You'll still have the protection of the spares located in the JBOD as they can span enclosures. This definitely strays from Apple's recommended practices but then again, this particular configuration isn't exactly posted on their website either. If you'd like help scripting the LUN creation when you decide to try this, get in touch with me and I'll whip something up for you.


Since I haven't tried this, or seen anything about it, I'm curious. Isn't this basically creating a RAID 10 array for the MD, or does Xsan just seen that if there are two RAID1's, it sticks the journal on one, and the MD on the other?

Has there been any testing with using SAS drives for MD rather than the standard SATA drives? Of course, you'd want a J class to be 13 SATA's, and 3 SAS drives to allow a spare for each type of LUN (Data and MD, respectively). But, 15k SAS drives seem like they'd be faster than the dual RAID1 idea, as well as the fact that putting MD on a 1TB RAID1 seems kinda, overkill. =P
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    Xsanity Forums Forum Index -> General Xsan Talk All times are GMT - 5 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group
Best Viewed on a Mac | Suggested Browser: Whatever floats yer boat.