Volume not mounting - LUNS with exclamation points

arls1481's picture

So after withering through a full failure on one of my LUNs, well haven't really got through it but I got all my LUNs back online as far as I can tell from the WebPAM on all my Vtraks, my volumes will not mount and I need some guidance please!

The long end of the story is that I had 1 of 2 drives in my meta-data & journal volume drive fail at the same time as 4 of 6 drives in a data LUN died. With only 4 GR Spares available for that drawer, I think you all know what I was faced with. I have a vtrak Ex610Fd system with 4 E-class and 4 J-class with 4 GR spares per E&J all compounded in to 2 Xsan volumes.
I've attached a screen grab of my LUNs for reference if it helps.

Basically, Vtrak1 (my top E&J) had the failures and as a result I ended up successfully recovering the meta-data/journal LUN but the data LUN went offline and I ended up having no choice but to replace the bad disks and initialize it as a new LUN duplicating the naming/settings that were there before. Promise support was of little help and told me my volume was lost because of that. Which I don't buy but that's why I am turning to you all for some help here.
I can download and attach whatever other logs/images you need if it helps or try to explain more but what I need is some help figuring out how to get my volume back online.
I think that the exclamation point icons mean that Xsan can't see those LUNs but I have all 4 subsystems online and OK so I don't know why I'm stuck here.
Also, I realize that everyone sets things up differently but is it conceivable that my LUN schema provides for failure in this manner? What I mean is that is the XSan able to deal with a single LUN failing in a multi-LUN volume?
Hopefully that makes some sense?!?!!!
And thanks for anyone's time and effort in advance!


xsanguy's picture

Xsan (absent exotic HA solutions) is not designed to be resilient to permanently losing LUNs.

In this type of scenario, you are going to lose ALL (usually) or at least SOME (rarely, but hopefully, I've done it when others said it was impossible) of your data - largely dependent on your data type, and stripe group / affinity settings.

First though, are you *absolutely sure* you really lost 4 of 6 drives in one of your LUNs? It's super rare to lose that many at once for real. It may be possible to force them back online, unless you've re-used them to create the new LUN (probably gone for good then.)

We need to take a deeper look at your system, but you can't just nuke a LUN then make another one and slot it back into place. Did you name it the same? That's just one of half a dozen hurdles.

If you have a backup, it's going to be easier to nuke the volume and restore.

If not, shoot me a PM. There are a few things to try depending on your configuration.

arls1481's picture

I'm positive that I lost all the drives because, in my unknowing, I wiped the failed drives as I replaced them thinking that the rebuild that was going on for each failed drive was going to complete without issue, which it did not. The rebuild BGA ran for 48+ hours but after looking behind the curtain, was only hung up and idling, the rebuilds never actually completed and eventually the LUN wen't offline on its own because of that. I tried forcing the drives back online too and couldn't seem to make that work either.

When I re-created the LUN, I named and configured it exactly as it had been prior to faulting out.

I do have backup of my mission critical data but there was some tertiary filestores that I would rather not lose if I don't have to. So, I'm trying to give it as best a shot as I can here.

What other information can I furnish to give better details?
cv admin logs?

csanchez's picture

You can try deactivating the data stripe groups that have suffered LUN failures. If the volume is able to start, you should then be able to mount it again. All files with extents on the downed stripe groups will be inaccessible.

See this article:


Gerard's picture

From your screenshot, I would double check the configuration files of your MDCs and make sure they match up.

I had a similar problem a few weeks back. I added a second MDC into an environment, which only had one MDC at the time. When I failed over to this new MDC, it failed back onto the first one. Looking at Xsan Admin, I had the same image (duplicate LUNs with yellow warning signs).

I have an alliance support contract, so I gave a Xsan engineer a call. Looking at the config files between the two MDCs, they didn't match.

Failed over to the first MDC
Copied it's config file onto the second MDC
Restarted the second MDC and compared config files again and they matched.
Disconnected from Xsan Admin, reconnected and the yellow signs/duplicates were gone.

Hope this helps.

arls1481's picture

csanchez wrote:
You can try deactivating the data stripe groups that have suffered LUN failures. If the volume is able to start, you should then be able to mount it again. All files with extents on the downed stripe groups will be inaccessible.

See this article:

This article got my volume back online, no google search got me to this, you did! TY TY TY!!!! 8)
So I am probably facing some tertiary data loss and or fragmentation but I can deal with that, what I need to work on now is how can I heal those two LUNs that are being so pesky? They are showing a status of ok in WebPAM so I am a little co fused. I've run consistency checks against them via WebPAM and they run through OK, what next?

Sirsloth's picture

Rebuilding the LUN from scratch you would probably need to cvlabel them again as they have lost their Stornext partition formatting on the recreated LUNs. Do a sudo cvlabel -l at the terminal to check it out. If you need to relabel the new LUNS be sure to name them the same. Data on these LUNs will be lost forever at this stage so you will need to re-initialize the volume or use these luns in a seperate volume... Whatever you see the way forward.