Xsan 3.1 created two duplicate LUNs

Andrew Allen's picture

HI,

We've got a system with 3 different SANs on it. We added the third SAN a few weeks ago and finally got it working. The third SAN (SAN3) has three LUNs which I defined in the Infortrend web interface for the RAID box. LUN names are FSN_SAN3_LUN1, FSN_SAN3_LUN2 and FSN_SAN3_LUN3.

LUN1 is a  RAID 1 for the Xsan metadata and journaling parition that covers two drives.

LUN2 is a RAID6 over 11 drives for 36 TB total and I'm using it for video

LUN3 is another RAID6 over 11 drives for 36TB total and I'm using it for video.

My first question, now that I think of it, is this: how bad is it to use 2 LUNs for video data in Xsan instead of the 4 LUNs it askes for? SAN1 has 4 luns for video data, but SAN2 was made with just 2 LUNs and it's been in use with video data for 4 years straight with no failures. Are there issues with ignoring/being ignorant of Xsan's preference of 4 LUNs for video data?

My second question (the subject of the thread) is this: We've been using SAN3 for about a week now. Editors have already bumped 25TB worth of stuff on it and it functions fine. However, for some reason Xsan Admin is reporting that there are duplicate lun names for this volume. Here's what the Console is spitting at me ~every 50 seconds:

Xsan Admin: ERROR : Error getting list of volumes: kOfflineError (0)
Xsan Admin: ERROR : Error getting list of volumes: kOfflineError (0)
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN2
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN3
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN2
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN3
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN2
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN3
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN2
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN3
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN2
Xsan Admin: ERROR: Duplicate LUN label: FSN_SAN3_LUN3

Sometimes it will include 1-3 more messages for " kOfflineError (0)"

When I go into the LUN section of Xsan Admin on the left, there are indeed 2 duplicate LUNs with these names. FSN_SAN3_LUN2 and FSN_SAN3_LUN3 appear twice in the list. One of the FSAN_SAN3_LUN2 entries has a yellow exclamation mark over it and is flagged under Errors & Warnings. It is correctly listed as a Video Storage Pool and as part of the Volume FSN_SAN3. The other entry for LUN2 has no error flag, and has nothing populating the "Storage Pool" and "Volume" column fields. Both are listed as being 36TB in size. 

The exact same scenario is the case for FSN_SAN3_LUN3.

The entries for LUN2 and LUN3 that do NOT have data in the Storage and Volume column fields appear in the "Unused LUNs" tab.

My questions are: A) where did these come from and B) can I delete them? If yes for B), how? Can I just Right Click > Remove LUN Label? Is that a problematic action? Can I do that while the SANs are running without messing something up?

Probably a simple question, but I don't want to muck anything up. I'm still very new to Xsan. Did Xsan create these LUNs to try to fulfill it's want for making 4 LUNs for Video paritions? I shouldn't think so because I SAN2 has the same LUN formating (LUN1 for MetadataAndJournaling and then LUN2 and LUN3 as equal size (18TB) Video Storage Pools). 

Andrew Allen's picture

Here's an update: 

I decided to try removing those LUNs that said they weren't attached to anything. Removing the LUN label just removes the name of it. I wasn't able to remove them from the SAN. However, when I changed their names, DISASTER struck.

We have 3 metadata controllers: MDC1, MDC2, and ShareServer (soon to be relabeled MDC3). These "unusued" LUNs were only appearing on ShareServer/MDC3. 

The third metadata controller has a corrupt directory or something. The two LUNs it thinks are unused and have duplicate names are the actual LUNs that all the other metadata controllers are looking at. When I changed the name, SAN3 became inaccessible. Thankfully, by the grace of Apple programmers, I was able to rename the LUNs through the other Metadata controllers (even when most LUNs are greyed out for changing). Now we can see SAN3 again, though there's still lots of errors I'll report on later in other threads. 

So, the problem seems to be that Metadata3 (ShareServer) is messed up. It has these extra LUNs and it also thinks that a client which used to be a Controller is still a controller. It has two entries for the same machine, one appearing as a client and one appearing as a controller. 

MDC3 is hosting SAN3 at the moment. Do I need to find some time to failover SAN3 to a different metadata controller and remove MDC3 from the SAN? It needs it's "memory" wiped. How should I go about doing that? 

Johnny_0_o's picture

Well, il sounds like you have a multipath issue or something like it.

On your raid enclosure, I would check and compare how the lun mapping is done. Usually if therer are 2 raid controllers, they will each provide visibility to the same LUN and that will cause the duplicate. Since each controler will generate their own ID for the luns, they will appear as different LUNS in Xsan Admin but are actually the same.

I can't find the documents to back this up at this time but I would strongly suggest to check your different enclosures and compare how the LUN mapping is set up on those 2 other Xsan volumes.

 

Good luck.

Andrew Allen's picture

Johnny_0_o wrote:

 

Well, il sounds like you have a multipath issue or something like it.

On your raid enclosure, I would check and compare how the lun mapping is done. Usually if therer are 2 raid controllers, they will each provide visibility to the same LUN and that will cause the duplicate. Since each controler will generate their own ID for the luns, they will appear as different LUNS in Xsan Admin but are actually the same.

I can't find the documents to back this up at this time but I would strongly suggest to check your different enclosures and compare how the LUN mapping is set up on those 2 other Xsan volumes.

 

Good luck.

/quote

 

If I understand you correctly, you're saying that in the RAID interface for each SAN I should check and make sure that the LUN labels are different on each RAID, correct?

When I installed SAN3, I did check the LUN mapping from the other two RAIDs. The existing SANs one and two were inconsistent in how they were configured. SAN1 was set up as Xsan specifically asked: a small metadata and journaling parition, 4 paritions set up for Video data and then 2 paritions for "other". SAN2 simply had two paritions-- one set for video data (despite Xsan asking for 4 paritions) and then the metadata and journaling parition. I set up SAN3 as in the same format as SAN2: 1 parition for all the data, and 1 small parition for metadata. I did not use the same LUN labels for SAN3 as SAN2 had defined. I'll need to go back and check exactly how that is all configured however.

 

I resolved the pathing issue after removing MDC3/Shareserver from the SAN. After deleting Xsan configurations files on that machine and readding it to the SAN, we haven't had this duplicate LUN issue arise again. 

matx's picture

Are you staring at Xsan Admin or looking at cvadmin? Many bad things happen when a) you stare something long enough and you try to fix a working SAN and b) using Xsan Admin to determine what's wrong with your SAN. Check cvadmin before making any rash decision like deleting luns that you think are duplicates. Use cvlabel -l to determine what luns exist on each clients and controller. 

Questions: are you using dual fibre connections? Redundant fibre channel switches? What fibre channel card? Atto or LSI / Apple? What drivers? 

Andrew Allen's picture

Thanks for your comments and questions.

I was using Xsan Admin. I haven't used cvadmin before. I"ll have to do some more research on how to use this command line tool.

We are using mutlichannel fibre cables on the MDCs, clients and SANs themselves. We just have a single Qlogic Fibre switch that's zoned in such a way that each client port and MDC port are isolated from each other. Each port is grouped with the 3 SANs and nothing else.

We're using ATTO Fibre Cards--I believe they are the 8GB Celerity model that supports Rorke RAIDs. Rorke isn't in the industry any more, but Rorke's RAIDs were based rebadged Infortrend RAID boxes. All three SANs are Rorke/Infortrend RAIDs. There's 2 24-bay SANs and one 16 bay SAN with 2 16-day JBODS.

This SAN was originally configured 4-5 years ago. I don't think the ATTO drivers have been updated since then on most of the machines that are still running Snow Leopard. We're cautiously advancing towards moving to Mavericks; we've updated an Ingest system and a Finalcut work station to Mavericks to see how they preform. I have no updated the Atto drivers on the Finalcut workstation. I've been following the "if it works, don't fix it" mantra (with the expection mentioned in this thread). Should we be upgrading ATTO drivers as well?