Some LUNs invisible to some clients but not others?

thomasb's picture

Hello,

Has anybody here seen that some LUNs are visible to some Mac Pro clients, but not others on the same Qlogic 5600 switch? There is no zoning enabled.

We have checked the port properties and tried switching between different physical ports, but whatever I do, a group of LUNs, 8 ActiveStorage LUNs (two Xsan volumes) are not visible to a new Mac Pro 10.7.5 client. The new client can see other storage LUNs available on the same FC fabric. Already existing 10.6.8 and 10.7.5 clients on the same 5600 switch can see the 8 ActiveStorage LUNs that the new 10.7.5 client can't see. We have tried reinstalling the client to 10.6.8, but that doesn't help.

I found these Qlogic kbase articles, but this issue does not seem like a random one.

Initiators are Randomly Losing LUNs After Reboot
https://support.qlogic.com/app/answers/detail/a_id/306/kw/306

Windows XP and MAC OSx Servers Do Not See LUNs
https://support.qlogic.com/app/answers/detail/a_id/368/kw/368

Apple Systems Losing Access To LUNs After Activating a Zoning Change
https://support.qlogic.com/app/answers/detail/a_id/611/kw/611

We have a pretty big FC fabric, and have been running without zoning for quite a while. Have we reached some kind of limit maybe? This would still not explain why it's the same set of 8 LUNs that are invisible/unreachable to this new client.

Are there any limits as to how many targets and initiators that can see each other in the same FC fabric? How many targets and initators can you run in a Qlogic FC fabric without zoning?

abstractrude's picture

zone it and see if your issue goes away. you 100% should always zone.

-Trevor Carlson
THUMBWAR

thomasb's picture

Thanks, yeah I struggle to find any other logical reason to these strange visibility issues. I was just wondering if anybody has seen similar issues.

I'm surprised it has worked fine with 17 switches, 108 LUNs, 100 clients and 7 MDCs for so long. The issues didn't start before switch 17 was added to the fabric (yes, ImplicitHZ is set to false on all switches). There is a mix of 5800 and 5600 switches.

Zoning will be implemented in June. I will for sure post back here with the results.

abstractrude's picture

every host should be its own zone (soft zone in qlogic world).

-Trevor Carlson
THUMBWAR

thomasb's picture

We found the issue. We had a faulty 5800 switch. When we tried to shut everything down and then turning it on again, one of the 5800 switches would not boot properly. Once we replaced the 5800 switch, everything worked fine again, and all LUNs were visible across all 19 switches.

There were no signs in the logs of the faulty switch, which made it quite hard to find the error without doing a complete shutdown.

abstractrude's picture

Did you ever zone? hope so. good troubleshooting though.

-Trevor Carlson
THUMBWAR

thomasb's picture

We implemented zoning like this for our big Xsan MultiSAN environment (7 MDCs, 108 LUNs, 15 volumes, 100 clients).

1 zone per switch for initators (excluding storage ports)
1 alias for all storage ports
Storage alias is a member of every switch zone

We were going to implement Single Initator Zoning, but with 224 zones with the storage alias as a member of each zone, we exceeded the amount of members allowed, which is 10 000.

Any comments, thoughts or advice?

Everything seems to be running fine after we managed to find and replace the faulty 5800 switch.

We have done some tests in our Xsan lab environment, doing zoning changes in a live environment (changing zones and adding/removing ports from storage alias), and it seems like everything runs fine without traffic disruption when editing, saving and re-activating a zone set. The only thing we see on the client side, is that they get a "rescan" message if new storage ports are added to the alias. I did playback of video from the storage during the zoning changes, and there were no dropped frames or other visible traffic disruption.

Example of log messages on the Mac client when ports are added to the storage alias, which is a member again of each zone:
27.06.13 18:32:34,000 kernel[0]: FusionMPT: Notification = 6 (Rescan) for SCSI Domain = 0
27.06.13 18:32:34,000 kernel[0]: FusionMPT: Notification = 6 (Rescan) for SCSI Domain = 1

thomasb's picture

Hi,

Just wanted to update this thread, and tell you that we ended up disabling the zoning again because of issues with replicating zoning changes accross all 19 switches.

Also, the faulty switch that caused the LUN visibility issues had to be repaired because of an uncommon problem with the DDR Memory (not a user problem).

We have also found that there is a CLI shutdown command, that I would highly recommend using instead of just cutting the power, when you need to turn off a switch. Even Qlogic support say they rarely use the shutdown command, but after having three 5800 switches fail in less then a year, we have become quite cautious about it.