XSAN 2 Admin - Computers unreachable or offline

BenT's picture

I've got this problem at a few sites, and wondering if anyone has a trick to reliably fix it:

- Computers in XSAN2 admin show "UNREACHABLE OR OFFLINE"
- All computers can be controlled via Screensharing, can ping on both MD and LAN IPs, and can be resolved by hostname.
- If you Force Remove the computer from the SAN, and then add it back by specifying IP, it appears as READY (until you quit XSAN Admin and relaunch).
- Computer showing as UNREACABLE OR OFFLINE will mount the XSAN volume(s) each boot without error.
- Only some client machines are affected, but the same machines show the problem on both MDCs.

It seems to be something DNS related, as it only happens on sites where the XSAN has it's own Apple DNS service, but there is also a corporate DNS server present on the same network.

I was hoping there is some host file entry or some plist tweak to make XSAN Admin stop "forgetting" how to communicate with these clients.

abstractrude's picture

your on static addresses right?
clients dont really need entries for DNS....

-Trevor Carlson
THUMBWAR

BenT's picture

Yes, all the XSAN client systems have static addresses for both the public LAN interface and the Metadata interface.
It seems to be Bonjour fighting with DNS that causes the issue.

Not sure if it makes a difference, but if you look at Inspector for the clients, it always lists their metadata interface address for IP, but their public LAN DNS name.

abstractrude's picture

are your controllers both pointing to the same DNS server, and is that server sending requests up the tree properly. why dont you remove the entries for the clients. i dont put entries in for clients and have never had issues. that said make sure you keep your DNS working for the controllers to resolve both ways.

-Trevor Carlson
THUMBWAR

BenT's picture

Yes, both controllers are pointing to the same DNS servers, with the same DNS search order.

I suspect the problem is some sort of conflict between the entries in the local Mac-based DNS servers and the Corporate servers.
The local servers have the zones xsan.companyname.com and metadata.companyname.com - which the corporate server just hosts companyname.com
However there are entries for the client machines with the same hostname and IPs in both xsan.companyname.com and companyname.com zones.

I'll give your suggestion ago for removing the clients completely from the local DNS servers, and see if the behaviour changes.

JesusAli's picture

Hello, a little bit back, Aaron Freimark posted a great "DNS Worksheet" in a thread where someone seemed to be having a DNS problem.

He wanted to try to fix it with Host Config files, but the general consensus was that it was better to NAIL DOWN all DNS first because it would be easier to maintain in the long run.

Here is that Thread:
http://www.xsanity.com/forum/viewtopic.php?t=6123

Aaron's Worksheet is about 3/4's down.

Good Luck. If you run the worksheet, please tell us what you find.

BenT's picture

Unfortunately this problem is still persisting at one site.
DNS servers are just the 2 MDCs (no public DNS servers in the mix at all), all tests on Aaron's worksheet return expected values (and I also ran through the client names/IPs and these return correct too).

3 of the 7 client systems show up as UNREACHABLE OR OFFLINE in XSAN Admin, yet their System Preferences values match the MDCs and all other clients.
If you remove them with FORCE REMOVE then add them back to XSAN Admin they show up as READY and can be managed as normal..... but only until you quit XSAN Admin and re-launch it, then you get the UNREACHABLE OR OFFLINE status again.

The DIG values for the problem clients returns correct for both the forward and reverse lookups.

So right now it looks like DNS setup is perfect, but XSAN Admin is still having troubles.... :(

I'm not sure if this is related, and I've seen this at multiple XSAN2 sites, but when you do any operation in XSAN Admin, the first entry in the computers list gets duplicated. After a while you get a lot of duplicates. The only way to clear this is to disconnect the XSAN Admin session, and set it up again through the wizard.

nrausch's picture

Ben,

Just curious if the client machines were cloned?
I had a very similar issue (especially with the repeated listings) a while back. I found out the client machines had been cloned from an image that had xsan client software installed. Removing all the clients and reinstalling xsan client software individually on clients solved it...

The other thing to check would be local firewalls on clients, and any switches that may be blocking necessary ports for xsan admin...

JesusAli's picture

YES! I had the same problem in my setup when two labs were imaged from a master that had Xsan activated before duplication!

The culprit is a particular file inside the Libraries/FileSystems/Xsan/config/ directory (that path is probably wrong, I'm away from an xsan right now).

The file's name is "UUID" ("universal user identification?"). You can trash it and restart the computer and a new UUID file will be generated.

Please let us know if that was the culprit!

BenT's picture

2 of the clients showing the problem are running server and were not cloned.
All the other clients were from a master SOE - but the SOE did NOT have XSAN software installed (it was a post-SOE step).

I'll run some tests trashing the UUID file - thanks for the suggestion.

peterk's picture

Hi Ben,

did you find a fix for your problem ? I had a similar problem today: two fresh XServer installs (10.6.2, XS 2.2.1), DNS records on separate machine.
The first weirtd thing I noticed was that when I checked changeip -checkhostname, the changeip told me that Host name "mainMDC" doesn't match DNS name that is "mainmdc". I checked DNS, there was mainMDC record and I am pretty sure I never used mainmdc string anywhere (it is my convention I am using last years ...). Then I fixed it (by changeip), checked again, and it looked like it's ok. I also noticed that installer installed and activated DNS on both MDCs (never happaned before on 10.5). So I stopped them, restarted, checked the DNS names, launched XSAN Admin and I saw the same problem. I am pretty sure my DNS is working correctly.
I had no time to search further, but I'm curious if you have some news on this issue.

Thank you.

Peter

peterk's picture

Hi Ben,

did you find a fix for your problem ? I had a similar problem today: two fresh XServer installs (10.6.2, XS 2.2.1), DNS records on separate machine.
The first weirtd thing I noticed was that when I checked changeip -checkhostname, the changeip told me that Host name "mainMDC" doesn't match DNS name that is "mainmdc". I checked DNS, there was mainMDC record and I am pretty sure I never used mainmdc string anywhere (it is my convention I am using last years ...). Then I fixed it (by changeip), checked again, and it looked like it's ok. I also noticed that installer installed and activated DNS on both MDCs (never happaned before on 10.5). So I stopped them, restarted, checked the DNS names, launched XSAN Admin and I saw the same problem. I am pretty sure my DNS is working correctly.
I had no time to search further, but I'm curious if you have some news on this issue.

Thank you.

Peter

BenT's picture

Unfortunately not yet.
When restarting XSAN Admin, I still get a few of the clients reporting UNREACHABLE OR OFFLINE, and one of the clients gets multiple entries in the COMPUTERS list.

I have to FORCE REMOVE the problem machines, then re-add them, then XSAN Admin works for a time with all machines showing READY.

I'm suspecting some sort of UUID file problem, but I've yet to go through every machine to strip and re-create the file.

JesusAli's picture

After you Force Remove a station, I recommend running the latest version of Xsan Uninstaller on it. Completely wipe out the Xsan file system on the station.

When you download the Xsan 2.2 install CD, you can right click the installer and get to the Xsan installer pkg. Then also download the Xsan File System Update pkg. You can push them both out to station(s) with Apple Remote Desktop.

btw, you may be able to run the Uninstaller that way too.

cthomasquinlan's picture

It sounds simplistic, but I've resolved this exact issue before by removing all keychain access entries for mdcs and clients; if they were entered improperly at some point, or have changed, this could resolve the issue. I've had dns working, all pinging correctly, and authenticating initially but upon relaunch, some came back unreachable or offline. Close xsan admin, remove all mdc and client entries in Keychain Access, relaunch xsan admin and reauthenticate the machines, creating new entries.

d4corp's picture

BenT please check UUID files. This _is_ the answer. And cthomasquinlan, you should check your UUID files as well, just issue a 'cat' command in ARD and compare the results.

Cloning workstations is a very bad habit in a xsan and/or open directory environment.

If you don't have an open directory server, make sure that you uninstall xsan before cloning (that way all /Library/Filesystems/Xsan will be deleted, including uuid). Otherwise you could write a script that put a random uuid in the file, but I won't call this good practice.

If you do have an open directory server, either use a script to prepare the system (check those included in deploystudio) or use directly deploystudio to deploy or clone them. The main things to check are LocalKDC (LKDC) and system keychain.

Uninstalling Xsan with ARD works fine, just make sure you're using the latest version of Xsan Uninstaller, as JesusAli wrote.

BenT's picture

Yes, we have seen problems before cloning machines with XSAN installed.
Our standard practice for building SOEs is to NOT install XSAN, but simply put the XSAN installers on the desktop of the image, so you can install it first boot.

This is what was done with the machines that are showing this odd problem with the XSAN Admin listings, so they all had unique UUIDs created when XSAN 2 was installed (and the 2.2 update applied).

I've even seen installing the XSAN filesystem via ARD not work 100% (though reports completed sucesfully - new machine doesn't show up automatically in the ADD COMPUTER wizard of XSAN Admin), so I always install this directly from the machine.

I'll do some poking around clearing the UUIDs and Keychains at this customer site on Monday and hopefully get this issue resolved once and for all.

Thanks for all the suggestions.

MacCyclone's picture

Hi Ben,

Did you manage to resolve the issue? I have the same problem at one of my client site. Hope you can share it with us.

Regards.

BenT's picture

Unfortunately its not gone away completely. There are still 2 client systems are 2 separate sites which both throw this problem up periodically.

In each case, the workaround is to:
- Launch XSAN Admin
- Chose Disconnect
- Connect to an existing SAN in the wizard
- Force Remove the machines that won't authenticate or are not responding
- Add Computers to put all the machine back in the console
- Everything works until you quit and relaunch XSAN Admin

Still looking for a permanent fix - uninstalling and re-installing XSAN from the client machines is not enough.

dibyendu's picture

it is noticed practically by me that managing the XSAN Admin from a single unique controller generally solves out these authentication issues….or else use ':hostname' command to edit or resolve the host name of the servers and the machines….

Managing a XSAN from different controller leaves a .plist on every controller which may collide may result in an unstable XSAN GUI issue…..

Follow this you can have a good sleep….

Cheers :D

proton's picture

Ben, are you running Parallels or similar software which create its own network interfaces on the workstations? We have similar problem here constantly. Sometimes for some strange reason Xsan Admin starts to see the clients through virtual Parallels interface (IP 10.32.x.x) and this of course confuses it when virtual machine on the workstation is down. I have managed to fix this partialy by editing Xsan Admin configuration file by hand and changing IP values in it to correct ones. But if anyone knows permanent solution feel free to post your suggestions :)

Kaku Ito's picture

After not knowingly updated the OS to 10.7.3, I had problems mounting Xsan 2 volume on the clients (which I updated to 10.7.3), luckily I didn't update the servers, so the volume was mounted on both the servers. However, I didn't know that in the beginning, I tried too hard to solve the problem, then I started cloning the system volume from the one which was working.

Some folks mentioned that the cloning is not good and while I agree to that, I think I found out that cloning the system volume to use with the multiple CPUs are okay if you clone the system before you initiate the Xsan. Xsan seems to register the unique ID from the machines, then if you clone the system after you initiate the Xsan, you can't seem to initialize it no matter what you do. If there's a machine that running the Xsan from other machine's ID, that will prohibit the original or the copy not to work. So if you build both the system from scratch then they will work. I'm no expert in San but that is what I found out from two weeks of fighting.

alanscs's picture

I realize this thread is a bit old, but for those still struggling, the only solution I have found is to temporarily disable all network connections except the Metadata network connection (on the system you are trying to manage, not the Metadata Controller)