Starting and Stopping Xsan in 1.4.2

MattG's picture

There's been a lot of buzz about the effectiveness of the Xsan 1.4.2 update. For the systems we manage, we've never encountered a more robust version of the software, for both Tiger and Leopard. The console reporting is far more robust. MDC failover reliability has greatly improved.

That's not to say that there aren't some caustic bugs. Most specifically, machines will fall out of both the Setup lists and the Client lists in Xsan Admin, even though the SAN is functioning perfectly fine on those machines. The annoying part here is when you want to use the Xsan Admin GUI to unmount the volume from a particular client, but can't since you don't see it!

If possible, a reboot of the client usually does the trick. Sometimes jogging the serial number and resaving the config in the Setup tab will get a client you see in the Setup tab to show up in the Client tab.

But I write this to explain a new way to perform a functionality that used to be easy pre-1.4.2, that now has been taken away from us. Sometimes restarting the Xsan processes without rebooting is all that's needed.

In 1.4.1 and earlier, we accomplished this by issuing the following command in the Terminal:

[code] sudo /System/Library/StartupItems/acfs/acfs restart /code

The hostconfig file for Tiger machines (located in /etc) used to contain a laundry list of system-level processes that needed to be launched at startup. In here we had an "ACFS=-YES-" tag that told the machine to look for a startup script within StartupItems. The acfs script, inside the acfs folder, located in the StartupItems folder, was essentially a script that launched the fsmpm process that runs on all nodes. fsmpm, in turn, launched fsm if the node was a controller. Using the restart switch for the acfs script just killed these processes and started them up again, which was helpful to refresh the Xsan software without rebooting the machine.

Well, those times are gone. In Leopard, the StartupItems folder is starkly empty (except for code that hasn't been rewritten for Leopard). The /etc/hostconfig file also contains the ominous comment "# This file is going away" in its first line. Even in Tiger, we don't see the acfs script that used to be there.

What has replaced all this?

launchd

And for good reason. launchd is the first process launched by the OS, and is basically responsible for running and maintaining the state of every other process. launchd is the launcher of a new process in 1.4.2, called xsand.

xsand replaces the cumbersome acfs startup script. It launches very close to startup time, and knows to launch fsmpm, which in turn launches fsm if the machine is a controller. And unlike the acfs script, which launched fsmpm and called it a day, xsand will also monitor the fsmpm and fsm processes and relaunch them in case they crash.

Because of this, the Xsan processes are far more reliable, and xsand was written from the ground up to be more verbose about what it is doing.

But what if we want to kill the fsmpm, fsm and xsand processes properly to give Xsan a swift kick without rebooting the machine?

All we need do is "unload" the xsand from the launchd laundry list. We do this using the companion command to launchd: launchctl.

So to stop Xsan on a machine, we would type:

[code] sudo launchctl unload -w /System/Library/LaunchDaemons/com.apple.xsan.plist /code

The -w switch ensures that the xsan job will not reload until we want it to, even after reboot. So, to get things started again, soon after, we should issue:

[code] sudo launchctl load -w /System/Library/LaunchDaemons/com.apple.xsan.plist /code

Only difference in the second command is "load." We want to get the Xsan software back in the good graces of launchd.

Now we have a reliable way of restarting Xsan on a machine without rebooting.

Just one caution. We shouldn't issue this command on an active MDC. This would basically yield the same result as cutting power to it. I guess if you wanted to test failover, this is one way you could do it.

Please write back with corrections or successes!

MattG

francisyo's picture

Hey MattG,

Thank you very much for sharing this one out. To tell you frankly this is my first time to know that there is such a code in unix that can restart xsan without restarting the metadata (this is because i am not a unix guy, all i know is the basic one). I immediately try the code in my xsan metadata which is the failover (not the active one) and it was successful. Thanks to you. Before if there is some problems with my xsan and i feel that i should restart it, what i do is i will restart my metadata (the active one and the failover.) But I follow the rules of xsan like unmount first the clients then the volumes blah blah blah. Before this thing gets long I have one question, I don't know if this is related to restarting my xsan. Before I implement the code there was no error in the logs but when I restarted it, there is a bunch of error I found on my MDC logs. and here it is.

Feb 1 11:04:35 MDC servermgrd: xsan: index_of_fsmvol_named: SNFS Generic Error
Feb 1 11:04:35 MDC servermgrd: xsan: [3285/7D5ABB0] ERROR: index_of_fsmvol_named(VOLUME NAME): SNAdmin_NSListFsm(0) returned -1, error Broken pipe
Feb 1 11:04:35 MDC servermgrd: xsan: index_of_fsmvol_named: SNFS Generic Error
Feb 1 11:04:35 MDC servermgrd: xsan: [3285/7D5ABB0] ERROR: index_of_fsmvol_named(VOLUME NAME): SNAdmin_NSListFsm(0) returned -1, error Broken pipe
Feb 1 11:04:35 MDC servermgrd: xsan: index_of_fsmvol_named: SNFS Generic Error
Feb 1 11:04:35 MDC servermgrd: xsan: [3285/7D5ABB0] ERROR: index_of_fsmvol_named(VOLUME NAME): SNAdmin_NSListFsm(0) returned -1, error Broken pipe
Feb 1 11:04:35 MDC servermgrd: xsan: index_of_fsmvol_named: SNFS Generic Error
Feb 1 11:04:35 canopus servermgrd: xsan: [3285/7D5ABB0] ERROR: index_of_fsmvol_named(VOLUME NAME): SNAdmin_NSListFsm(0) returned -1, error Broken pipe

Right now, I dont know what this error message means. Can you help me? Please?

Thanks

keithkoby's picture

Hey Matt,

I'm glad to hear about your success with 1.4.2. There is a large number of posters here (and in other places on the internets) that has had issues with 1.4.2 crashing and controllers [i]not/i failing over successfully. It's nice to know about the fix to get the clients back in the admin gui lists (something that plagued my setup), but I'd also like to know about the special sauce :wink: you have that is making your 1.4.2 systems stable. Any advice?

Thanks Matt!
Keith

MattG's picture

There is no special sauce.

If folks are having issues with 1.4.2, it most probably has to do with improper configuration.

What we really should be discussing is why specific issues are happening, rather than make blanket statements about whether a version works or not.

colbru's picture

Hi MattG

Thank you very much for this very informative posting.

I'm still running on 1.4.1.. (like never change a running system)

I've tried the acfs restart on 2 of my clients. It restarted the service fine but my previously mounted Volumes did not remount automatically. (On a reboot the Volumes remount as they should)

I need to remount from XSAN Admin.

Is this "normal"?

Here are the last couple of lines that acfs restart is giving me back.
Starting fsmpm
fsmpm started
Starting cvfsd
cvfsd started
Mounting Xsan File System volumes
(null)

MattG's picture

Good question.

Mounting the Xsan Volume actually happens on the client end. Even though you are pushing a button in the Xsan Admin program, you are essentially sending two commands to the client, and modifying one config file over there. They are as follows:

[code]sudo mkdir /Volumes/volumename/code

This creates a mountpoint for the volume.

[code]sudo mount_acfs volumename /Volumes/volumename/code

This mounts the volume at the mountpoint.

Then, the automount.plist file within /Library/Filesystems/Xsan/config is modified so that the AutoMount key for that volume is set to "rw", which will do the steps above automatically on next reboot.

The automount feature is very robust in version 1.4.2! Therefore, if an end user accidentally (or intentionally) ejects the Xsan Volume, it pops back up again very quickly.

That is also why, when you try to unmount an Xsan Volume with

[code]sudo umount /Volumes/xsanvolume
/code
the volume also pops back in place in a few seconds.

However, as expected, if you modify the automount.plist file on that client and change the AutoMount key to "no", then execute the umount command above, the volume will stay unmounted.

By the way, this rigid mounting phenomenon is also why Qmaster rendering nodes and other clustering software that needs to "hard path" to the volume now work properly with Xsan 1.4.2.

If a hard path already exists for the volume, Xsan now does further testing to see if it's a valid mountpoint. If the folder has correct permissions and doesn't have any nested folders, it assumes it's a residual mountpoint and will mount the Xsan volume to it. This didn't happen in previous versions and was the bane of existence for folks trying to make Xsan work with Qmaster.

Chief Technician's picture

MattG wrote:
However, as expected, if you modify the automount.plist file on that client/quote
Located in /Library/Filesystems/Xsan/config
MattG wrote:
and change the AutoMount key to "no", then execute the umount command above,/quote
sudo umount /Volumes/
MattG wrote:
the volume will stay unmounted./quote
Indeed it does. This allowed me to unmount a volume that the GUI would not unmount. Thanks for this bit!
AngelF.'s picture

Nice post! Having a broken pipes is one of our problem, in case you are dealing with broken pipes in your basement after the winter, you might need a cash advance cash advance to take care of it all. If you are not cautious, your pipes can freeze with water in them in the winter. They expand and trigger the water lines to burst. You do not notice until the end of winter when every little thing begins to melt and leak. Get more data