Fail Actions on CSM Serverfarms

I’ve talked about probes and stuff on the CSM, but I never mentioned what happens to the connections to a server that fails.  That is, if I’m connected to server A in a cluster and that server suddenly commits ritual seppuku, what happens to my connection through the CSM?

Remember how the CSM works?  You connect to the VIP, some state tables are updated, your packet’s destination IP is changed to a RIP, and the packet is forwarded.  The point I want to emphasize this time is the state table.  If you were to send another packet to the same VIP on the same port, the CSM would look in its state table and see that you’re already connected to a server and just forward you on over after a NAT.  What if that server has suddenly died?

The answer is nothing by default.  You keep sending packets asking for more information, and the CSM keeps forwarding your packets to a server that’s dead.  Even if the CSM knows the server is dead through probes or health checking, it still keeps the connections open.  Eventually, after a chunk of seconds, you’ll stop sending packets, and the CSM will timeout your connection.  That’s not really good, though.  You don’t want users to have to wait 45 seconds or more for everyone to time out, do you?

The answer is to set a fail action on your serverfarms.  By default there is none, but you do have two choices — purge or reassignPurge does exactly what you think it does.  Come to think of it, so does reassign.  I love it when stuff is easy.  For those who didn’t buy a program, the purge directive will just clear the connection and make the user try again (probably behind the scenes to the extent that the user won’t know it happened) while the reassign directive sends the current connections to another server in the farm.

Configuration is easy.  You just use the failaction command under the server with the action you want.

switch(config)#mod csm 7
switch(config-module-csm)#serverfarm MYFARM
switch(config-slb-sfarm)#failaction purge

Deciding which one to use depends on a few things — your failure rate, the reason a box dies, your cluster’s capacity, etc.  For example, if the web application is freezing up and making your server unusable, chances are that the other servers in the farm are doing the same, and reassigning may just accelerate the new server’s failure with the additional connections.  If your servers fail because you are just doing maintenance, reassigning those connections probably won’t hurt the cluster (assuming you have the capacity to handle all the connections).

The moral of the story is that you probably don’t want to leave the users hanging around, so you should look into purging them or resassigning them when something is amiss.

Send any leprechauns questions to me.

Aaron Conaway

I shake my head around sometimes and see what falls out. That's what lands on these pages.

More Posts

Follow Me:
Twitter

3 comments for “Fail Actions on CSM Serverfarms

  1. Ponch
    January 27, 2010 at 4:10 pm

    This makes me happy in multiple ways.

  2. Eamon
    September 6, 2010 at 10:38 am

    Hello Aaron,

    This probably isn’t the correct place to ask this question, if you wish to move this somewhere else, no probs, I have a requirement in work where managers of a website want any requests that come into our loadbalancer(CSM) on https from our internal network (10.0.0.0) to be redirected to http as it is causes extra needless processing on our SSL blade. I’ve been doing a bit of searching and can find info about URL rewrites from http to https but nothing the other way.

    We are running an old csm version 3.2(1) and an SSL modfule running 2.1(9)

    This is in a 6500 chasis with a Sup 720 running IOS.

    Any help you can give would be very useful,

    Thanks and keep up the good work.

    Eamon

  3. swampie
    January 21, 2011 at 4:07 am

    Hi Aaron,
     
    Great site and lots of info easily accessed. Its probably worth highlighting on this topic that the "reassign" is based on the available (non failed) real being able to pick up on the stateful session where the failed server left off. My understanding of the purge action is that it will clear the sessions and the clients will have to reinitiate their required session,
    ;@)

Leave a Reply

Your email address will not be published. Required fields are marked *