Archive for the ‘csm’ tag
A coworker brought an interesting problem to me the other day. He wanted to move a serverfarm from one server VLAN to another without taking an outage. Since I didn’t want to have to come into the office late at night to do work, I decided to see what we could do.
It turned out to be pretty easy. We tend to think of CSM VLANs as pairs — you have the client VLAN for the web servers where the vserver sits and the server VLAN where the serverfarm sits. The CSM doesn’t know about these relationships; all it cares about is whether the servers are in a server VLAN, and we can use that to our advantage here.
Here’s a snippet of what the original config looked like (not really since I’m not telling you how my company’s network is set up). The original serverfarm included a serverfarm called SFARM-ORIG that included 192.168.0.10. That farm is used by the vserver VSERV-ORIG that listens to 22.214.171.124 on HTTP. The probe is in there, too.
probe HTTP tcp port 80 ! serverfarm SFARM-ORIG real 192.168.0.101 inservice real 192.168.0.102 inservice probe HTTP ! vserver VSERV-ORIG virtual 126.96.36.199 tcp http vlan 100 serverfarm SFARM-ORIG inservice
To make the move, we start by creating a new vserver and serverfarm that contains all the IPs invovled — both the original IPs that are already in service as well as the new IPs to which the servers will migrate. The new vserver listens for 188.8.131.52. In this case, we’re moving the servers to 10.10.1.10.
serverfarm SFARM-NEW real 192.168.0.101 inservice real 192.168.0.102 inservice real 10.10.1.101 inservice real 10.10.1.102 inservice probe HTTP ! vserver VSERV-NEW virtual 184.108.40.206 tcp http vlan 200 serverfarm SFARM-NEW inservice
When you first drop in the config, the original RIPs should come up as operational, and the new ones should fail since they don’t exist yet (duh!). When everyone’s ready, you then move the service over to the new VIP and run off of that for a while to make sure everything’s working as expected. When all the parties involved are happy, you can then start moving over the servers one at a time. The probe should fail out a server pretty quickly, then, when the server is reconfigured and put on the right VLAN, the CSM should eventually see the new RIP come up and put it back in the available server pool for the farm.
Configured like that, you can move the servers whenever you would like, and the CSM will automatically detect the changes and act accordingly. You just have to remember to remove the old IPs out of the serverfarm when a server moves.
Send any alternative study techniques questions my way.
It seems that we have another piece of evidence that Cisco doesn’t like the CSM. From what I’m able to creatively interpret, the software developers didn’t think anyone would be running the CSM for very long, so they set a variable that expires CSM-inserted cookies at 01:01:50GMT on 1 January 20101. If you’re using cookies to make connections sticky, that means you may see some unexpected results; this shouldn’t affect the web servers’ cookies.
The bug tookit lists 4.3(3) as the “first found in” version, but I’m fairly confident that it exists in every version before 4.3(3). If you want to be sure you have the bug, you can run the show mod csm # variable command and look for the COOKIE_INSERT_EXPIRATION_DATE value. It should look something like this.
Switch#sh mod csm 2 variable variable value ---------------------------------------------------------------- ARP_INTERVAL 300 ARP_LEARNED_INTERVAL 14400 ARP_GRATUITOUS_INTERVAL 15 ARP_RATE 10 ARP_RETRIES 3 ARP_LEARN_MODE 1 ARP_REPLY_FOR_NO_INSERVICE_VIP 0 ADVERTISE_RHI_FREQ 10 AGGREGATE_BACKUP_SF_STATE_TO_VS 0 COOKIE_INSERT_EXPIRATION_DATE Fri, 1 Jan 2010 01:01:50 GMT ...
The “real fix” is to upgrade to 4.3(3.1) or 4.2(12.1). Of course that means a reboot of the CSM and an outage and all that. A workaround includes setting the COOKIE_INSERT_EXPIRATION_DATE variable to some time in the future. The bug text gives an example of some time in 2020, but any time in the distant future will do. Assuming your CSM is in slot 2 and you’ve selected 1 Jan 2020 at 00:00:00 for your expiration, you would do this.
Switch(config)#mod csm 2 Switch(config-module-csm)#variable COOKIE_INSERT_EXPIRATION_DATE "Web, 1 Jan 2020 00:00:00 GMT"
That’s much easier than upgrading the CSM, eh? If you’re still using your CSM by 2020, you can set it again if you want, but you’ll be well past the EOL on that guy (4.1 goes EOL on 13 Oct 20122)
Send any space shuttle launch tickets questions my way.
* May require CCO access
SSH is more than just a shell. You can copy files from and to a server or piece of network gear with it. You can use it to tunnel traffic. Possibly my favorite, though, is to use SSH to run a command on a remote box without interacting with a shell.
One of my biggest pet peeves with IOS (or pretty much any Cisco OS) is the lack of complex filtering. Let’s say I want to look at all the downed ports and interfaces on modules 3 and 6 of my 6509. I can’t easily do that with command from the IOS, but, on my Linux box, I can use multiple grep commands to get exactly what I want really easily. Let’s work through the example, shall we?
To start with, let’s just do a show ip int brief without getting a shell on the switch.
ssh my.switch.com "show ip int brief"
When you run this and give your password, you see the output we’ve all learned to love, and, now that you’ve got it in STDOUT on your Linux box, you can start filtering. Now, let’s use grep to find the downed ports and interfaces on modules 3 and 6.
ssh my.switch.com "show ip int brief" | grep down | grep Ethernet
How about downed ports and interfaces on modules 3 and 6 that not administratively down?
ssh my.switch.com "show ip int brief" | grep down | grep Ethernet | grep -v admin
I’ll stop there, but it can go on and on. Read up on regular expression and/or grep if you don’t know what we’re doing here.
What’s really happening is that we’re taking the output of the command “ssh ….” and piping it (with |) to the command grep. We can send it to whatever command we want, though, so don’t be shy. I’ve actually written several scripts that take output of commands like show int description on a router to generate some reports. When I want to run one of those, I do something like this.
ssh my.switch.com "show int desc" | parseOutput.pl
There’s always a gotcha or two to watch for, isn’t there? I’ve found a couple.
First, your command runs at your privilege level, so, if your user is priv 1, you’re not going to be able to do a show run or reload. You could just ignore security for a bit and set your privilege to 15, but I don’t recommend doing anything like that. Before you say it, you’ll probably have a hard time with enabling as well. You can only run one command at a time, so you would just enable yourself and get kicked off. Not very helpful.
Another problem I see is the lack of public/private key pair support on Cisco devices. On a Linux box, you can copy your keys around, and those are presented in lieu of a password. Since (most) Cisco devices don’t have home directories, there’s no place to drop the keys, and we’re left with just using passwords. Support for this would be nice, but the security problems associated with keep SSH keys and user home directories are probably too much to even think about.
What else? Oh, yeah. The PIX/FWSM/ASA family supports SSH, but it acts differently from the IOS guys. When you run a command through SSH, you actually get an interactive shell with the command already on the CLI for you. This is probably by design; the only thing you can really do from a non-priv prompt is to enable.
Anyway, send any grilling tips questions my way.
We talked about running multiple data centers on a stick back in August, which is where you have multiple logical pairs of client and server VLANs on a single CSM for different tiers or functions. The big point of the article was that you had to do some fancy forwarding to get a server-initiated connection from one server VLAN to appear out the appropriate client VLAN. Well, we ran into an interesting issue with the given solution.
Let’s set up a scenario. Check the diagram for an overview. We have many pairs of client and server VLANs each with a firewall interface as the gateway into the DCOAS. Let’s just focus on just one, though — client VLAN 101 and server VLAN 102. In VLAN 101 is ServerA (not pictured) with an IP of 220.127.116.11; in VLAN 102 is our web farm that needs to connect to ServerA to drop off some data. We add a static route on ServerA pointing traffic for 18.104.22.168/24 back through the CSM.
When you try to connect from the web farm, though, it just times out. Why is that?
Remember that weird forwarding vserver that we had to use to get traffic to come out of the right client VLAN? Well, that’s stabbing you in the eye right now. When the web server initiates a connection, it sends traffic to the server VLAN IP of the CSM. The forwarding vserver grabs the new connection and load balances it to its only RIP, which is the IP of the firewall. What happens when any good firewall accepts traffic destined on an interface destined for a host out of the same interface? It drops the packet, and, eventually, the server times out.
What’s the fix, then? There are a few that come to mind. The first may be to just move ServerA to another network segment. Another may be to change the process around a bit by having ServerA pull the data instead of it being pushed since client-initiated connections will work like a champ.
A really outrageous one would be to set up another forwarding vserver that has only ServerA as it’s serverfarm. You would then add a static route in the web servers pointing to ServerA through that VIP, which would foward it over.
On the CSM, you’d do something like this.
serverfarm SERVERA-SF no nat server no nat client real 22.214.171.124 <--- ServerA inservice vserver SERVERA-VS virtual 126.96.36.199 any vlan 102 serverfarm SERVERA-SF inservice
On the server, you would add a static route to ServerA through 188.8.131.52. If you’re using some brand of Linux, you’d do this.
route add 184.108.40.206 gw 220.127.116.11
Don’t forget the static route on ServerA.
Send any Peeps questions my way.
I must be bored since I’m posting again.
A colleague asked me to change the failed value of a TCP probe today. It was no big deal, but, when I looked to see the status of the change, I noticed interesting stati of the RIPs.
switch#sh mod csm 7 probe name TCP80-PROBE detail probe type port interval retries failed open receive --------------------------------------------------------------------- TCP80-PROBE tcp 80 20 3 120 10 Description: Quick fail recovery recover = 3 real vserver serverfarm policy status ------------------------------------------------------------------------------ 192.168.1.45:80 VS01 FARM01 (default) ??? 192.168.1.44:80 VS01 FARM01 (default) ??? 192.168.1.43:80 VS01 FARM01 (default) ??? 192.168.1.42:80 VS01 FARM01 (default) ???
It seems that when a change is made to a probe, the CSM discards the state of the probe and starts over. If you catch it before the first probe is finished, you’ll get a status of “???”. I’m just picturing the CSM saying “Uhh…I…don’t…know”.
It quickly cleared back to “OPERABLE”, and my morning fun was gone. :(
I’ve talked about probes and stuff on the CSM, but I never mentioned what happens to the connections to a server that fails. That is, if I’m connected to server A in a cluster and that server suddenly commits ritual seppuku, what happens to my connection through the CSM?
Remember how the CSM works? You connect to the VIP, some state tables are updated, your packet’s destination IP is changed to a RIP, and the packet is forwarded. The point I want to emphasize this time is the state table. If you were to send another packet to the same VIP on the same port, the CSM would look in its state table and see that you’re already connected to a server and just forward you on over after a NAT. What if that server has suddenly died?
The answer is nothing by default. You keep sending packets asking for more information, and the CSM keeps forwarding your packets to a server that’s dead. Even if the CSM knows the server is dead through probes or health checking, it still keeps the connections open. Eventually, after a chunk of seconds, you’ll stop sending packets, and the CSM will timeout your connection. That’s not really good, though. You don’t want users to have to wait 45 seconds or more for everyone to time out, do you?
The answer is to set a fail action on your serverfarms. By default there is none, but you do have two choices — purge or reassign. Purge does exactly what you think it does. Come to think of it, so does reassign. I love it when stuff is easy. For those who didn’t buy a program, the purge directive will just clear the connection and make the user try again (probably behind the scenes to the extent that the user won’t know it happened) while the reassign directive sends the current connections to another server in the farm.
Configuration is easy. You just use the failaction command under the server with the action you want.
switch(config)#mod csm 7
Deciding which one to use depends on a few things — your failure rate, the reason a box dies, your cluster’s capacity, etc. For example, if the web application is freezing up and making your server unusable, chances are that the other servers in the farm are doing the same, and reassigning may just accelerate the new server’s failure with the additional connections. If your servers fail because you are just doing maintenance, reassigning those connections probably won’t hurt the cluster (assuming you have the capacity to handle all the connections).
The moral of the story is that you probably don’t want to leave the users hanging around, so you should look into purging them or resassigning them when something is amiss.
Send any leprechauns questions to me.
Did you catch the article on setting up fault tolerance on the CSM? In that article, I mentioned that Cisco recommends a dedicated trunk for the FT VLAN if you have two HA CSMs in two chassis. Discuss amongst yourselves while I drone on.
Why should you set up a dedicated trunk for this stuff? The most obvious reason is to be sure that normal traffic doesn’t step on the syncing traffic. Since we’re syncing state information as well as configuration, the frames need to arrive in a timely manner. Any errors could potentially disrupt the FT process, which is bad. You surely don’t want the primary to fail only to find out that the standby doesn’t have the complete or current config.
Another reason is to keep the syncing traffic from stepping on normal traffic. The CSM is a pretty robust box and can handle a pretty good chunk of data. If you had a 100Mbps trunk between your chassis, there is the potential for the link to get flooded if the CSM ever starts sending some real data. All things being equal, though, your trunks are probably sized properly for your network, and the addition of the syncing traffic probably won’t affect much.
Let’s review our configuration from the other article.
vlan 83 name CSM-Sync ! module csm 3 ft group 1 vlan 83 priority 100 alt 90 preempt
This snippet creates VLAN 83 and tells the CSM to use it for syncing, but how do we dedicate a trunk for that VLAN? We use the switchport trunk allowed vlan directive. We’ll assume that G1/1 on your primary switch is connected to G1/1 on your standby.
interface GigabitEthernet1/1 description CSM Syncing switchport switchport trunk encapsulation dot1q switchport trunk allowed vlan 83 switchport mode trunk
This sets G1/1 up to only allow VLAN 83 across it. If you do a show int G1/1 trunk, you’ll see that this VLAN is the only one allowed, the only one active, and the only one one forwarding on that link. Of course, you’ll need to do the same on the other side to keep traffic flow sane, but it’s fairly easy.
What if G1/1 goes down, though? You’d lose sync, so you probably want to look at a solution for that little problem. You could put in multiple links and let Spanning Tree do the work. You could even turn those links into an EtherChannel for redundancy and throughput. If you have more than two chassis, you could full mesh them with trunks dedicated to VLAN 83. There are a number of ways around the problem. Be creative.
Be sure to send turkey questions my way.
There are three different ways that a CSM checks for the health of the servers — active probes, inband health checking, and inband HTTP monitoring. Let’s talk about active probes.
Active probes (or just probes) typically send traffic to one of the RIPs of a serverfarm, do some stuff, and give a pass or fail grade. If the probe fails a certain number of times in a row, that server is considered sick and taken out of the pool for use. The CSM keeps checking the unhealthy until it passes a number of times in a row, at which point it is placed back in the pool for use. Almost everything is configurable, of course, so let’s look at some of those settings.
These all have their defaults, so you don’t need to actually configure them, but it’s important to know they’re there to tweak later. There are other parameters to the more specific types of probes as well. You’ll have to venture forth on your own to figure those out, though I’ll be glad to help.
- interval: The time between probes when the server is healthy.
- retries: The number of consecutive failures before a healthy server is considered failed.
- failed: The time between probes when the server is failed.
- recover: The number of consecutive successes before a failed server is considered health.
I always said that the CSM only speaks HTTP, but it knows how to order a ham and cheese sandwich in a few other protocols, including doing some decent stuff like watching for SMTP banners, looking for ICMP reachability, or getting a response from a Tcl script. Here’s a list of the probes with some boring commentary. Depending on IOS versions, you may have more or less available to you.
- tcp: Establishes a connection to a TCP port. If the port is open, we pass.
- udp: Same as TCP, but in UDP. Duh.
- icmp: Ping-a-ling. Do I need to explain this one?
- http: Requests a URL from a webserver and looks for HTTP return codes.
- dns, smtp, telnet: These guys just attach to the service and look for a proper response header. It doesn’t do any transactions or anything but simply makes sure that DNS, SMTP, or telnet is running on the port.
- script: These probes run a Tcl script (that you write) and look for the return code of the script. Very powerful!
- kal-ap-tcp, kal-ap-udp: Admittedly, I have no earthly idea what those are. Most references I’ve seen to it involve the Cisco ACE, but I have no clue. Can someone fill in for me here?
Shall we try one or two? How about a TCP probe that makes sure your custom app is still running on TCP/8839? Since it’s a custom app, we can use a TCP probe on that port to make sure something is listening (If you need something more, you’ll need to check out script probes.).
probe MYAPP tcp
description My app on TCP/8839
Now we apply the probe to the serverfarm.
Easy enough. How about another? I want to configure an HTTP probe that gets the URL /test.php and looks for the status code 200 and apply it to a serverfarm.
probe LOOKFOR200 http
request url /test.php
expect status 200
By the way, you don’t want to use this in production at all. You’ll probably want to elect to look for ranges of status codes instead of a single value. Google up HTTP return codes and you’ll see why.
I will mention again, though, that the script probe is very interesting. If you know Tcl or have done any development, check these things out. You can do whatever you want with them instead of using the canned probes that come with your CSM. I suggest taking a look at Ivan Pepelnjak’s page for some insight into Tcl on Cisco devices.
I think you can take it from here. As always, comments are welcome.
Like (nearly) everything in the Cisco world, you can set up your CSM to fail over to another module when the primary dies a horrible death. You can have two in the same chassis or even have them in separate chassis — the process is the same no matter how you have it set up. Either way, you have a primary and a secondary module in fault tolerance (FT) mode.
First, we’ll establish a VLAN that the CSM will use to do its configuration and state syncing over. This is just an ordinary VLAN; there’s nothing special about it, really, but it should be dedicated for the CSM to use for syncing. Let’s randomly choose VLAN 83.
You will, of course, have to do this on every switch that holds a CSM, so, if you’re using them in two different chassis, you’ll put the same VLAN on each making sure they can see each other through a trunk. Cisco recommends that you dedicate a trunk between the two switches for the sync VLAN in order to remove the chance of other traffic stepping on the sync packets, but I’m not convinced that’s necessary. Use your judgement on that one.
Back to it. Next, you need to decide on a FT group ID. This is similar to a HSRP group and lets you run multiple FT groups on the same VLAN. The group ID needs to be in the range of 1 to 256, so, since this is the first one, let’s just use 1. Get into config mode for the CSM that you want to be the primary and do this.
ft group 1 vlan 83
This takes you to the config-slb-ft prompt. Just like HSRP, we need to set priorities for each device and whether or not it should preempt, so let’s configure. Yes, we want to preempt, right? Let’s set the priorities to 100 and 90, too.
priority 100 alt 90
This sets the primary CSM to priority 100 and the secondary to 90; both will preempt.
What about configuring the secondary for FT? That’s easy. Go into CSM config mode on the secondary and enter the ft group 1 vlan 83 command. That’s it. The two CSMs will do a little arguing and come back as the primary and secondary. After that, all configuration is done on the primary, which is synced over to the secondary just like an ASA. Pretty cool, eh?
When configuring things like IP addresses, though, you’ll need to make provisions for the secondary with the alt directive (remember that one from the priority). I won’t go into much, but you’ll need it mostly when settings IPs to VLANs. Here’s an example of setting an IP address on client VLAN 100 for both the primary and secondary.
vlan 100 client
ip address 192.168.0.11 255.255.255.0 alt 192.168.0.12 255.255.255.0
Alright…one more thing. The configurations don’t sync automagically (at least not on my old version of code). If you make a change to the primary CSM, you’ll see an out-of-sync message when you look at the FT status.
Switch#sh mod csm X ft
FT group 1, vlan 83
This box is active
Configuration is out-of-sync
priority 100, heartbeat 1, failover 3, preemption is on
alternate priority 90
If the primary goes down now and the secondary takes over, the changes you just made won’t be reflected on the secondary. You fix this with the hw-module contentSwitchingModule X standby config-sync command (where X is the module slot in the chassis). Alternatively, you can just type hw c X s c as a shortcut. It’ll take a few minutes depending on your configuration, so check your logs for when it’s finished. Note that the secondary does not save the new configuration to its startup-config; you’ll have to log in and save that manually (or automatically through CiscoWorks or something) to save changes there.
Let me know if you have any questions and check out my page on getting output from Cisco’s fine mid-tier load balancer. :)
That’s an awesome title, eh? I’ve mentioned a router-on-a-stick before but not a data-center-on-a-stick (DCOAS). This is one of those Cisco terms I ran across a while ago and is a group of servers sort of sticking out on their own behind a load balancer and/or firewall. Connections to and from the server group go through a single spoke — kinda like stubby routing. Here’s a pretty picture.
To configure this type of setup on the CSM, assuming you’re running in router mode, you just define a server and client VLAN pair along with an alias IP on the server VLAN. The servers on that VLAN point to the alias as their gateway for return traffic, and the CSM handles the VIP/RIP conversion.
What if you have more than one server/client VLAN pair, though?
You set up your VLANs and your server VLAN alias for both sets and all is well, right? Not really. The CSM isn’t a router and doesn’t do a very good job at routing. If you were to send traffic from “Other Stuff” to a VIP on VLAN101, everything works great. If, however, one of those servers initiates a connection, the traffic will come out of the client VLAN with the lowest VLAN ID. If you have VLANs 1 and 2 for the top pair, traffic from all servers will come out VLAN 1. Very big problem if you have a firewall that’s expecting traffic on another interface.
How do you get around it, then? You have to trick the CSM by generating new catch-all vservers on each server VLAN to “load balance” the gateway for each VLAN. Let’s look at the configs.
Here are the serverfarms.
no nat server
no nat client
no nat server
no nat client
And the vservers.
virtual 0.0.0.0 0.0.0.0 any
virtual 0.0.0.0 0.0.0.0 any
When a server makes a new connection to a database or whatnot, it sends the traffic to the alias you already configured on the server VLAN. When the CSM receives that traffic through the alias, this new vserver takes over since it’s set for any protocol and port on any IP. This vserver uses a serverfarm that contains the gateway for the appropriate client VLAN — in our case, the IP of the firewall — so now every connection from the server will exit an appropriate VLAN instead of out the lowest ID.
Confusing? Yes. Should be fixed? Yes. Is already fixed? I have no idea, but it’s not as of 12.2 and some change.