Archive for February, 2009
I must be bored since I’m posting again.
A colleague asked me to change the failed value of a TCP probe today. It was no big deal, but, when I looked to see the status of the change, I noticed interesting stati of the RIPs.
switch#sh mod csm 7 probe name TCP80-PROBE detail probe type port interval retries failed open receive --------------------------------------------------------------------- TCP80-PROBE tcp 80 20 3 120 10 Description: Quick fail recovery recover = 3 real vserver serverfarm policy status ------------------------------------------------------------------------------ 192.168.1.45:80 VS01 FARM01 (default) ??? 192.168.1.44:80 VS01 FARM01 (default) ??? 192.168.1.43:80 VS01 FARM01 (default) ??? 192.168.1.42:80 VS01 FARM01 (default) ???
It seems that when a change is made to a probe, the CSM discards the state of the probe and starts over. If you catch it before the first probe is finished, you’ll get a status of “???”. I’m just picturing the CSM saying “Uhh…I…don’t…know”.
It quickly cleared back to “OPERABLE”, and my morning fun was gone. :(
I’ve talked about probes and stuff on the CSM, but I never mentioned what happens to the connections to a server that fails. That is, if I’m connected to server A in a cluster and that server suddenly commits ritual seppuku, what happens to my connection through the CSM?
Remember how the CSM works? You connect to the VIP, some state tables are updated, your packet’s destination IP is changed to a RIP, and the packet is forwarded. The point I want to emphasize this time is the state table. If you were to send another packet to the same VIP on the same port, the CSM would look in its state table and see that you’re already connected to a server and just forward you on over after a NAT. What if that server has suddenly died?
The answer is nothing by default. You keep sending packets asking for more information, and the CSM keeps forwarding your packets to a server that’s dead. Even if the CSM knows the server is dead through probes or health checking, it still keeps the connections open. Eventually, after a chunk of seconds, you’ll stop sending packets, and the CSM will timeout your connection. That’s not really good, though. You don’t want users to have to wait 45 seconds or more for everyone to time out, do you?
The answer is to set a fail action on your serverfarms. By default there is none, but you do have two choices — purge or reassign. Purge does exactly what you think it does. Come to think of it, so does reassign. I love it when stuff is easy. For those who didn’t buy a program, the purge directive will just clear the connection and make the user try again (probably behind the scenes to the extent that the user won’t know it happened) while the reassign directive sends the current connections to another server in the farm.
Configuration is easy. You just use the failaction command under the server with the action you want.
switch(config)#mod csm 7
Deciding which one to use depends on a few things — your failure rate, the reason a box dies, your cluster’s capacity, etc. For example, if the web application is freezing up and making your server unusable, chances are that the other servers in the farm are doing the same, and reassigning may just accelerate the new server’s failure with the additional connections. If your servers fail because you are just doing maintenance, reassigning those connections probably won’t hurt the cluster (assuming you have the capacity to handle all the connections).
The moral of the story is that you probably don’t want to leave the users hanging around, so you should look into purging them or resassigning them when something is amiss.
Send any leprechauns questions to me.
My home network has a Linux box running IPTables as it’s center point, and, since there are four networks, it has 4 NICs and 4 cables into the switch. I kept running into problems with the NICs (they would reorder depending on what flavor of Linux was installed), so I wanted to consolidate the NICs down to 2 — one for the Internet link and one for the LAN segments with 802.1q tagging.
Disclosure: I have only labbed this stuff out, and it seems to work, but I have yet to implement it. Use at your own risk in the wild.
Configuring VLAN tagging on Linux is pretty simple, actually. One way to do it is to use the vconfig command to add and remove VLANs from interfaces. As a demonstration, say you want to run VLANs 20 and 30 on eth0. You would just do something like this. Note that the interface you mention here has to be in an UP state, so do an ifconfig eth0 up if you need to get it into a good state.
vconfig add eth0 20
vconfig add eth0 30
Now, whenever eth0 comes up, you’ll have the interfaces eth0.20 and eth0.30. You can give them IP addresses through the command line with ifconfig.
ifconfig eth0.20 192.168.20.1 netmask 255.255.255.0 up
ifconfig eth0.30 192.168.30.1 netmask 255.255.255.0 up
I didn’t expect them to do so, but the IP addresses actually stay across network restarts; as long on the physical interface comes up, the VLANs come up with IPs and everything. Speaking of network restarts, the “downfall” with using vconfig is that VLAN interfaces don’t show as coming up or going down during network restarts; I don’t like that at all.
Another way to configure the VLANs is through the old-fashioned network-scripts directory. Copy your interface config (ifcfg-eth0) to the same name but with the VLAN extension (ifcfg-eth0.20) and edit it. Change the device field appropriately along with the IP address subnet mask info. For the final touch, at the end of the file, add this line.
Personally, this is the way I would do it. It lets you change configurations through the configuration files just like physical interfaces instead of trusting the configuration that resides out in the ether that is the Linux kernel. Also, when you restart network, the interface itself actually goes up and down, so you can see what’s going on with it. If you need some help with this, check out Redhat’s manual on it. Let me know if you’re still having problems with it.
Remember to set the box’s switch port to a matching 802.1q trunk. You’ve seen that before, but here’s a refresher, assuming the Linux box is plugged into f0/1 on the switch.
switchport trunk encapsulation dot1q
switchport mode trunk
To check the status of your VLANs, look in /proc/net/vlan. You’ll see the config file, which lists all your VLANs. You’ll also see a device file (like eth0.20) with the statistics for that VLAN device (interface).
Let’s talk security, though. First of all, I could argue that a Linux box shouldn’t be participating in any trunking at all. There will be exceptions, but, in my experience, a Linux box (read: server) should only be on one network at a time and shouldn’t straddle networks. Do you really trust the Linux guys to keep their boxes from doing bad things on more than one network? I don’t. Heh.
If, however, you need to use VLANs on a Linux box, you’ll need to make sure you have only the proper VLANs running across this port (like we did with the CSM VLAN). If a box were to be compromised, the bad guy could simply start adding VLANs to the server and suddenly get around your routers and firewalls. Awesome, right? Make sure you put in the switchport trunked allowed vlan x directive so the server only has access to those VLANs.
As always, send me any four-leaf clovers questions you have.
P.S.: For the record, since I haven’t tried this in the field yet, I can’t tell you how well it works with IPTables, but, from what I’ve been reading, it should work fine. Good luck.
For the record, I’ve got this working at my house connected to a Cisco Catalyst 2950 trunk port. I’m happy to report that it works like a champ with IPTables and everything.
This single Czech provider announcing a single prefix caused a huge increase in the global rate of updates, peaking at 107,780 updates per-second. This peak occurred at 16:30:54 UTC, less than 8 minutes after the first announcement.
At Renesys, we call a prefix impacted in a given hour if either suffers an outage or has a non-trivial amount of instability. In the hour before this event, there were 1215 impacted prefixes globally out of a total of 271,175. During the event, that number surged to 12,920 or 4.8% of all prefixes on earth. One announcement from one provider and we have a 10-fold increase in planetary routing instability for an hour. North America suffered the most, increasing from 0.35% to 4.76%, while South America suffered the least, increasing from 0.52% to 1.75%.
It’s an interesting read and shows another example of just how vulnerable the Internet is as a whole.
I’m kind of an obsessive-compulsive when it comes to numbers (1, 2, 3, 4, 5…), so I’m fairly excited about next Friday (..6, 7, 8, 9, 10…) when Epoch time reaches 1234567890 at 18:31:30 on 13 February(…11, 12, 13, 14, 15…). I’m sure my ADD will kick in (Oh, look. A squirrel!) right before, but I’ll try to remember to run to a Linux box and type date +%s (…16, 17, 18, 19, 20! Made it!).