Archive for the ‘tools’ tag
Happy new year, all. I’m finally over my hangover from the party and ready to blog.
Everywhere I go, I always wind up in a debate about how to alert on log messages as they come in. I was at the grocery store yesterday, and the cashier told me that she had a list of log messages that she watched for, and, if she saw one of them, she sent an email. I asked her what she would do if she got a log message that she had never seen before, and she said that she would have to find it first, then research the message and put in an alert for the next time it showed up.
This is probably a decent method for most setups, but what if that one message was that someone was DOSing your Internet router or that your firewalls were all at 100% CPU? My preferred logging method (and others’) is to send an alert on every message and, if I don’t really care about it, to filter it out for the next time. Now, when I see a message that hasn’t come in before, I know that something has happened instead of finding the message after all my routers have died.
The big disadvantage to using this method, though, is the noise you’ll get at first. If there are no filters on the messages, you’ll see denies from your firewalls, port up/down from your switches, and a whole bunch of other messages that you get 4849278 times a day. If your network is large enough, you may get so many that you fill up your syslog server. That sucks, but it’s it’s the price you pay for being able to know when something unknown is happening. As the alerts DB matures, you usually only see stuff that doesn’t come up very often — like when a power supply dies or someone is attacking your web server.
Just some food for thought.
Stretch at Packetlife has a lively little write-up on the Australian government’s attempt to implement a nation-wide web filtering service.
Setting aside the myriad of technical barriers to implementing such a system, the most obvious question is, “who decides what gets blocked?” When a corporation implements a web filter, it does so in accordance with corporate policy — policy that is set by the owner of the network. But the Internet doesn’t belong to any one entity, be it governmental or commercial, so such an authority simply doesn’t exist at this scale. In a very Orwellian sense, this filtering initiative appears to want to create that authority out of thin air.
I don’t know enough about the specifics down under to weigh in very heavily, but I would never support any service that filters web content from my house.
I thought I’d throw an easy one out before taking off for the holiday. Merry Christmas, Hannukah, Kwanzaa, Saturnia, etc., to all.
A few years ago, I was looking through some Cacti graphs of gigabit trunks between 6500s and noticed an abrupt change in traffic. The graphs were nice and smooth at around 135Mpbs until, seemingly randomly, they just started going wild. It seriously looked like a lie detector from the movies; I saw spikes up to 140Mbps in one sample and 2Mpbs the next sample for days and days. I looked around to see if anything weird was going on somewhere on the network, but I didn’t find anything.
I manually went to the trunk ports and sampled the output of show interface over the course of a day or so. Nothing strange. Everything was moving up and down about 10% during the day, but there were no huge jumps and drops like the graphs told me. I asked our monitoring guy for a little help, and we sat down and found…nothing. On a whim, we started looking through the templates that we could use on an interface and saw the 64-bit counters, which triggered a little binary math in both of our heads.
2^32 = 4,294,967,296
120Mbps * 60 sec/min * 5 min/sample = 3,600,000,000 bits/sample
That’s awful close, isn’t it? What if traffic goes up to 150Mbps.
150Mbps * 60 sec/min * 5 min/sample = 4,500,000,000 bits/sample
That’s bigger than a 32-bit counter! If a trunk was pushing 150Mbps at any time, Cacti would not be able to detect that a counter had been flipped multiple times between samples!
Cacti (and any or most other SNMP tools) polls an interface, it gets the total number of bits that have been sent or received since these counters were reset (usually at boot). When it polls again in 5 minutes, it get the new number and subtracts the old number, and, voila, the total number of bits transferred in the last 5 minutes. If the new number is smaller than the first, then Cacti assumes that the counter flipped and adds the second number to the difference between the first number and 2^32 to get the value. If, however, the interface is spewing out 150Mpbs of data, the counters may flip around once and then still be higher than the original number. If that the case, Cacti only sees a small number of bits difference and show you a sample rate of 2Mpbs. What if you’re pushing 300Mbps on the trunk? It may flip twice and still land higher than the first sample for a rate of 2Mbps. Ack!
The fix? Query the proper OID for 64-bit counters. It shows the same data, but reports it in much larger numbers. Calculating 2^64 gives you 18,446,744,073,709,551,616. That’s 18.4Ebps. Exabits. Wow. I can’t even imagine that much traffic. I’m sure I’ll be dead and gone by the time network reach those speeds in the wild.
All modern network gear has the capability to use 64-bit counters, so use them where you can. Since it’s jut another OID, using 64-bit counters doesn’t add any more CPU to the gear or the monitoring box. Some packages like Cacti come bundled with support for “the big boy” counters, but you may have to do a little research and find the right OID to query. Google is your friend. Let me know if you have problems, and I’ll try to help.
Do it now, by the way; you don’t want to have to explain those flaky graphs to the boss. The concept of Exabits may be a little much for him to understand.
A few articles ago, we discussed getting logging up and running on your IOS box. Part of the discussion was actually having the device log remotely to a box somewhere, but that’s kind of worthless without a properly (for definitions of proper) configured syslog server. A low-end Linux box with an appropriate amount of disk space is a really good candidate to do this for you. I’ll assume you’re running some Redhat-based distro.
I won’t go through the installation, but it should be easy. Just look for the syslog packages for your distro and you should wind up with a working copy on your box. On a Redhat distro, you’ll probably just do a yum install syslog to get it working.
The first thing you need to do is to configure the daemon to listen for remote machines, so open in /etc/sysconfig/syslog in your favorite editor (read: vi) and change the SYSLOGD_OPTIONS line to read this.
SYSLOGD_OPTIONS=”-m 0 -r”
By default, if you restarted the daemon now, you’d wind up sending all your syslog messages to /var/log/boot.log. That may be alright for you, but, if you have a lot of devices, you may want to log them to another file. To do that, you need to change the local7 line at the bottom of /etc/syslog.conf. Just add this and comment out the original line.
# Write router messages to /var/log/cisco.log
This is not the best way to do handles messages from IOS devices, but it’ll get your started. You’ll want to look at changing the facility or further filtering the logging based on facility and severity. In several setups I’ve done, the devices all log to files based on function — routers to one file, firewalls to another, switches to another, etc. — and those files are rotated every X hours.
That’s all the configuration you need, so let’s restart the service for everything to spring into action.
In a perfect world, when your IOS devices are configured properly, you’ll have a nice log of IOS messages to keep for posterity.
I like logging on an IOS device. I like to look at the buffer and tell you that your interface went down 30 seconds ago. I like to look on the box and see that BGP with my Internet provider has been flapping since 02:13ET. I like to look and see that one of the other guys has been making changes to the gear all morning. I could go on and on.
There are lots of ways to monitor a Cisco box — SNMP polling, SNMP traps, show commands, etc. — but there’s nothing so handy as the log buffer. A show logging can provide you all sorts of information on things you do and don’t care about, so it’s important to know the destinations and levels when setting up logging.
There are four logging destinations.
- Console — logs to the console ports of the device
- Monitor — logs to any device or pseudo-device that’s in monitoring mode. The most common application is when you do a terminal monitor to see output of debugs.
- Buffer — logs to a memory device that let’s you see the log messages on demand. It has a finite size and scrolls old messages out after X bytes have been written.
- Host or Trap — logs to an external syslog server.
What’s the most important destination? There’s not one. I personally thing the syslog host is the most important since it allows you to log messages to disk on a server somewhere. The buffer is also important since it lets anyone with access see what’s going on with the device. Your mileage will vary depending on what you have set up.
There are eight logging levels as well.
- Debug – level 7
- Informational – level 6
- Notifications – level 5
- Warnings – level 4
- Errors – level 3
- Critical – level 2
- Alerts – level 1
- Emergencies – level 0
Wow. There’s some numbers. What does it mean? Every logging message comes with a level. If my CSM loses a RIP, it generates a level 6 message (%CSM_SLB-6-RSERVERSTATE) telling me. If I configure the box, I get a level 5 message (%SYS-5-CONFIG_I) saying that I’ve done so. If my router is on fire in the rack, I’ll get a level 0 message telling me. Now, when you configure a destination, you have to give a logging level, and every message at that level or below will be logged. If I set my logging buffer to 5, I’ll see the configuration message and the “Oh, the humanity!” message, but not the RIP failures. If I set it to 0, I only see the emergencies. If I set it to 7, I see everything.
Let’s do the configuration, then. After hours or research, you’ve decided to use a remote syslog server at 220.127.116.11 for warnings and the buffer for informational. Here’s what you’d do.
logging host 18.104.22.168
logging trap warnings
logging buffer information
It’s not that hard. You can even use the number instead of the words for the logging level if you would like. The same procedure holds true for the console and monitor mechanisms — logging <mechanism> <level>. Easy.
If you care, here’s what I usually run.
- Console — off. I’ve seen a console rendered unusable because the console was being obliterated with syslog messages. Not only is it an issue with being able to see what you’re typing when stuff is scrolling, but some older devices wind up using 100% CPU because they’re sending messages to the console.
- Monitor — debug. It doesn’t really log anything unless you do a term mon or something, and, in that case, I want to see my debugs.
- Buffer — informational, but it depends on the device. It lets me see all the messages except for debugs, which is probably just right for most routers and switches. If you’re switch is in a closet somewhere with users plugged directly into it, you may be flooded with up/down messages, so keep an eye out for stuff like that.
- Host or Trap — informational. Debug’s a little too much for the corporate environment, but, depending on how much disk space you have, you may be able to handle it.
There’s a lot more to syslog and log messages, so see this nifty Cisco page.
Alright, that’s an exaggeration, but screen is pretty freaking cool. It’s an app that’s (usually) run under Linux that lets you run commands then detach from that session and reattach later. It doesn’t seem like much, but a few examples can show what it does for me.
I have a backup script at home that takes a target file, tars up everything listed in there, zips up the new file, and puts it on an external drive. It’s very simple but takes about 3 hours to run. I run it manually, so, in normal circumstances, I have to SSH in to my box and keep that window open for 3 hours while the backup runs. With screen, I can open a new shell, run the script, and detach from it while everything gets backed up.
To do this, I log into my box and simply type screen. This takes me to a new shell that’s no different than the one I got when I first logged into the box, and, from here, I run my backup script and watch it dump output like it’s going out of style. When I see it’s running as expected, I do a Ctrl-A, D to detach from the session and return to my original shell. From there, I can do my other business or just log off. When I want to check status, I log into the box again, type screen -r to reattach, and I’m back at my backup session.
How about something more network-dude(tte)-based? In the past, we’ve had issues with our VPN kicking us off at random times while we’re trying to do some maintenance. This sucked pretty badly for us when we were doing log archive searches or running custom reporting scripts that may each take several minutes to run — when we got kicked off, we lost everything we had. Since we weren’t the guys doing the VPN at the time, we wound up using screen to help alleviate some of those problems. We would VPN in and connect to one of the Linux management servers. From there, we would open a new screen session and do our work. When the inevitable boot came around, we could just reattach to the screen session to find our stuff still running. That saves a whole mess of frustration when something happens at 03:00.
What else? I’ve mentioned in past articles that I use screen to run dynagen labs — I have a shell for dynamips, one for dynagen, and one for each console that all run in the same screen session. I can use my function keys to add new shells, navigate among them, and detach when I’m done. I editing my .screenrc file on my lab box so that I get the same setup just by typing screen. I stole most of this off the Intrawebs, but here’s my .screenrc file. It sets up the function keys for navigation and opens (and labels) the multiple sessions for my labs.
bindkey -k k7 detach
bindkey -k k8 kill
bindkey -k k9 screen
bindkey -k k; title
bindkey -k F1 prev
bindkey -k F2 next
termcapinfo xterm ti@:te@
screen -t dyanmips 0
screen -t dynagen 1
screen -t R0 2
screen -t R1 3
Check the man pages or ask me for more details.
Here’s a quick one for you. In Dynagen, if you want to load a configuration when you first fire up the router instance, you can use the cnfg tag in your NET file like this.
cnfg = /home/jac/labs/cfg/R0.cfg
If you put that in your dynagen NET file under a router, the contents of that file will be loaded into the router configuration when it’s brought up. This is great if you already have a configuration to use in another lab or if you want to load a basic configuration on startup. Please be warned, though; if you make changes to your router instance via the CLI and restart dyangen, the configuration changes you made will be gone. Be sure to remove that line from the NET before you restart dynagen.
If you’ve never used TCPDump before, you’re missing out on one of the best parts of being a network guy — pointing fingers at everyone else.
TCPDump is an open-source app that copies packets on a machine’s NIC to screen or to file. TCPDump is typically a Linux/Unix app; in the Windows world, TCPDump is replaced by WinDump or Ethereal, now known as Wireshark. It’s a must-know for network dude(tte)s since it lets you capture the packets that a machine is generating. An app may be documented to work one way, but I’ve seen many times where the documentation is out-of-date or just wrong, and I’ve had to look at captures to see what it was actualy doing. I used it one time way back when a developer told me the switch was changing his HTTP POST to an HTTP GET; I captured the packets he was sending, pointed to the GET, and never answered a phone call from him ever again.
Am I angry today?
Here’s a more down-to-earth example. How many times have you been asked to open an ACL for a host, but the requester didn’t know the destination IP or the service port? For me, this happens as least twice a week, and I use TCPDump to figure out what the requester is trying to do. Here’s a typical conversation.
Requester: I’ve installed an app on my Linux server to help me zip up my pants, but it’s not working.
Me: Well, I’ve told you on at least 6 occasions that the firewall is going to block connections unless you tell us to open it up.
R: Oh, yeah. I’m still used to the way we did things in 1988. Can you open that up for me?
M: Can you put in a ticket like I always ask you to do?
R: Sure. What do you need it to say?
M: The same thing the last 6 tickets said — source IP, protocol, port, destination IP.
R: I don’t know all that. I just know it tries to connect to my home IP via HTTP, but I don’t know what that IP is.
M: Then how are you going to zip up your pants?
R: I guess I won’t. Can you help me find that information?
Wow. I’ve got some penned up aggression, don’t I?
To find out what’s used to help this user zip up his pants, we can just run TCPDump (as root since you need access to the NIC) to capture some packets.
Hmmm…I bet you get a lot of stuff. You’re looking at all the packets that are flying in and out of the box, including all sorts of stuff like DNS requests and your SSH session. This is where TCPDump tries to shine, though; it’s got a very powerful capture filtering system, which can get complicated at times. You can do all sorts of filtering on the capture, but the most common are the host and port filters, which, like all the other filters, can be strung together to make great huge chains of filters. Since we know it’s trying to connect to an HTTP server, so we can start by showing only traffic on port 80.
sudo /usr/sbin/tcpdump port 80
Much better. Here’s some output.
[jac@finland ~]$ sudo /usr/sbin/tcpdump port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
18:54:36.690198 IP 192.168.70.129.8736> yourmama.com.http: S 1864432324:1864432324(0) win 5840 <mss 1460,sackOK,timestamp 383978802 0,nop,wscale 3>
18:54:36.694046 IP yourmama.com.http > 192.168.70.129.8736: S 1931495854:1931495854(0) ack 1864432325 win 64240 <mss 1460>
18:54:36.694110 IP 192.168.70.129.8736> yourmama.com.http: . ack 1 win 5840
18:54:38.172982 IP 192.168.70.129.v> yourmama.com.http: P 1:16(15) ack 1 win 5840
18:54:38.173947 IP yourmama.com.http > 192.168.70.129.8736: . ack 16 win 64240
It looks like it’s using the hostname yourmama.com. Now we can just look up what that IP is and generate the firewall rule for it (or, if you’re like me, just give is a -nn flag to not look up IPs or service ports).
What else can you do with TCPDump? You can check out the man pages for how to do this stuff or just ask me nicely.
- Capture the packet headers to file for review later
- Capture the whole packet to screen or file
- Read in a packet capture from file
- Listen on a particular interface (like eth3 instead of eth0)
- Capture X number of packets
- Filter based on all sorts of stuff including IP address, port, protocol, MAC address, IPv6 multicast address, VLAN from 802.1Q packet, and all sorts of other good stuff
Let’s talk [tag]audit[/tag]ing for a bit. It’s important to have an outside person look over your [tag]configuration[/tag]s every so often to be sure you didn’t do something stupid, so, every quarter or so (mostly so), I bring in someone to…wait a minute. It would cost about $3000 for someone to do that, and the company surely isn’t going to pay for that. The wonderful people from “The Internet” know this, though, and have released a whole bunch of tools to audit gear like that. One of those is called [tag]Nipper[/tag].
Nipper was the dog in the RCA logo, but that has nothing to do with this. What I´m talking about is the Network Infrastructure Parser. It´s a very nice tool for parsing the configs of your [tag]IOS[/tag] [tag]routers[/tag], IOS [tag]switch[/tag]es, CatOS switches, PIXes, ASAs, FWSM, and a whole mess of other gear. It´s ultra-fast and spits out a great report in HTML by default.
It’s very easy to use, so I won’t get into that, but check it out. It’s worth running your config through this guy every once in a while to make sure you didn’t miss something stupid. Check it out!
Note: You shouldn’t just trust one app to do all your auditing. There’s no way that just a single app can cover everything, so download a bunch of them and run them all when you do your audit.
We all have limited budgets these days. Long gone are the days of unlimited resources and uncontrollable expansion of the network, so it’s important that any network dude or dudette pay attention to the open-source world. Below is a list of stuff I use at the office and at home to monitor, trend, and alert the network. All this stuff is free and runs on Linux to save even more cash.
- Cacti – This is a system for trending pretty much anything. If it has an SNMP value, Cacti can trend it. It’s also really flexible, allowing multiple displays of data and even a mechanism to get values from scripts you write. At the office, we use it to monitor utilization of the circuit and Ethernet ports, CPU and memory of the gear, and the number of connections on the load-balancer. At home, I use it to watch utilization and track the number of connections to the wireless networks.
- Nagios – This is a monitoring and alerting system for all sorts of stuff. It watches hosts and applications for availability and response time, then alerts based on threshold. This is one of the most complicated apps to configure, but, once it’s up, it rocks. I use it at home to monitor all the network gear and systems for response times. I also use it to monitor the web servers and restart them if they’re down.
- Apache – You know what Apache is. You use it already. About 71% of webservers on the Internet are Apache.
- Squid – A caching proxy server by the same guys who do Apache. It can be configured for both inbound and outbound application acceleration. It’s great to put in front of a CMS like Drupal or Joomla. It has a mess of built-in functions that can look for bad requests, do redirects, or completely rewrite requests. At work, it fronts our application and CMS servers so users don’t have access directly to them. At home, it runs on the firewall to serve pages to the Internet. The real webserver actually sits on a box behind the firewall for security.
- Subversion – This is a version control system. Subverions lets you create repositories, check out the contents, edit them, and check them back in. This is good for keeping track of configuration files or scripts you write. We use it at work to track configuration files for Apache, NTP, yum, etc. At home, I use it to keep track of my scripts and Perl module.
- Rancid – This is configuration management for Cisco (and other network) devices. It gets configs from devices and checks them for changes. It’s got built-in alerting and is easy to set up. I use it at home to keep track of the configs on the switches and access points.
- nfsen/nfdump – These are [tag]netflow[/tag] tools. Nfdump is a suite for collecting the data, while nfsen is for displaying the information. Check out netflow if you’ve never worked with it…it’s pretty cool.
- Dyanmips/dynagen – These apps let you run virtual Cisco routers on a machine. You can set up full network deployments for testing and configuration experimentation. It takes a good bit of resources, but it’s well worth it for the functionality. I use it all the time at work to test or tweak configs. I also use it to simulate certification labs.