Archive for the ‘3750’ tag
It’s pretty widely known that I hate Cisco 3750 switches. We’ve had so many hardware and software failures with them that I’ve got a seriously bad taste in my mouth. Since I’m leaving for a new company, I thought I’d publish some statistics while I still have access to the numbers.
Total TAC cases opened related to 3750s: 21
Number of 3750G-12S-S replaced: 21
Number of 3750G-24TS replaced: 7
Total number of RMAs issued: 28
Total number of 3750s in the company: ~120
Failure rate: 23.3%
I can accept a handful of failures, but 23%?!?!? That’s one fine platform you’ve developed there, Cisco. Keep up the good work.
Last week, @fletcherjoyce posted an article on his blog about his positive experiences with Cisco’s 3750 switches. If you follow my complaints tweets, you know that I’ve had quite the opposite experience with them. I would never pick on anyone, but I had to throw in my 2 cents.
I’m guessing here, but we have about 50 3750 stacks in the enterprise. Most of them are pairs, you wind up with roughly 120 switches. Since we’ve done about 20 replacements over the last 5 years, that means we have a 17% failure rate. That’s pretty horrible, isn’t it?
For the most part and with few (if any) exception, we use the 3750s as aggregation points for our access switches. We don’t do QoS on them. We don’t do any access control on them. We don’t even do routing on them. They’re simply used to connect all the access switches in the closet to the core, so they’re not doing anything funky or burdensome. The CPU and memory are always well within normal operating parameters. They just fail and fail repeatedly.
The flies started dropping in closets at our corporate headquarters a few years ago. It was the middle of summer, and the temperatures kept rising to over 90F (32C) until the we lost 3 switches in 3 weeks. If you could stand to make it into the closet, you could feel that the sheet metal of the switches was hot enough to make you pull your hand back! When the facilities team added more cooling, the temperatures dropped to around 82F there (28C), but we continued losing switches. I figured the newly-failed switches were feeling the effects of the earlier heat wave and were just getting around to giving up the ghost. Surely the heat was the culprit.
A few months after our headquarters meltdown, a tech for a satellite office called and asked if we could help with some latency issues. He showed me the switch stacks throughout the building, and I noticed that only one of the 10 switches actually had a label. The tech said that he never got around to relabeling them after they were replaced. Some, he said, had been replaced multiple times. The closets were running about 76F (24C), so heat didn’t seem to be the problem at this location. The closets were clean as a whistle, and everything in the racks was on building UPS. I couldn’t find a pattern at all. For the record, all their latency issues were related to two unrelated 3750s. Two RMAs later, and their problems were gone.
I’ve been trying to find patterns for the failures, but I can’t think of any. If it’s heat, humidity, power, dust, etc., then why are we not replacing 2950s as well? There are 4-10 of them for every 3750s stack we have. We’re replacing them, but it’s a rate of less than 1%. If it is environment, then the 2950s are English hooligans compared to the 3750s being French aristocracy. Maybe it’s sabotage. I still don’t know after years of watching RMA after RMA come in.
I have noticed one pattern, though. The only deployments of 3750s that have never had a problem are in data centers. They seem to love any room that has an ambient temperature of 62F (16C) with less than 40% humidity and large volumes of air flow. If only we could install micro-data centers in all our closets, then I would be a happy network dude.
Send any wooden shoes questions my way.
Edit: I went back and checked our TAC cases to see what switches we actually replaced. It turns out that we’ve done 19 replacements, and they’ve all been 3750G-12S-S switches.
For those that don’t know, when I say “stack”, I mean a group of 3750s connected together using the StackWise technology. When you use a very expensive and very proprietary cable, your individual switches are combined into a single logical device. This means you configure one device to control potentially many switches.
To the point. I’ve spent the last few weeks replacing a mess of 3750s in stacks. These guys are very easy to replace, but the big problem I find is getting the IOS version in sync. When the RMA comes, it’s inevitably got a different version on it, and you’ll see something like this.
Switch# Role Mac Address Priority State -------------------------------------------------------- 1 Member 0023.33ad.a500 1 Version Mismatch *2 Master 0023.5eac.e900 15 Ready
In this case, switch 2, running 12.2(25)SEE3 is the master, and switch 1, running 12.2(35)SEB, is the new member.
A switch in the stack needs to have the same version of IOS as the master to be brought into the fold, so you’ve got to do get the code on the switch somehow. You can use traditional methods to get the right code on the box (like TFTPing one up), but there are easier ways.
If the code versions are close (I’m not sure as to the rules about how close), then you’ll be able to check out the flash of all the switches in the stack through the master. If you do a dir ?, you’ll see flash1:, flash2:, etc. Those are the individual flash drives for each switch in the stack. The numbering is based on the stack number, so all you have to do is copy the IOS image over from the master. Well, maybe.
I found that a lot (maybe all) of my 3750s actually have a directory structure to include the image and HTML files for the device manager. You don’t need the HTML files for the switch to function in the stack, but having the IOS image in different places will force you to change the boot variable for each switch. I like consistency, so I put everything in the same place when I can, so that means creating the directory by hand.
That method works fine, but there’s an easier way using the archive command. You can tell your switch to tar, copy, and extract all of the files from one switch to the others by giving it one of these.
archive copy-sw [ /destination-system X ] Y
I used this command to copy the image from switch 2 (the master) to switch 1 (the replaced member) and got a whole bunch of output. The whole process took about 3 minutes.
Switch#archive copy-sw /destination-system 1 2 System software to be uploaded: System Type: 0x00000000 archiving c3750-ipbasek9-mz.122-25.SEE3 (directory) archiving c3750-ipbasek9-mz.122-25.SEE3/c3750-ipbasek9-mz.122-25.SEE3.bin (7206732 bytes) archiving c3750-ipbasek9-mz.122-25.SEE3/html (directory) archiving c3750-ipbasek9-mz.122-25.SEE3/html/toolbar.js (7084 bytes) archiving c3750-ipbasek9-mz.122-25.SEE3/html/title.js (577 bytes) ... extracting c3750-ipbasek9-mz.122-25.SEE3/html/images/141280.gif (3053 bytes) extracting c3750-ipbasek9-mz.122-25.SEE3/html/images/meter_yellow.gif (59 bytes) extracting c3750-ipbasek9-mz.122-25.SEE3/html/images/legend_off.gif (1158 bytes) extracting c3750-ipbasek9-mz.122-25.SEE3/info (682 bytes) extracting info (108 bytes) Installing (renaming): `flash1:update/c3750-ipbasek9-mz.122-25.SEE3' -> `flash1:c3750-ipbasek9-mz.122-25.SEE3' New software image installed in flash1:c3750-ipbasek9-mz.122-25.SEE3 All software images installed.
I left out a large chunk, but you get the idea. If you’ll notice, all the HTML files are copied along with the IOS image, so you get exactly the same structure on all your switches. It beats doing it all manually.
Send any RMA packages questions my way.