Catalyst 3750s – Bad Luck with a Cisco Logo
Last week, @fletcherjoyce posted an article on his blog about his positive experiences with Cisco’s 3750 switches. If you follow my complaints tweets, you know that I’ve had quite the opposite experience with them. I would never pick on anyone, but I had to throw in my 2 cents.
I’m guessing here, but we have about 50 3750 stacks in the enterprise. Most of them are pairs, you wind up with roughly 120 switches. Since we’ve done about 20 replacements over the last 5 years, that means we have a 17% failure rate. That’s pretty horrible, isn’t it?
For the most part and with few (if any) exception, we use the 3750s as aggregation points for our access switches. We don’t do QoS on them. We don’t do any access control on them. We don’t even do routing on them. They’re simply used to connect all the access switches in the closet to the core, so they’re not doing anything funky or burdensome. The CPU and memory are always well within normal operating parameters. They just fail and fail repeatedly.
The flies started dropping in closets at our corporate headquarters a few years ago. It was the middle of summer, and the temperatures kept rising to over 90F (32C) until the we lost 3 switches in 3 weeks. If you could stand to make it into the closet, you could feel that the sheet metal of the switches was hot enough to make you pull your hand back! When the facilities team added more cooling, the temperatures dropped to around 82F there (28C), but we continued losing switches. I figured the newly-failed switches were feeling the effects of the earlier heat wave and were just getting around to giving up the ghost. Surely the heat was the culprit.
A few months after our headquarters meltdown, a tech for a satellite office called and asked if we could help with some latency issues. He showed me the switch stacks throughout the building, and I noticed that only one of the 10 switches actually had a label. The tech said that he never got around to relabeling them after they were replaced. Some, he said, had been replaced multiple times. The closets were running about 76F (24C), so heat didn’t seem to be the problem at this location. The closets were clean as a whistle, and everything in the racks was on building UPS. I couldn’t find a pattern at all. For the record, all their latency issues were related to two unrelated 3750s. Two RMAs later, and their problems were gone.
I’ve been trying to find patterns for the failures, but I can’t think of any. If it’s heat, humidity, power, dust, etc., then why are we not replacing 2950s as well? There are 4-10 of them for every 3750s stack we have. We’re replacing them, but it’s a rate of less than 1%. If it is environment, then the 2950s are English hooligans compared to the 3750s being French aristocracy. Maybe it’s sabotage. I still don’t know after years of watching RMA after RMA come in.
I have noticed one pattern, though. The only deployments of 3750s that have never had a problem are in data centers. They seem to love any room that has an ambient temperature of 62F (16C) with less than 40% humidity and large volumes of air flow. If only we could install micro-data centers in all our closets, then I would be a happy network dude.
Send any wooden shoes questions my way.
Edit: I went back and checked our TAC cases to see what switches we actually replaced. It turns out that we’ve done 19 replacements, and they’ve all been 3750G-12S-S switches.
- Netbox Upgrade Play-by-play - April 25, 2023
- Sending Slack Messages with Python - March 15, 2023
- Using Python Logging to Figure Out What You Did Wrong - February 26, 2023
But what does the 3750 crash actually mean? Is it a power suply failure?
btw: I have a lot 3750 switches in my environment and I haven’t had any crash since 2 years.
Hey, Neferith. The problems show up differently, but mostly we see ports slowly stop passing traffic until users complain or we notice the errors. The switch always stays up, and we never see a crash or anything. The fix from Cisco is always an RMA.
The 3750G in a stack configuration is one of the few Cisco products that make me twitch every time I see it. Failures relating to the stacking cable were the dominant issue, where one of the stupid stack ports just quit working. Reload the stack, might come back for a while, then go away. Or maybe go up, and go down. And then back up. And then go down again. Enough to drive you mad with the SNMP traps until you go in and unplug the bad port, which incidentally also cuts your stack redundancy in half. Split brain scenario appeal to anyone? No? Okay, how about splitting your cross-stack etherchannel uplinks to your core switch into two different physical switches because of a total stacking backplane failure? That scare you? Yeah…me too.
Solution? RMA after RMA. We replaced many 3750G stacks due to stacking port (not cable) failures, a tedious and risky process in a downtime intolerant data center. My Cisco circle tells me that newer stacking technology is oh-so-much more wonderful, but you won’t catch me using it. Fool me once…
I have never had these kind of problems with any of the 3750g stacks. Even had stacks of 7.
The only thing i don’t like are the SFP connectors.
Some might lose connection when you hardly even touch them.
[…] Aaron’s Worthless Words » Blog Archive » Catalyst 3750s – Bad Luck with a Ci… – For every good story about Cisco 3700 switches, I think there are two stories of how poor the quality is and how many software or hardware failures they've had. Here's another one. I’m guessing here, but we have about 50 3750 stacks in the enterprise. Most of them are pairs, you wind up with roughly 120 switches. Since we’ve done about 20 replacements over the last 5 years, that means we have a 17% failure rate. That’s pretty horrible, isn’t it? […]
They’ve been pretty stable from a RMA perspective in my environment. But like most things Cisco, we’ve ran into some fun bugs with them. Cross-stack etherchannels didn’t keep it from functioning but damn if it didn’t introduce performance problems. I’m about ready to deploy some of the new 3750-X switches, which have use the new “StackWise Plus”. Someone tell me everything’s going to be OK….
Not that this is the source of your problem, but Cisco does have bad batches of hardware every now and again. A few years ago, I had a bad batch of 2821 routers that were all purchased at the same time. Their fans started failing, but it wasn’t always the same fan. Cisco replaced the entire chassis each time we opened a TAC case. Eventually, we opened up 1 big case with TAC when we had about 10 or so go bad around the same time.
I have replaced my share of 3750’s due to various problems. The 3750G-12S sticks out in my mind in particular I think I have had to replace 2 or 3 of these in the past couple of years. It has always been a case of the switch going completely dead. Almost as if the power supply completely fails. It doesn’t even spin up. Perhaps the entire line is a “bad batch”?
I have had the same issue with 3750s failing in our corporate and remote offices. Guess where we don’t have any issues our data center. The only difference between our DC and closet 3750s is we have PoE in all closets and not in the DC. I really think the heat produced by the PoE components is too much for these things.
We started leaving 1u between each switch that has PoE last year and our failure rate has dropped significantly. We were replacing a switch ever month and after the 1u spacing we are down to about 1 or 2 over the last 12 months. We currently have roughly 80-90 3750s with PoE in our environment.
I work in as a consultant in the Service Provider space in Australia, and I have a particular dislike for the 3750 series after an incident in January.
Scene: I was on my last night at the NZNOG conference and the beer was about to start flowing. First pint was poured when i got an alert about a stack “acting odd”. CPU usage was high (compared to normal operating mode), with a large amount of “IP Input” load. Very little information was evident in the log except for a reference to a TCAM problem. “sh platform tcam utilization” suggested that the router had reached its limit of “unicast indirectly-connected routes”. This customer is a hosting provider with over 50 switches and now summarisation due to their network addresses plan 🙁 – oh and did I mention they were currently experiencing a (small) DDoS to a host on this switch at the time?
NOTE: Best time to run into a problem like this is when you have a lot of other service provider engineers around – even if they are drunk!
After much investigation I found an obscure reference to a possible problem on a 3750 stack that would only manifest on a 3750 *in* a stack *with* IPv6 sdm modification (Yes its kind of specific). In this position the router choose which routes to kick out of the FIB and instead of losing most specifics, would drop 0.0.0.0/0 (Not what you want during a DDoS), this causing all traffic external to the AS to be process switched.
Short term fix was to change back to an IPv4 only SDM profile (with the associated reboot which my customer loved!). Longer term solution was to move this switch into its own OSPF stub area to reduce the total number of routes on the box.
Lesson: Dont touch 3750s if you can help it!
Epilogue: All was not lost, after this fix I returned to drinking, and won the Casino night competition (funnily enough the prize was a flight from NZ to Australia for the AusNOG conference in Sydney… shame I live in Sydney!).
[…] hard way that your interface descriptions aren’t accurate. Or you’ve swapped out a piece-of-crap 3750 and didn’t notice that the labels on the cables were wrong. In either case, we all know […]
We’re finding out ourselves, along with the general IT community, that there are upgrade issues when 3750 switches are stacked. The switch will boot after a software upgrade, but not with the new image because the wrong command was inputted. Some of these wrong commands inputted are “/leave-old-sw” and “archive download-sw/leave-old-sw”, which will lead the switches to boot with old software. There is an simple solution to upgrading the software to all WS-C3750 when stacked. There is a command that you can input into the system, and that is: “archive copy-sw”. For more information about this issue, please refer to our blog. http://www.ccnytech.com/blog/Cisco3750/