Lessons Learned from a Bad Day

I had a really, really bad day this past Tuesday.  I mean, a really bad day.  I guess I should have seen it coming since the last #stabbytuesday was uneventful.  Here's what said cosmos had in for me and the lessons I took away.  Most of these are things we've all lived before, but, for various reasons, I got blindsided.  I expected more from myself.

First of all, I drove the 2 1/2 hours from my home office to the headquarters to get some work done.  We have a large migration going on this weekend, and I didn't want the guys there to have to do all the hands-on work.  I planned to get some switches installed and do some cabling so the systems guys could just plug in when the servers landed in the data center.  None of that got done thanks to the rest of the karmic retribution.

Moral of the story:  Give yourself some additional time to do projects just in case something happens.

Over the past weekend and into the week, we ran into an issue with a 48-port GE module on a 6513.  It kept reloading, and a TAC engineer figured out that we were seeing a bug that would cause the module to reload every 12 hours (don't ask me why it was showing up after years in the same configuration).  Each time it went down, we'd lose around half of our servers.  Bear in mind that we have a pair of 6513s and that each server is cabled to each switch for redundancy.  We shouldn't have seen any servers go down; they should have simply flipped over to the other NIC and kept trucking.  As you probably figured out, the bonding/teaming configurations on some of them was either wrong or missing, so the $100k+ switch for redundancy was just a big paperweight.

Moral of the story:  Having redundant network gear is worthless if the servers can't handle a lost link.

When we got back from lunch, the module had rebooted again, and we lost all those servers and services.  This time, however, the module didn't come back.  We reloaded the module via software but we couldn't make any headway.  It would come up with a status of "Other" and then reboot again.  We ran down to the data center to reseat it, but that didn't help.  It absolutely refused to come up, so we called our TAC engineer and had him generate an RMA for us.

Moral of the story:  Modules die all the time.  You have to have a plan to deal with it.

We tried one more time to reset the module in software but we lost our SSH connection to the 6500.  Nothing but solid red lights on the supervisor.  The console didn't respond at all, so we wound up pulling power to reboot the whole switch.  It came back up, allowed us to log into the console, and then crashed.  It did this over and over until we physically removed the bad module.  This wasn't exactly an easy feat.  The switch isn't fully populated, but all the 48-port cards were right on top of one another.  This made the cabling very, very dense and hard to maneuver through.  The cables were neat and tidy, mind you, but it's hard to keep one bundle above the card without snagging it, and there is always the one cable that has to be different and not follow the rest of the structured cabling.  Of course, that one winds up cutting right across the module you need to get out.

Moral of the story:  Your switch may be modular, but each modules needs the supervisor.  Break the sup, and nothing works.
Moral #2:  If the cabling weighs more than you do, you may want to spread them out a bit.

We finally got the 6500 more-or-less stable, but there's are 48 ports missing.  Of course, all the servers are still down  The servers aren't configured to use their other NICs, and the RMA is 4 hours out; what do we do now?  I said we have to wait, but upper management declared an emergency and said that it needed to be back up as soon as possible.  Remember the switches Online Blackjack I was going to install for the migration?  Yeah…those are installed as replacements now.  There goes those plans, eh?

Moral of the story:  If you can't wait until the RMA gets there, you better have spare parts on hand.

The replacement switches are actually a pair of 3750s stacked together.  As you know from my previous rants on them, I absolutely hate the 3750s.  These were the only ones we had, though, so we configured them and went into the data center to install.  That's when we noticed that there was no appropriate power in that rack.  The 6513s had power, but it was an L21-30 (or something similar; I didn't look at the plugs) which doesn't exactly help when you have standard connectors.  We found a power outlet in the floor a few feet down the way and wound up putting in a strip to get power into the rack.

Here's our Senior Network Engineer doing his part to help.  The pic pretty much sums up the whole day.  Upside down.  Cold.  Parts and cables all over the place.  He looks unconscious.

There was another problem, though.  There was no room to mount the new switches.  There was only room for the patch panels and the 6513 with a single rack unit of space at the bottom.  Luckily, we were able to finagle the switches into the bottom of the rack using that single U and the extra space on the floor.  Once powered up, we go the cabled plugged in and checked the systems that we could while our boss walked around to get status from the different departments.  Everything was finally fine.

Not really.  While we were dong our checks, someone sent an email saying that one of the major systems was down.  We just had a major outage, and all hands were in the data center helping where they could.  Do you really think anyone checked their email?  After about an hour and a half, we finally saw the problem reports and went back at it to figure out what was wrong.

Moral of the story:  Use a proper trouble reporting and escalation system.  Email is not it.
 Moral #2:  Don't be so inflexible in your infrastructure planning that a rack can only serve one function.  That includes power, cabling, and rack arrangement.

I totally take credit for this one.  It turns out that VTP had slapped us in the face.  I configured the switches and didn't even check the VTP configuration in the rush to get all the servers back kup.  There were other contributing factors, though, that were beyond me.  First of all, the 6513 that we trunked up to had a null VTP domain in transparent mode, but the other 6513 had a domain in server mode.  When I plugged in my 3750 stack (also in server mode), down went the VLANs.  If you know how VTP works, you know that the domains have to match in order to give it the opportunity to take down the network.  It turns out that the 3750s and the other 6513 (the healthy one) were configured with same VTP domain.  It's not our standard practice to use the same VTP domains everywhere, but, for some reason the other 6513 was configured with the VTP domain of another location. 

Moral of the story:  Go through your configuration checklist no matter what.  If I would have taken 5 more minutes in configuration, we would have saved over an hour of downtime.
Moral #2:  Just because two devices are supposed to be configured the same doesn't mean that they are.  Always check.
 Moral #3:  Be consistent in your configurations.  Most configurations should be predictable.

Finally, after about 6 hours or so, everything was up and running.  The replacement module arrived, but management said that we couldn't take another outage to do the replacement.  I can understand that they don't want any more downtime and wasted time, so I have no problem with that.  There's an old adage, though, that says that nothing is temporary.  I'm not expecting to get those 3750s out of that rack for quite a while.

Oh, yeah.  The 2 1/2 hour drive back home sucked.

Send any stiff drinks questions my way.

Aaron Conaway

I shake my head around sometimes and see what falls out. That's what lands on these pages.

More Posts

Follow Me:
Twitter

15 comments for “Lessons Learned from a Bad Day

  1. November 11, 2010 at 11:06 pm

    That pic is priceless!!! 🙂 btw, sry about that awful day dude. 

  2. November 11, 2010 at 11:13 pm

    I feel so bad! And I thought the outages I had were terrible. But I congratulate you in getting the network up and running again in that amount of time. 
    Great tips provided, especially at the end about configurations.  Who knew that the sup would be a single point of failure and luckily you had some extra switches on hand. 

  3. Tyler
    November 11, 2010 at 11:52 pm

    Thanks for posting.  It's just nice to know that someone else has felt that same moment of dread and then spent the rest of the day working to get out of it.

  4. Nevot
    November 15, 2010 at 5:25 pm

    Reading this I found myself depicted. I run into these same issues two or three times. I know the deal…
    Regards!

  5. Clint Young
    November 17, 2010 at 3:07 pm

    Wow, I thought my Thursday sucked last week… (Well, it really did…)  Hit a bug on an ACE module, killing the Nitrox and crashing the blade.  Crashed, and traffic failed over to the standby blade.  Well, what do ya know…  continued SSL loads crashed the standby blade too.  The primary had been up for 40 seconds, but was still transferring data for  a few virtual contexts.  So when everything was back up, I had no configurations AT ALL for a few virtual contexts.  Oh $hit, everything for a few internet facing cages are GONE!  Then, I go to the tool that is supposed to automatically back up the configs, there were 3 configs missing.  And guess what, two of our 3 missing contexts were the "missing backup" winners!  Luckily I had created a TAC case about 30 days prior, and found a show tech and got a copy of the configs on there.  Found an audit history of all other changes we made and got those in there.  Whew… got this all done in about an hour.  And to know the real kicker?  We had upgrades scheduled for Saturday to fix this bug.  lol…    Good to hear I'm not the only one who has to deal with extreme disasters.

  6. Taoufik Belmekki
    November 22, 2010 at 11:54 am

    Some stories must be told 😉

  7. Dan Holme
    December 6, 2010 at 8:31 am

    Yeah tough one. The thing is, the best way to learn is the HARD way, unfortunate for us all but that's how it is.
    Not that it matters now, but how come you didn't just swap the uplink ports on the servers that were down so that their working NICs were plugged into the working 6500?

  8. December 6, 2010 at 9:10 am

    Thanks for the input, all.  It was a rough day for sure, and I hope someone can learn a lesson from it.

    Clint:  Wow.  Quite the story.  When I read it, I went to all my core network gear and manually backed up the configs just in case something like this happened.  I think I'll do it again just in case.  Heh.

    Dan:  The 6500s were full and had no room to add servers, so I guess I should have added one more moral of the story – if you have no room on the switch, you'll need another ports soon than you think.  🙁

  9. Leo Song
    December 6, 2010 at 9:55 am

    I second your comment "Having redundant network gear is worthless if the servers can't handle a lost link.".
    It might be the best time for you have cold spare line cards, and put them on the shelf, fight budget to store those spare parts.
    And I hope all of your patch cords are running to right hand side, just in case you need to replace fan tray some day.
     

  10. December 6, 2010 at 10:39 am

    Leo:  We would only be so lucky to have all the cables on the right side.  The mass covers nearly the entire switch including the fan tray.  🙁

  11. December 7, 2010 at 5:39 am

    Aaron & Clint: Rancid is your friend 🙂 (http://www.shrubbery.net/rancid/) Not only does it handle configuration backup and revisioning, it can also be used to do bulk changes on several devices. And a lot more. (Did I mention it's free software?) 🙂

  12. JL
    December 11, 2010 at 1:41 pm

    Ouch, but this situation happen a lot. Enterprises should test the “redundant” systems regularly, but most never do and so when something like this happens, there is panic and chaos.

  13. George Karavitis
    December 14, 2010 at 8:11 am

    For me, the top moral story is the following:
    When it comes to networks, assumption is the mother of (almost) all f*ckups
     

  14. SG
    October 28, 2011 at 12:26 am

    VTPv3 and Primary Server mode have solved a lot of the VTP-related switch incidents. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *