Results 1 to 14 of 14
  1. #1
    Platinum Lounger
    Join Date
    Dec 2000
    Location
    Hornsby Heights, New South Wales, Australia
    Posts
    3,822
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Lounge Server Crash Explanation

    A lot of people emailed me about their inability to contact the wopr site a couple of days ago. This all happened during my sleeping hours, so I wasn't aware of it until the following morning. For those of you who are interested, here is the technical explanation provided by the ISP as to what happened:

    <hr>Earlier today, our central core Gateway router (HUGGIN) experienced a soft lockup due to a driver failure on our high speed Intel Quad ether card... This card feeds out to our two frontside routers handling the DS-3 links, NOVA and IKONOS... The default Gateway NIC itself is a separate builtin Intel 82558 port... The Quad card was only letting every 3rd, 5th, 7th, or 9th (random) packet through and dropping all the rest, while the gateway NIC itself was accepting all packets... This is what prevented the failover system to kick in and takeover the gateway IPs... BGP at this point had become my worst enemy as trying to mix static with dynamic routing is an experience better left for campfires with your enemies...

    I was able to easily work with the two frontside routers, but could not get anything through because the switch has security controls to prevent 'casual' IP to MAC changes to prevent spoofing... Trying to get into the switch was next to impossible because it still believed that HUGGIN was working properly, therefore shoving all it's outbound packets back through HUGGIN and in turn got dropped on the floor... The same goes for the remote power control equipment...

    I was finally able to get the NOVA router to route to the inside via a backup (DOROTHY) switch link and finally climb into our primary switch (OZ) so that I could shutdown the primary gateway ether port... This allowed me to start populating the failover ether link on NOVA with all the gateway addresses... Getting to this part alone was a long and winding road due to security measures in place...

    Once HUGGIN was offline, an unfortunate turn of events happened that caused the gated (BGP daemon) to go into a 'd' (locked) state where they were no longer responding... This caused our advertised IPs to fall off the net, creating an even larger mess.... I had to shutdown all my connections and restart going in through 'alternative' paths to the front routers... Once back in, and wrestling with the gated daemons, I finally got them to die off... At this point I decided to shutdown the Qwest side of the link, and take our InterNAP link solo... When bringing back up our InterNAP BGP, the daemon proceeded to flap the daylights out of InterNAP for which we got rightfully squelched by them... After a few cool down periods, I was able to get a very small handlful of nets advertised, but it was still not enough...

    This is when Kevin entered the scene and we went through a complete dump of all the routers + BGP daemons and slowly brought everything back up piece by piece... HUGGIN was responding properly again, however the BGP daemon was not advertising to the iBGP properly, therefore the front routers were not pushing them out to the net... After a few good swift kicks to HUGGIN's gated, the iBGP was once again back alive... In turn the front routers tried to advertise, however they were stopped dead... My summation was that we were still under heavy squealching from our Qwest and InterNAP uplinks... I then backed out all of our nets, and waited another 10 minutes to let things cool off... At that point, we started *slowly* advertising all of our nets one by one, with only a small handful of nets that were on hard squelch... The ones that didn't come up had to be manually released by the upstreams to allow them to be advertised properly...

    Overall, our routers are designed to be failure resilient, with BGP4... In this cause, the 'failure' was a soft failure and HUGGIN gave enough indication that it was still operational that failover procedures did not kick in... It is sort of an all-or-nothing situation - with current technology... In the next release of our router's software will be the inclusion of VRRP, which is supposed to better handle the switching over of the default Gateways as it's builtin to the protocol itself... What I have now is emulated as close to this as possible, yet the tolerances were unfortunately to high as to not cause false positives and ended up stonewalling me from making a clean cut without completely hosing up the switches arp tables and mac lockdowns...

    I have dubbed this the 'Doomsday Scenario'... I will definitely be making a pile of alterations to our router configurations to ensure I don't stonewalled like this again now that I see how the cascading events all intertwined with one another creating a huge granny knot 6 layers deep... It was the epitome of every solution ending up creating another problem...

    The actual routing/switch details are deeply technical though I really tried to make the above as non-technical as possible while still covering as much as I could... Hopefully this will explain the cause of the failure, and provide you with the assurance that this situation is not being taken lightly by the FutureQuest Team... I will do everything in my power to prevent this from happening in the future, now that I understand the personality and behavior of this unfortunate event...

    Our sincerest apologies for the hardship and inconvenience this routing outage caused!
    <hr>
    So, there you have it <img src=/S/smile.gif border=0 alt=smile width=15 height=15>
    Cheers, Claude.

  2. #2
    Uranium Lounger
    Join Date
    Dec 2000
    Location
    Salt Lake City, Utah, USA
    Posts
    9,508
    Thanks
    0
    Thanked 6 Times in 6 Posts

    Re: Lounge Server Crash Explanation

    What a load of baloney. Obviously they all went out and got drunk as Lords, and they don't want to confess.

    <img src=/S/grin.gif border=0 alt=grin width=15 height=15>
    -John ... I float in liquid gardens
    UTC -7ąDS

  3. #3
    Plutonium Lounger
    Join Date
    Oct 2001
    Location
    Lexington, Kentucky, USA
    Posts
    12,107
    Thanks
    0
    Thanked 1 Time in 1 Post

    Re: Lounge Server Crash Explanation

    You're a Pistol, John. Drunk as Lords???

  4. #4
    Uranium Lounger
    Join Date
    Dec 2000
    Location
    Salt Lake City, Utah, USA
    Posts
    9,508
    Thanks
    0
    Thanked 6 Times in 6 Posts

    Re: Lounge Server Crash Explanation

    Good British expression, BigAl. Drunk as a Lord is the first appropriate explanation I found from Googling.
    -John ... I float in liquid gardens
    UTC -7ąDS

  5. #5
    5 Star Lounger
    Join Date
    Jan 2001
    Location
    Cumberland, Maryland, USA
    Posts
    880
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    My favorite part is "I really tried to make the above as non-technical as possible while still covering as much as I could. . . ." Really??!!

  6. #6
    Uranium Lounger viking33's Avatar
    Join Date
    Jun 2002
    Location
    Cape Cod, Massachusetts, USA
    Posts
    6,308
    Thanks
    0
    Thanked 1 Time in 1 Post

    Re: Lounge Server Crash Explanation

    <img src=/S/claude.gif border=0 alt=claude width=21 height=21> All this happened during your sleeping hours? Don't you have 150 db crash alarms in every room? Two in the bedroom?

    As for the rest of your post...don't worry about it, it happens to all of us once in a while, as we all can attest. Well familiar with that problem! <img src=/S/dizzy.gif border=0 alt=dizzy width=15 height=15> <img src=/S/laugh.gif border=0 alt=laugh width=15 height=15>

    Bob
    BOB
    http://lounge.windowssecrets.com/S/flags/USA.gif http://lounge.windowssecrets.com/S/f...sachusetts.gif


    Long ago, there was a time when men cursed and beat on the ground with sticks. It was called witchcraft.
    Today it is called golf!

  7. #7
    Platinum Lounger
    Join Date
    Feb 2001
    Location
    Yilgarn region of Toronto, Ontario
    Posts
    5,453
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    Good post. It's important to remember that there are people who can do/understand this.

    The best part of reading this post through from start to finish was learning (yet again) what I sound like when I'm explaining a simple one-off patch via VBA to fix a corrupted whatchamacallit file. Your post ranks right up there with attending a seminar, any seminar, once every three months to see how boring I might be to My students (grin!)

  8. #8
    4 Star Lounger
    Join Date
    Dec 2000
    Location
    London, Ontario, Canada
    Posts
    437
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    I was right with him there up until Dorothy climbed into OZ. We're obviously not in Kansas any more Toto. <img src=/S/puppy.gif border=0 alt=puppy width=396 height=35>

  9. #9
    2 Star Lounger
    Join Date
    Apr 2001
    Location
    Coppell, Texas, USA
    Posts
    168
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    <img src=/S/laugh.gif border=0 alt=laugh width=15 height=15> You got that far?

    I noticed that all the replies are from serious lounge lizards, three-stars and way up. Everyone else must have bailed...

  10. #10
    5 Star Lounger
    Join Date
    Mar 2001
    Location
    Lorain, Ohio, USA
    Posts
    953
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    I read it, but didn't understand a single word...so I didn't think a reply from me was worth the space . <img src=/S/evilgrin.gif border=0 alt=evilgrin width=15 height=15>

  11. #11
    Plutonium Lounger
    Join Date
    Nov 2001
    Posts
    10,550
    Thanks
    0
    Thanked 7 Times in 7 Posts

    Re: Lounge Server Crash Explanation

    > I read it, but didn't understand a single word...

    The single word's made sense, but none of the sentences did!

    StuartR

  12. #12
    Banned Member
    Join Date
    Jul 2002
    Location
    Newport Richey, Florida, USA
    Posts
    2,149
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    Oh yea , I understood every other paragraph. Or a least the part that started with "The". After that I was bored to tears. Let me explain it for him. " It was down due to SH** Happens and now it's back up again" See how easy that was.

  13. #13
    Platinum Lounger
    Join Date
    Dec 2000
    Location
    Queanbeyan, New South Wales, Australia
    Posts
    3,730
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    <img src=/S/rofl.gif border=0 alt=rofl width=15 height=15> <img src=/S/rofl.gif border=0 alt=rofl width=15 height=15> <img src=/S/rofl.gif border=0 alt=rofl width=15 height=15> <img src=/S/rofl.gif border=0 alt=rofl width=15 height=15>
    Subway Belconnen- home of the Signboard to make you smile. Get (almost) daily updates- follow SubwayBelconnen on Twitter.

  14. #14
    4 Star Lounger
    Join Date
    Aug 2002
    Location
    Dallas, Texas, USA
    Posts
    594
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Re: Lounge Server Crash Explanation

    Made perfect sense to me. Of course after reading it, I thank my lucky stars that my co-worker in our two man IS Department is the one that programs our Router! <img src=/S/grin.gif border=0 alt=grin width=15 height=15>

    I wish my ISP gave me that much detail. Instead, I get 'IP Communications went bankrupt, and turned off their network. We don't know what to do, but we're working on the problem as we speak.' Got that for about a week. I had to wonder if anyone at the ISP new what DSL actually meant! <img src=/S/grin.gif border=0 alt=grin width=15 height=15>.

    My personal favorite, is when I first got DSL. I was given my IP Address (static), and at the time, I was running the Beta 3 version of Millenium Edition. (Don't ask me...I guess I was feeling too chipper. Anyhow, I called their tech support to get the DNS addresses, since they didn't send those along with the IP Address. The 'technician' didn't know what I was asking for (must not think about what he reads all day long), and asked what OS I was using. I told him Windows ME, and I got a long silent pause on the other end. 'I don't know what that is, you'll need to call Microsoft.'. Call Microsoft? To get the DNS servers for MY ISP? I managed to talk the guy into reading the script he had for Windows 98, until he finally got to the DNS part.....egads.......

    Oh what fond memories.....

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •