My Psychotic Little Datacenter….
by Jesse on May.26, 2008, under General
Ok – I’m starting to think that it’s not just my imagination.
About an hour ago I had this great post written up – basically explaining what some may have noticed as a period of “instability” in the website over the past few days. I’ll try and recap.
On Friday I started having what can only be described as “random weirdness.” VMWare hosts coming in and out, network issues, etc. (When your primary storage is NFS, network issues can be fatal)
I *THOUGHT* I had narrowed it down to a bad card in one of my servers. I was up until about 4am working on that particular problem.
Saturday morning I wake up and *poof* the environment is gone again. I got it up by about 1:30pm by recycling everything, taking into account an hour or two to mow the back lawn, a chore that was about 12″ of growth overdue.
Things work fairly well for the rest of the day. Then we sit down to watch a movie and it blows up *AGAIN*. Well I’m getting good at cycling the systems, so this time I work on it from 11p – about 1:30a. All back up and seemingly purring along, I go to bed…
Well according to the logs, it crashed at 1:47am this morning. Now my wife regularly jokes with me about the co-dependant relationship I have with my computers – and this is the point at which I realize that it’s gone from co-dependant to stalking. My little server-rack has developed a personality of it’s own, and just doesn’t want me to leave.
So about 2 hours ago, I just finished writing a great posting about how I’d really like to know if anyone would like to trade a “datacenter-in-a-rack” for a slightly used abacus….and as I click “Publish” I get a weird database error. I log into the webserver (which is still, amazingly enough, accepting logins) and I tail the message log to find that the MySQL database has crashed…because something (or someone) marked the filesystem read-only…..
Ok, now it’s spooky. In fact I’m going to have to go with very spooky.
P.S. – Through a dexterous shuffling of harddrives, the VMWare host that was “Argos” is now “Goliath” my storage server. That leaves me with two VMWare hosts instead of three while I try to figure out what in the heck was going wrong. Near as I can figure the Dell PowerEdge 2650 that was my storage server may be starting to have the computer equivalent of a nervous breakdown…
May 26th, 2008 on 2:49 am
I’m sure there is nothing wrong with the computer. It sounds like their just jealous of the other parts of your life and can’t stand to be without you.
Obviously that “trade a “datacenter-in-a-rack” for a slightly used abacus” line really offended the MySQL daemon and it decided to punish you. Just be glad it didn’t banish the entire website to hell permanently…
May 26th, 2008 on 9:08 am
I was worried for a moment. Images and sounds of HAL going through my head… “What are you doing Dave….”
Going on 10 hours up on the “new” hardware and all seems well so far. Now if I can just figure out what was going on with the old server… Time to start a little thing I like to call “Troubleshooting.”
I say that with such a sarcastic tone because I think troubleshooting has become a lost art in this industry. I run across so many people who don’t know the first thing about troubleshooting, say, and down fibre-path or a defective HBA. A path goes down, they replace the card, then if that doesn’t fix the problem they throw up their hands and call the SAC. (Though diagnostically speaking a failed HBA is usually the culprit – or more specifically a failed optic, because the HBA works fine without the light)
How many people swap calbes to see if the light problem follows the cable or HBA? Run a new cable across the top of the floor to see if the problem is elsewhere in the path? Or check a second host on the storage port in the interest of making sure it’s not the storage?
Nope – swap the HBA and hope. Shotgun troubleshooting at it’s best.
May 27th, 2008 on 1:00 pm
I’ve had a similar issues. It turned out to be a vmware config issue, where it was telling one server it had the guest, another thought it had it. It would turn the filesystem read only.
May 27th, 2008 on 1:00 pm
Turn off ha and drs and see if the problem goes away
May 27th, 2008 on 1:03 pm
Actually it’s looking more and more like a network problem brought on by my attempts at configuring LACP on the switch – even with the virtual switch and VMKernel set to IPHash when I set up LACP the NFS connection gets wildly unstable.
Unfortunately I’m not as much a network guy as I am a storage guy so I’m at a loss as to why this would be the case, unless it’s my crappy Dell switch.
May 27th, 2008 on 1:07 pm
Re Shotgun Troubleshooting: In our case the most likely culprit is a failed GBIC on a swtich(usually and i10k). We always go and check to see light before replacing either a GBIC or HBA. When we were cabling for our new datacenter we did a lot of tracing along to see where things failed or were miscabled(all of our under-floor work was done by a contracting company).
May 27th, 2008 on 1:10 pm
I hate dealing with random cabling nightmares – when I was at LoanTolLearn – the cabling contractor, on their first pass, hooked up the transmit/receive in any random order they saw fit.. (Truthfully – with all cables connected, you should be able to look at the patch-panel and see a straight line of lights)
When we called them back to re-do it they straightened everything up – backwards. So that when I connected any host to the patch panel I first had to separate the leads and swap them so that TX/RX would be lined up correctly.
May 27th, 2008 on 1:23 pm
Think about what happens to a vm when it loses it’s heartbeat between hosts….Yes, a network issue is as likely(your storage is nas)….But if you’ll turn off ha/drs, you’ll eliminate interconnectivity between the hosts. Actually, if you’ll sort through your logs you should have any notation of heartbeat loss, etc.
May 27th, 2008 on 1:25 pm
Remember, VMware brings another layer of complexity to your networking. Enjoy.
May 28th, 2008 on 10:19 pm
The strangest thing was – the problem I was having was the VMKernel losing contact with the NFS mount, and as such all of the VM’s on that mount would just disappear.
I always thought that network was a strange way to go about this – since network connections between devices always seem so hit-or-miss.
I’m still a fibrechannel bigot.
May 29th, 2008 on 8:15 am
Join the crowd….It’s uber reliable, uber fast. What’s not to love.