Duck
So much fun, so little time.
by Jesse on Apr.11, 2008, under DAS, Downtime, Duck, VMWare, Vmware-NFS
A few have noticed the site was down for an extended period this week. I learned a few things this week.
I set up my FC system and was so excited to get it moving that I neglected to adequately test my equipment. I bought used equipment, with used drives, and put real data on them after a whopping 2 days of light testing. I never stress-tested the drives, didn’t do any kind of exercizing of them to validate that they were worthy of production data.
I also neglected to functionally test the array. While it did offer the ability to configure a hot-spare, I didn’t check to see if the hot-spare was functional before I moved data over to it. (Seeing that it was configured was enough for me)
So what happened was this. I was running on the system and all was well until a drive failed. The hot-spare didn’t invoke on it’s own, and while one drive was in a failed state, the second drive failed. Needless to say I lost half my luns and three of them were corrupted beyond repair.
Luckily I’m one of the old hold-outs. I have a tape backup system consisting of a Veritas 6.0 environment with an ATL tape library. I was able to restore to within 48 hours of the failure using tape.
My *NEW* storage back-end of consists of a Dell 2650 with 5x 146G drives. I installed CentOS5 with a 512GB NFS-mount partition and mounted them to my VMWare servers. The most interesting part is I realized that by bonding the network interfaces I’m getting the same bandwidth I got out of the 2x 1Gig fibrechannel ports.
Not being a network guy though, does anyone have any suggestions for optimising NFS for storage applications?
Comcast is *SO* fired
by Jesse on Jul.01, 2007, under Comcast, Downtime, Duck
They screwed it up again. Verizon has already been called.
Not so much for the “the system is down” but for the “i really dont care when you get it back up or how much business you’re losing in the meantime.”
Â
Woah – Bigtime Duck
by Jesse on Mar.20, 2007, under Backup, Duck
I think I’m going to send this guy the biggest duck ever.
Oops! Techie wipes out $38 billion fundWe’ve all done it, accidentally deleted a file, maybe some of us have even formatted the wrong drive.
But I think it’s safe to say that most of us have not wiped out $38 billion dollars.
According to the MSNBC story, this guy formatted a drive at the Alask Department of Revenue, accidentally deleting application information.
If that wasn’t enough, the guy then follows this with a perfect encore:Â He formats the backup drive.
*THEN* when they realize that their backups are useless.
Talk about the pefect storm. Three disasters. Three totally preventable mistakes, put together it cost them over $200,000 because they had one more backup.
 300 cardboard boxes containing some 800,000 applications.
The impressive thing was the following quote, from Revenue Commissioner Bill Corbus. “Everybody felt very bad about it and we all learned a lesson. There was no witch hunt,â€
I’m impressed.
Â
My Duck – redux
by Jesse on Jan.24, 2007, under Duck
I’m not sure but I think I’ve got to go get the duck back from Tim.
I say I’m not sure because I’m not sure, even by my own rules of my own game whether taking a system down at midnight and getting it back up by 5am, before the users come in, counts.Â
Our usually criteria for a duck is that it has to affect at least 10 people, 10 being an arbitrary number that I pulled out of my backside.
This was a major system, with many, many users, but since none of them noticed (other than the people I notified of the problem) I personally am not sure it should count.
Can we get a ruling on this?
More information. In the process of doing a data migration on one of our development systems (mind you, we’re at peak development cycle right now, so that’s a busy system) one of the requests was that I resize the C: drive on this system from 12G to something more useful, like the 36G that is the physical drive. The problem that has kept me from doing it is that someone had put application data in the latter partition of the drive.
So after (from home, over Remote Desktop) moving the Application, SQL Data, SQL logs, and such to the Symmetrix, I wiped the left-over partitions and resized the C: drive using a tool called “Acronis Disk-Director”.  Unfortunately this was bought before I had any input into purchasing of software like this, because I’m a PowerQuest guy myself.
So at about 12:30am I resize the drive and the system locks up cold. Won’t respond to a ping, nothing. This didn’t make me happy because I live about an hour’s drive from work and having to haul my tired butt in at this hour is just not attractive.
I get into the office about 2:30 (after the usual puttering around that involves getting ready to leave on short notice) and realize that I left my badge on my dresser at home. No other way into the building, let alone the datacenter, so I haul my tail all the way back home to get my badge. It’s a good thing though, because I found I had both keys to my wife’s car in my pocket and it would have totally thrown off her day if she couldn’t get the kids to school. (side story)
So I get back in around 4:30am, get into the datacenter, and find the system sitting at a “Delayed Write failed” message.  I power it off and power it back on to find it *thankfully* boots. However I have two different partition sizes being reported, which usually means that there is a corruption in the FAT table.
Acronis showed the size as being 33GB, Windows still showed it at 12G. What I ended up doing is actually having to resize the 33G partition to 30G, reboot, then resize it back to 33G to get all of the various tables in sync.
Seemed to fix the problem, like I said, before 5am.
So tell me, is it duck-worthy? Or just another example of how Microsoft conspires to keep us all grumpy and sleepless.