Downtime
We interupt this experiment to bring you this special bulletin…
by Jesse on Feb.25, 2011, under Consulting, Downtime, Employement, Politics
The government’s “Continuing Resolution” will be expiring a week from today. As a government contractor, this directly affects me.
They have two choices. They can pass ANOTHER C.R. or they can actually pass a budget.
I don’t post political statements here too often. However I don’t know about you, but from where I stand this travesty that the House has floated is a disaster. 1.2 million jobs lost by the estimates I’m hearing, and to top it all of, it doesn’t do SQUAT to balance the budget because the places that need to be cut / reformed, IE Defense, etc. are off the table. So this will be for nothing.
If the posturing peacocks on capitol hill don’t get their collective crap together and one side (or the other) forces the government to shut down. I may have some time on their hands.
Part of me is hoping that cooler heads prevail.
Part of me is looking forward to a little time off.
I’m told it is actually a CRIMINAL offense for me to work if the government shuts down.
Bring it.
I remember the good old days….
by Jesse on Mar.17, 2009, under Downtime, EMC Failures, Gripe, LackOfCustomerService, SRDF, Support-Calls
When the triage guys @ EMC actually listened to the person calling in the problem and directed the call appropriately. (Just had one I specifically asked for PSE and they routed me to software for some unknown reason, now, three hours later, it’s been re-routed to PSE)
When the support specialist working a call would stay through the end of the problem, and didn’t give you the “Sorry my shift is over, please explain your problem to a new guy for 45 minutes before you do anything else productive”
Follow-up. It’s now 0100 the next morning. I’ve been working on this problem for 8 hours straight. The software guy who was originally assigned went home without turning the call over to someone else. I’ve gotten not a single call since before 9pm.
This is not support people. This is the opposite of support. I’ve since fixed the problem myself, without the help of the SAC/PSE folks. The sad part is if I had done this 8 hours ago I could have been home eating my corned-beef / cabbage and drinking my Guiness.
Totally missed St. Patrick’s day, the only religious holiday I actually observe.
Not happy.
Jumping the shark
by Jesse on Aug.25, 2008, under Downtime, FC@Home, Fibrechannel, iSCSI, Linux, SiteAdmin, Vmware-NFS
This may be a more well-known reference than I earlier thought.
I grew up watching Happy-Days. The show was great until the episode where Fonzi jumped the shark-tank. After that it pretty much went down-hill quickly.
Hence the term “Jumped the shark” or “Jumping the shark” has come to mean any single event that marks the point where something degenerates into crap.
My VMWare NFS server jumped the shark this weekend. It was hilarious. I had a beautifully quiet afternoon on Friday, from about 14:30 on my blackberry was quiet. Turns out that the NFS server that I use for storage experienced an unexplained (and apparently barely logged) kernel panic and rebooted.
In the process, the 6 adapters, in what I can only guess was a techno-square-dance, all switched places and lost their bonding configuration.
All went south, right in the middle of one of my busiest travel weeks as far as work goes. So my wife, god bless her, earned her stripes this weekend as I walked her through ‘ifconfig eth0 10.1.1.10′ and ‘ping 10.1.1.254′ etc. trying to figure out what happened.
Still don’t know. But with everything down (including this site) my first priority was to get it all back online, troubleshoot later. (When my desktop goes down I know why, I have an inquisitive 3 year old with a fetish for power-buttons), but the server power buttons are protected by a key – for that very purpose.
So I ordered a bunch of 146G drives for the hosts, and I’m going to move criticial apps back to internal storage until I figure out what in the hell happened and how to fix it. It might give me an opportunity to eval. some new FC Target toys I’ve been thinking about.
Who knows. No more shark-jumping though.
Enterprise vs….not
by Jesse on Jun.22, 2008, under CustomerService, Downtime, EMC Failures, Symmetrix, Technical Support
I have a cousin. Very well-to-do man, owns a company that does something with storing and providing stock data to other users. I don’t pretent do know the details of the business, but what I do know is that it’s storage and bandwidth intensive.
He’s building his infrastructure on a home-grown storage solution – Tyan motherboards, Areca SATA controllers, infiniband back-end, etc. Probably screaming fast but I don’t have any hard-numbers on what kind of performance he’s getting.
Now I understand people like me not wanting to invest a quarter-mil on “enterprise-class” storage, but why would someone who’se complete and total livelihood depends on their storage infrastructure rely on an open-source, unsupported architecture?
One of the things you get with the Symmetrix is the 24×7 monitored support. One of the stories I tell people was about my first experience with EMC. When I worked at Intuit I was on the graveyeard operations shift. (The grunt shift, that most of us have been subjected to at least once in their lives) About 4am one morning I got a call from EMC saying that a hard-disk in our old Symmetrix-3 array had failed, and that the tech would be onsite in about 20 minutes (I guess they gave him the head-start) to replace it. I asked them if there was anything I needed to do and they told me that it was transparent and that the hosts wouldn’t notice the difference.
I was in love.
People ask what the “Enterprise” money gets you, and that’s it. You get the security of knowing that it doesn’t matter when, where, or how a failure happens, they are on top of it and have it dealt with before you even know the problem exists most of the time.
My second great EMC story – I was working at the Library of Congress on a tech-refresh, they had four Symm4 and 2 Symm5 arrays that were being upgraded to a pair of DMX’s. About two weeks before we were to have decomissioned one of the Symm4′s, it started experiencing problems. It seemd that 2 of the three power supplies had failed. The Symm4 was at least 7 years old at the time, and was designed for n+1 redundancy.
Even with two-thirds of it’s power gone, the thing kept running for almost 7 hours, tapping the internal batteries as needed. (Unfortunately it took only slightly longer to locate a replacement power-supply for such an antiquated peice of hardware, but at least it gave us the chance to gracefully power-down the last remaining hosts and gracefully power-off the Symm.
I’ve heard other stories, one in particular of a Symm in California that, after an earthquake, ran laying on it’s side until the hardware could be replaced and the data-migrated off it. (But having no first-hand knowledge of this, I will consider this an urban ledgend until someone who witnessed it tells me it really happened)
*THAT* is what you get for enterprise money.
Of course another relative from the same branch of the family is the one who told me “I have RAID, why do I need backups?”
So much fun, so little time.
by Jesse on Apr.11, 2008, under DAS, Downtime, Duck, VMWare, Vmware-NFS
A few have noticed the site was down for an extended period this week. I learned a few things this week.
I set up my FC system and was so excited to get it moving that I neglected to adequately test my equipment. I bought used equipment, with used drives, and put real data on them after a whopping 2 days of light testing. I never stress-tested the drives, didn’t do any kind of exercizing of them to validate that they were worthy of production data.
I also neglected to functionally test the array. While it did offer the ability to configure a hot-spare, I didn’t check to see if the hot-spare was functional before I moved data over to it. (Seeing that it was configured was enough for me)
So what happened was this. I was running on the system and all was well until a drive failed. The hot-spare didn’t invoke on it’s own, and while one drive was in a failed state, the second drive failed. Needless to say I lost half my luns and three of them were corrupted beyond repair.
Luckily I’m one of the old hold-outs. I have a tape backup system consisting of a Veritas 6.0 environment with an ATL tape library. I was able to restore to within 48 hours of the failure using tape.
My *NEW* storage back-end of consists of a Dell 2650 with 5x 146G drives. I installed CentOS5 with a 512GB NFS-mount partition and mounted them to my VMWare servers. The most interesting part is I realized that by bonding the network interfaces I’m getting the same bandwidth I got out of the 2x 1Gig fibrechannel ports.
Not being a network guy though, does anyone have any suggestions for optimising NFS for storage applications?
Comcast is *SO* fired
by Jesse on Jul.01, 2007, under Comcast, Downtime, Duck
They screwed it up again. Verizon has already been called.
Not so much for the “the system is down” but for the “i really dont care when you get it back up or how much business you’re losing in the meantime.”
Â
Another six hours of down-time….
by Jesse on May.04, 2007, under Comcast, Downtime, General
Ok, this one was my fault. I messed with the cable companies careful organizational system.
I paid my bill.
I went in there today to pay my bill. While I was in there I asked them for an adjustment, because when they set up the business service, they neglected to stop charging me for the residential service I used to have.
While she was removing the residential service, she noticed that my old modem was still associated with the account, and proceeded to remove it.
The warning bells should have gone off in my head there, but they didn’t until I was sitting at the auto service center getting the oil changed in my wife’s car, she called me and told me that the internet connection was down again, and this time rebooting the modem didn’t fix it.
Didn’t take me more than 8 seconds to put 2+2 together and figure out that the dim-wit behind the counter at the local Comcast office had disconnected the wrong modem.
It did however take me more than 6 hours of repeated calls before I got someone on the phone who actually had half a clue and got the connection back up, with my IP addresses in tact.
Hat’s off to Comcast’s one good tech.Â
I’ve got Verizon’s business services number on speed dial. If it goes down one more time, they’re up.
Another bout of down-time – Sorry.
by Jesse on Mar.30, 2007, under Downtime, General
Apparently the move to Comcast Business Services hasn’t done me all that much good. The SMC gateway they gave me hangs on a regular basis, the 6meg I’m paying for is closer to 1.5.Â
If you have the chance to make use of their services….pass. I can’t wait until Verizon comes out with Fios in this area.
Downtime
by Jesse on Mar.14, 2007, under Downtime, Non-Storage
Some may have noticed the downtime last night – well it was planned…sort of.
Last night I started working on moving the site over to the new internet connection. (I’m curious as to whether or not anyone outside notices the speed difference)
When I realized (as I should have before) that moving the default gateway from the old router to the new would affect *EVERY* website and that I couldn’t do it gracefully one at a time, I made the decision to go ahead and make the DNS changes to get off the Dynamic DNS service over to the fixed IP’s.
It took almost 12 hours for the changes to full propegate because not only was I changing the DNS entries, I was changing DNS servers from dyndns.org to Network Solutions.
I apologize for the downtime, but the upside is that the most painful part of this is done, now I can start with the rest of the upgrades.
Oops I did it again?
by Jesse on Dec.23, 2006, under Downtime, General
Took the servers down today – on purpose this time. Had to move the rack to put the new floor in my office.
Â
It’s starting to look like a real office now – my little 10′ x 15′ world.
Anyway, I apologize for the inconvienence.
I have a question for anyone who is VMWare savvy – I’m learning VMWare the hard (but most efficient) way. I’m running it. Everything works wonderfully (as far as vmware goes, I’m having postfix issues that are totally self-inflicted) except that I can’t seem to get the time to synchronize correctly.
Any ideas?