Archive for the 'CustomerService' Category

Good Cloud, Bad Cloud, a Titanic story…

Saturday, April 23rd, 2011

This weeks abject failure of Amazon.com’s EC2 hosting environment has caused quite the stir.  There are those who say that this proves that this incident “Proves Cloud Failure Recovery is a Myth” and others who say that we should just give it a chance.

Facts are facts.  Amazon screwed the pooch big-time last week.  Their outage caused ripple effects nation-wide.  But while it’s easy to throw the blame at Amazon for the failure ti’s important to remember that cloud computing is still only in it’s infancy, this mad rush to adopt it is part and parcel of the reason these problems are happening.  Customers rushing for a new product creates demand, companies looking to be the first to capitalize on that demand create a product that may or may not be ready for prime time.

But because no-one ever (because it’s impossible) thought to test the kind of cascade failure they experienced, they were pushing the high-availability envelope right out of the gate.

So no big deal, right?  Foursquare, parts of netflix, etc. were down due to the outage.  Other than inconvenience and the inability of narcissistic people to let the world know where they are and what they’re doing, it’s not really that big a deal (for us)

And then this came out: https://forums.aws.amazon.com/thread.jspa?threadID=65649&tstart=0

Specifically this line:

“We are a monitoring company and are monitoring hundreds of cardiac patients at home.  We were unable to see their ECG signals since 21st of April.”

Really?  You have a life-critical application and you hosted it “in the cloud”?  Did it never occur to you that it’s probably *NOT* a good place for a life-or-death application?  While I would consider it as a backup, definitely not my one and only.

People who know me know I have a rule.  I don’t say it works until I’ve seen it work at least once, and even then I’ll qualify my statement with “well I saw it work under THESE conditions.”  I do *NOT* say something works based on what some sales or marketing person tells me works.  (Trust me, this has been a major sticking point between me and my sales team. ;-)

That being said.  You have to accept that if you put your critical apps in “the cloud” by it’s very nature you are abdicating your control over it, and putting your full faith in someone ELSE to fix the problem.  Someone who may not think your application is as important as the one in the rack next to yours.

Are you going to take someone’s word that something is “Highly Available” if you haven’t actually pulled the plug yourself and watched it fail over?  I won’t.  I will candidly couch my answer in “That’s the way it’s supposed to work” or “That’s the way it’s designed to work”  But until you see a failover, that’s not the way it DOES work, because it never has.

I run my own email, my own webserver, my own infrastructure. I prefer it this way, because now if the system goes down, I know exactly whose butt to kick.

As a rule, and If I’m paying someone else to provide a service… I make sure I know where, how, and who to call when it blows up.  It’s probably the best advise I can give.

Amazon billed this as being “highly avaialble” and maybe it is, for the most part.  But obviously if you think of a million ways for something to go wrong, you can bet even money on their being at least a million and one ways for it to fail.

Instead of EC2, they should have named it “Titanic” because everyone knows the easiest way to invite disaster is to tell the world you’re immune to it.

Enterprise vs….not

Sunday, June 22nd, 2008

I have a cousin. Very well-to-do man, owns a company that does something with storing and providing stock data to other users. I don’t pretent do know the details of the business, but what I do know is that it’s storage and bandwidth intensive.

He’s building his infrastructure on a home-grown storage solution – Tyan motherboards, Areca SATA controllers, infiniband back-end, etc. Probably screaming fast but I don’t have any hard-numbers on what kind of performance he’s getting.

Now I understand people like me not wanting to invest a quarter-mil on “enterprise-class” storage, but why would someone who’se complete and total livelihood depends on their storage infrastructure rely on an open-source, unsupported architecture?

One of the things you get with the Symmetrix is the 24×7 monitored support. One of the stories I tell people was about my first experience with EMC. When I worked at Intuit I was on the graveyeard operations shift. (The grunt shift, that most of us have been subjected to at least once in their lives) About 4am one morning I got a call from EMC saying that a hard-disk in our old Symmetrix-3 array had failed, and that the tech would be onsite in about 20 minutes (I guess they gave him the head-start) to replace it. I asked them if there was anything I needed to do and they told me that it was transparent and that the hosts wouldn’t notice the difference.

I was in love.

People ask what the “Enterprise” money gets you, and that’s it. You get the security of knowing that it doesn’t matter when, where, or how a failure happens, they are on top of it and have it dealt with before you even know the problem exists most of the time.

My second great EMC story – I was working at the Library of Congress on a tech-refresh, they had four Symm4 and 2 Symm5 arrays that were being upgraded to a pair of DMX’s. About two weeks before we were to have decomissioned one of the Symm4′s, it started experiencing problems. It seemd that 2 of the three power supplies had failed. The Symm4 was at least 7 years old at the time, and was designed for n+1 redundancy.

Even with two-thirds of it’s power gone, the thing kept running for almost 7 hours, tapping the internal batteries as needed. (Unfortunately it took only slightly longer to locate a replacement power-supply for such an antiquated peice of hardware, but at least it gave us the chance to gracefully power-down the last remaining hosts and gracefully power-off the Symm.

I’ve heard other stories, one in particular of a Symm in California that, after an earthquake, ran laying on it’s side until the hardware could be replaced and the data-migrated off it. (But having no first-hand knowledge of this, I will consider this an urban ledgend until someone who witnessed it tells me it really happened)

*THAT* is what you get for enterprise money.

Of course another relative from the same branch of the family is the one who told me “I have RAID, why do I need backups?”

Dell’s false promises -

Monday, November 19th, 2007

Yes – I can say it.  Dell lied to me.

I have a problem with my new notebook.  Recently (read: less than 90 days ago) I bought a new Dell D620.  When I was forcibly ejected from my last company I found myself without a notebook computer.  So I logged in and bought one. 

Given that my new job requires extensive travel, I chose to spend the extra $300/US or so on their “Next Business Day” warranty.

They could have at least bought me a drink first.

According to Dell – next business day doesn’t mean it will be replaced in the next business day, it means the tech will call you on the next business day and schedule a time to come and replace your motherboard, and that time will be the “NEXT BUSINESS DAY AFTER HE RECEIVES THE PARTS”

That’s such a load I can’t even believe it.

So here I am with a laptop with one failed USB port, and the second one failing – an issue which Microsoft has already said contributed to the multiple failures Vista experienced, in a situation where the only day I’m not traveling in the next month is thanksgiving day.

So the fun part is:

The idiot Jr. level support dweeb told me, promised me after I asked him twice if he was sure, that the tech would come to my house ON THANKSGIVING DAY to replace the motherboard in my laptop.

Now I don’t hold him responsible (though I do hold him stupid for not checking his calendar before making a promise that you and I know will cost Dell a mint to fulfill) for Dell’s policies.  As a Jr. level support dweeb he effectively did his job of keeping me from getting to a more senior support dweeb.

But to promise support on a national holiday is hilarious, and I’m going to have a blast ramming that one down their throats when the guy doesn’t show up.

Sorry – it’s late, I’m tired, I’ve had a bad day.  ;-)

The hilarious part is – I even gave him the chance to back out of his lie:

 12:31:21 AM         Jesse

you’re lying to me again you know – you have no intention of having a tech to fix this on Thursday.

 12:31:56 AM         Vikas_166576 

On Thursday the system will be serviced.

 12:32:10 AM         Jesse  

and if it isn’t what is my recourse?

 12:32:26 AM         Vikas_166576 

And there is no reason to lie to my customer.

 12:32:41 AM         Jesse  

ok, so long as I have you on record saying that.

 12:33:32 AM         Jesse  

Thank you for your time – I’ll expect a manager’s call tomorrow.

The problem is, I think, that of course Dell doesn’t use support people from the US, which would mean that he didn’t realize that Thursday was a holiday. ;-) Don’t know, don’t care.  They made a promise now I expect them to follow through.