Archive for the 'Symmetrix' Category

In case you’re wondering…

Friday, May 27th, 2011

Point of reference – A few months ago I wrote a post that I never ended up publishing that started with the line:

“My gods I need to work with technology that wasn’t conceived of in the 1990s.”

With that in mind, in case you’re wondering where I’ve been this past month or so…

I’ve been playing with this beast…

8 Engine VMAX

225 400G SSD Drives (90 TB Raw)

Direct Attached to *ONE* host.

Biggest.  Thumbdrive.  Ever.

Well I was saying I needed to get some serious hands-on VMAX experience.  When you put a request like that out there, sometimes the universe answers LOUDLY. ;-)

Storage Tiering…

Thursday, July 9th, 2009

Ok, given the changes to the storage arena I’ve been working on a revised “Tiering system” to incorporate all of the levels of data…importance?

My version of Storage Tiering is (or should be) as follows:

  • Tier-1    – Symmetrix/Replicated – High Performance/Criticial Data
  • Tier-2    – Symmetrix/NonReplicated – High Performance/Non-Criticial Data
  • Tier-3   – Symmetrix/SATA/Replicated – High-Medium Performance/Critical Data
  • Tier-4   – Symmetrix/SATA/NonReplicated – High-Medium Performance/Non-Critical Data
  • Tier-5    – Clariion/FC/Replicated – Medium Performance/Critical Data
  • Tier-6    – Clariion/FC/NonReplicated – Medium Performance/Non-Critical Data
  • Tier-7    – Clariion/SATA/Replicated – Low Performance/Critical Data
  • Tier-8    – Clariion/SATA/NonReplicated – Low Performance/Non-Critical Data
  • Tier-9    – CelerraNAS/Replicated – Network Attached/Critical Data
  • Tier-10  – CelerraNAS/NonReplicated – Network Attached/Non-Criticial Data
  • Tier-11  – Atmos – Network Attached / Low Performance
  • Tier-12  – Centerra (Content Addressable Storage) – Low Performance Archive / Highly Available
  • Tier-13  – Primary Tape-In-Library (Automatic loading on demand via HSM)
  • Tier-14  – Primary Tape-Out-Of-Library (Manual Intervention Required)

“Critical Data” vs. “Non-Critical Data” is simply a matter of how long you can be without the data should a failure or accidental deletion occur.  As all data is available in Tier8/9 storage (in theory).

I’ve also considered using Tier1/Tier1B to describe DMX storage vs. Clariion storage, given that there is a LOT of overlap in performance characteristics these days…

Oh, and iSCSI would be somewhere between 10 and 13….

Any thoughts?

Thin Provisioning

Wednesday, August 27th, 2008

Just got word that I’m going to be doing a “Thin Provisioning” install next week.  I’ve not had a lot of experience with EMC’s implementation of this particular brand of virtualization so it’s going to be interesting.

Thin provisioning has been around for a while, I think NetApp and Compellant have had thin devices in one form or another, but it’s not my idea of a fun time.

Thin provisioning is basically pretending you have space that you don’t.  You create a storage pool, or a group of volumes that your “thin” disks can pull from, they fill up space as it’s used, taking from the pool of disks in the process.

The thin devices then pull from the common pool as tracks are used.  This way a 500GB pool of disks can easily be provisioned to the hosts as a terabyte or more of virtual disks.

But if you create a 500GB pool and 4 250G “thin” devices, you are only safe until the total used space hits 500GB.  (IE 125G in each thin device)

I can’t even begin to count the number of ways this can blow up on you.  The only reason for using thin provisioning would be to lie to your internal customers.  Now granted I can think of a great use for this in dealing with users who make unreasonable storage requests.  (You know, the “I need 250GB for a webserver that only has a 35MB webpage hosted on it” types)

The problem is, again.  When they decide to fill up the disk, you’re out of luck.

Enterprise vs….not

Sunday, June 22nd, 2008

I have a cousin. Very well-to-do man, owns a company that does something with storing and providing stock data to other users. I don’t pretent do know the details of the business, but what I do know is that it’s storage and bandwidth intensive.

He’s building his infrastructure on a home-grown storage solution – Tyan motherboards, Areca SATA controllers, infiniband back-end, etc. Probably screaming fast but I don’t have any hard-numbers on what kind of performance he’s getting.

Now I understand people like me not wanting to invest a quarter-mil on “enterprise-class” storage, but why would someone who’se complete and total livelihood depends on their storage infrastructure rely on an open-source, unsupported architecture?

One of the things you get with the Symmetrix is the 24×7 monitored support. One of the stories I tell people was about my first experience with EMC. When I worked at Intuit I was on the graveyeard operations shift. (The grunt shift, that most of us have been subjected to at least once in their lives) About 4am one morning I got a call from EMC saying that a hard-disk in our old Symmetrix-3 array had failed, and that the tech would be onsite in about 20 minutes (I guess they gave him the head-start) to replace it. I asked them if there was anything I needed to do and they told me that it was transparent and that the hosts wouldn’t notice the difference.

I was in love.

People ask what the “Enterprise” money gets you, and that’s it. You get the security of knowing that it doesn’t matter when, where, or how a failure happens, they are on top of it and have it dealt with before you even know the problem exists most of the time.

My second great EMC story – I was working at the Library of Congress on a tech-refresh, they had four Symm4 and 2 Symm5 arrays that were being upgraded to a pair of DMX’s. About two weeks before we were to have decomissioned one of the Symm4′s, it started experiencing problems. It seemd that 2 of the three power supplies had failed. The Symm4 was at least 7 years old at the time, and was designed for n+1 redundancy.

Even with two-thirds of it’s power gone, the thing kept running for almost 7 hours, tapping the internal batteries as needed. (Unfortunately it took only slightly longer to locate a replacement power-supply for such an antiquated peice of hardware, but at least it gave us the chance to gracefully power-down the last remaining hosts and gracefully power-off the Symm.

I’ve heard other stories, one in particular of a Symm in California that, after an earthquake, ran laying on it’s side until the hardware could be replaced and the data-migrated off it. (But having no first-hand knowledge of this, I will consider this an urban ledgend until someone who witnessed it tells me it really happened)

*THAT* is what you get for enterprise money.

Of course another relative from the same branch of the family is the one who told me “I have RAID, why do I need backups?”

Migration complete -

Sunday, November 25th, 2007

We did it.   Migrated the hosts/data.   Production is now running in Kansas, DR in Georgia, and the old datacenters in NY/NJ are one step closer to being shut down.

Interesting couple of things I learned today.

SRDF/A is a great technology for replicating over long distances while maintaining what they call a “dependent-write-consistent” state.  It means that even though the replication is being taken care of asynchronously, with minimal performance impact to the host, that in the event of a failure you’re going to lose a minimal amount of data.  (In our case, when it was running the R2 disks were about 45-60 seconds behind the R1.)

We also performed a “failure” (disconnected both Gig-E ports to simulate the Kansas site dropping out) and brought the DR hardware up as primary, then reconnecting, unmounting, and restarting the SRDF/A session.

The only downside I’ve found with SRDF/A is that it’s a royal pain to stop and restart the replication.  In cases like this one, where once a week they take the R2′s offline to run a 20-hour backup off them, they are putting themselves at unneeded risk.  It’s a situation where TimeFinder/SNAP would be a great benefit.  You snap the R2′s at midnight and back them up, thereby leaving your R2′s in sync with your R1′s for the duration.  You can also then mount the SNAP volumes to a separate media server thereby avoiding having to re-configure the DR server as a temporary media server.

It’s just a thought.

It’s always a great feeling when you hit the deadline dead-on, especially when you’re dealing with a situation where the requirements keptchanging throughout the project, even to the point of having to add new devices at the last minute.

Oh well, on to the next.  At least the next is going to keep me closer to home.   Small-scale data migration from DMX2 to DMX2 within the same room, this should be a cake-walk. :)

Binfile changes

Thursday, November 8th, 2007

The joys of data migrations. 

One of the most common problems is the standard practice of most companies to avoid upgrading whenever possible.  The “if it ain’t broke, don’t fix it” mentality.

I could spend days and days on that particular brand of suicide.  For now I’ll just replace that addage with a new one.

If you don’t upgrade it now when you can do it in a controlled fashion, you will end up doing it when your life depends on it with very little planning.

So on the 17th, a customer is going to have to take an application down on an *OLD* Symmetrix 4.8 system to upgrade from 5265 code to 5267 code.  (two major code revs up, from 5×65 to 5×66, then from 5×66 to 5×67, and neither can be loaded on-line)

All of this has to happen *JUST* so we can move the data off this symm and onto a “not-so” old Symm 5.0 that will then me packed up to be moved out of state.

First off, the idea that you can simply turn off a Symm and ship it across the country is nuts.  Anytime you get a system with that many moving parts (harddrives) that have been spinning for that length of time and simply “turn it off” you run the risk of multiple hard-disk failures.  And as we all know, any time you have multiple hard-disk failures in an array, you run the risk of losing both halfs of a mirror.  Hell I cringe at turning off my desktop PC because I know that there is always the chance it’s not going to come back up, and I’ve got Raid-1 (160G) on my boot devices and Raid-5 (500G) on my data volumes, so I’m reasonably protected.

Secondly, why are we moving a Symm that is going to hit EOSL before too long?  Doesn’t it make sense to go ahead and upgrade to the latest and greatest hardware, get a free support renewal (included with the purchase of new hardware) and get the latest and greatest features/functionality?  Of course we’re moving a bunch of Sun E3000/E4000/E6000 class hardware.  These are the systems I cut my admin teeth on back in 1996 when I first started out in datacenter operations.  They were old 6 years ago.

Next time someone asks you the correct way to move a datacenter, the correct answer is “twin the hardware and replicate” followed by “trade-in.”

Bekins Moving should never be an option.

Upgrades complete

Thursday, November 8th, 2007

With the exception of a small problem I had with sendmail, everything seems to be working.

Love it when it works this smoothly. ;-)

/jg

Slow week….

Sunday, November 4th, 2007

Last month my timesheet showed a whopping 245 hours…..  I’m tired.

But the insanity continues.  We completed the first stage of migrations last week, a swimming success.  (it’s funny, because I got the impression from the end-customer that they didn’t expect it to go as smoothly.)

SRDF is the easiest way to move data – granted these were….shall we say OLDER systems, running Solaris 2.7/2.8 and Solstice DiskSuite (not something I would *EVER* recomend, as any logical volume manager that runs based on the CxTxDx number of the device is inherently risky, if the CxTxDx number changes, there goes your entire volume group configuration) and this created a number of ridiculous requirements that we had to jump through hoops at the last minute to meet, like making sure the lun numbers didn’t change.

Another requirement was that we had to stick to an out-of-date code rev on the new DMX-3.  Normally I’m not the biggest fan of running the “latest and greatest” revision of software, mostly because I don’t like being the one to do the post-beta – beta-testing.  I prefer to let it crash someone else’s system first.

But EMC’s Enginuity code is pretty unique in that the benefits of a major code rev update usually outweighs the draw-backs.

As an example.  5772 is the latest major rev of DMX code.  We’re running 5771 on the DMX3′s we’re migrating too.  5771 is very stable, does a lot, performs well, etc.

But it does not however allow for dynamic lun numbering.  On a Clariion you have two lun numbers.  The ALU (Array Logical Unit) and HLU (Host Logical Unit)  The ALU is the identifier and what is assigned to the lun by the array, the HLU is what the host sees, and is usually determined by what order the devices are added to the storage group.

On a Symmetrix, the ALU is the only lun.  You configure a device to have lun 0EF, and that is exactly what the host sees.  So on a Sun system you’d translate that to decimal and configure lun # 239 in your sd.conf as the lun.

Now enter 72 code and dynamc lun numbering.  You can now, through the symmask command, dynamically assign a “host-facing” lun # to a device, which means that even if the array-based lun skips all over, your host-based lun numbers will end up nice and sequential.  This can come in very handy when you run across operating systems that don’t like to have holes in the lun numbering.

An interesting tidbit. 

Cisco FCIP and SRDF

Friday, October 19th, 2007

Been a while since I’ve written anything – I’m not even sure if I still have a readership.

I’ve been working an average of 60 hours a week on a single project these days.  Doing a datacenter migration and consolidation.  Basically moving 4 Symm-5 generation systems into a single DMX-3.

The funniest part of this has been learning the DMX-3, which I’ve not had a lot of stick-time with.  It seems like a great machine, a good hybrid of the Clariion and the Symmetrix.  I don’t much care for the DAE back-end, too many major points of failure, too many cables.  (Though when you do your first code-load on one, it sure gives you a work-out as far as learning what plugs in where.)

Anyway, as the title suggests, we’re doing a large part of this migration using temporary hardware, in the form of the Cisco MDS9216i.  This is a normal MDS 92xx chassis (2-slot) with a 14/2 FCIP blade in it.  Simply 14x4gbit FC ports and 2xGig-E ports on the same blade.  So far it’s been one challenge after another, and as of this posting we still don’t have the georgia and new-york datacenters talking to each other.

Part of the problem is the customer’s network infrastructure.  Namely it sucks.  For those who don’t know, Gig-E ports on the Cisco don’t negotiate down, they are essentially 1000-SX ports.  The customer, who makes a substantial part of their income off network traffic, doesn’t have a single Gig-E port in the entire datacenter. – that was problem number one.

Problem #2 was, in the datacenter that does have Gigabit available, namely the new in in Georgia, there is no optical available.  So we have to go through the painful process of getting an RPQ (an in-exact definition is “Request for Price Quote” – what it really means is getting engineering to bless the configuration) to use copper SFP’s on the MDS switches.

*THEN* we find out we’re replciating over a DS3 circuit, and even at that that we have to “nice” our hardware down to 12.5Meg/Sec so as to not affect their production traffic, which is (of course) running on the same network.  (SRDF has a nasty habit of sucking up all available bandwidth)

Do you know how long it takes to replciate terabytes of data at 12Meg?  LOL

This is going to be fun.  I’ll keep you posted.

 

Bulletproof storage

Saturday, May 19th, 2007

You know – I’ve worked with EMC hardware for a long time (in technology terms, eons.)

in my going on 10 years experience working with EMC hardware, I’ve seen ONE symm completely fail.  an *OLD* Symmetrix 3530 failed during a 9 month project I was working to decomission 4 of them and migrate them to DMX1000′s. 

The funny part is, it failed because two of the three power supplies went at the same time…..and still took almost 7 hours to go down completely.  (This is where EMC dropped the ball, they were getting the alerts for some time and because it was registering as “AC Failure” assumed that it was environmental (customer power) and didn’t check it out properly.

At the end of the cycle, when the batteries in the Symm were almost dead, we finally took the hosts down gracefully and shut the Symm down gracefully to await the new power supplies.  They arrived about an hour later.

I’ve seen Symms in a lot of states, and heard of more.  Stories abound about a Symmetrix that was lying on it’s side in California after the Northridge earthquake, still passing data. 

However, still more stories about New Orleans after Katrina that weren’t as good.  No hardware, no matter how bulletproof, will work sitting in two feet of water. ;-)

That’s what you get for putting a data center 25′ below sea-level in a coastal area.