Archive for the 'Replication' Category

Multivendor or Single Source? Is there a right answer?

Wednesday, May 26th, 2010

Every time I turn around it seems I seem to be running into the same question.

Is it better to be multi-vendor or single source?

Well the easy answer to that is, it depends.  Different vendors do things differently, work better/worse with some hardware, etc.

The arguments in favor of a single-vendor solution is easy.  Cost, Simplicity, Management, Interoperability.

Even if you’re buying a more expensive solution, there can STILL be major cost savings.

First, in staffing.  When you maintain multiple vendors, you have to maintain support-staff knowledgable for each vendor.

If you’ve got a storage team that consists of 5 people, and two of them work almost exclusively on Veritas Netbackup.  You *MIGHT* be lucky if you get one subject matter expert capable of doing Tier1 (IE Symmetrix) one for Tier2 (Clariion) and one for NAS (Celerra) .

But throw in HDS, IBM DSxxxx, XiV, IBM GPFS, IBM HPSS, NetApp, SONAS, Sun StorEdge, etc. etc. etc.  And what do you have?

You either have an overworked staff (and as i’ve discussed, union protected salaried federal employees aren’t known for 70 hour weeks) or stuff just plain doesn’t get done.

If you don’t spend the money on staffing, you *WILL* spend the money in support and professional services.  Now support is one thing.  If my XiV or Symm or whatever loses a harddrive, I expect the vendor to own that problem and fix it.

They will *NOT* however send people out to help with day-to-day provisioning without a pretty hefty P.O. associated with it.

And the last reason for single-vendor options is simple.  I want stuff that is going to work together.  Now yes, functionality costs, but one of the things I like about EMC is that when it comes down to it, it *ALL* works together.  I can move data from Symm to Clariion or vice-versa using SanCopy, I can migrate fileservers to celerra and within storage tiers as needed.

There is nothing worse than needing to expand one storage system by 20TB and having the storage somewhere else, but unusable.  It means you’re wasting money buying storage you already have.  (Especially when your purchase cycle is 4-6 months on average.)

Not a happy thing to explain to the boss.

“Yes we have 80TB of Clariion avaialble, but the IBM DS4800 is running short so I need to spend an extra $100k on disks.”

“Yes, I know this isn’t budgeted, but the data grew faster than we’d expected.”

(Of course, you can span filesystems across arrays, as long as it’s not replicated data, because you can’t get a consistent split when half of your extents are on one array and half on another)

On tape…

Friday, October 23rd, 2009

Ok, I have no problem with tape.  It’s a *GREAT* backup medium when your requirement is portability for massive amounts of data and you’re not replicating said data.

If I had to ship 400TB of backups to Iron-Mountain, to protect against the earthquake-to-end-all-earthquakes tape would be my FIRST choice (though maybe, as a GIANT CAVE – Iron Mountain might not be.) ;-)

But… (and this is where it gets fun)

I have a customer who *LOVES* tape.

Wants to have it’s children loves it.

Uses it as primary storage loves it.

Now if you:

A> Have a few hundred terabytes of data to Archive.

B> Have millions of dollars to spend on giant room-sized storagetek libraries, and the space, power, and cooling that that entails.

C> Really love tape.

and most importantly

D> Live in the early 1980s

Then Archival to tape is *SO* the way to go.

The argument given is as follows.  “Tape is cheaper than Disk”

Well yes, on a terabyte for terabyte scale tape might be cheaper…maybe if you exclude the hardware.

But if you throw something along the lines of EMC’s Atmos product, or even Centerra, or I’d even go so far as to say the NetApp box appealed to me at one point.  (Now that the Celerra supports File Level Retention, I’ve been cured of that.)

Because when  you throw in modern options like replication and, dare I say it, DEDUPLICATION, Disk rapidly becomes the better, faster, more cost effective way to store your long-term data.

Now I wouldn’t recommend anyone go out and buy a DMX-4 for Archival purposes..  (Though if you want to let me know ahead of time so I can buy some EMC stock. – I’m not currently holding any.)

I checked, and the only Tape vs. Disk comparisons I could find on-line were done by storage vendors, each of which has their own agenda (and big surprise, the analysis came out favouring whatever they were selling), so none of them are valid in the grand scheme of things.  (I have a few things to say about marketing and statistics, but that’s a different post)

The things I look for when judging where to store data…

A> How many copies of the data do I need?

This is often overlooked and a question not asked.  How many copies of a piece of data do you really need?  And how many do you currently have?  I’ve been in one data center recently where they LITERALLY have boxes of old tapes stacked up along the walls.  (Note: Storing your backups WITH the system you’re backing up doesn’t do much in the event of a fire or natural disaster)

B> How long to I need to keep the data?

Retention policies are a big catch for a lot of people.  For “Backup” purposes (see my last post) I say two full backups are all that is really required.  If there is any kind of a likelihood that some critical corruption could be missed for weeks (or months) than adjust your backup strategy accordingly.  (or find a better way of auditing your production data for errors)

C> Does my data have to be portable?

Ok, this is aimed specifically at Tape.  The answer is this.  If you have a remote DR facility and a high-speed connection between them, there is absolutely NO REASON to go to tape for portability.  By virtue of Replication (whether it be the production data or VTL) you’ve already moved your data off-site.  Now if you’ve only got one data centre and it’s sitting right on the San Andreas fault line (I’ve actually worked here – not joking) then send tapes off-site.

Lots of them.

5 or 6 times a day if you can.

D> Am I storing a copy of production or my only copy?

If you’re storing a copy of production (running) then chances are you’re not going to need the backup.  If you’re protecting yourself against someone hitting the delete key accidentally, then maybe Celerra (SnapSure – periodic checkpoints that even the users can access themselves) or Centerra (Don’tEvenThinkAboutDeletingThis) are better options.

If you’re storing a copy of something so you can make room for something else, than backup tape is probably not your best option.  Consider an archiving solution like Atmos or Centerra, or even a Celerra with File Level Retrieval enabled – and version 5.6.44 and later supports de-duplication (both single-instance storage and compression) natively.

E> Do I have the money to spend now, or am I willing to spend more over time to keep the initial investment down.  (This is a valid question – and I’d like to know if anyone has any ideas on which would be the cheaper initial investment.

Just remember that you have to count the floor-space as well.  Something many people forget when scoping out storage buys.

if I want 150TB of storage and I want to do it with tape, what’s the supporting hardware going to cost me?  (A single CX4-240 with one rack of disks can provide up to about 220TB of storage with current drive-sizes.

A final note.  Remember with any “portable” backup solution that you have to keep your backups safe.  Tapes, like disks, don’t respond well to things like…well…dropping.  Anytime you transport a medium from one location to another physically you put that data at risk.

Just my .02 cents.

How to tell if your sales rep hates you….

Friday, May 22nd, 2009

I just got the following job posting and it made me, literally, laugh out loud, spitting latte all over my laptop.

If your sales rep allows you to do something like this, it’s a fair bet that s/he hates you (or is planning to buy your company out of bankruptcy later).

WANTED: VMWare 1-month resident to assist with new deployment/planning around 200VM’s and new Celerra NS480′s being purchased by client. Will probably end up primarily being VM’s using NFS on NS Celerra Replication will be enabled between (2) NS480′s.”

The key points are:

200VM’s

Celerra

**NFS**

Replicator

Ewww…..

Did I mention NFS?

Someone actually sold this?  Even if the customer comes to you direct and says “this is what I want…” the answer should be “In the interests of protecting you from yourself, I can’t allow you to do this.”

I don’t care how much the deal is worth.

Recoverpoint vs. Conventional Replication

Tuesday, March 17th, 2009

Ok – I see the surface benefits of a third party replication appliance, such as Recoverpoint. I even sat through a long discussion on it. I’m still totally on the fence but there are a few questions I need satisfactorally answered before I can recommend this to my customer.

I want to ask my readers, (all five of you) What’s your take on Recoverpoint or any appliance-based replication standard vs. array-based SRDF, Mirrorview/S, and the like?

There were a few points in the presentation that I had to cry foul.

“There is no appreciable latency in the appliance.”

Sorry, that doesn’t wash. Any time you put an intermediary device into a fibre path, you’re going to introduce latency. The real question (and answer) is “Is the latency introduced by said appliance mitigated by the compression gain.”

“Fully Synchronous Replication”

Synchronous replication exists when the IO is not acknowledged back to the host until the write is completed to disk on BOTH SIDES. Because of obvious latency issues this is not possible without A> Sufficient Bandwidth B> low-latency. Since the Recoverpoint product seems to respond to the source array as soon as the write is journaled, that means there is a potential for data-loss if the source building becomes a smoking hole in the ground. (Or even if something as simple as a regional power-failure happens, provided such a failure affects the circuit as well as the datacenter.)

Anyway – I really need to know – are my concerns off base or am I just being protectionist of the technologies I already know?

Open Replicator

Tuesday, September 9th, 2008

Pretty cool toy.  Essentially it’s SanCopy for Symmetrix.

So a customer is particularly reticent about upgrades, to the point that they are going from an 8830 directly to a DMX-4.  This is an easy data migration, but they have a handful of test/dev servers that are going to stay on the 8830 for the time being.

So enter Open Replicator.  It actually turned out to be easier than I expected.  Zone the FA’s, mask the source FA’s to the target luns, create the device pairing file, and issue a couple of simple commands on the “Control” (Usually source) side to push the data across.

The fun part is of course that, like SanCopy, if you mount a filesystem on the target side, you have to issue a full sync.  Because after all, SanCopy isn’t like SRDF, there is no persistent cache table tracking changed tracks on the target side.  When the track changes on the target device, the only way it’s going to get copied is if it’s changed on the source side as well.

Very cool.  I always like playing with new toys.

Clariion – Mirrorview – Cisco – FCIP

Tuesday, June 17th, 2008

Got into a scary situation this week.  Got called into help with a customer with a Mirrorview implementation.

Situation was:  Customer had Mirrorview/S set up within the existing switch environment, replication worked perfectly.

Then they reconfigured the switches to run FCIP so they could start replication to a remote site.  This is where things went badly.

First off – Cisco sets the Gig/E ports on the 9216i for jumbo-frames.  (MTU defaults to 2300)

This is a great idea for Fibrechannel replication, because a fibrechannel frame is 2114 bytes and this allows an entire FC frame to be sent within an ethernet packet.

Problem is that the default MTU on most network environments is 1500.  Now the *REAL* problem is that when  you first connect the GIg/E ports on the 9216i to a 6509 or other switch – it will at first appear to work perfectly…..until you try to pass data.

When you try to pass data across this link, the DF (Don’t Fragment) bit is set and the larger frames get dropped.  This causes an ISL connection between switches to flap, which causes no end of issues.  The fabrics will segment and re-join repeatedly until the first time you do anything that causes a reconfiguration, like updating the zoneset.  If you do that during a cycle where the ISL is going up and down, the vsan’s will fragment and stay fragmented because it will not be able to re-merge the fabrics.

So I come into this situation and the switches are so badly configured that it takes me a day just to get the ISL’s up and stable.  I set the MTU to 1500 on the switches, took the gig-e links down, and went to each switch and (carefully) deleted each vsan that didn’t belong on that switch.  (In addition to this being set up incorrectly, all three vsans merged when the swtiches were first connected due to the ISL’s being configured incorrectly)

Now the Clariion issue is still open.  A normal mirrorview configuration is as follows:

Source_SPAx –> Target_SPAx  (Where ‘x’ is the highest SP port #)

Source_SPBx –> Target_SPBx   (same here)

Now when the customer’s Clariion’s are zoned this way (in this case SPB3 to SPB1) nothing shows up in the Connectivity Status window.   But when I reverse the zoning, running SPA3 to SPB1, it shows up fine.  (Unfortunately Mirrorview doesn’t work in that configuration.

That’s where we stand.  A “simple 15 minute FCIP fix” is coming to the end of it’s third day.

Frying Pan / Fire

Friday, January 25th, 2008

Well – after moving along for a few weeks at a nice leisurely pace, I find myself working on six different projects.  Loads of fun, especially when three of them are just similar enough to get the details confused.

Got an iSCSI install next weekend though, this should be interesting.  I think I have a handle on how the Celerra does iSCSI, so the only real trick will be setting up the hosts correctly.  It’s a mixture of RHEL3, RHEL4, and CentOS5, which makes it even more interesting.

Another thing I got to play with last week was the McData Eclipse series FCIP router.   When tied to a pair of Brocade switches (one on each side of the FCIP tunnel) I found them to be almost impossible to use.  I’ve got quite a bit of FCIP experience in different replication scenarios, and it still took me almost 6 hours to get these connected.  Talk about having to pay attention to detail, this was painful.  I’d be infinitely happier with a Cisco 9216i with a 14/2 blade in it and be done with it.  Eliminates the need for 90% of the make-work that had to be done to get this thing running.

In McData’s defense though, I came in unprepared, it was my understanding going into the engagement that the McData was already set up and I came in and found not only wasn’t it set up for FCIP, (it was set up for iSCSI) the tunnel hadn’t been built.  So all we really had was a CE who came in, set the IP’s and ran like heck for the door.

I also had a Celerra NSX install that same weekend.  The NSX is an interesting piece, very much like the old CNS boxes, albeit much smaller/faster.  Modular setup makes it very expandable.

What I don’t get, is why, in this day and age, you are still required to use a floppy to boot the control-station installation CD.  Bootable CD’s have been around for quite some time, and in fact the NS502g I ran when I was at the evil empire even booted perfectly off the CD.

The NSX required however a serial console connection into the server, and had the bios locked so you couldn’t change the boot order.  Add that to the fact that EMC put the wrong crossover serial cable in the box with it (Mail-Female) meant a 2 hour install ended up taking 8 hours when you factored in driving around looking for a USB floppy drive (to create the boot disk) and a null-modem adapter (First time I’ve set foot in a radio shack in 10 years).

This week I’m off to sunny (?) Florida to do an ECC install and then off to Texas to do the NS20/iSCSI install.  *THAT* should be worth writing about. :)

Thanks.

The beatings will continue…

Tuesday, October 23rd, 2007

…Until morale improves.

Trying to run 4, FCIP trunks over a half a DS3 is a lot like raising a teenager.

Sometimes it looks like it’s working, but in reality it’s just screwing around playing video games.

Actually, my favorite is that “Raising a teenager is like trying to nail JELL-O to a tree”  I’m feeling about the same level of frustration.

What’s basically happening is that the link is fine, as long as we’re not doing anything silly like, oh, PASSING DATA over it.  THe minute we start moving data the link gives up and goes to Palm Beach for the holiday.

I tried to explain to both EMC and the customer at the start of this engagement that replicating four Symms over even a full DS3 is very…optimistic.

So I’ve spent the last three days solid beating my head over this, more than 18 hours a day (except for yesterday which involved 7 hours + 8 hours travel time.

Cisco FCIP and SRDF

Friday, October 19th, 2007

Been a while since I’ve written anything – I’m not even sure if I still have a readership.

I’ve been working an average of 60 hours a week on a single project these days.  Doing a datacenter migration and consolidation.  Basically moving 4 Symm-5 generation systems into a single DMX-3.

The funniest part of this has been learning the DMX-3, which I’ve not had a lot of stick-time with.  It seems like a great machine, a good hybrid of the Clariion and the Symmetrix.  I don’t much care for the DAE back-end, too many major points of failure, too many cables.  (Though when you do your first code-load on one, it sure gives you a work-out as far as learning what plugs in where.)

Anyway, as the title suggests, we’re doing a large part of this migration using temporary hardware, in the form of the Cisco MDS9216i.  This is a normal MDS 92xx chassis (2-slot) with a 14/2 FCIP blade in it.  Simply 14x4gbit FC ports and 2xGig-E ports on the same blade.  So far it’s been one challenge after another, and as of this posting we still don’t have the georgia and new-york datacenters talking to each other.

Part of the problem is the customer’s network infrastructure.  Namely it sucks.  For those who don’t know, Gig-E ports on the Cisco don’t negotiate down, they are essentially 1000-SX ports.  The customer, who makes a substantial part of their income off network traffic, doesn’t have a single Gig-E port in the entire datacenter. – that was problem number one.

Problem #2 was, in the datacenter that does have Gigabit available, namely the new in in Georgia, there is no optical available.  So we have to go through the painful process of getting an RPQ (an in-exact definition is “Request for Price Quote” – what it really means is getting engineering to bless the configuration) to use copper SFP’s on the MDS switches.

*THEN* we find out we’re replciating over a DS3 circuit, and even at that that we have to “nice” our hardware down to 12.5Meg/Sec so as to not affect their production traffic, which is (of course) running on the same network.  (SRDF has a nasty habit of sucking up all available bandwidth)

Do you know how long it takes to replciate terabytes of data at 12Meg?  LOL

This is going to be fun.  I’ll keep you posted.

 

SRDF woes….

Thursday, April 12th, 2007

Ok, for about 4 months I’ve had an SRDF problem that has been kicking my ass. 

SRDF over Ethernet presents a host of new problems, unique to Ethernet. 

First, and mostly – the RE adapters want to see *ALL* of the remote adapters in the RDF group.  That means if you have four RDF/Ethernet (RE) boards in the source box and four RE boards in the target, the way the Symm does the SRDF is that it expects EACH source RE to see *ALL* target RE’s.  It allows for a nice load balancing scenario, but doesn’t do much to allow for either dual paths or direct attach connections.

The second problem I found was an undocumented feature of Solutions Enabler 6.3.1 which caused problems querying SRDF devices, and more importantly, problems running any configuration change script against an RDF device.

For instance – the config change script as follows, took about 45 minutes before it timed out and died without error.

convert dev 0db to 2-way-mir ;

Now some will recognize this as a simple “de-configuration” of an SRDF volume.  But because the two symms have to be correctly communicating with each other for this to work, the script simply failed.

I went in to work this morning determined to kick this thing once and for all.

First off I picked up a couple of 29xx series Cisco switches, configured them with ports 1-4 in a trunk leaving the 4 optical ports on each switch for the Symmetrix connections.  This allowed me to put all four connections on each Symm into one subnet.

Then I realized that that’s NOT in fact how the symm was configured – the CE left the symm configured (granted – per my suggestion) so that each RE was in a subnet with only the corresponding RE on the remote side.  My understanding being that if there was no gateway it would not try to contact the other (target) RE’s.  (bad assumption)

So I did the one thing neither Customer-me or Consultant-me should ever do.  I did a binfile change on the Symm without contacting EMC and putting it through any form of change control.  It’s not a difficult thing to do a binfile change, provided you know the passwords to get into Symmwin and have a basic understanding of how a Symm functions. 

It’s just frowned upon by EMC because they don’t like the idea that they are not the gods of the universe that they purport themselves to be.  (for binfile changes not related to initial setup they also expect to be engaged professionally, for approximately $5k per change – if I were smart, and had the time to spare, I’d start offering my services at half that price)

After the binfile change all lights as far as SRDF went were green, but my config change was still hanging.  A brief email exchange with one of the SAC’s better engineers (I’ve worked with him before) pointed me very quickly to the Solutions Enabler upgrade.

The 6.3.2 upgrade did the trick, and is now in the process of being pushed out to all servers.

Â