50Micron.com

DR/COOP

On tape…

by Jesse on Oct.23, 2009, under Backup, Best Practices, DR/COOP, Replication, Tape, Worst Practices

Ok, I have no problem with tape.  It’s a *GREAT* backup medium when your requirement is portability for massive amounts of data and you’re not replicating said data.

If I had to ship 400TB of backups to Iron-Mountain, to protect against the earthquake-to-end-all-earthquakes tape would be my FIRST choice (though maybe, as a GIANT CAVE – Iron Mountain might not be.) ;-)

But… (and this is where it gets fun)

I have a customer who *LOVES* tape.

Wants to have it’s children loves it.

Uses it as primary storage loves it.

Now if you:

A> Have a few hundred terabytes of data to Archive.

B> Have millions of dollars to spend on giant room-sized storagetek libraries, and the space, power, and cooling that that entails.

C> Really love tape.

and most importantly

D> Live in the early 1980s

Then Archival to tape is *SO* the way to go.

The argument given is as follows.  “Tape is cheaper than Disk”

Well yes, on a terabyte for terabyte scale tape might be cheaper…maybe if you exclude the hardware.

But if you throw something along the lines of EMC’s Atmos product, or even Centerra, or I’d even go so far as to say the NetApp box appealed to me at one point.  (Now that the Celerra supports File Level Retention, I’ve been cured of that.)

Because when  you throw in modern options like replication and, dare I say it, DEDUPLICATION, Disk rapidly becomes the better, faster, more cost effective way to store your long-term data.

Now I wouldn’t recommend anyone go out and buy a DMX-4 for Archival purposes..  (Though if you want to let me know ahead of time so I can buy some EMC stock. – I’m not currently holding any.)

I checked, and the only Tape vs. Disk comparisons I could find on-line were done by storage vendors, each of which has their own agenda (and big surprise, the analysis came out favouring whatever they were selling), so none of them are valid in the grand scheme of things.  (I have a few things to say about marketing and statistics, but that’s a different post)

The things I look for when judging where to store data…

A> How many copies of the data do I need?

This is often overlooked and a question not asked.  How many copies of a piece of data do you really need?  And how many do you currently have?  I’ve been in one data center recently where they LITERALLY have boxes of old tapes stacked up along the walls.  (Note: Storing your backups WITH the system you’re backing up doesn’t do much in the event of a fire or natural disaster)

B> How long to I need to keep the data?

Retention policies are a big catch for a lot of people.  For “Backup” purposes (see my last post) I say two full backups are all that is really required.  If there is any kind of a likelihood that some critical corruption could be missed for weeks (or months) than adjust your backup strategy accordingly.  (or find a better way of auditing your production data for errors)

C> Does my data have to be portable?

Ok, this is aimed specifically at Tape.  The answer is this.  If you have a remote DR facility and a high-speed connection between them, there is absolutely NO REASON to go to tape for portability.  By virtue of Replication (whether it be the production data or VTL) you’ve already moved your data off-site.  Now if you’ve only got one data centre and it’s sitting right on the San Andreas fault line (I’ve actually worked here – not joking) then send tapes off-site.

Lots of them.

5 or 6 times a day if you can.

D> Am I storing a copy of production or my only copy?

If you’re storing a copy of production (running) then chances are you’re not going to need the backup.  If you’re protecting yourself against someone hitting the delete key accidentally, then maybe Celerra (SnapSure – periodic checkpoints that even the users can access themselves) or Centerra (Don’tEvenThinkAboutDeletingThis) are better options.

If you’re storing a copy of something so you can make room for something else, than backup tape is probably not your best option.  Consider an archiving solution like Atmos or Centerra, or even a Celerra with File Level Retrieval enabled – and version 5.6.44 and later supports de-duplication (both single-instance storage and compression) natively.

E> Do I have the money to spend now, or am I willing to spend more over time to keep the initial investment down.  (This is a valid question – and I’d like to know if anyone has any ideas on which would be the cheaper initial investment.

Just remember that you have to count the floor-space as well.  Something many people forget when scoping out storage buys.

if I want 150TB of storage and I want to do it with tape, what’s the supporting hardware going to cost me?  (A single CX4-240 with one rack of disks can provide up to about 220TB of storage with current drive-sizes.

A final note.  Remember with any “portable” backup solution that you have to keep your backups safe.  Tapes, like disks, don’t respond well to things like…well…dropping.  Anytime you transport a medium from one location to another physically you put that data at risk.

Just my .02 cents.

7 Comments more...

Storage Tiering…

by Jesse on Jul.09, 2009, under "Cloud", Backup, Celerra, Centerra, Clariion, DR/COOP, ILM, RAID, Symmetrix

Ok, given the changes to the storage arena I’ve been working on a revised “Tiering system” to incorporate all of the levels of data…importance?

My version of Storage Tiering is (or should be) as follows:

  • Tier-1    – Symmetrix/Replicated – High Performance/Criticial Data
  • Tier-2    – Symmetrix/NonReplicated – High Performance/Non-Criticial Data
  • Tier-3   – Symmetrix/SATA/Replicated – High-Medium Performance/Critical Data
  • Tier-4   – Symmetrix/SATA/NonReplicated – High-Medium Performance/Non-Critical Data
  • Tier-5    – Clariion/FC/Replicated – Medium Performance/Critical Data
  • Tier-6    – Clariion/FC/NonReplicated – Medium Performance/Non-Critical Data
  • Tier-7    – Clariion/SATA/Replicated – Low Performance/Critical Data
  • Tier-8    – Clariion/SATA/NonReplicated – Low Performance/Non-Critical Data
  • Tier-9    – CelerraNAS/Replicated – Network Attached/Critical Data
  • Tier-10  – CelerraNAS/NonReplicated – Network Attached/Non-Criticial Data
  • Tier-11  – Atmos – Network Attached / Low Performance
  • Tier-12  – Centerra (Content Addressable Storage) – Low Performance Archive / Highly Available
  • Tier-13  – Primary Tape-In-Library (Automatic loading on demand via HSM)
  • Tier-14  – Primary Tape-Out-Of-Library (Manual Intervention Required)

“Critical Data” vs. “Non-Critical Data” is simply a matter of how long you can be without the data should a failure or accidental deletion occur.  As all data is available in Tier8/9 storage (in theory).

I’ve also considered using Tier1/Tier1B to describe DMX storage vs. Clariion storage, given that there is a LOT of overlap in performance characteristics these days…

Oh, and iSCSI would be somewhere between 10 and 13….

Any thoughts?

10 Comments more...

Priceless…..

by Jesse on Jun.07, 2009, under DR/COOP

6x Dell 1850 Servers - $8,000, One two-post Rack - $500.  One Furniture Moving Strap from home-depot - $5.99 - Earthquakeproofing your server rack - Priceless.

6x Dell 1850 Servers - $8,000, One two-post Rack - $500. One Furniture Moving Strap from home-depot - $5.99 - Earthquakeproofing your server rack - Priceless.

Just Sayin……

2 Comments more...

Hours lost….

by Jesse on Aug.22, 2008, under DR/COOP

So we have 12 500G SATA disks in a DMX-4…..   Carved them into 147 29G Hypervolumes, the VTOC took over 7 hours.

The fun part happened when the symconfigure script died.  Had to have the lab dial into the Symm and step the process through to the end.

The cool part is that we were in the process of doing a fail-back process, other than having to shuffle the process a bit – (did the swap last instead of during the failover process) – the customers absolutely didn’t notice any change.

In fact, we had a 2 hour window for the fail-back, it was completed in 65 minutes.

However it wasn’t until after 4:30am when the VTOC completed and we were able to issue the R1/R2 swap command.

Ouch – I’m tired.

12 Comments more...

The Fake Synchronous SAN

by Jesse on Feb.06, 2007, under DR/COOP, Gripe, Replication, Switches

In response to this article on “EnterpriseStorageForum”:

 Synchronous SAN Sets Fibre Channel Distance Record

My Response:

True Synchronous transmission works over any distance – if you can live with the latency.   The problem is that most hosts operating systems can’t.  So different buffering schemes are cooked up to fool the host into thinking the write is complete on both sides when in fact it’s not.

Any time you get over about 30km the latency, that is the time it takes for the IO to be transmitted, acknowledged, and released, becomes about that of a normal unbuffered physical drive, about 20-30 ms.

Any further and you will start seeing slower and slower response times and eventually IO timeouts on the source hosts.

In order for a storage system to be truly “synchronous”, the array cannot acknowledge the I/O to the host until it’s received the write ACK from both the source, *AND* the target array.  If there is buffering going on between point-A and point-B, such as a cisco MDS with the buffer credits cranked up or a Nisshan IPS3300, it is not a truly synchronous replication, because the failure of the switch on the source (or target side) will cause the target array to have missed I/O’s that have already been acknowledged to the host as having been complete.

Sorry – but this test, as it appears to have been run was obviously designed by the various vendors to accentuate their hardware without showing the failures and flaws in the logic.  I’m sure I could walk in there and in about 5 minutes simulate a link failure that would have the remote site in an inconsistent state at worst, or having to roll back journaled IO’s at best.

Leave a Comment more...

RPO vs. RTO

by Jesse on Oct.25, 2006, under Backup, DR/COOP, Replication

I had an engineer friend of mine (real engineer, not affiliated with computers) once told me.

 ”There are three options:

     1. You can have it faster.
     2. You can have it smaller.
     3. You can have it cheaper.

….Now pick any two.”

Over and over in my life I’ve put that theory to the test and to this day it has always held true.  The smaller and faster something is the more expensive it gets.  The cheaper something is the more it is slow and less portable.

Disaster Recovery and COOP (Continuation of OPerations – for the layman) follow a lot of the same rules.

There are Three main criteria you’re aiming for.  The main two are RPO and RTO.  That’s “Recovery Point Objective” and “Recovery Time Objective”

The third is, of course, cost.

RPO is defined by the point at which you need to be able to recover to.  Goals are sometimes easy to obtain, “Midnight on the morning of the failure” is usually pretty easily obtainable, as you can do that by restoring from backups.  Financial institutions aim for somewhat stricter objectives.  Most banks will require an RPO of “Zero” meaning “I want to see the last committed transaction on my DR site in the even the source site becomes a smoking hole in the ground.”

This is doable of course, provided the DR site is close enough to the source site to run dark fibre between the two with low enough latency to add negligible impact to production.  (the rule of thumb for synchronous replication is 2ms per 10k, that is for every 10 kilometers you’re adding 2ms of latency.  A normal physical drive has a latency of about 9-14ms, so if you go to far you’re going to slow your system to a crawl.

RTO is defined as “how long can I afford to have my environment down to affect a failover.”  I’ve worked in one environment where transaction logs were backed up to tape and shipped across country from the L.A. area to Orlando, Florida, where the tapes were then restored into a standby system.  The recovery time to a 15 minute increment was effectively days, because they actually had to wait for the last tape to make it to the target site before they could restore it and bring the system on-line.  It was Insane.

Your goal is to get RPO and RTO to as close to zero as possible without bankrupting the budget (or the company).

An RPO of zero can be obtained with a DR site within about 10 kilometers, 20 if you can live with the slower response times in production.  This is full synchronous transfer from one array to another, every write from host to disk has to be acknowledged by the REMOTE array before it is reported to the host that the write is committed.

EMC’s SRDF/A and SRDF/AR mitigate that in environments where the DR site is far enough away as to kill any chance of SRDF/Syncronous working. 

SRDF/A is a “packetized” SRDF, where the receiving Symm has to receive two consecutive “checkpoints” before it commits the block of data.  That way if an incomplete block is received, it’s discarded to prevent data corruption resulting from incomplete write information.  The downside to SRDF/A of course it that it requires an insane amount of cache to function properly.  (And don’t let an SE tell you it doesn’t, he’s lying or not capable of understanding that for the remote Symm to receive a block of data, it has to be able to store it somewhere other than disk until it receives two checkpoints.

SRDF/AR is an automated replication product.  You are essentially mirroring production to a TimeFinder BCV, which is then sent synchronously to the remote site.  You can run a Sync transfer because the BCV’s are not connected to the production volumes, and as such the production volumes do not require any ACK/NAK from the remote system.  Depending on the time it takes to replicate (how fast the pipe is between the two sites) you can get RTO to about 10 minutes, which is good enough for most.  The effects of SRDF/AR can be duplicated by anyone proficient in Korn shell, as it literally runs a series of waits and whiles for each stage of the process.  AR has the added bonus that you can actually keep a second set of BCV’s on the target host and run your backups from them.  The down-side to the AR type of scenario (whether it be SRDF/AR or a scripted set-up) is that it costs disks – and lots of them.  There are the production volumes, mirrored, the first set of BCV’s, unprotected, the SRDF target devices (Mirrored or Raid5) and the second set of BCV’s.

Scary huh?

As I prepare to start my own replciation design this was formost on my mind, which is how it ended up here.  (this is after all the dumping ground for my random thoughts)

4 Comments more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...