50Micron.com

Replication

Cloning for fun and profit.

by on Mar.23, 2007, under Cloning, Replication, TimeFinder

You know, everytime I start thinking of “Cloning” I am afraid the far-right is going to burn me in effigy, just on the principle of it.

But in this case, I’m talking about cloning disks within an array for a data migration.

The decision was made to move our Microsoft Exchange (Corporate Email) from the Tier-1 (Raid-1+SRDF) to Tier-2 (3+1 Raid-5) storage.  I guess the logic is that losing a day of email won’t hurt us horribly.  (Ok, maybe it will, but I’ll get into the solution to that problem in a different post)

In this case, I’ve got the following drives:

7x 84G Metavolumes
4x 33G Metavolumes
2x single hypervolumes.

The new Exchange administrator, who I am actually most impressed with so far, would like to add 3x 200+G Metavolumes to the mix.  The main reason for the move is that we’re rapidly running out of Tier-1 storage, and need to save it for expansion, production growth, etc.  (Or buy new disks, but that’s yet another story)

So I am going to use this opportunity to demonstrate the power of TimeFinder/Clone.

I’ve created the new volumes on the Raid-5, mapped them to the front end ports, and prepared the masking scripts to move the device masking from the old devices to the new. 

Now in the old days, the way you’d do this is; Shut the hosts down, do a bit-by-bit copy of the disks (if you’re lucky you can do them in parallel, otherwise it’s single-threaded) change the pointers on the host, and bring them back up, hoping everything is exactly how you left it.  Net downtime could be in the neighborhood of several hours, if you’re lucky.

This is a new beast.  Enter TimeFinder/Clone.  Now I have the blank devices.  I do things in a slightly different order.

  1. Create the clone session – this establishes the pairing of devices.
  2. Shut down the Exchange services.
  3. (each node) Unmount the existing disks using admhost or symntctl
  4. (each node) Change the device masking so Exchange sees it’s new disks
  5. (each node) re-scan the bus to remove the entries for the old disks and create the entries for the new luns. (at this time, it will see all new blank disks)
  6. Shut the cluster hosts down.
  7. “Activate” the clone session 
  8. Bring the first “Active” node of exchange up
  9. Bring the passive node up.

Net downtime is about an hour.  The reason being, once the clone session is activated, a background copy is started.  However from the target side, any reads to “invalid” tracks on the target disks actually get serviced from the source disks.  As far as the host is concerned it’s all the original data.  

As the copy progresses, more and more of the reads are serviced from the new disks.

When the array receives a write to a track that hasn’t been copied to the target yet, that track is first copied, THEN the write is processed on the target disks only.  This preserves your production disks in their original state.

With the advent of TF/Clone, I’m surprised anyone still uses BCV’s.  They’re so “old-tech”  The main hang-up of course was the fact that while you (in theory) could protect a BCV using Raid-1, the performance hit you took during establish and split operations was so bad that it wasn’t worth it.  With TF/Clone you can go from Raid-1, to 3+1 Raid5, to 7+1 Raid5, etc. etc.  Without minimal performance impact.

The only downside comes of course when you’re cloning production volumes while they’re in use.  Since reads are being serviced by the production disks and not the clone disks (technically, the tracks you’re reading are simply being copied to the target while the host read is serviced from the cached track) you’re impacting the production spindles during the copy process.

It’s a cool bit of magic – and it’s really fun to play with the minds of people who don’t understand the technology.

16 Comments more...

DWDM Limitations – how far is too far?

by on Mar.21, 2007, under DWDM, Fibrechannel, Latency, Replication

I saw this post on http://lordegg.wordpress.com and felt that the comment I posted to him there would make pretty good topic here.

Most people don’t understand that the speed of light has become a serious limitation in computing.  Even the original Cray, which was installed in Los Alamos in 1976, had some million individual wires pushing data, no single one of them was more than something like a foot long, due to the time it took to push electrons across them. (I wish I could remember the exact numbers, but I’ve been up for going on 20 hours now, my brain is shutting down)


DWDM is a great technology – allowing 4-8 different signals to travel down the same link.

The down side is when you get, say 8 channels going down a 60km link, you’ve created a very wide path indeed.

But you’ve not fixed the latency problem. Under ideal circumstances latency over fibrechannel is about 2ms per kilometer.

2ms per k at 60k is 120ms. That’s each way, there is a return trip as well for each ACK transmission.

Now when you add multiple data paths, the only thing that changes is now instead of having one I/O outstanding, waiting for it’s ACK, you’ve got four or eight.

60k is more than twice what I as an engineer would recomend without some sort of repeater, especialy when you consider that optical cable is not an “ideal” transmission medium.

The speed of light has some profound implications for networking technology. Light, or electromagnetic radiation, travels at 299,792,458 meters per second in a vacuum. Within a copper conductor the propagation speed is some three quarters of this speed, and in a fibre optic cable the speed of propagation is slightly slower, at two thirds of this speed.

At 2/3 the speed of light, latency is actually closer to 3ms/km.

3 Comments more...

SRDF over what media?

by on Feb.06, 2007, under Best Practices, Data Migration, Replication

Well – Tomorrow I should have data replication going between the two Symms.  And it gives me pause.

We’re going to be using SRDF/A for our replication.  To those who are not familiar with the EMC terminology, SRDF/A is a “Semi-Asychronous” form of SRDF that provides consistency points in the data being transmitted without affecting production performance. 

SRDF inserts a “Checkpoint” periodically into asyncronous traffic.  The target frame will only write a block of changes, called a “Delta-Set”, when the ending checkpoint has been received.  If a link fails before the checkpoint is received, the previous block of data is considered to be invalid and discarded.

This allows a recovery-point of 10-15 minutes, with guaranteed consistency, over a longer distance.  (Our planned replication distance is approximately 1500 miles)

The other option, albeit too expensive for the bean-counters who manage our money, is multi-hop SRDF, which allows you to replicate to a bunker site 10-15km away from the primary site in full synchronous mode, and then from the bunker site to the DR site in Async. or SRDF/A mode.  This allows a recovery up to the point of failure in the event the primary site is lost, and recovery to the last delta-set in the event of both a primary and bunker site loss.  (nuclear explosion?)

So the options for distance are:  Ethernet, and ethernet.  The longest peice of dark fibre I’ve ever seen covers the 35km or so between capitol hill and and the congressional DR facility.  They ran full Syncronous mode but the users never noticed because they never saw what performance was like without the 30ms round-trip.

The Symmetrix supports three protocols for SRDF

*  IP (current max 1gb per link)
*  FibreChannel (current max 4gb per link with DMX-3 and 5772 code)
*  Escon (The original standard for SRDF going all the way back to the Symm3)

FC and Escon are good for limited distances, With long-wave (1300nm) optics you can do about 10km reliably.  With a good DWDM set you can stretch that out significantly, (not native, the DWDM hardware acts as a repeater of sorts) plus will allow you to put multiple links down the same fibre-pair.

Ethernet seems to be the most often implemented version these days.  I’ve seen a few, though not many, “Symmetrix RFA (RDF over Fibre) –> Nishan IPS3300 –> Ethernet –> Nishan IPS3300 –> Symmetrix RFA” types of implementations, but it seems to me that you’re just throwing that many more potential breaks in the transmission line, plus every time you have to decode a signal and re-encode it in another format you’re losing a step. 

(Even the fastest computer hardware takes time to process data, nothing hands it straight across.)

Of course, then the bean-counters get into it….the RE (RDF/Ethernet) adapters are more expensive than simply dedicating two ports of your existing FA (Fibre Host Adapters) to the RDF functionality.

Just ranting. :)

2 Comments more...

The Fake Synchronous SAN

by on Feb.06, 2007, under DR/COOP, Gripe, Replication, Switches

In response to this article on “EnterpriseStorageForum”:

 Synchronous SAN Sets Fibre Channel Distance Record

My Response:

True Synchronous transmission works over any distance – if you can live with the latency.   The problem is that most hosts operating systems can’t.  So different buffering schemes are cooked up to fool the host into thinking the write is complete on both sides when in fact it’s not.

Any time you get over about 30km the latency, that is the time it takes for the IO to be transmitted, acknowledged, and released, becomes about that of a normal unbuffered physical drive, about 20-30 ms.

Any further and you will start seeing slower and slower response times and eventually IO timeouts on the source hosts.

In order for a storage system to be truly “synchronous”, the array cannot acknowledge the I/O to the host until it’s received the write ACK from both the source, *AND* the target array.  If there is buffering going on between point-A and point-B, such as a cisco MDS with the buffer credits cranked up or a Nisshan IPS3300, it is not a truly synchronous replication, because the failure of the switch on the source (or target side) will cause the target array to have missed I/O’s that have already been acknowledged to the host as having been complete.

Sorry – but this test, as it appears to have been run was obviously designed by the various vendors to accentuate their hardware without showing the failures and flaws in the logic.  I’m sure I could walk in there and in about 5 minutes simulate a link failure that would have the remote site in an inconsistent state at worst, or having to roll back journaled IO’s at best.

Leave a Comment more...

Centerra – love it?

by on Jan.05, 2007, under Backup, Centerra, ILM, Replication

We just had our sales presentation on EMC’s Centerra Content Addressable Storage system.  I have to admit, I went into it knowing a little about it, and even the 60,000 foot “executive summary” EMC put together really impressed me.

The idea of putting so much data to tape but keeping it up and available just floors me.  But for a “reasonable” price, I can offload all of our imaging (we don’t use paper records) voice recordings for the call center, and email traffic to a system that is widely considered to be so bulletproof (when in a multi-location DR environment) that it doesn’t require backups.

By doing object-level mirroring it seems like they’ve really conquered the need for backups, as well as the management nightmare that is records retention.   Since the objects can be mirrored within the frame, as well as to a remote frame, that makes it even more solid.

I have to say I’m impressed – now to sell it to the Execs…..  (Actually our CEO is so “compliance driven” it may not be much of a hard sell)

2 Comments more...

RPO vs. RTO

by on Oct.25, 2006, under Backup, DR/COOP, Replication

I had an engineer friend of mine (real engineer, not affiliated with computers) once told me.

 ”There are three options:

     1. You can have it faster.
     2. You can have it smaller.
     3. You can have it cheaper.

….Now pick any two.”

Over and over in my life I’ve put that theory to the test and to this day it has always held true.  The smaller and faster something is the more expensive it gets.  The cheaper something is the more it is slow and less portable.

Disaster Recovery and COOP (Continuation of OPerations – for the layman) follow a lot of the same rules.

There are Three main criteria you’re aiming for.  The main two are RPO and RTO.  That’s “Recovery Point Objective” and “Recovery Time Objective”

The third is, of course, cost.

RPO is defined by the point at which you need to be able to recover to.  Goals are sometimes easy to obtain, “Midnight on the morning of the failure” is usually pretty easily obtainable, as you can do that by restoring from backups.  Financial institutions aim for somewhat stricter objectives.  Most banks will require an RPO of “Zero” meaning “I want to see the last committed transaction on my DR site in the even the source site becomes a smoking hole in the ground.”

This is doable of course, provided the DR site is close enough to the source site to run dark fibre between the two with low enough latency to add negligible impact to production.  (the rule of thumb for synchronous replication is 2ms per 10k, that is for every 10 kilometers you’re adding 2ms of latency.  A normal physical drive has a latency of about 9-14ms, so if you go to far you’re going to slow your system to a crawl.

RTO is defined as “how long can I afford to have my environment down to affect a failover.”  I’ve worked in one environment where transaction logs were backed up to tape and shipped across country from the L.A. area to Orlando, Florida, where the tapes were then restored into a standby system.  The recovery time to a 15 minute increment was effectively days, because they actually had to wait for the last tape to make it to the target site before they could restore it and bring the system on-line.  It was Insane.

Your goal is to get RPO and RTO to as close to zero as possible without bankrupting the budget (or the company).

An RPO of zero can be obtained with a DR site within about 10 kilometers, 20 if you can live with the slower response times in production.  This is full synchronous transfer from one array to another, every write from host to disk has to be acknowledged by the REMOTE array before it is reported to the host that the write is committed.

EMC’s SRDF/A and SRDF/AR mitigate that in environments where the DR site is far enough away as to kill any chance of SRDF/Syncronous working. 

SRDF/A is a “packetized” SRDF, where the receiving Symm has to receive two consecutive “checkpoints” before it commits the block of data.  That way if an incomplete block is received, it’s discarded to prevent data corruption resulting from incomplete write information.  The downside to SRDF/A of course it that it requires an insane amount of cache to function properly.  (And don’t let an SE tell you it doesn’t, he’s lying or not capable of understanding that for the remote Symm to receive a block of data, it has to be able to store it somewhere other than disk until it receives two checkpoints.

SRDF/AR is an automated replication product.  You are essentially mirroring production to a TimeFinder BCV, which is then sent synchronously to the remote site.  You can run a Sync transfer because the BCV’s are not connected to the production volumes, and as such the production volumes do not require any ACK/NAK from the remote system.  Depending on the time it takes to replicate (how fast the pipe is between the two sites) you can get RTO to about 10 minutes, which is good enough for most.  The effects of SRDF/AR can be duplicated by anyone proficient in Korn shell, as it literally runs a series of waits and whiles for each stage of the process.  AR has the added bonus that you can actually keep a second set of BCV’s on the target host and run your backups from them.  The down-side to the AR type of scenario (whether it be SRDF/AR or a scripted set-up) is that it costs disks – and lots of them.  There are the production volumes, mirrored, the first set of BCV’s, unprotected, the SRDF target devices (Mirrored or Raid5) and the second set of BCV’s.

Scary huh?

As I prepare to start my own replciation design this was formost on my mind, which is how it ended up here.  (this is after all the dumping ground for my random thoughts)

4 Comments more...

The Different Flavors of EMC TimeFinder

by on Sep.26, 2006, under Backup, General, RAID, Replication, TimeFinder

I don’t know what most people know about TimeFinder, so I’ll start with a short introduction.

EMC Timefinder was developed to provide customers with a dynamic mirror they could use to try and cut some of the tediousness out of copying data, whether from one host to another or within the same host.

When I was working for EMC, Timefinder was still more or less in it’s infancy, and only came in one flavor.  (Now referred to as “TimeFinder/Mirror”)

A Timefinder volume is a volume that is essentially a dynamic mirror of a production volume.  Called a ‘BCV’ (for Business Continuance Volume) it was a straight 1:1 mirror of your production data.  (If you were running in a 2-Way-Mirrored configuration, the BCV essentially become a third mirror.)  At a point-in-time, you could split the third mirror off it’s production pair and make it available to another host.

The main benefits were obviously discovered in Backup.  You could split a BCV of your production data off, mount it to another host, and back it up to a locally attached tape library with zero network overhead.  Using this you could also use a single backup host to back-up copies of BCV’s from multiple production hosts.

Another good use for BCV’s was in development.  One story I like to tell was from I was an admin several years ago.  Developers liked to “break” their database on a friday afternoon, knowing that the restore from tape would take the better part of a day and in doing so they guaranteed they would make their tee-time.  With the advent of TimeFinder, I was able to tell them “Not a problem, I’ll have it back up in 20 minutes.”   The reason being that I could restore from the mirror almost instantly.

The main negatives for TF/Mirror are that in all cases, the initial synchronization has to complete before the data can be made available to the target system.  Now this is mitigated by the fact that after the initial relationship is established all further mirrors are incremental, meaning that only changed tracks are copied to the BCV volume, but it can still be a time consuming process.

Now EMC has come out with two new forms of TimeFinder in Symmetrix, very similar to the Clariion functionality.

TimeFinder/SNAP

  SNAP uses a process called “Copy On First Write”.   This uses a much smaller volume than the production volume as a “virtual device”  (Called a VDEV)   The VDEV serves as a list of pointers for each track in the volume.  Reads are serviced from the original volume until the track is changed.  When the track is changed the original data is copied to a cache area, and the pointer for this track in the VDEV device is changed to point to the cached original track.  In doing this the VDEV device will contain an exact copy of the production data as it was when the Snap session was activated.  When the data is no longer needed, you terminate the Snap session and the cached changes are discarded.

The data is available the instant the SNAP session is activated.

The downside to this is that all reads touch the production volumes.  In a heavily utilized system there can be a noticible impact.  Another negative is that SNAP sessions are limited to the amount of cache set aside.  A usual configuration is to set aside about 20% of the area used by production as “SnapCache”  This can then be used as needed.  If the SnapCache fills up, the Snap session ends and that is that.

TimeFinder/CLONE

  Clone uses another process, similar to SNAP, called “Copy On Access”.  Clone volumes are identically sized to the production volumes, which of course uses up more space, but provides for a more permanent home.  This provides the data permanance of TimeFinder Mirror, the speed of TimeFinder Snap, and the agility to move data from Standard volume to another standard volume. (Raid-1 production volumes to Raid-5 Development volumes is a good example)

What copy-on-access offers is the unique ability to use the volume before it’s actually finished mirroring.  When a clone session is first started, all the target volume contains are pointers to the source volume.  Every time a track is accessed, (read or write) it is copied to the target volume first.  (prior to any write operations)  If no options are selected this is the only time a track is copied.  If the -copy option is selected when the Clone session is created, a background copy of the production volume is started.  This will eventually result in a copy of the data that will persist after the clone session is terminated.  (when no option is specified, the data will disappear when the session is terminated)  There is also the option to copy (mirror) the entire production volume to the target volume before the session is activated.  This is called “Precopy” and is a close emulation to what is done using TimeFinder mirror without the limitations of having to use BCV’s as targets.

TF/Clone has to be the best of all worlds.  It gives you the flexibility of Snap with the data-resilience of Mirror, and the flexibility of being able to go from one volume to another without restrictions on what type of volume your target is.  (TimeFinder/Mirror requires the use of BCV volumes)

Timefinder is the production that gave me my introduction into EMC.  TimeFinder and SRDF are also the technologies I’ve implemented more often than any others in my work for (and with) EMC.

If you’ve got questions, feel free to post them.  You’re probably not the only ones.

25 Comments more...

Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!

Visit our friends!

A few highly recommended friends...