Archive for the 'Fibrechannel' Category

The Great Conversion… (Part1)

Tuesday, December 7th, 2010

Tonight I have started the process of converting my own CX300 to a CX3-20c.  (yay!  upgrading from SERIOUSLY out-of-date hardware to MODERATELY out-of-date hardware, right?)

Why?  Because it’s there.  Since EMC doesn’t offer free training to sub-sub-sub-contractors such as myself, it falls to me to learn what I can where I can. 

Besides.  It’s fun.

So far it’s been pretty simple.  Printed out the 63 page guide from the Clariion Proceedure Generator, and was briefly intimidated by it before I realized that 80% of it is completely useless.

But, because I want the experience, I’m going through each step of it.

First thing I did was backed up the vault pack.  This particular CX300 has no data on it, so it was a simple process to swap the five vault drives on at a time.  Though in the interest of doing it gracefully, I did bind a 1G lun across the five drives so that I could use the proactive sparing to gracefully remove each drive.  (The option isn’t available unless you have a bound lun on the raid group)

Obviously I want to come out of this with a working CX300 as well as the CX3-20. :)   (And of course my fear is ending up with not one but TWO doorstops at the end of this process)

Ran CRC2.  Interesting application, might come in handy in the future, because  when it comes down to it, it gives you a LOT of information about your clariion that you don’t get from the GUI.

Luckily this is a situation where there isn’t anything i need to keep on the array.  One thing stands out though, I had to delete the little 1G lun I bound on the vault pack to get the CRC2 check to pass, because apparently a number of the system partitions need to expand.  (Life would have sucked if I didn’t have the ability to wipe the drives)

After deleting the 1G lun, CRC2 passed and all was good with the world.

Skipped the next 6 pages, which describe how to make room in the rack for the new equipment.

Why?

No rack in my basement anymore, I guess a desk will do.

Let’s just say rack-space really isn’t the issue here. ;-)

So the only thing I wasn’t able to find was a copy of EMCRemote.  Though I have an older one from back in my EMC days, hopefully the protocols haven’t changed much.  <crosses fingers>

So Unisphere Service Manager (Formerly Navisphere Service TaskBar) is installed, and the first step is to install the target platform conversion-prep software.  A pretty straight-forward install.  Once this package is installed you’re comitted. (or should be)

I had a bit of a worry loading the conversion-image, SPB didn’t come back before USM timed out…which seems like something that might just happen in these older, slower devices, right? 

The ConversionPrep, ConversionImage, and most importantly, the new Utility partition all seem to have installed correctly, as with the HSConversionB package. 

Now sadly, there are no descriptions as to what each NDU does, though logically:

ConversionPrep handles the settings – one of the things that you verify after the ConversionPrep is loaded is that write-cache is disabled (which it always was because I don’t have a SPS cable for this model) and such.

ConversionImage and Utility Partition are pretty self-explanitory, the CX3 looks for things in different places, requires different drivers, etc. 

HSConversionB was the stumper.  HardwareSwap?  Any ideas?

Last step was to shut-down the array.  This will be my stopping point. 

Stay tuned, same bat-time, same bat-rss-feed.

No SAN is an island…

Wednesday, September 16th, 2009

Ok, that was too cutsey for such a classy establishment.

When you’re building a SAN, everything should play together, in the same SAN box if you will (with my apologies to QLogic.)

When you start putting in multiple stand-alone SAN islands you increase your maintenance overhead exponentially.  You also prevent the very thing that make a SAN a huge advantage over DAS.

Everything can see everything.

If you have a host and need to throw a certain type of storage at it, you can do that easily.  (and if it’s already cabled you can do it from your living-room)

However, if SAN A/B are connected to one group of hosts/storage, and SAN C/D are connected to a second group of storage, and SAN E (no redundancy) is connected to even a third, you run into a problem.

*NOW* if I want Host_A (Connected to SAN A/B) to see Storage_C (Connected to SAN C/D) I have to do much more than a simple zoning change.

In the end, this is where a few well placed ISL connections can come in handy.  VSAN them off so they don’t cause the fabrics to merge, create an IVR zone to route across them, and then presto.  Host_A can see Storage_C with a minimum of fuss.

Or maybe a *REAL* core-edge topology even.  Where you put a core switch with 24-port (fully subscribed) blades ISL’d to an edge switch, which maybe has the 4/4/40 configuration (4 8-Gig ports, 4 dedicated 4Gig ports, and 40 shared 4Gig ports)

And put one person in charge of it.  Preferably someone with a touch of OCD. ;-)

Jumping the shark

Monday, August 25th, 2008

This may be a more well-known reference than I earlier thought.

I grew up watching Happy-Days.  The show was great until the episode where Fonzi jumped the shark-tank.  After that it pretty much went down-hill quickly.

Hence the term “Jumped the shark” or “Jumping the shark” has come to mean any single event that marks the point where something degenerates into crap.

My VMWare NFS server jumped the shark this weekend.  It was hilarious.  I had a beautifully quiet afternoon on Friday, from about 14:30 on my blackberry was quiet.  Turns out that the NFS server that I use for storage experienced an unexplained (and apparently barely logged) kernel panic and rebooted.

In the process, the 6 adapters, in what I can only guess was a techno-square-dance, all switched places and lost their bonding configuration.

All went south, right in the middle of one of my busiest travel weeks as far as work goes.  So my wife, god bless her, earned her stripes this weekend as I walked her through ‘ifconfig eth0 10.1.1.10′ and ‘ping 10.1.1.254′ etc.  trying to figure out what happened.

Still don’t know.  But with everything down (including this site) my first priority was to get it all back online, troubleshoot later.  (When my desktop goes down I know why, I have an inquisitive 3 year old with a fetish for power-buttons), but the server power buttons are protected by a key – for that very purpose.

So I ordered a bunch of 146G drives for the hosts, and I’m going to move criticial apps back to internal storage until I figure out what in the hell happened and how to fix it.  It might give me an opportunity to eval. some new FC Target toys I’ve been thinking about.

Who knows.  No more shark-jumping though.  ;-)

Clariion – Mirrorview – Cisco – FCIP

Tuesday, June 17th, 2008

Got into a scary situation this week.  Got called into help with a customer with a Mirrorview implementation.

Situation was:  Customer had Mirrorview/S set up within the existing switch environment, replication worked perfectly.

Then they reconfigured the switches to run FCIP so they could start replication to a remote site.  This is where things went badly.

First off – Cisco sets the Gig/E ports on the 9216i for jumbo-frames.  (MTU defaults to 2300)

This is a great idea for Fibrechannel replication, because a fibrechannel frame is 2114 bytes and this allows an entire FC frame to be sent within an ethernet packet.

Problem is that the default MTU on most network environments is 1500.  Now the *REAL* problem is that when  you first connect the GIg/E ports on the 9216i to a 6509 or other switch – it will at first appear to work perfectly…..until you try to pass data.

When you try to pass data across this link, the DF (Don’t Fragment) bit is set and the larger frames get dropped.  This causes an ISL connection between switches to flap, which causes no end of issues.  The fabrics will segment and re-join repeatedly until the first time you do anything that causes a reconfiguration, like updating the zoneset.  If you do that during a cycle where the ISL is going up and down, the vsan’s will fragment and stay fragmented because it will not be able to re-merge the fabrics.

So I come into this situation and the switches are so badly configured that it takes me a day just to get the ISL’s up and stable.  I set the MTU to 1500 on the switches, took the gig-e links down, and went to each switch and (carefully) deleted each vsan that didn’t belong on that switch.  (In addition to this being set up incorrectly, all three vsans merged when the swtiches were first connected due to the ISL’s being configured incorrectly)

Now the Clariion issue is still open.  A normal mirrorview configuration is as follows:

Source_SPAx –> Target_SPAx  (Where ‘x’ is the highest SP port #)

Source_SPBx –> Target_SPBx   (same here)

Now when the customer’s Clariion’s are zoned this way (in this case SPB3 to SPB1) nothing shows up in the Connectivity Status window.   But when I reverse the zoning, running SPA3 to SPB1, it shows up fine.  (Unfortunately Mirrorview doesn’t work in that configuration.

That’s where we stand.  A “simple 15 minute FCIP fix” is coming to the end of it’s third day.

FC@Home

Monday, March 31st, 2008

A couple of years ago, I picked up an old Clariiion FC5300 wholesale (free) from a junk-pile at one of my customers.  I played with it, it was nice, but I couldn’t figure out why I should use it when I had 73+G drives available to me.

I started the FC@Home project then.  Because I thought it would be cool to have fibrechannel running in my home system.

Well I got rid of the FC5300 because the 30 x 18G Full-Height drives were just too much to power and cool.

A few weeks ago I decided that I needed to do it again.  (I posted something of it earlier)  Got an old EMC/Brocade DS16B2 switch, a PowerVault 224F JBOD, and started playing.

Well the first thing I found is that I could never use JBOD for the purposes I wanted to.  I wanted to put together some redundant shared storage for my VMWare servers so I could play with VMotion and Clustering.  While I could share individual disks, RAID wasn’t an option and I refuse to use unprotected storage.

So I scoured Ebay and found a PowerVault 660F to add to the 224F.  Now the 224F came with 14x 18G drives, the 660F came with 14x36G drives.  I paid under $200 (not including shipping) for each of the two racks.  They are 3U units, don’t pull a tremendous amount of power, and are as difficult to cool as any drive array (it’s the drives that cause the heat, not the array)

Another $100 or so in cables (The HSSDC->HSSDC jumpers that were required for between the units) and I was good to go.  I already had some DB-9 FC –> SC-Duplex converts, as well as some SC–>LC cables, so that part was easy.  I found someone who off-loaded a bunch of old Qlogic QLA2200 HBA’s (9 for $50) and the whole things was done.

I initially had issues getting it recognized, but on a whim I called Dell support.  The tech informed me that this was so far out of support he really couldn’t help me, then proceeded to spend about an hour helping me out.  Turns out that the Array Manager software that you use to manage the thing doesn’t work with the latest / greatest QLogic drivers.  I had to back-rev them to v8.x and suddenly it worked perfectly.  (He also told me it was never ever going to work on Win2k3 – a fact I’ve happily disproven.)

I just got it carved, and all but one of my VM’s are moved over to it.  I have about 500G of Raid-5 Storage available with 2 Hot-Spares (Since I don’t know the history of the drives, I figured better safe than sorry).

So far so good.  Performance is great, though I’m only going through one switch I have redundant RAID controllers, so that’s at least something.  As soon as I find someone dumping a second DS16B2 I’ll probably incorporate that into the mix as well.

So I set up the 2-node VMWare cluster, and set it for DRS just to see if it works the way they say it will.  (I’m also curious because I have less memory in the second node than the first, if it will be aware of that.)  I have a third 2650 I got here because some newbie on Ebay didn’t realize that the particular error message he got on boot meant simply that there was no operating system on the disks.  As soon as I get the rail-kit I’m going to mount this puppy up and make it a 3-node cluster.

I’m such a geek.

Brocade is just in a buying mood these days…

Tuesday, March 4th, 2008

Brocade bought SBS.

I don’t know how many of you happen to have looked at the resume I had posted – but I spent a couple of years at Strategic Business Systems (www.sbsplanet.com).

I’m not sure what Brocade is hoping to get out of this.  SBS doesn’t do sales, and doesn’t even really have any influence in the buying process.

SBS has been a pretty successful company – grown by leaps and bounds.  I would never go back to them because they wield their non-compete agreement like a battle-axe and use every opportunity as a chance to hook someone in.

The real problem is that Brocade as a switch manufacturer is on it’s way out.  From a 90% install base they really have nowhere to go but down, and Cisco is gaining very quickly.

I’m not a big fan of Brocade.  I have a brocade switch in my home SAN not because of any preference, but because they are cheap on Ebay.   Their ASIC’s are slow and their licensing is oppressive.

Does this make sense to anyone?

Things I’ve learned today:

Monday, February 4th, 2008

1. iSCSI is a viable alternative to FC for Small infrastructures.

2. I learned that no matter how well prepared for an install you are, the techie-gods will always throw curve-balls at you.

3. I’ve learned that Linux and PowerPath requires that multiple iSCSI HBA’s in a single host are not supported.
(Author’s note – this is not entirely true – see comment #3 below)

4. I’ve learned that seeing mice (yes, plural) running around a datacenter while you’re crawling around on the floor running cables is creepy.

Nuff said. I just got off 20 almost-straight hours in a row (I napped between 3am and 6am this morning) doing what should have been a very very simple install.

Needless to say it wasn’t. I’m going to bed.

-J

Overcomplicating the world

Thursday, December 6th, 2007

Ok, I’ve seen it happen over and over again.  Customers who think they know better.

Now I absolutely applaud a customer who wants to take the time to learn the ins and outs of the storage they’ve spent probably  hundreds of thousands of dollars on.

But when you pay a consultant to come in and do work for you, please please PLEASE don’t handicap him by telling him to do something he knows is wrong.

Simple things, zoning, pathing, masking, failover, even naming conventions we use come from years of experience on what is the best way to put something together.  More damage is done  by people thinking they know better than the years and years of developer-hours that went into the system.

As a for instance.  Single initiator, single target.  There is no need to zone an HBA to multiple targets for redundancy unless it’s the only HBA in the system, in which case you have your first mistake right there.

HBA_1 –> Switch_1 –> FA_1

HBA_2 –> Switch_2 –> FA_2

These should be two completely separate paths, completely isolated from each other.  It’s so simple it’s not funny, yet I’ve seen more “unique” ways of zoning than I can count.

For instance:

HBA_1 –>Switch_1 –> FA_1A
–> FA_1B

The inherent problem with this is that powerpath (or DMP, or whatever flavor multipath software you use) is going to spend more time managing two paths, and the weak/slow point in the link is still the HBA.  You can’t get beyond the fact that it is a serial interface.

Plus the fact that on a Symmetrix, FA1A and FA1B share a processor, so you aren’t even gaining anything from spreading the IO across the Symmetrix front-end.  (not that you would even if you used FA1A and FA2A, because the processors are still writing to cache, minimal lag.)

Then you get into management.  You spend more time and effort managing a complex solution than it’s worth.  Simplify it and you’ll find you spend much more time at your local pub.  ;-)

/jg

DWDM Limitations – how far is too far?

Wednesday, March 21st, 2007

I saw this post on http://lordegg.wordpress.com and felt that the comment I posted to him there would make pretty good topic here.

Most people don’t understand that the speed of light has become a serious limitation in computing.  Even the original Cray, which was installed in Los Alamos in 1976, had some million individual wires pushing data, no single one of them was more than something like a foot long, due to the time it took to push electrons across them. (I wish I could remember the exact numbers, but I’ve been up for going on 20 hours now, my brain is shutting down)


DWDM is a great technology – allowing 4-8 different signals to travel down the same link.

The down side is when you get, say 8 channels going down a 60km link, you’ve created a very wide path indeed.

But you’ve not fixed the latency problem. Under ideal circumstances latency over fibrechannel is about 2ms per kilometer.

2ms per k at 60k is 120ms. That’s each way, there is a return trip as well for each ACK transmission.

Now when you add multiple data paths, the only thing that changes is now instead of having one I/O outstanding, waiting for it’s ACK, you’ve got four or eight.

60k is more than twice what I as an engineer would recomend without some sort of repeater, especialy when you consider that optical cable is not an “ideal” transmission medium.

The speed of light has some profound implications for networking technology. Light, or electromagnetic radiation, travels at 299,792,458 meters per second in a vacuum. Within a copper conductor the propagation speed is some three quarters of this speed, and in a fibre optic cable the speed of propagation is slightly slower, at two thirds of this speed.

At 2/3 the speed of light, latency is actually closer to 3ms/km.