Best Practices
Multivendor or Single Source? Is there a right answer?
by Jesse on May.26, 2010, under Best Practices, Replication, SingleVendor
Every time I turn around it seems I seem to be running into the same question.
Is it better to be multi-vendor or single source?
Well the easy answer to that is, it depends. Different vendors do things differently, work better/worse with some hardware, etc.
The arguments in favor of a single-vendor solution is easy. Cost, Simplicity, Management, Interoperability.
Even if you’re buying a more expensive solution, there can STILL be major cost savings.
First, in staffing. When you maintain multiple vendors, you have to maintain support-staff knowledgable for each vendor.
If you’ve got a storage team that consists of 5 people, and two of them work almost exclusively on Veritas Netbackup. You *MIGHT* be lucky if you get one subject matter expert capable of doing Tier1 (IE Symmetrix) one for Tier2 (Clariion) and one for NAS (Celerra) .
But throw in HDS, IBM DSxxxx, XiV, IBM GPFS, IBM HPSS, NetApp, SONAS, Sun StorEdge, etc. etc. etc. And what do you have?
You either have an overworked staff (and as i’ve discussed, union protected salaried federal employees aren’t known for 70 hour weeks) or stuff just plain doesn’t get done.
If you don’t spend the money on staffing, you *WILL* spend the money in support and professional services. Now support is one thing. If my XiV or Symm or whatever loses a harddrive, I expect the vendor to own that problem and fix it.
They will *NOT* however send people out to help with day-to-day provisioning without a pretty hefty P.O. associated with it.
And the last reason for single-vendor options is simple. I want stuff that is going to work together. Now yes, functionality costs, but one of the things I like about EMC is that when it comes down to it, it *ALL* works together. I can move data from Symm to Clariion or vice-versa using SanCopy, I can migrate fileservers to celerra and within storage tiers as needed.
There is nothing worse than needing to expand one storage system by 20TB and having the storage somewhere else, but unusable. It means you’re wasting money buying storage you already have. (Especially when your purchase cycle is 4-6 months on average.)
Not a happy thing to explain to the boss.
“Yes we have 80TB of Clariion avaialble, but the IBM DS4800 is running short so I need to spend an extra $100k on disks.”
“Yes, I know this isn’t budgeted, but the data grew faster than we’d expected.”
(Of course, you can span filesystems across arrays, as long as it’s not replicated data, because you can’t get a consistent split when half of your extents are on one array and half on another)
Consulting vs. Contracting…. A primer….
by Jesse on Nov.04, 2009, under Best Practices, Gripe, Opinion
Ok, I can say it in a sentence.
A contractor is someone you hire to do a job, a consultant is someone you hire to fix a problem.
I’ve done both, but in the last 8 years I’ve been primarily a “Consultant.” My job is to fix whatever perceived problem.
Some companies might have a backup problem. You streamline their process and reduce redundancy, and poof, backup problem solved.
Some companies might have a replication problem. You analyze their environment then recommend and implement changes.
Some companies have a data management problem. You simplify storage, identify Tiers, move storage to where it best suits the orginization. (IE Static Image data doesn’t belong on Tier-1 Symmetrix)
Some companies have a culture problem.
Here I got nothing.
But when your culture problem interferes with the consulting that you are asking me to do, I bristle.
When your culture problem causes me to wait 8 months of a 1 year contract before I’m given the tools to do my job, I boil.
When your culture problem is making me feel like I should take up golf. I start looking at dice.com for something better. (I hate golf)
Maybe it’s just that I *LOVE* what I do.
I do. I love what I do. I get paid to do what I love. Which is why I can’t stand seeing people who are either A> There to collect a paycheck and maybe if they’re lucky a pension. or B> try and create their little empire so they can brag to colleagues about how much money they have to spend this year on nothing.
It boggles the mind.
On tape…
by Jesse on Oct.23, 2009, under Backup, Best Practices, DR/COOP, Replication, Tape, Worst Practices
Ok, I have no problem with tape. It’s a *GREAT* backup medium when your requirement is portability for massive amounts of data and you’re not replicating said data.
If I had to ship 400TB of backups to Iron-Mountain, to protect against the earthquake-to-end-all-earthquakes tape would be my FIRST choice (though maybe, as a GIANT CAVE – Iron Mountain might not be.)
But… (and this is where it gets fun)
I have a customer who *LOVES* tape.
Wants to have it’s children loves it.
Uses it as primary storage loves it.
Now if you:
A> Have a few hundred terabytes of data to Archive.
B> Have millions of dollars to spend on giant room-sized storagetek libraries, and the space, power, and cooling that that entails.
C> Really love tape.
and most importantly
D> Live in the early 1980s
Then Archival to tape is *SO* the way to go.
The argument given is as follows. “Tape is cheaper than Disk”
Well yes, on a terabyte for terabyte scale tape might be cheaper…maybe if you exclude the hardware.
But if you throw something along the lines of EMC’s Atmos product, or even Centerra, or I’d even go so far as to say the NetApp box appealed to me at one point. (Now that the Celerra supports File Level Retention, I’ve been cured of that.)
Because when you throw in modern options like replication and, dare I say it, DEDUPLICATION, Disk rapidly becomes the better, faster, more cost effective way to store your long-term data.
Now I wouldn’t recommend anyone go out and buy a DMX-4 for Archival purposes.. (Though if you want to let me know ahead of time so I can buy some EMC stock. – I’m not currently holding any.)
I checked, and the only Tape vs. Disk comparisons I could find on-line were done by storage vendors, each of which has their own agenda (and big surprise, the analysis came out favouring whatever they were selling), so none of them are valid in the grand scheme of things. (I have a few things to say about marketing and statistics, but that’s a different post)
The things I look for when judging where to store data…
A> How many copies of the data do I need?
This is often overlooked and a question not asked. How many copies of a piece of data do you really need? And how many do you currently have? I’ve been in one data center recently where they LITERALLY have boxes of old tapes stacked up along the walls. (Note: Storing your backups WITH the system you’re backing up doesn’t do much in the event of a fire or natural disaster)
B> How long to I need to keep the data?
Retention policies are a big catch for a lot of people. For “Backup” purposes (see my last post) I say two full backups are all that is really required. If there is any kind of a likelihood that some critical corruption could be missed for weeks (or months) than adjust your backup strategy accordingly. (or find a better way of auditing your production data for errors)
C> Does my data have to be portable?
Ok, this is aimed specifically at Tape. The answer is this. If you have a remote DR facility and a high-speed connection between them, there is absolutely NO REASON to go to tape for portability. By virtue of Replication (whether it be the production data or VTL) you’ve already moved your data off-site. Now if you’ve only got one data centre and it’s sitting right on the San Andreas fault line (I’ve actually worked here – not joking) then send tapes off-site.
Lots of them.
5 or 6 times a day if you can.
D> Am I storing a copy of production or my only copy?
If you’re storing a copy of production (running) then chances are you’re not going to need the backup. If you’re protecting yourself against someone hitting the delete key accidentally, then maybe Celerra (SnapSure – periodic checkpoints that even the users can access themselves) or Centerra (Don’tEvenThinkAboutDeletingThis) are better options.
If you’re storing a copy of something so you can make room for something else, than backup tape is probably not your best option. Consider an archiving solution like Atmos or Centerra, or even a Celerra with File Level Retrieval enabled – and version 5.6.44 and later supports de-duplication (both single-instance storage and compression) natively.
E> Do I have the money to spend now, or am I willing to spend more over time to keep the initial investment down. (This is a valid question – and I’d like to know if anyone has any ideas on which would be the cheaper initial investment.
Just remember that you have to count the floor-space as well. Something many people forget when scoping out storage buys.
if I want 150TB of storage and I want to do it with tape, what’s the supporting hardware going to cost me? (A single CX4-240 with one rack of disks can provide up to about 220TB of storage with current drive-sizes.
A final note. Remember with any “portable” backup solution that you have to keep your backups safe. Tapes, like disks, don’t respond well to things like…well…dropping. Anytime you transport a medium from one location to another physically you put that data at risk.
Just my .02 cents.
No SAN is an island…
by Jesse on Sep.16, 2009, under Best Practices, Cisco, Fibrechannel, Switches
Ok, that was too cutsey for such a classy establishment.
When you’re building a SAN, everything should play together, in the same SAN box if you will (with my apologies to QLogic.)
When you start putting in multiple stand-alone SAN islands you increase your maintenance overhead exponentially. You also prevent the very thing that make a SAN a huge advantage over DAS.
Everything can see everything.
If you have a host and need to throw a certain type of storage at it, you can do that easily. (and if it’s already cabled you can do it from your living-room)
However, if SAN A/B are connected to one group of hosts/storage, and SAN C/D are connected to a second group of storage, and SAN E (no redundancy) is connected to even a third, you run into a problem.
*NOW* if I want Host_A (Connected to SAN A/B) to see Storage_C (Connected to SAN C/D) I have to do much more than a simple zoning change.
In the end, this is where a few well placed ISL connections can come in handy. VSAN them off so they don’t cause the fabrics to merge, create an IVR zone to route across them, and then presto. Host_A can see Storage_C with a minimum of fuss.
Or maybe a *REAL* core-edge topology even. Where you put a core switch with 24-port (fully subscribed) blades ISL’d to an edge switch, which maybe has the 4/4/40 configuration (4 8-Gig ports, 4 dedicated 4Gig ports, and 40 shared 4Gig ports)
And put one person in charge of it. Preferably someone with a touch of OCD.
Backup Vs. Archive
by Jesse on Sep.15, 2009, under "Cloud", Archive, Backup, Best Practices, Centerra, Deduplciation, Gripe
The fundamental difference between BACKUP and ARCHIVE.
A backup is there to help you deal with a crisis such as “My datacenter is a smoking hole in the ground now what do I do?” or something not quite as dramatic like “A virus ate my data.” You recover from the backup to the last known good and all is happy, right? Well except for the two or three days that might have gone since your last good backup… (Was in one lawfirm that lost a drive only to find out their backups hadn’t been running for two months.. came back two weeks later to find a COMPLETE change in personnel had gone on while I was gone – lawyers are not very forgiving when they lose two months worth of email.)
An archive is data that, while not “Active” still might be required on a day-to-day basis. Film / Video / Image archives are a good candidate for and example of that.
So on a disk-based archive you have some platform, ostensibly EMC/Legato DiskExtender or Rainfinity or something along those lines – that will move the data from “Active” storage to “Archive” storage. In some applications you can even set up a true HSM, moving data that hasn’t been accessed to Tier-2(Enterprise SATA) and even Tier-3(yes, tape) as it ages, only to be recalled to Tier-1 when it’s accessed.
More often than not I’m brought face to face with people who don’t understand that very subtle difference. One of my recent customers is actually doing it appropriately, using DX and a smallish Centerra to archive data that, while retention is required, is almost never actually accessed.
Then there are the people who use backup technology for archival purposes.
I’m pretty “old school” when it comes down to it.
Tape is for backup. Tape is *NOT* supposed to be used as nearline storage when there are equally inexpensive (and more reliable) disk methods out there.
My main complaint about tape as archive: You don’t know if it’s bad until you try to read it. And time you read it the simple act of moving the tape into a tape drive that was manufactured under less than ideal conditions means you are putting your data at risk.
Spending millions of dollars on a new Room-Sized tape library doesn’t make sense when Centerra storage is fairly inexpensive *AND* provides redundancy of the data automatically.
Spending more millions of dollars on three of them is lunacy when one EMC Atmos set up could provide redundancy and a single namespace for recall. (and if you go whole hog, geographically relevant retrieval is an option to, so you automatically get it from the closest copy.)
It pains me to see it done wrong. Especially when it involves trying to shoe-horn two more STK monsters into an already cramped datacenter when the work of it could be done in a couple of floor-tiles of spinning disks.
VMWare Booting…
by Jesse on Aug.31, 2009, under Best Practices, Linux, VMWare, Worst Practices
Ok, I’m curious as to whether anyone has an answer for this.
Why don’t more people boot VMWare ESX from the SAN?
It occurred to me the other night that I have 2 36G drives in each of my servers that I use possibly 10G of, when I already have a High-Availability storage solution at my fingertips. I’ve got plenty of storage space, not even including the vault drives.
So I tried it. I took one of my off-line VMware boxes. (I use DPM so at any given time 2 of my 3 VMWare hosts are probably in StandBy mode) and popped the drives out of it.
I turned it on, went into the BIOS and disabled the onboard RAID controller and enabled the boot BIOS on one of the Emulex HBA.s
I created an 18G lun on the clariion and assigned it to the host as LUN0 and poof, I have a boot disk.
Worked like a charm. The one surprise (pleasant) is that VMWare seems aware of the multi-pathed boot device even without any form of powerpath on the system. (That was my biggest concern)
So now I have my VMWare infrastructure running on a host with ZERO fixed-disk drives spinning in it.
So has anyone else tried this and know of any gotchas involved that I may not have run across yet? I’ve done windows and Linux native boot-from-san many many times, but this is my first attempt at VMWare.
I’ve not however tried pulling a path to see just HOW resilient it is…I should probably should try that before I convert the other two systems to diskless operation, right?
New look
by Jesse on Aug.18, 2009, under Best Practices
I’m bored – bygones.
Monday was another one of those days. When will facilities people get it through their heads that
…maybe it isn’t a GREAT idea to test the generator during the day…
…maybe it’s something better done at night, on a weekend, or when the moon isn’t full…
…maybe it’s a good idea to let the IT people know you’re testing it…
…maybe it’s an even better idea to CLOSE THE BYPASS BREAKER before you start the test.
Monday at about 2pm the planets aligned in their universal task of making me work late.
17 hours later I left the site.
I’m pretty impressed. We went from a quick-quiet datacenter to back up and running in about 10 hours. A few more hours working out parts replacements… and all is golden.
Not bad. Could have been better. I hope so because in order to fix the Generator/UPS problem that caused the issue in the first place, they are going to have to take the power down again…
At least this time it will be graceful…I hope…I think they’ve scheduled it for the next full moon.
I will say this – of the vendors EMC was first on scene, and they had parts in tow before we even knew what we needed.
I’m suitably impressed.
On roleplaying…
by Jesse on Aug.07, 2009, under Best Practices, Celerra, NFS, NetApp, Vendor Abuse, Worst Practices
Ok – certain people do certain things well.
I’m a storage administrator/architect. If you present me a problem I will *ALWAYS* look at it from a storage standpoint. If you present me with a non-storage problem, I’ll try and make it fit.
I’ve identified four types of systems engineer-type-people:
Storage people
Server people
Network people
Desktop people
I think that just about anyone in IT either fits into one of those four roles or supports one of those roles.
Now when you are looking to solve a problem, the solution you get depends on who you go to. If you ask a desktop person to solve a network problem for instance, they will probably come up with something under the desk. (IE throwing a linksys router under a desk.)
If you try and throw a server person a storage role, you’re going to get a server solution to that role.
Enter IBM GPFS.
GPFS is a server solution to a storage problem. It’s obvious that the person who came up with the idea of solving a storage problem by loading software on a server is not a storage person.
POSIT: Mutliple hosts in a web-farm need access to data. Filesystems need to be R/W to an ingest server and R/O to the web-content servers.
Storage Solution: NAS/NFS – Trunked connection to a real backbone and multiple Apache webserver front-ends running at 1G to play out data. (Fastest data transfer is going to be the 45MB/Sec backbone coming into the building, so a single Gigabit connection can handle it. F5 Round-robin load-balancer to distribute the front-end load. (might also be proposed by Savvy network people, who tend to understand NAS)
Server Solution: IBM GPFS solution. Over a million dollars in net-new server hardware + software licensing (not including storage). Each host accessing storage requires HBA’s, Drivers, fast RELIABLE network. and a level of complexity unheard of even in government.
From what I can tell, and maybe someone can give me a little more insight, works very much like Sun’s Shared QFS. A metadata server acts as a gatekeeper telling which member servers can access which blocks on which disks. There is still no simultaneous disk access because a SCSI lock is a SCSI lock.
Now from a storage standpoint, this is rife with problems.
First off, it would seem that if network access was compromised during a write data integrity could easily be compromised.
Secondly, Other than block-level mirroring of the underlying disks, I can’t see a good way to replicate this. And block-level mirroring of the underlying disks would require an identical infrastructure at the remote/DR site wouldn’t it? That is of course assuming that the metadata can be mirrored.
Now in database uses or other types of distributed computing I can see it being VERY valuable. But for flat file storage and web retrieval I can’t think of a single good reason to use something so obnoxiously complicated. Especially when EMC Celerra, NetApp, or just about any of the other higher-end NAS appliances would cost *SO MUCH* less and be *SO MUCH* more reliable.
/EndOfRant
How to tell if your sales rep hates you….
by Jesse on May.22, 2009, under Best Practices, Celerra, Ethics, NFS, Replication, Vmware-NFS, Worst Practices
I just got the following job posting and it made me, literally, laugh out loud, spitting latte all over my laptop.
If your sales rep allows you to do something like this, it’s a fair bet that s/he hates you (or is planning to buy your company out of bankruptcy later).
“WANTED: VMWare 1-month resident to assist with new deployment/planning around 200VM’s and new Celerra NS480′s being purchased by client. Will probably end up primarily being VM’s using NFS on NS Celerra Replication will be enabled between (2) NS480′s.”
The key points are:
200VM’s
Celerra
**NFS**
Replicator
Ewww…..
Did I mention NFS?
Someone actually sold this? Even if the customer comes to you direct and says “this is what I want…” the answer should be “In the interests of protecting you from yourself, I can’t allow you to do this.”
I don’t care how much the deal is worth.
On Security….
by Jesse on Mar.25, 2009, under Best Practices, General, Job Market, Security
Security is a good thing….until it isn’t.
Security isn’t a good then when it interferes needlessly with productivity. By needlessly I mean to say when you don’t get the security you’re looking for but instead make it harder for your people to do their job than needs to be.
A few examples:
1. Company “A” hires consultants to perform day-to-day tasks. Company “A” then refuses to give them access to the troubleshooting tools and software downloads they are supposed to be supporting.
2. Company “B” decides that it’s employees can’t be trusted. (If you can’t trust an employee, why are they an employee?) Company “B” then decides to lock down PC workstations so that *NO* software can be installed or removed by said employee. Company “B” instructs their helpdesk to ignore all requests for installation of needed software.
3. Company “C” requires an contractor to be on-call for 24×7 support. Company “C” refuses to grant said contractor remote access to support the equipment he’s on-call to support, forcing a 45 minute drive in the event of an emergency. Company “C” then reams the contractor for not being timely in his/her support.
4. (My Favourite) Company “D” gets *VERY* creative with Windows Group Policies on a workstation, rendering said workstation a paperweight. Company “D” neglects to block access to the system BIOS and allows booting from USB only to allow any user to introduce any unlocked/unguarded operating system in the world into their environment by virtue of a thumbdrive.
In my career, I’ve been said employee/contractor in every one of these instances.
(Just an aside - my favorite gotcha came from watching a help-desk guy come in and disable the USB ports in the bios of a system only to be rudely reminded that the keyboard and mouse are USB (and that they don’t make PS2 connections for them any longer))
My point is this: If you’re going to implement security make sure it’s effective security that also allows your employees to do their jobs.
If it’s not effective security – IE going to show a security benefit (that benefit being a quantifiable improvement in the security of your data or the stability of your environment) don’t bother with it – you do nothing but alienate the people you hire to work for you and make them want to go elsewhere.
Contrary to popular belief, there are still elsewheres to go.