<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Data De-Duplication</title>
	<atom:link href="http://blog.50micron.com/2008/05/14/data-de-duplication/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.50micron.com/2008/05/14/data-de-duplication/</link>
	<description>Ranting and raving about storage and technology</description>
	<lastBuildDate>Mon, 19 Dec 2011 15:36:52 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: storageguy201</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6246</link>
		<dc:creator>storageguy201</dc:creator>
		<pubDate>Thu, 29 May 2008 14:58:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6246</guid>
		<description>Yep, but it has it&#039;s benefits in the way it&#039;s implemented today. Prior to these appliances we couldn&#039;t backup 80TB of data online on a single 20TB tape cartridge or disk automagically and do it in a short back up window.

Though I agree with the old technology, new market name thought...the mainframes could run multiple OSes back in the 60s...now it&#039;s vmware. Sun E10K had dozens of CPUs way back when, now there are multi core multi processors from Intel/AMD (implemented differently). Databases wrote logs for lazy writes, now this is journaling file system (and concept behind many storage appliances to improve write performance)...and so on.

However, all these technologies have a better implementation which allows them to become so popular in the mainstream computer market.</description>
		<content:encoded><![CDATA[<p>Yep, but it has it&#8217;s benefits in the way it&#8217;s implemented today. Prior to these appliances we couldn&#8217;t backup 80TB of data online on a single 20TB tape cartridge or disk automagically and do it in a short back up window.</p>
<p>Though I agree with the old technology, new market name thought&#8230;the mainframes could run multiple OSes back in the 60s&#8230;now it&#8217;s vmware. Sun E10K had dozens of CPUs way back when, now there are multi core multi processors from Intel/AMD (implemented differently). Databases wrote logs for lazy writes, now this is journaling file system (and concept behind many storage appliances to improve write performance)&#8230;and so on.</p>
<p>However, all these technologies have a better implementation which allows them to become so popular in the mainstream computer market.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SanGod</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6243</link>
		<dc:creator>SanGod</dc:creator>
		<pubDate>Thu, 29 May 2008 03:35:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6243</guid>
		<description>We used to do the same thing at the pretend-bank - but we used a Clariion as a disk-storage-unit.  Now I don&#039;t know that I got 40% compression out of most of the crap we backed-up to that area, most of it was MP3 / WAV format audio from the call-center floor trying to keep the reps honest.  

I know, from past experience, that MP3 format audio just plain doesn&#039;t compress well.  I seriously doubt we would have seen &quot;as advertised&quot; compression ratios.

Again - DeDuplication is simply new-fangled marketing-speak for compression.  The basic mechanism for compressing data isn&#039;t anything that PKZIP hasn&#039;t been doing for 20 years.</description>
		<content:encoded><![CDATA[<p>We used to do the same thing at the pretend-bank &#8211; but we used a Clariion as a disk-storage-unit.  Now I don&#8217;t know that I got 40% compression out of most of the crap we backed-up to that area, most of it was MP3 / WAV format audio from the call-center floor trying to keep the reps honest.  </p>
<p>I know, from past experience, that MP3 format audio just plain doesn&#8217;t compress well.  I seriously doubt we would have seen &#8220;as advertised&#8221; compression ratios.</p>
<p>Again &#8211; DeDuplication is simply new-fangled marketing-speak for compression.  The basic mechanism for compressing data isn&#8217;t anything that PKZIP hasn&#8217;t been doing for 20 years.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: storageguy201</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6240</link>
		<dc:creator>storageguy201</dc:creator>
		<pubDate>Wed, 28 May 2008 15:23:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6240</guid>
		<description>As a real-world use case - we have a requirement to backup our mission critical databases (rather large) in a way that it&#039;s easily restorable, we have to keep 30 days worth of backups and the backups must finish within a very short window so as to not drag down the performance. This is where we use our de-deup appliance. Because of all the white-space in the database it compresses really nicely. We can typically project up to 40x with the de-dup technology. Restoring it is fast and painless without having to recall or load LTO4 tapes (which we also use - for long term archival of the databases).</description>
		<content:encoded><![CDATA[<p>As a real-world use case &#8211; we have a requirement to backup our mission critical databases (rather large) in a way that it&#8217;s easily restorable, we have to keep 30 days worth of backups and the backups must finish within a very short window so as to not drag down the performance. This is where we use our de-deup appliance. Because of all the white-space in the database it compresses really nicely. We can typically project up to 40x with the de-dup technology. Restoring it is fast and painless without having to recall or load LTO4 tapes (which we also use &#8211; for long term archival of the databases).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SanGod</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6229</link>
		<dc:creator>SanGod</dc:creator>
		<pubDate>Mon, 26 May 2008 03:12:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6229</guid>
		<description>It&#039;s not whether it works or not, it&#039;s whether the ROI is worth it.  If all you&#039;re looking to do is to save money on tapes, well you&#039;re going to have to save a *LOT* of tapes to come out even.

I&#039;m hoping no-one is trying to push this as something to be done to production data?  I wouldn&#039;t run any kind of compression against production data for the same reason that I wouldn&#039;t rely on disk-compression.  It takes processing cycles/time, and puts you further at risk for corruption.

De-duplication hardware is expensive, it&#039;s another working part to break, and it&#039;s something else an already beleaguered SAN admin has to master.</description>
		<content:encoded><![CDATA[<p>It&#8217;s not whether it works or not, it&#8217;s whether the ROI is worth it.  If all you&#8217;re looking to do is to save money on tapes, well you&#8217;re going to have to save a *LOT* of tapes to come out even.</p>
<p>I&#8217;m hoping no-one is trying to push this as something to be done to production data?  I wouldn&#8217;t run any kind of compression against production data for the same reason that I wouldn&#8217;t rely on disk-compression.  It takes processing cycles/time, and puts you further at risk for corruption.</p>
<p>De-duplication hardware is expensive, it&#8217;s another working part to break, and it&#8217;s something else an already beleaguered SAN admin has to master.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: kstansb</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6228</link>
		<dc:creator>kstansb</dc:creator>
		<pubDate>Thu, 22 May 2008 14:19:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6228</guid>
		<description>I&#039;m working on a POC for Avamar right now and it really does work as advertised.  We have a 10TB ( and growing) VMWare environment.  Right now we have to dump full backups to tape every night for various reasons.  In my Avamar testing I&#039;m seeing a 40-80% reduction for the first backup for any given guest.  After the first pass I&#039;m seeing reductions of 99.7 to 99.9%.  This is across 18 VM guests of various applications ranging from 4GB to 750GB.

This works well for us because we have a lot of spare cpu cycles on our VMware boxes at night when we run backups.  Once we move to production we are planning of doing away with tape completely for VMWare and simply replicating the data to a 2nd Avamar appliance at our remote site.</description>
		<content:encoded><![CDATA[<p>I&#8217;m working on a POC for Avamar right now and it really does work as advertised.  We have a 10TB ( and growing) VMWare environment.  Right now we have to dump full backups to tape every night for various reasons.  In my Avamar testing I&#8217;m seeing a 40-80% reduction for the first backup for any given guest.  After the first pass I&#8217;m seeing reductions of 99.7 to 99.9%.  This is across 18 VM guests of various applications ranging from 4GB to 750GB.</p>
<p>This works well for us because we have a lot of spare cpu cycles on our VMware boxes at night when we run backups.  Once we move to production we are planning of doing away with tape completely for VMWare and simply replicating the data to a 2nd Avamar appliance at our remote site.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jm</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6225</link>
		<dc:creator>jm</dc:creator>
		<pubDate>Thu, 15 May 2008 17:44:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6225</guid>
		<description>You got it, they store a hash table for everything they&#039;ve ever seen.  Encryption would be easy to add on at the back end so your data is encrypted at rest, maybe some of the dedup vendors are doing it already.  A while back EMC bought a company called Avamar that does all this deduplication at the host side, and I know it does encrypted data in flight, not sure about at rest though.  The Avamar appliance not only stores a big hash table for everything it has ever seen, it also shares the hashes it has seen with all the clients it backs up.  This way the clients don&#039;t ever send data over the wire that the back-end has already seen, no matter which client originally sent it.  It&#039;s a cool idea.  In addition to the space savings, I imagine it works quite well for folks who want to do centralized backups for remote offices on WAN links.</description>
		<content:encoded><![CDATA[<p>You got it, they store a hash table for everything they&#8217;ve ever seen.  Encryption would be easy to add on at the back end so your data is encrypted at rest, maybe some of the dedup vendors are doing it already.  A while back EMC bought a company called Avamar that does all this deduplication at the host side, and I know it does encrypted data in flight, not sure about at rest though.  The Avamar appliance not only stores a big hash table for everything it has ever seen, it also shares the hashes it has seen with all the clients it backs up.  This way the clients don&#8217;t ever send data over the wire that the back-end has already seen, no matter which client originally sent it.  It&#8217;s a cool idea.  In addition to the space savings, I imagine it works quite well for folks who want to do centralized backups for remote offices on WAN links.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SanGod</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6224</link>
		<dc:creator>SanGod</dc:creator>
		<pubDate>Thu, 15 May 2008 16:46:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6224</guid>
		<description>Ok - that&#039;s the first &quot;new&quot; aspect of DeDupe I&#039;ve heard so far.  You&#039;re saying that the appliance stores historical data as well as what&#039;s transmitted in-line?  That adds a bit to the scheme.

i also understand that the DeDupe appliance uses variable block-sizes, whereas i believe the older compression engines look at a fixed block-size, say 2114bytes, and look for redundancies within.

Now add that to a key-based &quot;de-duplication&quot; engine and you have space-savings and encryption in the same barrel.  Because once the data is de-duplicated it&#039;s mostly encrypted as it is.  

Ok you have my attention.</description>
		<content:encoded><![CDATA[<p>Ok &#8211; that&#8217;s the first &#8220;new&#8221; aspect of DeDupe I&#8217;ve heard so far.  You&#8217;re saying that the appliance stores historical data as well as what&#8217;s transmitted in-line?  That adds a bit to the scheme.</p>
<p>i also understand that the DeDupe appliance uses variable block-sizes, whereas i believe the older compression engines look at a fixed block-size, say 2114bytes, and look for redundancies within.</p>
<p>Now add that to a key-based &#8220;de-duplication&#8221; engine and you have space-savings and encryption in the same barrel.  Because once the data is de-duplicated it&#8217;s mostly encrypted as it is.  </p>
<p>Ok you have my attention.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jm</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6223</link>
		<dc:creator>jm</dc:creator>
		<pubDate>Thu, 15 May 2008 16:39:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6223</guid>
		<description>I would agree that the idea has been around for a long time, but it hasn&#039;t been practical to implement on a reasonable scale until relatively recently.  Sure, it&#039;s exactly how compression algorithms work, but your typical compression algorithm takes a chunk of data and removes the redundancy in it.  Depending on the algorithm that chunk could be measured in KB or MB, and the larger the chunk the better your overall compression ratio.  In theory, data deduplication extends that chunk size to be the size of all the data the dedupe engine has ever seen before.

People are getting significant storage savings from this.  At my previous employer we were doing backup to disk and running out of room.  Adding a dedupe appliance allowed us to store 20 to 30x as much data in the same space.</description>
		<content:encoded><![CDATA[<p>I would agree that the idea has been around for a long time, but it hasn&#8217;t been practical to implement on a reasonable scale until relatively recently.  Sure, it&#8217;s exactly how compression algorithms work, but your typical compression algorithm takes a chunk of data and removes the redundancy in it.  Depending on the algorithm that chunk could be measured in KB or MB, and the larger the chunk the better your overall compression ratio.  In theory, data deduplication extends that chunk size to be the size of all the data the dedupe engine has ever seen before.</p>
<p>People are getting significant storage savings from this.  At my previous employer we were doing backup to disk and running out of room.  Adding a dedupe appliance allowed us to store 20 to 30x as much data in the same space.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: SanGod</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6222</link>
		<dc:creator>SanGod</dc:creator>
		<pubDate>Thu, 15 May 2008 16:23:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6222</guid>
		<description>Block-level incrementals, right?

I&#039;ve seen it done before - but as with any incremental backup the longer you go between full stores the more tedious and time-consuming the restore/recovery procedure is, right?

I mean in &#039;backup-class&#039; we learned that in the perfect world you&#039;d have the time and storage to do full backups every day.  You start from the ideal and work your way backwards.

What you&#039;re describing as deduplication is EXACTLY how compression alghorhythms works.  It&#039;s almost akin to the difference between bitmap graphics and JPEG compressed graphics.   With a bitmap you&#039;re storing information on every pixel.  With JPEG compression you&#039;re storing infomration on one pixel, then counting how many similar pixels are around.  The difference in the resulting picture quality is soly measured by how the compression algorhythm measures &quot;similar.&quot;

I wonder if we&#039;re truly getting into that &quot;nothing new under the sun&quot; state of technology.  Everyone&#039;s new breakthrough seems to be a re-branding / re-marketing of the old technology.  I don&#039;t think you can put a faster processor and more cache / RAM in a computer and call it a new computer.  You can call it an upgrade certainly, but not new.</description>
		<content:encoded><![CDATA[<p>Block-level incrementals, right?</p>
<p>I&#8217;ve seen it done before &#8211; but as with any incremental backup the longer you go between full stores the more tedious and time-consuming the restore/recovery procedure is, right?</p>
<p>I mean in &#8216;backup-class&#8217; we learned that in the perfect world you&#8217;d have the time and storage to do full backups every day.  You start from the ideal and work your way backwards.</p>
<p>What you&#8217;re describing as deduplication is EXACTLY how compression alghorhythms works.  It&#8217;s almost akin to the difference between bitmap graphics and JPEG compressed graphics.   With a bitmap you&#8217;re storing information on every pixel.  With JPEG compression you&#8217;re storing infomration on one pixel, then counting how many similar pixels are around.  The difference in the resulting picture quality is soly measured by how the compression algorhythm measures &#8220;similar.&#8221;</p>
<p>I wonder if we&#8217;re truly getting into that &#8220;nothing new under the sun&#8221; state of technology.  Everyone&#8217;s new breakthrough seems to be a re-branding / re-marketing of the old technology.  I don&#8217;t think you can put a faster processor and more cache / RAM in a computer and call it a new computer.  You can call it an upgrade certainly, but not new.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jm</title>
		<link>http://blog.50micron.com/2008/05/14/data-de-duplication/comment-page-1/#comment-6221</link>
		<dc:creator>jm</dc:creator>
		<pubDate>Thu, 15 May 2008 16:17:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.sangod.com/?p=214#comment-6221</guid>
		<description>I suppose you could call de-duplication a single instance store, but typically single instance stores work at a file level.  Data de-duplication works on a variable-size block level.  If I save two copies of a file and slightly modify one of them, I only need to save a single copy of the bits in the first file and the delta between the first and second.  Also, if you&#039;re comparing data dedupe to tape compression I think you&#039;re missing the bigger picture.  Say I have 1TB of data and I do a full backup.  With tape and tape compression, I store say 500GB on tape on a good day.  Via a dedupe appliance I store 500GB also.  Wait a week and do another full backup.  I store another 500GB on tape.  Via my dedupe appliance I store the changes between that first full and the next, say 10%.  So, with only two full backups I&#039;ve got 1TB of data on tape and 550GB on my dedupe appliance.  Consider how often you do full backups and your retention period and do the math, you&#039;re storing a whole heck of a lot less data if you dedupe it.</description>
		<content:encoded><![CDATA[<p>I suppose you could call de-duplication a single instance store, but typically single instance stores work at a file level.  Data de-duplication works on a variable-size block level.  If I save two copies of a file and slightly modify one of them, I only need to save a single copy of the bits in the first file and the delta between the first and second.  Also, if you&#8217;re comparing data dedupe to tape compression I think you&#8217;re missing the bigger picture.  Say I have 1TB of data and I do a full backup.  With tape and tape compression, I store say 500GB on tape on a good day.  Via a dedupe appliance I store 500GB also.  Wait a week and do another full backup.  I store another 500GB on tape.  Via my dedupe appliance I store the changes between that first full and the next, say 10%.  So, with only two full backups I&#8217;ve got 1TB of data on tape and 550GB on my dedupe appliance.  Consider how often you do full backups and your retention period and do the math, you&#8217;re storing a whole heck of a lot less data if you dedupe it.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

