04-28-06 06:12 PM
In article <1146241178.602187.47380@u72g2000cwu.googlegroups.com>,
<lacka_dacka@yahoo.com> wrote:
>Thats Correct Faeandar... it can compress existing data as well...
>
>I agree that its implementation doesent fully match marketing yet...
>but I guess my question is... *if* the product were to match marketing
>and be able to introduce minimal latency and at the same time compress
>atleast about 60%, would this be a type of technology to invest in?
Storage compression is fun. It is quite easy if the data is only
written and never modified or read. Except for the fact that it
requires quite a bit of CPU power, which has to come from somewhere:
either extra CPU boxes have to be introduced in the storage stack
(which cost money and add complexity and unreliability), or existing
CPUs in the disk arrays / NAS boxes have to be used (which slows down
the storage systems), or the compression runs like a device driver or
loadable file system on the application server (where compression uses
expensive CPU cycles on a machine that was bought to run the
customer's application, not to masturbate the customers data).
Reading the compressed data sequentially from the beginning is
typically easy. Reading it randomly can be hard if decompression is
implemented carelessly. Reading little blocks in the middle can be
very hard, if decompression relies on compressing the whole stream
sequentially in large chunks.
What can be catastrophic is modifying (overwriting) the data in place.
First off, many compression algorithms rely on finding similarities
within the data stream, and modifying the data disrupts them, so the
new data is typically larger (compresses less). If the new data is
larger (compresses less), then the storage system has to virtualize
the new data and store it outside the hole in the file. If you do
that for a while, the originally file layout becomes completely
chaotic, and both reading and writing speed goes to hell, and the
metadata overhead and complexity of the stored files becomes a big
mess. Furthermore, it is very difficult (but not impossible) to
implement a storage system that can move blocks of data around within
a file, and is correct and doesn't lose or corrupt data, even in the
face of system failures. Example: What if the compression system is
in the middle of updating the file to indicate (typically in some
metadata) that one block had to be moved to the end because it is less
compressible, and then the power fails, and this complex multi-phase
update is only partially recorded on disk? There are ways around this
(which typically involve logging, hardware NVRAM, and very careful
ordering of operations), but those require serious thought and great
care in the implementation.
One more hair in the soup: Some data doesn't compress very well.
Examples include images (for example in JPEG format), documents (in
PDF format, which is often internally compressed), backups (which are
often compressed by the backup software), and archival data such as
mail archives (which are usually compressed by the archiving
software). If you are running interestingly complex ILM software, you
probably already have more compression going on in the software stack,
and then adding one more compression layer won't help much.
One technique that is closely related to compression is duplicate
elimination: Don't store copies of files (or blocks or mail messages)
if the content is identical. This really helps with backups of
desktop workstations (because every machine has a copy of the MS Excel
DLLs, which are mostly identical), and sometimes helps with mail
archiving (because the same 5MB spreadsheet attachment is forward 100
times within the same mail system, meaning that 100 copies of it are
in the mail archive). But again, be warned: some ILM software already
contains such duplicate elimination, so doing it again in the software
stack can be pointless and wasteful.
>I see all the arguments that I need to cut down on the cost of
>management of the data... and beleive me, we are doing everything we
>can to do that using products from Archivus etc...
This is really the place where compression can shine: data that is
written once, never modified, and not read all that often. Examples
include backup, reference data, and compliance archives. But the
above warnings still apply, compression is not a panacea.
> But in addition, we
>do spend a lot of money on NetApp filers that these kind of products
>seem to be able to help with... Not to mention, since I started this
>thread, in doing more research into our environment, I am amazed to see
>how much money we are paying on energy bills alone with all the storage
>equipment we have...
Absolutely true. Here is my new rule of thumb: For every $1 you spend
on storage systems, you will spend another $1 on energy and
infrastructure costs (that includes air conditioning and floorspace
for it) over the lifetime of the hardware, and anywhere between $3 and
$15 on system administration and management (a good fraction of which
goes into avoiding, planning for, and managing failures of the storage
system). And if you buy a tape drive for $1, you can easily spend
anywhere between $10 and $100 on the blank tapes required to operate
it.
Also remember that management overhead doesn't just scale with the
size of the storage system in GB, but also with the complexity of the
storage system. A Netapp with 80 disks is only a little harder to
administer than a Netapp with 60 disks. But a Netapp with a separate
compression system installed is a lot harder to administer than just a
Netapp. It might be much cheaper to throw a few dozen disks at the
problem than have another moving part in an already complex system.
From this point of view, an investment of $0.40 in compression
hardware/software that makes your storage 30% more space efficient,
but increases the management overhead (for example because it reduces
reliability by 20%), may be very foolish:
Before:
$1 for storage system
$1 for energy/cooling/floorspace
$10 for management
=> $12 lifetime cost
After:
$0.70 for storage system
$0.70 for energy/cooling/floorspace
$0.40 for compression system
$12 for management
=> $13.80 lifetime cost
--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca.us
[ Post a follow-up to this message ]
|