Data Storage - Long term archival storage

This is Interesting: Free IT Magazines  
Home > Archive > Data Storage > April 2005 > Long term archival storage





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author Long term archival storage
dgm

2005-03-29, 2:46 am

I have around 20 Tb of data that I (a) want to store for a very (>50
years)long time and also have available for search and download.

The data consists of two types:

(a) the preservation masters which is the data we want to keep and is
in tiff
and bwf formats among others

(b) the viewing copies which are in derived formals such as png and
mp3.

I am coming down to an HSM type of solution which a large enough front
end cache to allow us to keep the viewing copies online at all times
but which allows the archival copies to disappear off to tape to be
cloned and duplicated etc.

Anyone else doing this? Anyone got a better idea?

-dgm
Paul Rubin

2005-03-29, 2:46 am

doug.moncur@gmail.com (dgm) writes:
> I have around 20 Tb of data that I (a) want to store for a very (>50
> years)long time and also have available for search and download...
> (a) the preservation masters which is the data we want to keep and is
> in tiff and bwf formats among others>
> (b) the viewing copies which are in derived formals such as png and mp3.


You have a bunch of images and audio recordings. If the images are
scanned paper docs, the tiff files won't be substantially smaller than
the png files. If they're photographs, you probably want to view jpg
rather than png. In this case the mp3 and jpg files will probably be
less than 1/10th the size of the originals, or about 2 GB total, not
much at all. A small RAID disk system can hold that much easily, on
say six 400GB drives.

> I am coming down to an HSM type of solution which a large enough front
> end cache to allow us to keep the viewing copies online at all times
> but which allows the archival copies to disappear off to tape to be
> cloned and duplicated etc.


Sounds kind of complicated. Where's this data now, how is it stored,
and how fast are you adding to it and through what kind of system? 20
TB isn't really big storage these days. You could have a small tape
library online and move incoming raw data to tape immediately while
also making the online viewing copies on disk. HSM systems with
automatic migration and retrieval are probably overkill.
boatgeek

2005-03-29, 6:09 pm

depends, a lot of companies have regulatory requirements that data be
kept for a long time. If it needs to be accessible quickly, the ATA
drive solution is advisable. I think the cost is around $7 per GB for
stuff like netapp nearstore solutions. The particular product which
would simulate a worm drive is called snaplock.

RPR

2005-03-29, 6:09 pm

>50 years is a long time by archival standards. Most media/drive
manufacturers guarantee 30 years under nominal conditions. You'll have
to store your masters under optimal conditions and check them regularly
(read them every couple of years, check error rate, copy if needed
etc.) to go beyond that. Which raises the question where you'll find
the equipment (spares, service) a couple decades down the road. Example
from my perspective: If you have DLTtape III tapes lying around from
the early nineties, you'd better do something about them now since
Quantum EOL'd the DLT8000, which is the last generation that will read
those tapes. Service will be available for another 5 years. That's 20
of the 50 years.
Basically you'll have to copy the data to state of the art media about
every decade. Don't try to store spare drives for the future, that
doesn't usually work - electromechanical devices age when they're not
in use too.
There have been numerous stories about the problems NASA has retrieving
old data recordings. Your project will face the same. Fortunately 20 TB
isn't a big deal any more and will be less so in the future. The front
end doesn't really matter, but the archive will need a lot of thought
and care. Think what the state of the art was 50 years ago.

Rob Turk

2005-03-29, 6:09 pm

"RPR" <rohbeck@yahoo.com> wrote in message
news:1112123184.224224.230810@g14g2000cwa.googlegroups.com...
> Basically you'll have to copy the data to state of the art media about
> every decade. Don't try to store spare drives for the future, that
> doesn't usually work - electromechanical devices age when they're not
> in use too.


In addition to Ralf-Peter's comment, you better think long and hard about
how you will be accessing that data 50 years from now, from an application
point of view. 50 years from now, the computing devices will be radically
different from today's PC's. Unless you have documented every bit about the
format of the files you stored and the environment you need to recreate the
information, even migration to state of the art media will not help.

Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort
today to open and read such a file. Because the format is relatively simple,
you can still read the text using any hex editor. But recreating the page
formatting maybe harder already.

Now consider your MP3 and picture files which are heavily encoded en
compressed, and fast forward to the year 2055. Unless you know exactly how
they are recreated, all you'll have 50 years from now is a bunch of zeroes
and ones. This is scary for single files, but things are even worse when
multipple files form a single context. Think databases with external
pointers. Think HTML files with web links. How much of that will exist 50
years from now?

For permanent long-term records, store the information on a medium that can
be interpreted by the most universal and long-term computer you have - the
one between your ears -. Microfiche and dead trees aren't obsolete just
yet...

Rob


Faeandar

2005-03-29, 8:46 pm

On 28 Mar 2005 19:49:21 -0800, doug.moncur@gmail.com (dgm) wrote:

>I have around 20 Tb of data that I (a) want to store for a very (>50
>years)long time and also have available for search and download.
>
>The data consists of two types:
>
>(a) the preservation masters which is the data we want to keep and is
>in tiff
> and bwf formats among others
>
>(b) the viewing copies which are in derived formals such as png and
>mp3.
>
>I am coming down to an HSM type of solution which a large enough front
>end cache to allow us to keep the viewing copies online at all times
>but which allows the archival copies to disappear off to tape to be
>cloned and duplicated etc.
>
>Anyone else doing this? Anyone got a better idea?
>
>-dgm



Other replies have made several good points. Here's what we did at a
former employer.

All archived data was stored on NetApp Nearstore (any cheap disk will
do though). No if's, and's, or but's. Reason being is whenever the
next disk upgrade comes in the data is migrated along with it. no
issue of recovery or media type not being available, the data set
follows the technology.
Disks were more expensive than tape (and may still be) but the
guarantee of being able to at least access the data was worth it. As
someone pointed out, you still have to deal with the application to
read it but that can be tested along the way much easier if it's on
disk. Heck, you could even package the app with the data; that's what
we did.

And as technology progresses, no matter what the main media storage
type, there will always be migration techniques. Any vendor wanting
you to migrate from your 15PB 4billion k magentic drives to their
solid light storage will provide a migration path, guaranteed.

The data can be backed up to tape for DR as you see fit. We sent
copies offsite just for "smoking hole' purposes but mostly they were
rotated weekly.

As a proof of concept we did a forklift upgrade from the R100 to the
R200. Just roll in the new and roll out the old. Went from 144GB
drives to 266GB drives so existing data set took up about half of what
it did. This will always be the case.
The migration went fine and data that is now 8 years old is still
spinning away on new disk with their applications. Now whether or not
anyone knows how to work the app is another issue...

~F
Curious George

2005-03-30, 2:45 am

On Tue, 29 Mar 2005 23:36:05 +0200, "Rob Turk"
<_wipe_me_r.turk@chello.nl> wrote:

>"RPR" <rohbeck@yahoo.com> wrote in message
>news:1112123184.224224.230810@g14g2000cwa.googlegroups.com...
>
>In addition to Ralf-Peter's comment, you better think long and hard about
>how you will be accessing that data 50 years from now, from an application
>point of view. 50 years from now, the computing devices will be radically
>different from today's PC's. Unless you have documented every bit about the
>format of the files you stored and the environment you need to recreate the
>information, even migration to state of the art media will not help.
>
>Consider a Word Perfect 4.2 file from 20 years ago. You'll need some effort
>today to open and read such a file. Because the format is relatively simple,
>you can still read the text using any hex editor. But recreating the page
>formatting maybe harder already.


Ok so a lot of converters do an incomplete job, but is this really so
complicated? Save a copy of the application(s) and maybe the OS that
ran it with the data. Between backwards compatibility and improving
emulation technology it might be more doable than you think.

Also keeping data for 50 years doesn't necessarily imply keeping
storage devices for 50 years. Periodic upgrades of the storage and
maybe even the file format of the data might be what needs to happen
to realistically keep useable information for many decades. A major
overhaul like this around every 10 years seems to be working for me
pretty well. Waiting 15 years or more tends to be problematic.

Your mileage may vary and, well, the past is not always a good
indicator of the future.
Paul Rubin

2005-03-30, 2:45 am

Faeandar <mr_castalot@yahoo.com> writes:
> All archived data was stored on NetApp Nearstore (any cheap disk will
> do though). No if's, and's, or but's. Reason being is whenever the
> next disk upgrade comes in the data is migrated along with it. no
> issue of recovery or media type not being available, the data set
> follows the technology.


You seriously think you'll still be using that Netapp stuff in 2055?
Paul Rubin

2005-03-30, 2:45 am

Curious George <cg@email.net> writes:
>
> Ok so a lot of converters do an incomplete job, but is this really so
> complicated? Save a copy of the application(s) and maybe the OS that
> ran it with the data. Between backwards compatibility and improving
> emulation technology it might be more doable than you think.


I would say that most of these conversion problems have stemmed from
secret, undocumented formats. Formats like jpg and mp3, which are well
documented and have reference implementations available as free source
code, should be pretty well immune to the problems.
HVB

2005-03-30, 7:46 am

On 29 Mar 2005 20:26:17 -0800, Paul Rubin
<http://phr.cx@NOSPAM.invalid> wrote:

>Faeandar <mr_castalot@yahoo.com> writes:
>
>You seriously think you'll still be using that Netapp stuff in 2055?


It doesn't matter... he's saying that ANY vendor that wants to sell
you some storage is going to provide a migration facility to do this.
For example... EMC will happily migrate data off NetApp devices into
Clariion or Symmetrix today.

If they didn't do this, it would make it really hard to convince
heavily entrenched users to move.

We all know that disk-based storage will need to be migrated every 5
years or so. As we know we're going to migrate, we're also sure that
it's going to be possible to do this, rather than waiting for the
technology to fall over before doing something about it.

So in 2055 it could be NetAPP, EMC or ACME Storage Corp... it doesn't
matter.

HVB.
_firstname_@lr_dot_los-gatos_dot_ca.us

2005-03-30, 5:47 pm

Imagine that today (2004) you would need to read 20-year old data.
Say it is the content of a hierarchical database (not a relational
database). The source code of the database still exists, but it is
written in IBM 360 assembly, and only runs under OS/VSE, being run 20
years ago on a 3081 under VM. The last guy who maintained it died of
cancer 10 years back; his widow threw out his files.

Or the data was written 20 years ago with a cp/m machine, in binary
format using Borland dBase. Say for grins the cp/m program was doing
data acquisition on custom-built hardware (this was very common back
then), and requires special hardware interfaces to external sensors
and actuators to run.

In the former case, you have to deal with a huge problem: The data is
probably not in 512-byte blocks, but is written in IBM CKD
(count-key-data) format, on special disks (probably 3350 or 3380); a
sensible database application on the 370 would use CKD search
instructions for performance. Fortunately, IBM will today still sell
you a mainframe that is instruction-set compatible with the 360, and
disk arrays that can still execute CKD searches. And you can still
get an OS that somewhat resembles OS/VSE and VM. So for the next few
years, a few million $ and several months of hard work would recover
the information.

Or you could read 50000 lines of IBM assembly code to determine what
the exact data format really is, and write a converter. Enjoy.

The second case is even worse. Most likely, the old cp/m hardware
(even if you have managed to preserve it) will no longer boot, because
the EPROMs and boot floppies have decayed. You can no longer buy a
legal copy of cp/m or dbase. Running an illegal copy on a cp/m
emulator on a modern computer won't work, because the program requires
custom-built hardware sensors and actuators (I carefully constructed
the problem to maximally inconvenience you). Finding dbase manuals
today to decode what the dbase code was doing and understand the data
format will be very time consuming.

What I'm trying to say: The problem of preserving the bit pattern of
the raw data is the absolute least of the issues. It can be solved
trivially: Write the data to good-quality rewriteable CDs, make a few
copies of each, and every few years read all of them back, and write
them to newly current media. Done. The real problem is documenting
the semantics of the bits. The easy way out is to preserve the
complete computing environment used to read the data (including all
hardware, documentation, and wetware that is required to operate it).
This is hard, because hardware, paperware and wetware don't preserve
very well. The second-best way it to convert the data to formats that
are well-documented (say plain text files, or formats that are
enshrined in standards, like ISO9660 or JPG), and also preserve a
human-readable description of that format in a universally readable
way (like enclose a copy of the standard that defines ISO9660 in an
ASCII text file with the data).

I'm not saying that preserving the raw bits should be abandoned. This
is absolutely the most important step; if you fail at that, all other
problems become irrelevant. But please don't believe that it solves
the problem.

The long-term preservation of data is a huge research topic. Please
read the abundant literature on it, to get a flavor of the difficulty.
The real issue you need to think about is this: How valuable is this
data really? How valuable will it be in 20 years? What is the
expected cost of recovering it in 20 years (above I budgeted M$ for
buying the hardware for reading CKD data)? How much do you need to
invest now to minimize the expected data recovery cost in 20 years?
Is you CEO cool with investing this much money now, given that in 20
years he will no longer be the CEO? Will it be economically viable to
use the old data in 20 years? Wouldn't it be easier to print it all
on acid-free paper now, store it in a mine or an old railroad tunnel,
and scan the printout in 20 years?

As an example: I used to be an astrophysicist. I happen to have the
original data tape of the 8 neutrinos from the 1987 supernova that hit
the particle detector in the northern US at home. The tape is 6250
bpi open reel, with a highly complex data format on it; fortunately,
the data format was described on paper people's PhD thesis, but
finding the old decoding software and getting it to run would be very
hard (anyone got a VAX with VMS 4?). Reading it and decoding the data
would take several months of work. As this point, the tape has only
emotional value.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy _firstname_@lr_dot_los-gatos_dot_ca.us
Faeandar

2005-03-30, 5:47 pm

On Wed, 30 Mar 2005 18:12:54 -0000,
_firstname_@lr_dot_los-gatos_dot_ca.us wrote:

>Imagine that today (2004) you would need to read 20-year old data.
>Say it is the content of a hierarchical database (not a relational
>database). The source code of the database still exists, but it is
>written in IBM 360 assembly, and only runs under OS/VSE, being run 20
>years ago on a 3081 under VM. The last guy who maintained it died of
>cancer 10 years back; his widow threw out his files.
>
>Or the data was written 20 years ago with a cp/m machine, in binary
>format using Borland dBase. Say for grins the cp/m program was doing
>data acquisition on custom-built hardware (this was very common back
>then), and requires special hardware interfaces to external sensors
>and actuators to run.
>
>In the former case, you have to deal with a huge problem: The data is
>probably not in 512-byte blocks, but is written in IBM CKD
>(count-key-data) format, on special disks (probably 3350 or 3380); a
>sensible database application on the 370 would use CKD search
>instructions for performance. Fortunately, IBM will today still sell
>you a mainframe that is instruction-set compatible with the 360, and
>disk arrays that can still execute CKD searches. And you can still
>get an OS that somewhat resembles OS/VSE and VM. So for the next few
>years, a few million $ and several months of hard work would recover
>the information.


Well, for the most part we're discussing open systems so this would
not be an issue I think. These data sets all follow a standard, the
biggest problem I see is the application to read them as well as
something they can run on. Some apps that we packaged with the data
can only run on (get this) NT4 SP3. Nothing higher. We did not
package an OS with the data so this could be a serious problem in
about 5 more years. Right now it can still be acquired with minimal
effort.

>
>Or you could read 50000 lines of IBM assembly code to determine what
>the exact data format really is, and write a converter. Enjoy.
>
>The second case is even worse. Most likely, the old cp/m hardware
>(even if you have managed to preserve it) will no longer boot, because
>the EPROMs and boot floppies have decayed. You can no longer buy a
>legal copy of cp/m or dbase. Running an illegal copy on a cp/m
>emulator on a modern computer won't work, because the program requires
>custom-built hardware sensors and actuators (I carefully constructed
>the problem to maximally inconvenience you). Finding dbase manuals
>today to decode what the dbase code was doing and understand the data
>format will be very time consuming.
>
>What I'm trying to say: The problem of preserving the bit pattern of
>the raw data is the absolute least of the issues. It can be solved
>trivially: Write the data to good-quality rewriteable CDs, make a few
>copies of each, and every few years read all of them back, and write
>them to newly current media. Done. The real problem is documenting
>the semantics of the bits.


I'm not sure I understand what you're saying here; "the semantics of
the bits"? If you can elaborate a little I would appreciate it.
As to your trivial solution, good luck. If it were that trivial
everyone would be doing it. The problem is people, no one wants to
recall 4000 cd's from cold storage so they can convert them to DVD or
Blue Ray or whatever. Of course this is a procedural problem not a
technical one but you have to plan for those as well.
And like you said, if someone can't preserve data for that long how in
hell are they going to preserve the entire environment? Some people
do, but those are rare.

After thinking about the semantics a bit (pun intended) I finally got
it. Forget I asked for the elaboration. And as I said initially,
package the app with it. As long as you can get the app to run you
can access the data. I think we're saying the same thing but my way
is ALOT easier IMO. And cheaper in the long run too. Tried to get
support for a System 34 lately?

>The easy way out is to preserve the
>complete computing environment used to read the data (including all
>hardware, documentation, and wetware that is required to operate it).
>This is hard, because hardware, paperware and wetware don't preserve
>very well.


So is it easy or hard?

>
>I'm not saying that preserving the raw bits should be abandoned. This
>is absolutely the most important step; if you fail at that, all other
>problems become irrelevant. But please don't believe that it solves
>the problem.


Keep the app with the data. That should solve 90% of the potential
problems, minus things like OS or hardware platform.

>
>The long-term preservation of data is a huge research topic. Please
>read the abundant literature on it, to get a flavor of the difficulty.
>The real issue you need to think about is this: How valuable is this
>data really? How valuable will it be in 20 years? What is the
>expected cost of recovering it in 20 years (above I budgeted M$ for
>buying the hardware for reading CKD data)? How much do you need to
>invest now to minimize the expected data recovery cost in 20 years?
>Is you CEO cool with investing this much money now, given that in 20
>years he will no longer be the CEO? Will it be economically viable to
>use the old data in 20 years? Wouldn't it be easier to print it all
>on acid-free paper now, store it in a mine or an old railroad tunnel,
>and scan the printout in 20 years?


Most data loses any real value after 10 years. Some much sooner. The
most common value for data 10+ years is patent defense. And we've had
to pull data from 13 years ago for just that so it's a reality not
just a possibility. It cost a bundle but the alternative would have
cost infinitely more.

>
>As an example: I used to be an astrophysicist. I happen to have the
>original data tape of the 8 neutrinos from the 1987 supernova that hit
>the particle detector in the northern US at home. The tape is 6250
>bpi open reel, with a highly complex data format on it; fortunately,
>the data format was described on paper people's PhD thesis, but
>finding the old decoding software and getting it to run would be very
>hard (anyone got a VAX with VMS 4?). Reading it and decoding the data
>would take several months of work. As this point, the tape has only
>emotional value.


This is why I say keep it on disk and migrate it with the rest of your
data. It's extremely hard to find a Kennedy reel as well, but we have
data on those tapes too. Now if it were on disk with the application
that wrote it, I might have a 40% chance of getting to it rather than
a .05% chance, hardware and OS availability being key at that point.

~F
boatgeek

2005-03-30, 8:45 pm

Check out this article.
http://www.infoconomy.com/pages/sto...group101451.adp

basically digital information for present term access, understanding
that it will need a migration of the back end storage platform and a
translation of the front end software and data into whatever is the
current technical lingua franca..

For long term storage and DR, using microfilm rated at 250 years.

The was a law enacted by congress of UK and US for long term census
records.

So, there is your answer, everyone is right.

Wanna get really cool, ideas are floating for laser lithographs on
ceramic disks at the microscopic level for storage which would last
thousands of years. I personally like that, but I'm a geek with an
interest in history.

Al Dykes

2005-03-31, 2:45 am

In article <1112237441.536966.23640@o13g2000cwo.googlegroups.com>,
boatgeek <dougvibbert@yahoo.com> wrote:
>Check out this article.
>http://www.infoconomy.com/pages/sto...group101451.adp
>
>basically digital information for present term access, understanding
>that it will need a migration of the back end storage platform and a
>translation of the front end software and data into whatever is the
>current technical lingua franca..
>
>For long term storage and DR, using microfilm rated at 250 years.
>
>The was a law enacted by congress of UK and US for long term census
>records.
>
>So, there is your answer, everyone is right.
>
>Wanna get really cool, ideas are floating for laser lithographs on
>ceramic disks at the microscopic level for storage which would last
>thousands of years. I personally like that, but I'm a geek with an
>interest in history.
>



microscopic laser pits on nickel sheets.

But seriously, folks....


Emulation and virtual machines are going to be the salvation for
recovering ancient applications and data. As it has been pointed out,
having a database dump does you no good unless you can run the
application. Now it can be done in emulation. Current computers are
sooo much faster than the machines we emulate that performance can be
decent even if you have to emulate the entire instruction set.

There will always be some service shop that will read your old media
(if it's readable and burn it into a CDR (or whatever the media is
years from now) and as long as you have the OS, application and data
you'll be good to go.


Take a look at this for a list of emulators for machines that haven't
existed outside museums for decades.

http://simh.trailing-edge.com/


I've owned (as a corproate manager) several of the machines on this
list and played with an emulator. A machine that cost close to a
million bucks in 1978 and sucked about 30kW runs slower than its
emulator does on my PC, at least for a single user.

The PC Emulator of IBM370 (http://www.conmicro.cx/hercules/) was used
big-time by corporations in 1999 for testing Y2K conversions.




--
a d y k e s @ p a n i x . c o m

Don't blame me. I voted for Gore.
dgm

2005-04-03, 8:46 pm


Paul Rubin wrote:
[...]

> Sounds kind of complicated. Where's this data now, how is it stored,
> and how fast are you adding to it and through what kind of system?

20
> TB isn't really big storage these days. You could have a small tape
> library online and move incoming raw data to tape immediately while
> also making the online viewing copies on disk. HSM systems with
> automatic migration and retrieval are probably overkill.


It is kind of complicated. Currently we have 6Tb digitised and are
adding 0.1Tb/week.

Now this is data that's stuff that needs to be kept for ever - the
audio stuff is world heritage stuff. The driver for using HSM is two
fold

1) keeping multiple copies securely including offsite
2) we know we have a 900kg gorilla called video waiting in the wings
....

dgm

2005-04-03, 8:46 pm

The point about file formats is well made, but we've been through the
same arguement in detail already. We're choosing file formats which are
publically described for which there are multiple (open source)
clients.

The idea is to be able to ensure that we have the format description
and enough example code to be able to recreate viewers in the future.
That's why we're using tiff and bwf as the archival masters. I don't
care about the mp3's as they are *derived* copies - we can as easily
use ogg vorbis, or whatever we're using in 2055 as long as we can parse
the original compression free datastream

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com