Unix Programming - how an editor should write a file the right way

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > May 2006 > how an editor should write a file the right way





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author how an editor should write a file the right way
Ulrich Eckhardt

2006-05-07, 7:15 am

Greetings!

I'm currently musing about the right way how e.g. a text editor should
write a file and properly handle all possible error conditions. What I'm
also concerned about is that e.g. permissions/ownership are maintained
properly.

I think the typical advise goes like this:

1. write to some temp file next to the original
2. replace the original with the temporary

but this conflicts with the requirement that permissions/ownership are
preserved. What I don't know is how to copy over those attributes from the
original. Things like the Unix file file permissions are rather easy, but
what about the increasing number and variety of ACLs? Handling all of
those would be pretty hard, so I propose a different solution:

1. make a backup of the target
2. replace the content of the target
3. remove the backup

When replacing the content fails, you could still try to restore the
initial content and failing at that, you could at least alert the user
that the operation failed and that the content was backed up to a certain
place.

Is my reasoning sound? Are there other ways that are preferred? Any
comments?

Uli



--
http://www.erlenstar.demon.co.uk/unix/
Barry Margolin

2006-05-07, 7:15 am

In article <4c5i5gF12s4q9U1@uni-berlin.de>,
Ulrich Eckhardt <doomster@knuut.de> wrote:

> Greetings!
>
> I'm currently musing about the right way how e.g. a text editor should
> write a file and properly handle all possible error conditions. What I'm
> also concerned about is that e.g. permissions/ownership are maintained
> properly.
>
> I think the typical advise goes like this:
>
> 1. write to some temp file next to the original
> 2. replace the original with the temporary


This is what GNU Emacs does, unless one of the backup-by-copying*
variables applies.

>
> but this conflicts with the requirement that permissions/ownership are
> preserved. What I don't know is how to copy over those attributes from the
> original. Things like the Unix file file permissions are rather easy, but
> what about the increasing number and variety of ACLs? Handling all of
> those would be pretty hard, so I propose a different solution:
>
> 1. make a backup of the target
> 2. replace the content of the target
> 3. remove the backup
>
> When replacing the content fails, you could still try to restore the
> initial content and failing at that, you could at least alert the user
> that the operation failed and that the content was backed up to a certain
> place.
>
> Is my reasoning sound? Are there other ways that are preferred? Any
> comments?


Most editors simply overwrite the file without trying to be failsafe.
Some of them also make snapshots of the editing buffer somewhere, so
that if the editor crashes you can recover what you were doing. But
they don't generally save the original file anywhere.

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***
Nils O. Selåsdal

2006-05-07, 7:15 am

Ulrich Eckhardt wrote:
> Greetings!
>
> I'm currently musing about the right way how e.g. a text editor should
> write a file and properly handle all possible error conditions. What I'm
> also concerned about is that e.g. permissions/ownership are maintained
> properly.
>
> I think the typical advise goes like this:
>
> 1. write to some temp file next to the original
> 2. replace the original with the temporary
>
> but this conflicts with the requirement that permissions/ownership are
> preserved. What I don't know is how to copy over those attributes from the
> original. Things like the Unix file file permissions are rather easy, but
> what about the increasing number and variety of ACLs? Handling all of
> those would be pretty hard, so I propose a different solution:
>
> 1. make a backup of the target
> 2. replace the content of the target
> 3. remove the backup
>
> When replacing the content fails, you could still try to restore the
> initial content and failing at that, you could at least alert the user
> that the operation failed and that the content was backed up to a certain
> place.
>
> Is my reasoning sound? Are there other ways that are preferred? Any
> comments?


Read http://www.eelab.usyd.edu.au/doc/sam.pdf ,it discusses some
implementation details of R. Pikes sam text editor on Plan 9.

Pascal Bourguignon

2006-05-07, 1:15 pm

Ulrich Eckhardt <doomster@knuut.de> writes:
> [...]
> 1. make a backup of the target
> 2. replace the content of the target
> 3. remove the backup
>
> When replacing the content fails, you could still try to restore the
> initial content and failing at that, you could at least alert the user
> that the operation failed and that the content was backed up to a certain
> place.
>
> Is my reasoning sound?


Indeed. That's why in scripts I write:

cp file file~ && some_filter < file~ > file && rm file~

instead of:

some_filter < file > file.new && mv file.new file


> Are there other ways that are preferred? Any comments?


--
__Pascal Bourguignon__ http://www.informatimago.com/
Litter box not here.
You must have moved it again.
I'll poop in the sink.
Michael Paoli

2006-05-07, 7:14 pm

Ulrich Eckhardt wrote:
> I'm currently musing about the right way how e.g. a text editor should
> write a file and properly handle all possible error conditions. What I'm
> also concerned about is that e.g. permissions/ownership are maintained
> properly.
> I think the typical advise goes like this:
> 1. write to some temp file next to the original
> 2. replace the original with the temporary
> but this conflicts with the requirement that permissions/ownership are
> preserved. What I don't know is how to copy over those attributes from the
> original. Things like the Unix file file permissions are rather easy, but
> what about the increasing number and variety of ACLs? Handling all of
> those would be pretty hard, so I propose a different solution:
> 1. make a backup of the target
> 2. replace the content of the target
> 3. remove the backup
> When replacing the content fails, you could still try to restore the
> initial content and failing at that, you could at least alert the user
> that the operation failed and that the content was backed up to a certain
> place.
> Is my reasoning sound? Are there other ways that are preferred? Any
> comments?


Here's my take on it:
o The temporary(/ies) should be written in temporary location(s), not
"next to" the original.[1] If TMPDIR is set in the environment and
that's a writable directory, use TMPDIR, otherwise use /var/tmp for
general contents which should be non-volatile and /tmp for volatile
contents. Be sure to properly and securely handle temporary
files to avoid race conditions and security problems.
Rationale:
o [2]
o [3]
o Also general best practice - e.g. if the program gets abnormally
terminated, the cruft (temporary files) are left in temporary
and/or well known location(s), rather than scattered about the
filesystems - much easier to maintain, cleanup, etc. that way, and
in many cases better security and/or performance is obtained by
writing files to more suitable temporary location (e.g. /var/tmp
may be on a RAID 0+1 filesystem and /tmp may be a tmpfs
filesystem from virtual memory, whereas the file being edited may
reside on a much lower performance RAID-5 filesystem).
o The editor should suitably track its state such that it can recover
at least semi-reasonably from crashes or other abnormal
terminations. This may be done via a common directory, e.g.
/var/tmp/vi.recover, or per HOME directory file or directory, e.g.
~/.editor.d
o When it's time to write the file, directly overwrite it.[4] If one
doesn't take that approach, other problems may result, e.g. (hard)
link relationships lost, open file descriptors are no longer open
on the same file, one may not be able to set the same ownership(s)
on the file, if space is tight and/or the file large, there may be
insufficient space to have both old and new versions of the file on
the same filesystem at the same time, etc.[1][2]
o Handle in a reasonable manner any and all errors on a write attempt
that fails.[2][3]

references/footnotes:
1. Default emacs(1) behavior violates this and [2]
2. http://en.wikipedia.org/wiki/Princi...st_astonishment
3. http://www.rawbw.com/~mp/unix/sh/#G...mming_Practices
4. Failure to do so when performing administrative tasks (e.g. as
superuser (UID 0, generally "root")) is particularly likely to lead
to problems.
nvi(1)
vi(1)

Logan Shaw

2006-05-09, 1:15 am

Michael Paoli wrote:
> o Also general best practice - e.g. if the program gets abnormally
> terminated, the cruft (temporary files) are left in temporary
> and/or well known location(s), rather than scattered about the
> filesystems - much easier to maintain, cleanup, etc. that way, and
> in many cases better security and/or performance is obtained by
> writing files to more suitable temporary location (e.g. /var/tmp
> may be on a RAID 0+1 filesystem and /tmp may be a tmpfs
> filesystem from virtual memory, whereas the file being edited may
> reside on a much lower performance RAID-5 filesystem).


I don't think it's that one-sided when it comes to performance. If
the goal is, as the original poster described, to write the modified
version of the file but preserve the original until the modified
version has been completely written to disk, then it seems like the
next logical step is to replace the original with the modified
version. This is where writing the temporary file (which is the
modified version) to the same directory as the original can be
a huge win: two files in the same directory are almost always
on the same filesystem.

Thus, replacing the original version with the modified version
becomes an O(1) operation, where if you had written the modified
version to, say, /var/tmp, replacing the original with the
modified version is an O(N) operation unless the two happen to
be on the same filesystem.

Furthermore, there is an extra measure of safety that can be
achieved if you are replacing a file with another file on the
same filesystem. If the original file is /home/someuser/foo
and you have written your modified version to /tmp/foo.new,
then replacing foo with foo.new is a fairly unsafe thing. You
must start writing bytes to /home/someuser/foo, and if the
machine crashes or there's a power loss (or if you merely run
out of disk space) before you complete this process, you've
lost both the modified version AND the original. In contrast,
if you write the modified version to /home/someuser/foo.new,
then overwriting foo with foo.new can be done pretty safely.
I don't know of any guarantees that Unix filesystems make about
whether directory name changes are transactional (in the sense
that they either complete fully or have no effect at all, i.e.
that they never partially complete), but I wouldn't be surprised
if modern Unix filesystems that support logging actually can
guarantee this.

- Logan
Michael Paoli

2006-05-09, 7:18 am

Logan Shaw wrote:
> Michael Paoli wrote:
> I don't think it's that one-sided when it comes to performance. If
> the goal is, as the original poster described, to write the modified
> version of the file but preserve the original until the modified
> version has been completely written to disk, then it seems like the
> next logical step is to replace the original with the modified
> version. This is where writing the temporary file (which is the
> modified version) to the same directory as the original can be
> a huge win: two files in the same directory are almost always
> on the same filesystem.
>
> Thus, replacing the original version with the modified version
> becomes an O(1) operation, where if you had written the modified
> version to, say, /var/tmp, replacing the original with the
> modified version is an O(N) operation unless the two happen to
> be on the same filesystem.
>
> Furthermore, there is an extra measure of safety that can be
> achieved if you are replacing a file with another file on the
> same filesystem. If the original file is /home/someuser/foo
> and you have written your modified version to /tmp/foo.new,
> then replacing foo with foo.new is a fairly unsafe thing. You
> must start writing bytes to /home/someuser/foo, and if the
> machine crashes or there's a power loss (or if you merely run
> out of disk space) before you complete this process, you've
> lost both the modified version AND the original. In contrast,
> if you write the modified version to /home/someuser/foo.new,
> then overwriting foo with foo.new can be done pretty safely.
> I don't know of any guarantees that Unix filesystems make about
> whether directory name changes are transactional (in the sense
> that they either complete fully or have no effect at all, i.e.
> that they never partially complete), but I wouldn't be surprised
> if modern Unix filesystems that support logging actually can
> guarantee this.


True, there are tradeoffs. There isn't really a particular "perfect"
solution.

I'd still argue, for various reasons that I already covered[1] pretty
well, that, for most typical editor usage, the temporary files should
be written in an appropriate temporary location, and not alongside
the file being edited. That, however, is an issue which is mostly
quite distinct from how we ultimately go about ending up with our
completed edited version of the file at the pathname of our target
file.

To ultimately end up with our edited version of the file at our
target pathname, there are essentially two approaches, with various
pros and cons:

direct write(2) rename(2)
o direct overwrite (write(2)) o rename(2)
o generally easier to implement o have to securely and properly
manage a temporary file on the
same filesystem
o can be optimized to not have to o full write of file required,
rewrite portions of the file certain optimizations that might
which retain the same data in otherwise be possible cannot be
the same locations (can be taken advantage of
significantly more efficient
for small number of changes on
large file)
o don't need space for both to o do need space for both to exist
exist on the same filesystem at on the same filesystem at the
the same time same time
o ownerships and permissions o appropriate care must be taken
remain the same[2] if ownerships and permissions
are to be replicated, and in
some cases the ownerships and/or
permissions cannot be preserved
[3]
o (hard) link relationships are o (hard) link relationships are
preserved lost[4]
o open file descriptors on the o once the edits are complete,
original file continue to refer open file descriptors on the
to the same pathname, even after original file no longer
the edits are completed reference the file at the same
pathname; if the original file
is unlink(2)ed, open unlinked
file problems may result
o the write(2) operations are not o rename(2) is atomic[5], thus the
guaranteed to be atomic, hence original pathname is always
the file could be read or left either the old unedited file, or
in a partially the new fully edited file -
rewritten/overwritten state never something indeterminate
between the two

And of course there are also general tradeoffs, such as, in the case
of system crash or abnormal termination of the editing program,
what's the worst damage that could occur to the file being edited,
and what cruft (temporaries, etc.) are left, where, and how easy are
they to find, cleanup, and generally manage, and if/when something
goes wrong, how easy is it to detect and fix the issue/problem.


footnotes/references:
1. news:1147034979.155446.109330@v46g2000cwv.googlegroups.com
2. except S[UG]ID are generally cleared upon write by non-superuser,
ACLs would likely be preserved
3. e.g. in general if non-superuser euid is distinct from the owner of
the original file, or non-superuser writing the file is not member
of the group which is the group owner of the original file; any
ACLs present on the original which are to be reproduced on the
edited file, may further complicate matters and/or it may not be
possible to reproduce the ACLs.
4. tracking down and replacing all of the hard links may not be
possible for non-superuser, and even when it's possible, replacing
all the other hard links may cause more problems/surprises than not
doing so, and even if they were to be replaced, replacing them all
would not be an atomic operation
5. not necessarily withstanding NFS and/or possibly certain other
types of shared and/or remote filesystems and/or filesystem types
which may not be native to UNIX(/LINUX/...)
open(2)
write(2)
lseek(2)
llseek(2)
truncate(2)
rename(2)
unlink(2)
link(2)

Brian Raiter

2006-05-10, 7:15 pm

> o The temporary(/ies) should be written in temporary location(s), not
> "next to" the original.


[...]

> o Also general best practice - e.g. if the program gets abnormally
> terminated, the cruft (temporary files) are left in temporary
> and/or well known location(s), rather than scattered about the
> filesystems - much easier to maintain, cleanup, etc.


[...]

> o The editor should suitably track its state such that it can recover
> at least semi-reasonably from crashes or other abnormal
> terminations.


I'm afraid I'd have to argue that keeping the temporaries "next to"
the original is actually the better practice when it comes to the
other two points I've quoted above. If you keep all your temporaries
in one place, then you increase the likelihood of naming conflicts
(from two people editing different files with the same filename),
making it harder for casual users to see which temporary goes to which
file after a crash. Keeping the temporary "next to" the original means
that naming clashes occur only when people are (potentially) modifying
the same file anyway. emacs follows this path, and so it is easy for
me to tell when a temporary emacs file has been left behind, and emacs
does not rely on other data in order to resume from an autosave.

Of course these issues become unimportant in situations where you know
you will not have multiple simultaneous users, but that is the
exception.

b
Michael Paoli

2006-05-13, 1:18 am

Brian Raiter wrote[1]:
> I'm afraid I'd have to argue that keeping the temporaries "next to"
> the original is actually the better practice when it comes to the
> other two points I've quoted above. If you keep all your temporaries
> in one place, then you increase the likelihood of naming conflicts
> (from two people editing different files with the same filename),
> making it harder for casual users to see which temporary goes to which
> file after a crash. Keeping the temporary "next to" the original means
> that naming clashes occur only when people are (potentially) modifying
> the same file anyway. emacs follows this path, and so it is easy for
> me to tell when a temporary emacs file has been left behind, and emacs
> does not rely on other data in order to resume from an autosave.
>
> Of course these issues become unimportant in situations where you know
> you will not have multiple simultaneous users, but that is the
> exception.


Well, I beg to differ, on at least several of your points:
o "Naming conflicts" should not be an issue, e.g. with temporaries in
a "common" location, and/or multiple instances of editor working on
identically named file at the same time (with the same or distinct
pathnames). This is mostly dealt with via proper temporary file
handling[2]. This is essentially a "solved" problem in
UNIX(/LINUX, etc.) ... of course that doesn't mean folks don't
repeatedly make the same mistakes when it comes to programming
(e.g. BUGTRAQ[3] provides tons of examples of the same classic
mistakes made over and over again).
o Casual users don't need to "see" which temporaries are associated
with editing sessions on which files. Granted, for the casual
user, seeing something quite similarly named there alongside the
original may help give them more suitable hints, but that can also
lead to more problems (e.g. they think it's cruft and remove it,
not realizing they actually want to make use of it, or they remove
it or overwrite it or truncate it while it's open and in use by the
editor, or there's a naming conflict if the naming convention is
too limited, etc. - those problems are mostly avoided when the
temporaries are in suitable common temporary location(s).). Having
suitable means to find out there's a "recovery" file (abnormally
terminated edit session), and how to recover is usually pretty
sufficient (e.g. editor sends user e-mail informing them of the
fact and how to recover, basic training/documentation on the editor
covers how to check if there are recovery files and how to recover,
etc.).
o This is UNIX, etc., one must presume that various asynchronous
events may and will generally occur (same or other users editing
identically named file or pathname at the same time, any sequence
of non-atomic events may have other events occur between them or
the latter events may not even occur (e.g. abnormal termination),
one must take suitable precautions to avoid problematic race
conditions (e.g. proper handling of temporaries. And placing
temporaries alongside the original does not eliminate potential
race conditions, e.g. in a shared drwxrwx--T directory (with or
without any and/or all of sticky bit, SGID, world "execute" and/or
world read on directory) two distinct users could generally breach
the security of each other's accounts if temporary files are
written in that directory and the security of such isn't properly
handled. Even in a non-shared directory, a user could
unintentionally clobber or corrupt their own work if temporaries
aren't handled properly).

I certainly tend to be of the opinion that how we do our penultimate
operations to have the edited data end up at the original pathname is
much more debatable[4], but that the case of where the temporaries
should go - at least generally speaking - is relatively clear cut (or
at least makes for a rather lopsided debate).

footnotes/references:
1. in news:e3tu6e$8ej$1@cascadia.drizzle.com
and left out some attribution information:
news:1147034979.155446.109330@v46g2000cwv.googlegroups.com
2. e.g. see:

http://www-128.ibm.com/developerwor...ace.html#N10174
and numerous similar qualified reference materials.
3. http://www.securityfocus.com/archive/1
4. news:1147162629.287269.146060@e56g2000cwe.googlegroups.com
news:comp.security.unix

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com