|
Home > Archive > Unix administration > February 2004 > NFS file problem: maybe a stale/stuck handle?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
NFS file problem: maybe a stale/stuck handle?
|
|
| Theodore W. Hall 2004-02-22, 10:33 pm |
| This has me stumped. My apologies if the subject line misses the
mark.
I've recently migrated some directories from one NFS file server
(SunOS 5.6) to another (IRIX 6.5.13m) using "cp -Rp" as root from an
NFS client (IRIX 6.5.19m).
Before the migration, the NFS client did this:
/usr/local -> /net/sun-host/mnt/usr/local
After the migration, it does this:
/usr/local -> /net/irix-host/mnt/usr/local
Everything seems fine, except for one file:
/usr/local/netscape_4.79_irix6.5/spell/pen4s324.dat
The symptom is that netscape mail's "Spelling" tool is unavailable.
I've traced the spelling problem to that particular file.
I've compared the files on the SunOS and IRIX servers with diff and
cmp, as a regular user (no special privileges), and they appear to
be identical. Neither diff nor cmp have any trouble reading either
file. Ownership, sizes, dates, permissions, everything I can think
of to compare is identical, except for their physical locations on
disk. In trying to debug this, I've found the following:
* Only netscape seems to have any trouble accessing the file.
Other apps (diff, cmp, strings, cp) have no trouble reading it.
* If I replace the file on the IRIX server with a symlink back to
the SunOS server, it works in netscape:
pen4s324.dat -> /net/sun-host/mnt/usr/local/netscape_4.79_irix6.5/spell/pen4s324.dat
* If I replace it with a symlink to another duplicate of the file
on the NFS client's internal disk, it works in netscape (only on
that host, of course):
pen4s324.dat -> /usr/local-on-nfs-client/pen4s324.dat
* If I make another duplicate of the file, or even the entire
netscape_4.79_irix6.5 directory, on the IRIX server, it doesn't
work. Everything else in netscape seems to run fine, except for
the "Spelling" tool.
* If I log-in to the IRIX server and run netscape there (with
DISPLAY pointing back to the NFS client), THAT works. It avoids
NFS and accesses the file locally:
/usr/local -> /mnt/usr/local
I've rebooted both the IRIX NFS server and the NFS client, but this
behavior persists. It's not tied to netscape's pathname to the file
(since the pathname works if it's a symlink to the SunOS server).
It's not tied to the file's inode (since copying the file to a
different inode doesn't change anything). It's not tied to the
file's contents (since it works as long as the app is running on the
NFS server rather than the NFS client).
The last point seems to indicate some sort of NFS problem, but as I
said, I've rebooted both machines to no avail. I'd appreciate any
tips.
--
Ted Hall
| |
| Mark Hittinger 2004-02-23, 4:34 am |
| "Theodore W. Hall" <twhall@cuhk.edu.hk> writes:
>This has me stumped. My apologies if the subject line misses the
>mark.
>...
>Everything seems fine, except for one file:
> /usr/local/netscape_4.79_irix6.5/spell/pen4s324.dat
>...
>* If I make another duplicate of the file, or even the entire
> netscape_4.79_irix6.5 directory, on the IRIX server, it doesn't
> work. Everything else in netscape seems to run fine, except for
> the "Spelling" tool.
I wonder if the pen4s324.dat might be a "sparse file" which may not
have gotten copied correctly onto the new server. Try running a local
and nfs mounted cksum on both files and see if there is a checksum
difference. Along the same line of reasoning if the file did get
copied OK but is indeed a sparse file there may be some sort of nfs
bug that you are running into related to sparse file handling.
Good Luck!
Mark Hittinger
bugs@pu.net
| |
| Barry Margolin 2004-02-23, 9:34 am |
| In article <eKOdnVU3rrwIqafdRVn_vA@comcast.com>,
bugs@pu.net (Mark Hittinger) wrote:
> "Theodore W. Hall" <twhall@cuhk.edu.hk> writes:
>
> I wonder if the pen4s324.dat might be a "sparse file" which may not
> have gotten copied correctly onto the new server. Try running a local
> and nfs mounted cksum on both files and see if there is a checksum
> difference. Along the same line of reasoning if the file did get
> copied OK but is indeed a sparse file there may be some sort of nfs
> bug that you are running into related to sparse file handling.
He said he already compared them using "cmp" and "diff"; I can't imagine
how checksum would detect a difference that these didn't. Sparseness is
transparent to user-level applications.
I suggest the OP use truss on the client to see what it's doing when
Netscape hangs.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
| |
| Doug Freyburger 2004-02-24, 12:34 am |
| Theodore W. Hall wrote:
>
> I've recently migrated some directories from one NFS file server
> (SunOS 5.6) to another (IRIX 6.5.13m) using "cp -Rp" as root from an
> NFS client (IRIX 6.5.19m).
cp isn't a good tool to copy directory trees from one machine to
another.
> /usr/local/netscape_4.79_irix6.5/spell/pen4s324.dat
>
> The symptom is that netscape mail's "Spelling" tool is unavailable.
> I've traced the spelling problem to that particular file.
>
> I've compared the files on the SunOS and IRIX servers with diff and
> cmp, as a regular user (no special privileges), and they appear to
> be identical. Neither diff nor cmp have any trouble reading either
> file. Ownership, sizes, dates, permissions, everything I can think
> of to compare is identical, except for their physical locations on
> disk.
Next check the UID and GID numbers that give the names you see.
Check them on both machines. The ownership could still be wrong.
Finally there's a topic that comes up a lot in tape drives that may
apply. IRIX uses little-endian and Solaris uses big-endian (or is
it the other way around, anyways they are different). Reading a file
locally won't do htonl() and ntohl() mapping of binary files. Reading
a file over NFS should do network order mapping. I suspect it is a
binary file and the XDR layer of NFS broke its byte order.
| |
| Theodore W. Hall 2004-02-24, 1:34 am |
| Barry Margolin wrote:
> I suggest the OP use truss on the client to see what it's doing
> when netscape hangs.
That was revealing. "man -k truss" on IRIX led me to /usr/sbin/par,
which reports an error of "No locks available":
open("/usr/local/netscape_4.79_irix6.5/spell/pen4s324.dat", O_RDONLY, 0777) = 28
fcntl(28, F_GETLK, 0x7ffefba0) errno = 46 (No locks available)
close(28) OK
close(28) errno = 9 (Bad file number)
Moreover, when I shuffle things to get around that (as I described
in my original post), I get a similar "No locks available" error on
a couple of other spelling related files -- netscape.dic and
${HOME}/.netscape/custom.dic
open("/usr/local/netscape_4.79_irix6.5/spell/netscape.dic", O_RDONLY, 0777) = 29
fcntl(29, F_GETLK, 0x7ffef950) errno = 46 (No locks available)
close(29)
END-close() OK
...
open("/usr/people/hall/.netscape/custom.dic", O_RDONLY, 0777) = 29
fcntl(29, F_GETLK, 0x7ffef950) errno = 46 (No locks available)
close(29) OK
Apparently, if "pen4s324.dat" fails, then netscape doesn't even try
to open the custom dictionaries. If "pen4s324.dat" succeeds, then
the Spelling tool is available even though it fails to open the
custom dictionaries. ("pen4s324.dat" seems to be the main
dictionary, in some binary format. "netscape.dic" is a text file
with only a few entries such as "Netscape", "HTML", "browser",
"Collabra", "applets", ... not essential to the tool.)
I've rebooted the NFS client again, but it made no difference. So
the next question is: Why are there no locks available for these
few files? Everything else seems to be fine.
--
Ted Hall
| |
| Theodore W. Hall 2004-02-24, 2:34 am |
| Mark Hittinger wrote:
> I wonder if the pen4s324.dat might be a "sparse file" which may not
> have gotten copied correctly onto the new server. Try running a local
> and nfs mounted cksum on both files and see if there is a checksum
> difference. Along the same line of reasoning if the file did get
> copied OK but is indeed a sparse file there may be some sort of nfs
> bug that you are running into related to sparse file handling.
>
> Good Luck!
Thanks. I tried /usr/bin/cksum on both NFS servers and the NFS
client and got identical results on all three hosts.
I've also discovered via /usr/sbin/par that a similar error is
occurring on a plain text file of just 144 bytes (netscape.dic), so
it doesn't seem to be related to "sparsity". I didn't notice this
error previously since the only "damage" is the loss of a handful
of custom dictionary entries.
--
Ted Hall
| |
| Theodore W. Hall 2004-02-24, 3:34 am |
| Doug Freyburger wrote:
> Next check the UID and GID numbers that give the names you see.
> Check them on both machines. The ownership could still be wrong.
UID = 0, GID = 0, mode = -rw-r--r--
> Finally there's a topic that comes up a lot in tape drives that may
> apply. IRIX uses little-endian and Solaris uses big-endian (or is
> it the other way around, anyways they are different). Reading a
> file locally won't do htonl() and ntohl() mapping of binary files.
> Reading a file over NFS should do network order mapping. I suspect
> it is a binary file and the XDR layer of NFS broke its byte order.
Mmm, in my experience, IRIX-MIPS and SunOS-SPARC are both big-endian.
Anyay, /usr/sbin/par reveals that I'm getting a similar error
"No locks available" on a 144-byte plain text file, "netscape.dic".
Thanks for your suggestions. I've been bitten by byte order before,
but not this time.
--
Ted Hall
| |
| Barry Margolin 2004-02-25, 9:39 am |
| In article <403B697A.96099D1E@cuhk.edu.hk>,
"Theodore W. Hall" <twhall@cuhk.edu.hk> wrote:
> That was revealing. "man -k truss" on IRIX led me to /usr/sbin/par,
> which reports an error of "No locks available":
Sounds like there's a problem with file locking on the server you copied
the files to.
--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
| |
| Theodore W. Hall 2004-02-26, 4:34 am |
| In article <403B697A.96099D1E@cuhk.edu.hk>, I wrote:
> That was revealing. "man -k truss" on IRIX led me to
> /usr/sbin/par, which reports an error of "No locks available":
Barry Margolin wrote:
> Sounds like there's a problem with file locking on the server you
> copied the files to.
Eureka! I've just discovered that the SGI NFS server isn't
running lockd -- it's a separate switch from nfsd. nfsd is on,
but lockd is off. Urggh ... live and learn ... I'm obviously
an amateur at this.
I can't reboot now, but I'm confident that starting lockd on the
next reboot will clear this up. If it doesn't, I'll be back ...
--
Ted Hall
|
|
|
|
|