|
Home > Archive > Unix Programming > February 2007 > Please explain why kill -9 doesn't always kill
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Please explain why kill -9 doesn't always kill
|
|
| Andrew Falanga 2007-02-14, 1:20 pm |
| I've seen this before and I'm confused by it because the manual pages
for kill, signal, etc. all say that SIGKILL and SIGSTOP "cannot be
caught, blocked or ignored." I know that this has to do with
programming for sigaction that they cannot be caught, blocked or
ignored, but why is it sometimes, when at the commands prompt, I'll
give a process "kill -9 <pid>" and the kill program completes, but
then a ps -aux | grep <proc_name> shows that the process is still
active?
I'd like to know if I'm on the right track with thinking that it's
because the process I'm attempting to kill is in some blocking state
waiting for I/O, but I'm not sure I'm correct. This has really
baffled me from time to time because on the one hand the manual pages
say that the signal "cannot be caught, blocked or ignored" but yet
when I give the kill command as above, it sure seems to be "caught,
blocked or ignored." So, what gives?
Andy
| |
| Rafael Almeida 2007-02-14, 7:21 pm |
| On 14 Feb 2007 10:05:28 -0800
"Andrew Falanga" <af300wsm@gmail.com> wrote:
> I'd like to know if I'm on the right track with thinking that it's
> because the process I'm attempting to kill is in some blocking state
> waiting for I/O, but I'm not sure I'm correct. This has really
> baffled me from time to time because on the one hand the manual pages
> say that the signal "cannot be caught, blocked or ignored" but yet
> when I give the kill command as above, it sure seems to be "caught,
> blocked or ignored." So, what gives?
>
Yes, that's correct. I had a problem like that once. I removed a usb
pendrive while I was transfering data to it and a few process just
wouldn't die. Their stat on ps was Z, iirc. I'm not completely sure how
everything happens, I'm not sure why can't the kernel just see that the
pendrive is not there anymore and kill the process or something,
but that's how it happened to me.
The whole cannot be caught, blocked or ignore thing means that your
userland program won't be able to do it. The kernel, of course, is the
one handling the signals and it can do whatever. My guess is that it
never delivers the signals to the process while it's doing IO, the
SIGKILLs are just in some buffer waiting for the IO to be ready, but
since it's never ready, the process never gets killed. There are no
guarantees as to how fast a signal will get to a process.
| |
| Jens Thoms Toerring 2007-02-14, 7:21 pm |
| Rafael Almeida <rafaelc@dcc.ufmg.br> wrote:
> On 14 Feb 2007 10:05:28 -0800
> "Andrew Falanga" <af300wsm@gmail.com> wrote:
>
> Yes, that's correct. I had a problem like that once. I removed a usb
> pendrive while I was transfering data to it and a few process just
> wouldn't die. Their stat on ps was Z, iirc. I'm not completely sure how
> everything happens, I'm not sure why can't the kernel just see that the
> pendrive is not there anymore and kill the process or something,
> but that's how it happened to me.
The kernel (or the module that deals with the drive) can do that if
it can detect that it's gone. But that might no always be as simple
as it may look from the outside - quite often the module will be
waiting for an interrupt telling it the transfer is done, but when
the drive suddenly gets disconnected that interrupt may never arrive...
On the other hand, simply stopping I/O somewhere in the middle of a
transaction just because of a SIGKILL signal isn't really an option.
Imagine what would happen to the data on the drive if a user kills
a program with that signal while the data and accompanying meta-data
haven't made it to the drive - that easily could bring the whole file
system on the drive into disarray, resulting in an unreadable drive
and data loss. I would guess that this is the reason why a SIGKILL
signal sometimes doesn't kill a process that's still in the kernel
doing I/O (at least as long as the kernel doesn't deem it save to
abort the system call on receipt of the signal).
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
| |
| Andrew Falanga 2007-02-15, 1:19 pm |
| On Feb 14, 1:47 pm, j...@toerring.de (Jens Thoms Toerring) wrote:
> Rafael Almeida <rafa...@dcc.ufmg.br> wrote:
>
>
> The kernel (or the module that deals with the drive) can do that if
> it can detect that it's gone. But that might no always be as simple
> as it may look from the outside - quite often the module will be
> waiting for an interrupt telling it the transfer is done, but when
> the drive suddenly gets disconnected that interrupt may never arrive...
>
> On the other hand, simply stopping I/O somewhere in the middle of a
> transaction just because of a SIGKILL signal isn't really an option.
> Imagine what would happen to the data on the drive if a user kills
> a program with that signal while the data and accompanying meta-data
> haven't made it to the drive - that easily could bring the whole file
> system on the drive into disarray, resulting in an unreadable drive
> and data loss. I would guess that this is the reason why a SIGKILL
> signal sometimes doesn't kill a process that's still in the kernel
> doing I/O (at least as long as the kernel doesn't deem it save to
> abort the system call on receipt of the signal).
>
> Regards, Jens
> --
> \ Jens Thoms Toerring ___ j...@toerring.de
> \__________________________ http://toerring.de
Thanks everybody. I figured that was the case, but I wanted to ask to
find out.
Andy
| |
| Andrei Voropaev 2007-02-15, 1:19 pm |
| On 2007-02-14, Andrew Falanga <af300wsm@gmail.com> wrote:
> I'd like to know if I'm on the right track with thinking that it's
> because the process I'm attempting to kill is in some blocking state
> waiting for I/O, but I'm not sure I'm correct. This has really
> baffled me from time to time because on the one hand the manual pages
> say that the signal "cannot be caught, blocked or ignored" but yet
> when I give the kill command as above, it sure seems to be "caught,
> blocked or ignored." So, what gives?
If you look at the output of ps more carefully, then you'll see that it
provides you different states for the processes. Among those are 2
states that are of the interest.
One interesting state is Z (zombie). This process is already terminated,
but because the parent of it didn't do waitpid on it, the kernel keeps
it around. Obviously you can't kill dead process. To remove it, find its
parent (f switch may help) and try to kill it first.
The other state is D (uninterruptable sleep). This process is "alive".
But the kernle put it in such state because the process can't proceed
for some reason (maybe it waits for I/O). Since signals require
processing and the process is in uninterruptable sleep, the signal
simply gets queued and will be delivered to the process when it's state
changes to something different. Usualy it does change. But I've seen it
couple times, when a process would never return from that state, so we
had to hard reboot the machine to get rid of it 
--
Minds, like parachutes, function best when open
| |
| Krusty 2007-02-15, 1:19 pm |
| "Andrew Falanga" <af300wsm@gmail.com> wrote:
>I know that this has to do with
>programming for sigaction that they cannot be caught, blocked or
>ignored, but why is it sometimes, when at the commands prompt, I'll
>give a process "kill -9 <pid>" and the kill program completes, but
>then a ps -aux | grep <proc_name> shows that the process is still
>active?
Zombies! You've got zombies!
A process will remain in the process list until it's parent receives
it's termination status. This is done with a variant of the wait()
call. If the parent, for whatever reason, hasn't yet wait()ed on it's
children, and you kill it, then the process will hang around, perhaps
indefinitely. These are called zombie processes.
All processes have parents, except for init. New processes are created
by a fork() followed by an exec(), and then a wait() to clean up after
the new process exits. init starts the system daemon, and wait()s for
them to exit, as well as orphan processes - ones whose parents
exit()ed before they did.
| |
| phil-news-nospam@ipal.net 2007-02-15, 7:15 pm |
| On Wed, 14 Feb 2007 18:08:48 -0200 Rafael Almeida <rafaelc@dcc.ufmg.br> wrote:
| On 14 Feb 2007 10:05:28 -0800
| "Andrew Falanga" <af300wsm@gmail.com> wrote:
|
|> I'd like to know if I'm on the right track with thinking that it's
|> because the process I'm attempting to kill is in some blocking state
|> waiting for I/O, but I'm not sure I'm correct. This has really
|> baffled me from time to time because on the one hand the manual pages
|> say that the signal "cannot be caught, blocked or ignored" but yet
|> when I give the kill command as above, it sure seems to be "caught,
|> blocked or ignored." So, what gives?
|>
|
| Yes, that's correct. I had a problem like that once. I removed a usb
| pendrive while I was transfering data to it and a few process just
| wouldn't die. Their stat on ps was Z, iirc. I'm not completely sure how
| everything happens, I'm not sure why can't the kernel just see that the
| pendrive is not there anymore and kill the process or something,
| but that's how it happened to me.
Killing the processes because the device is gone is wrong. Causing the
syscall that the process is waiting in to return with an error is more
appropriate. The descriptor should be flagged as in error, or maybe even
directly closed.
But if you do a kill -KILL, it really should kill, including cancelling
all pending I/O, or where stuff is really stuck (like DMA transfers that
are pending), lock out the memory by assigning it to a dummy process or
PID 1.
| The whole cannot be caught, blocked or ignore thing means that your
| userland program won't be able to do it. The kernel, of course, is the
| one handling the signals and it can do whatever. My guess is that it
| never delivers the signals to the process while it's doing IO, the
| SIGKILLs are just in some buffer waiting for the IO to be ready, but
| since it's never ready, the process never gets killed. There are no
| guarantees as to how fast a signal will get to a process.
It doesn't know the I/O is in error, so presumably the signal just remains
pending. There needs to be some way to wake up a process from any I/O, at
least if the process is killed.
--
|---------------------------------------/----------------------------------|
| Phil Howard KA9WGN (ka9wgn.ham.org) / Do not send to the address below |
| first name lower case at ipal.net / spamtrap-2007-02-15-1454@ipal.net |
|------------------------------------/-------------------------------------|
| |
| phil-news-nospam@ipal.net 2007-02-15, 7:15 pm |
| On 14 Feb 2007 20:47:58 GMT Jens Thoms Toerring <jt@toerring.de> wrote:
| On the other hand, simply stopping I/O somewhere in the middle of a
| transaction just because of a SIGKILL signal isn't really an option.
| Imagine what would happen to the data on the drive if a user kills
| a program with that signal while the data and accompanying meta-data
| haven't made it to the drive - that easily could bring the whole file
| system on the drive into disarray, resulting in an unreadable drive
| and data loss. I would guess that this is the reason why a SIGKILL
| signal sometimes doesn't kill a process that's still in the kernel
| doing I/O (at least as long as the kernel doesn't deem it save to
| abort the system call on receipt of the signal).
Once the data is transferred from the process and the filesystem has it,
there should be no associativity between the process and the I/O. It
should proceed to completion even if the process exits or is killed.
When a process is killed, things normally get closed. That should all be
done regardless of any pending I/O. Even if we leave the process in the
process table, it should have no open descriptors besides the hung ones.
All else should be closed immediately in parallel. Same for current root
and current working directories (and decrement the use count). Everything
that could be released should be released, including all mappings. None
should be waiting behind any other.
--
|---------------------------------------/----------------------------------|
| Phil Howard KA9WGN (ka9wgn.ham.org) / Do not send to the address below |
| first name lower case at ipal.net / spamtrap-2007-02-15-1459@ipal.net |
|------------------------------------/-------------------------------------|
| |
| Jens Thoms Toerring 2007-02-15, 7:15 pm |
| phil-news-nospam@ipal.net wrote:
> On 14 Feb 2007 20:47:58 GMT Jens Thoms Toerring <jt@toerring.de> wrote:
> | On the other hand, simply stopping I/O somewhere in the middle of a
> | transaction just because of a SIGKILL signal isn't really an option.
> | Imagine what would happen to the data on the drive if a user kills
> | a program with that signal while the data and accompanying meta-data
> | haven't made it to the drive - that easily could bring the whole file
> | system on the drive into disarray, resulting in an unreadable drive
> | and data loss. I would guess that this is the reason why a SIGKILL
> | signal sometimes doesn't kill a process that's still in the kernel
> | doing I/O (at least as long as the kernel doesn't deem it save to
> | abort the system call on receipt of the signal).
> Once the data is transferred from the process and the filesystem has it,
> there should be no associativity between the process and the I/O. It
> should proceed to completion even if the process exits or is killed.
What you write looks as if you assume that that there is some
kind of "kernel process" (like a kind of server process) that
e.g. accepts the data from a write() call, processes them and
then returns a value to the process that has been waiting in
between (and which could be killed while the "kernel process"
still continues until it is finished). But that isn't how it
works, there is no uch "kernel process". When a process makes
a system call like write() the process is switched into kernel
mode and runs the kernel code for write(). And while it's run-
ning in kernel mode it can't be killed - the only way a signal
could influence what the process does while it's running in
kernel mode would be if the code in the kernel it's running
takes notice of the signal and switches the process back to
user mode where it then immediately would be killed due to the
signal. Making system calls is rather similar to calling a
function in some library - while a process runs the code in
a library it's also still the same process and there is no
"library process" that gets run while the process that made
the call is waiting.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
| |
| Binary 2007-02-16, 1:19 am |
| On 2=D4=C216=C8=D5, =C9=CF=CE=E712=CA=B145=B7=D6, Andrei Voropaev <avo...@m=
ail.ru> wrote:
> On 2007-02-14, Andrew Falanga <af300...@gmail.com> wrote:
>
>
> If you look at the output of ps more carefully, then you'll see that it
> provides you different states for the processes. Among those are 2
> states that are of the interest.
>
> One interesting state is Z (zombie). This process is already terminated,
> but because the parent of it didn't do waitpid on it, the kernel keeps
> it around. Obviously you can't kill dead process. To remove it, find its
> parent (f switch may help) and try to kill it first.
>
> The other state is D (uninterruptable sleep). This process is "alive".
> But the kernle put it in such state because the process can't proceed
> for some reason (maybe it waits for I/O). Since signals require
> processing and the process is in uninterruptable sleep, the signal
> simply gets queued and will be delivered to the process when it's state
> changes to something different. Usualy it does change. But I've seen it
> couple times, when a process would never return from that state, so we
> had to hard reboot the machine to get rid of it 
Can process itself put it into D state or just system call in kernel
space can do this?
>
> --
> Minds, like parachutes, function best when open
| |
| Gordon Burditt 2007-02-18, 1:22 am |
| >I've seen this before and I'm confused by it because the manual pages
>for kill, signal, etc. all say that SIGKILL and SIGSTOP "cannot be
>caught, blocked or ignored." I know that this has to do with
>programming for sigaction that they cannot be caught, blocked or
>ignored, but why is it sometimes, when at the commands prompt, I'll
>give a process "kill -9 <pid>" and the kill program completes, but
>then a ps -aux | grep <proc_name> shows that the process is still
>active?
>
>I'd like to know if I'm on the right track with thinking that it's
>because the process I'm attempting to kill is in some blocking state
>waiting for I/O, but I'm not sure I'm correct. This has really
>baffled me from time to time because on the one hand the manual pages
>say that the signal "cannot be caught, blocked or ignored" but yet
>when I give the kill command as above, it sure seems to be "caught,
>blocked or ignored." So, what gives?
There are several reasons for this.
1) Zombies. Zombies are processes that have *ALREADY* been killed,
but haven't finished flushing their I/O or haven't been waited for
yet. (These show up in Z state in ps). One drastic alternative
in this case is to kill the parent. The child will be inherited
by init (process 1), and waited for, unless init has died.
2) Buggy device drivers or problems with I/O. If a device driver
is stuck waiting for something that will never happen, the process
may stay stuck in D state for a long time. One situation that
happens not uncommonly is a hard-mounted NFS filesystem where the
server has crashed, lost power, lost network connectivity, or is
otherwise unresponsive. Since "hard mounts" are supposed to keep
retrying, the process may stick around for a long time, like until
the service tech gets the parts he ordered for the server. What
is likely to start happening is that any process trying to touch
the mounted filesystem hangs.
The same sort of things apply to devices suddenly disconnected when
they are not designed to be hot-swappable, or even where they *are*
designed to be hot-swappable but the driver has a bug. Some drivers
may be upset by unexpected power-cycling of a tape drive made
necessary because it's eating a tape.
I've even seen it happen in (typically parallel) printer drivers.
The process won't finish until it flushes all of its I/O, which
includes the last bit of what it was printing, but the printer is
offline because it is out of paper, and it's *SUPPOSED* to wait until
a new order of paper can arrive. Give it more paper, and the
process will quickly finish.
| |
| Daniel C. Bastos 2007-02-18, 1:18 pm |
| In article <53k584F1t4id9U1@mid.uni-berlin.de>,
Jens Thoms Toerring wrote:
> phil-news-nospam@ipal.net wrote:
>
>
>
> What you write looks as if you assume that that there is some
> kind of "kernel process" (like a kind of server process) that
> e.g. accepts the data from a write() call, processes them and
> then returns a value to the process that has been waiting in
> between (and which could be killed while the "kernel process"
> still continues until it is finished). But that isn't how it
> works, there is no uch "kernel process". When a process makes
> a system call like write() the process is switched into kernel
> mode and runs the kernel code for write(). And while it's run-
> ning in kernel mode it can't be killed - the only way a signal
> could influence what the process does while it's running in
> kernel mode would be if the code in the kernel it's running
> takes notice of the signal and switches the process back to
> user mode where it then immediately would be killed due to the
> signal.
Doesn't write() return EINTR when interrupted by a signal?
[...]
| |
| Jens Thoms Toerring 2007-02-18, 1:18 pm |
| Daniel C. Bastos <dbast0s@yahoo.com.br> wrote:
> Doesn't write() return EINTR when interrupted by a signal?
Yes, but only if interrupted before any data was written. Once
write has started writing data it (normally) can't be stopped
anymore.
Regards, Jens
--
\ Jens Thoms Toerring ___ jt@toerring.de
\__________________________ http://toerring.de
| |
| Rainer Weikusat 2007-02-18, 1:18 pm |
| jt@toerring.de (Jens Thoms Toerring) writes:
> Daniel C. Bastos <dbast0s@yahoo.com.br> wrote:
>
> Yes, but only if interrupted before any data was written. Once
> write has started writing data it (normally) can't be stopped
> anymore.
SUS says: If write() is interrupted by a signal after it successfully
writes some data, it shall return the number of bytes written.
| |
| Logan Shaw 2007-02-18, 1:18 pm |
| Rainer Weikusat wrote:
> jt@toerring.de (Jens Thoms Toerring) writes:
>
> SUS says: If write() is interrupted by a signal after it successfully
> writes some data, it shall return the number of bytes written.
That sentence doesn't make any statement about whether it can be
interrupted after successfully writing bytes. It merely says that
*if* that happens, then a particular behavior is required.
- Logan
| |
| Rainer Weikusat 2007-02-18, 1:18 pm |
| Logan Shaw <lshaw-usenet@austin.rr.com> writes:
> Rainer Weikusat wrote:
>
> That sentence doesn't make any statement about whether it can be
> interrupted after successfully writing bytes. It merely says that
> *if* that happens, then a particular behavior is required.
If it cannot be interrupted, the sentence above is obviously
meaningless. Which part of the standard specifies which other parts of
the standard are supposed to be meaningless? Answer:
none. Consequently, assumptions of meaninglessness are unfounded.
| |
| Michael Paoli 2007-02-20, 1:17 pm |
| Andrew Falanga wrote:
> I've seen this before and I'm confused by it because the manual pages
> for kill, signal, etc. all say that SIGKILL and SIGSTOP "cannot be
> caught, blocked or ignored." I know that this has to do with
> programming for sigaction that they cannot be caught, blocked or
> ignored, but why is it sometimes, when at the commands prompt, I'll
> give a process "kill -9 <pid>" and the kill program completes, but
> then a ps -aux | grep <proc_name> shows that the process is still
> active?
> I'd like to know if I'm on the right track with thinking that it's
> because the process I'm attempting to kill is in some blocking state
> waiting for I/O, but I'm not sure I'm correct. This has really
> baffled me from time to time because on the one hand the manual pages
> say that the signal "cannot be caught, blocked or ignored" but yet
> when I give the kill command as above, it sure seems to be "caught,
> blocked or ignored." So, what gives?
It was not caught, blocked or ignored. If the kill(2) attempt wasn't
valid or allowed, an error results (e.g. EINVAL, EPERM, ESRCH). If
it was allowed, the PID is signaled (kill(2)ed). The effect may not
be absolutely immediate - e.g. if the PID is in the middle of a
non-interruptable system call, or perhaps if the kernel's scheduler
hasn't yet processed the signal. Also, a fairly common scenario is
that the parent doesn't, or doesn't promptly reap the dead child -
this results in a zombie process - and it's common knowledge that one
can't kill a zombie.
references:
news:d85eb83f.0312180842.21c629d3@posting.google.com
wait(2)
exit(3)
exit(2)
kill(2)
| |
| Andrei Voropaev 2007-02-20, 1:17 pm |
| On 2007-02-16, Binary <binary.chen@gmail.com> wrote:
> On 2??16??, ????12??45??, Andrei Voropaev <avo...@mail.ru> wrote:
[...]
[vbcol=seagreen]
> Can process itself put it into D state or just system call in kernel
> space can do this?
All the states for the process are manipulated by the kernel. The
process can't directly manipulate it.
--
Minds, like parachutes, function best when open
|
|
|
|
|