|
Home > Archive > AOL Webserver > October 2007 > Ns_RegisterAtReady, NsRunAtReadyProcs
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
Ns_RegisterAtReady, NsRunAtReadyProcs
|
|
| Tom Jackson 2007-10-17, 1:11 am |
| Code was added to driver.c about 9 months ago reportedly to fix a race
condition of some sort. The Bug description (1615787) is at:
http://sourceforge.net/tracker/inde...ail&aid=1615787]&group_id=3152&atid=103152
The code added to callbacks.c are the two undocumented functions:
void *
Ns_RegisterAtReady(Ns_Callback *proc, void *arg)
{
return RegisterCallback(&firstReady, proc, arg);
}
void
NsRunAtReadyProcs(void)
{
Ns_Log(Debug, "************NsRunAtReadyProcs running****");
RunCallbacks(firstReady);
}
The block/context of code added to driver.c is this:
/*
* Register an at-ready callback to trigger the poll.
*/
Ns_RegisterAtReady(TriggerDriver, drvPtr);
/*
* Loop forever until signalled to shutdown and all
* connections are complete and gracefully closed.
*/
This all looks great, except for the fact that the NsRunAtReadyProcs proc is
never called.
The original bug was for 4.0.10, with a different setup:
Ns_RegisterAtReady(SockTrigger,NULL);
Question is why the code was put in, undocumented and unused. It also refers
to a bug from another version of AOLserver, not 4.5. Is NsRunAtReadyProcs
being used somewhere? If not, maybe it should be removed; if used, we need to
add some documentation.
tom jackson
| |
| Dossy Shiobara 2007-10-17, 1:11 pm |
| On 2007.10.13, Tom Jackson <tom@RMADILO.COM> wrote:
> This all looks great, except for the fact that the NsRunAtReadyProcs
> proc is never called.
In 4.5, that appears to be the case. In 4.0, it's called from
nsd/queue.c.
I'd need to look more closely at the code to see if we should remove
those functions entirely in 4.5 or if there's actually a missing call to
NsRunAtReadyProcs() somewhere.
But, you're absolutely right: we need to figure this out, one way or
another. Thanks for finding this.
-- Dossy
--
Dossy Shiobara | dossy@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)
| |
| Gustaf Neumann 2007-10-19, 7:11 am |
| Dossy Shiobara schrieb:
> On 2007.10.13, Tom Jackson <tom@RMADILO.COM> wrote:
>
> But, you're absolutely right: we need to figure this out, one way or
> another. Thanks for finding this.
>
The question, why the patch helped is indeed a miracle. Maybe it
changed the timing or causes a different serialization due to a mutex.
However, it did NOT solve the underlying problem.
Activated by Tom's question, i spent some more time on the problem
and managed to reproduce the problem reliably. The problem
actually happend when the number of available connection threads
is configured very low, as well as the the number of requests
per connection thread (e.g. 5 and 3). When the server accepts a
high number of requests, these are queued even without
guaranteeing resources to process these. The situation
is especially bad with bursty requests. When a connection thread
exits e.g. after 3 requests, and there are still many requests queued,
there is no mechanism to automatically trigger new
connection threads in this situation. When a client connects
in this situation to the server, some old (maybe obsolete) requests
are taken from the queue, but after this, the server might sit idle
even with a high number of requests queued.
I have commited a patch to cvs head, that processes these pending
requests, even when maxconns is exceeded. Maybe it is better to
restart exiting connection threads automatically or to limit the
number of queued requests at first hand; or maybe there is a callback
solution. However. I have removed the old - useless - patch
from the head version in CVS.
best regards
-gustaf neumann
| |
| Tom Jackson 2007-10-19, 1:12 pm |
| On Friday 19 October 2007 03:17, Gustaf Neumann wrote:
> The problem
> actually happend when the number of available connection threads
> is configured very low, as well as the the number of requests
> per connection thread (e.g. 5 and 3). When the server accepts a
> high number of requests, these are queued even without
> guaranteeing resources to process these.
I know this is true, several months ago I configured a threadpool with zero
threads after startup. The driver thread will still send requests to the
threadpool queue, where it will sit forever. Also, if you then configure the
threadpool with more threads after going to zero, they are never created.
I also noticed that you should be able to trigger the driver by calling the
Tcl command [ns_driver query ...], so maybe an experiment with a scheduled
proc would be interesting: put the server into the unresponsive mode and see
if it gets unstuck. The log should show the scheduled thread running, then
hopefully the server unsticking. Otherwise it isn't necessarily the driver
thread which is stuck, so triggering it would be pointless.
I can try running my tests again with 3-5 threads, what other parameters were
there? Does this get stuck running OpenACS on any machine, or just on
openacs.org?
Okay, just read Gustaf's comments on his patch, there might be a problem with
the configuration on openacs.org, see below...
> The situation
> is especially bad with bursty requests. When a connection thread
> exits e.g. after 3 requests, and there are still many requests queued,
> there is no mechanism to automatically trigger new
> connection threads in this situation. When a client connects
> in this situation to the server, some old (maybe obsolete) requests
> are taken from the queue, but after this, the server might sit idle
> even with a high number of requests queued.
As above, I wonder if the driver thread and the exiting connection thread
should signal the queue thread? There seems to be something not complete with
the queue loop.
Also, Gustaf, the maxconnections should be higher...with maxconnections=3 you
are exiting a connection thread every three requests, is this how you want it
to work? If resources are limited, this isn't going to help.
>
> I have commited a patch to cvs head, that processes these pending
> requests, even when maxconns is exceeded. Maybe it is better to
> restart exiting connection threads automatically or to limit the
> number of queued requests at first hand; or maybe there is a callback
> solution. However. I have removed the old - useless - patch
> from the head version in CVS.
>
Very interesting, I'll look into this more today. One thing I didn't try was a
configuration where maxconnections was lower than the number of simultaneous
requests from Apache Bench.
tom jackson
| |
| Tom Jackson 2007-10-19, 1:12 pm |
| On Friday 19 October 2007 03:17, Gustaf Neumann wrote:
> I have commited a patch to cvs head, that processes these pending
> requests, even when maxconns is exceeded. Maybe it is better to
> restart exiting connection threads automatically or to limit the
> number of queued requests at first hand; or maybe there is a callback
> solution. However. I have removed the old - useless - patch
> from the head version in CVS.
>
More on the patch and the problem.
Your code comment is that:
+ request is queued without resources to process these.
+ One has to take care about that one restarts the queue, when
+ resources become available again.
The only way the driver thread could guess if a queue has resources is to look
at the number of maxthreads, making sure it is at least 1. But it would only
be a guess, it could instantly change to zero if someone decided to zero out
the threadpool.
How exactly does the patch work? It appears to simply extend the tour-of-duty
for a thread, but does the exiting thread ever get to trigger the queue to
create new threads? This seemed to be the issue several users noticed:
exiting threads never cause the queue to reassess the situation. At the very
least, it seems that when a thread exits, the queue checks if the number of
threads is less than minthreads, and if so, creates a new thread, but also
creates a new thread if minthreads = 0 and there are queued requests. Then
the queue should loop over requests creating threads for queued requests
until it reaches maxthreads.
Okay, maybe restate this: the queue should loop over requests up to the point
of reaching maxthreads. If there is a spike in activity, it is best to
immediately respond with more threads up to the limit. I think the opposite
iw what is happening. Once threads die, it is nearly impossible to get the
number of threads back up, and threads is how long requests are handled,
otherwise the wait time is additive.
tom jackson
| |
| Gustaf Neumann 2007-10-19, 1:12 pm |
| Tom Jackson schrieb:
> I can try running my tests again with 3-5 threads, what other parameters were
> there? Does this get stuck running OpenACS on any machine, or just on
> openacs.org?
>
As written before, i set maxthreads to 5 and maxconnections to 3 to
trigger the problem.
I did not mess with openacs.org, but reproduced the problem under mac os x.
> Also, Gustaf, the maxconnections should be higher...with maxconnections=3 you
> are exiting a connection thread every three requests, is this how you want it
> to work?
yes, i wanted to trigger the bug. When the problem occured, openacs.org
had different settings
(maxthreads 5 maxconnections 10).
> Very interesting, I'll look into this more today. One thing I didn't try was a
> configuration where maxconnections was lower than the number of simultaneous
> requests from Apache Bench.
>
That is not the point. Run ab with more parallel requests than
maxthreads, such
that requests queue up. As long as maxconnections is high enough, the
connection
threads will process the queued requests. If the number of queued
requests is
higher than the number of maxconns, matters degrade. Having one or two
queued requests on a thread exit (due to maxconn exceeded) is no real
problem,
since on the next request, a new thread is started, which will care
about the
queued request as well. That means, that under "good conditions", it is
possible
that there are e.g. 10 requests queued, but a slow and steady heartbeat of
request will cause the queue to empty itself, such that aolserver recovers.
The worst case scenario is actually maxconnections 1 (stop every
connection thread after every request). In this situation (without
my patch) if a queue piles up, it is never shrinking.
Actually, if minthreads is set to a value > 0 (it was not set), then idle
threads should care about queued requests. Maybe the
easier fix is to set minthreads per default to 1. I will
try this, when i am back home (in a couple of hours).
> How exactly does the patch work? It appears to simply
> extend the tour-of-duty for a thread,
yes
> but does the exiting thread ever get to trigger the queue to
> create new threads?
no, i have not modified anything in this regard. When a thread
processes all queued requests before exiting, there is no
urgent need to create a new connection request. The next
incoming request will create a new thread on demand.
> Once threads die,
> it is nearly impossible to get the number of threads
> back up,
i have not observed this in my test, the problem appears
with reaching maxthreads. This does not exclude, that there
might be another problem.
-gustaf neumann
| |
| Tom Jackson 2007-10-19, 7:11 pm |
| On Friday 19 October 2007 09:12, Gustaf Neumann wrote:
> Actually, if minthreads is set to a value > 0 (it was not set), then idle
> threads should care about queued requests. Maybe the
> easier fix is to set minthreads per default to 1. I will
> try this, when i am back home (in a couple of hours).
AOLserver does not respect minthreads except at startup. This is part of the
same issue: Nothing except a request can create a thread. It seems like on
thread exit, a little accounting process could go on to bring threads.current
up to threads.min, this might require more than one thread creation, maybe
not.
In your patch, you change the while loop test:
+ while (poolPtr->threads.maxconns <= 0
+ || ncons-- > 0
+ || poolPtr->queue.wait.num > 1) {
Shouldn't the loop continue with poolPtr->queue.wait.num > 0 ?
Your patch looks like a great fix...I just still don't understand why the
server would completely lock up. As long as you have requests coming in,
seems like a new thread would get created. I wonder if what is happening is
that Apache Bench simple stops sending new requests because none of the
requests are finishing. If it can block, I wonder if simply visiting with a
browser would kick things off again?
tom jackson
| |
| Jeff Rogers 2007-10-19, 7:11 pm |
| Dossy Shiobara wrote:
> On 2007.10.13, Tom Jackson <tom@RMADILO.COM> wrote:
>
> In 4.5, that appears to be the case. In 4.0, it's called from
> nsd/queue.c.
>
> I'd need to look more closely at the code to see if we should remove
> those functions entirely in 4.5 or if there's actually a missing call to
> NsRunAtReadyProcs() somewhere.
I'm thinking the call to NsRunAtReadyProcs should be re-added in. It
was apparently removed a long time ago, just after 4.0.10 was released:
http://aolserver.cvs.sourceforge.ne...r1=1.24&r2=1.25
As the api was not being used at all it was fine to remove it, until it
became useful for this race condition I haven't successfully
reproduced the problem in 4.5 yet (or read the code deeply, but the main
connection service loop - queue.c:NsConnThread - is not significantly
different now) so it may be a different underlying problem than the one
I described, but it should be an easy test to see if adding the AtReady
callbacks back into the end of the loop helps things.
-J
>
> But, you're absolutely right: we need to figure this out, one way or
> another. Thanks for finding this.
>
> -- Dossy
>
| |
| Tom Jackson 2007-10-19, 7:11 pm |
| Jeff,
I don't have a problem with it being in, but it actually needs to do something
we can measure. There are significant overall differences between 4.0.10 and
4.5, although if you look at small parts it seems the same. Just as an
example, the patch in driver.c was hundreds of lines off from 4.0.10 to 4.5.
Somehow Gustaf has been able to reproduce the fix with just the driver.c
change, and the new fix he is trying out also works. I'm not sure we
understand why either of them work quite yet, or if they will continue to
work...and that is the real problem. Magical fixes are nice, but in the long
we might need to know more about it.
I have ab run without either fix and never had a problem. Actually the call
was in driver.c, but the function body was a noop, it never even registered
the TriggerDriver, the AtReadyProcs fuction was completely removed, not even
defined. To me this only means that I don't know how to reproduce the bad
behavior, but at least it didn't cause any difference for me.
tom jackson
On Friday 19 October 2007 11:11, Jeff Rogers wrote:
> I'm thinking the call to NsRunAtReadyProcs should be re-added in. It
> was apparently removed a long time ago, just after 4.0.10 was released:
>
> http://aolserver.cvs.sourceforge.ne.../queue.c?r1=1.2
>4&r2=1.25
>
> As the api was not being used at all it was fine to remove it, until it
> became useful for this race condition I haven't successfully
> reproduced the problem in 4.5 yet (or read the code deeply, but the main
> connection service loop - queue.c:NsConnThread - is not significantly
> different now) so it may be a different underlying problem than the one
> I described, but it should be an easy test to see if adding the AtReady
> callbacks back into the end of the loop helps things.
>
> -J
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the
> email message. You can leave the Subject: field of your email blank.
| |
| Jeff Rogers 2007-10-19, 7:11 pm |
| Tom Jackson wrote:
> Jeff,
>
> I don't have a problem with it being in, but it actually needs to do something
> we can measure. There are significant overall differences between 4.0.10 and
> 4.5, although if you look at small parts it seems the same. Just as an
> example, the patch in driver.c was hundreds of lines off from 4.0.10 to 4.5.
I was suggesting adding in extra stuff so that the patch to register the
AtReady proc would actually run; you're right in that it appears that
without such the patch looks like it does nothing.
Looking at it slightly more deeply, the AtReady callback shouldn't be
needed, because TriggerDriver is called from NsFreeConn, which is called
every time through the ConnThread loop after the connections have been
put back on the free list.
> Somehow Gustaf has been able to reproduce the fix with just the driver.c
> change, and the new fix he is trying out also works. I'm not sure we
> understand why either of them work quite yet, or if they will continue to
> work...and that is the real problem. Magical fixes are nice, but in the long
> we might need to know more about it.
Magical fixes are fine so long as the magic keeps working. But
something like there where there appears to be no effect may mean that
it is super timing sensitive, maybe pushing an allocated structure into
another block which causes a cache miss at just the right time to allow
a task switch... ugh, my brain hurts just thinking about it. 5 guys and
5 chopsticks are so much simpler 
> I have ab run without either fix and never had a problem. Actually the call
> was in driver.c, but the function body was a noop, it never even registered
> the TriggerDriver, the AtReadyProcs fuction was completely removed, not even
> defined. To me this only means that I don't know how to reproduce the bad
> behavior, but at least it didn't cause any difference for me.
Gustaf's description of his reproduction scenario sounds much different
from mine from last year, which suggests that the underlying cause is
different, and this trigger/atready is a red herring.
-J
[vbcol=seagreen]
>
> tom jackson
>
> On Friday 19 October 2007 11:11, Jeff Rogers wrote:
| |
| Tom Jackson 2007-10-19, 7:11 pm |
| On Friday 19 October 2007 12:46, Jeff Rogers wrote:
> Looking at it slightly more deeply, the AtReady callback shouldn't be
> needed, because TriggerDriver is called from NsFreeConn, which is called
> every time through the ConnThread loop after the connections have been
> put back on the free list.
>
This is one thing that I noticed, TriggerDriver is called in more places, but
the point seems to be to update any statistics, and also to return conns. And
of course, the driver doesn't need triggering when a lot of requests are
coming in, only when the server starts to reap stale or used up threads.
One thing which seems different is that the driver doesn't hold on to
requests, they are queued, so there is no reason for threads to need to do a
TriggerDriver in order to get a new conn (assuming one is in the queue).
In other words, the TriggerDriver is used to return stuff to the driver
thread, not to signal ready.
At least one thing is still missing here: if there is a -timeout on the thread
wait, all threads will end up exiting, going past minthreads. If -timeout is
set to 0, threads will exit and are not replaced when this ends up leaving
less than minthreads available. If you get surges of requests, a timeout of
zero would help a little to keep from continuously creating new threads.
tom jackson
| |
| Tom Jackson 2007-10-20, 1:11 am |
| I just started another thread about using apache bench. I'm not sure if it is
working as advertised, and I wonder if it might be misleading us with the
testing of this potential race condition fix.
Using the most recent version of queue.c, I have been able to create a timeout
on the apache bench test. That is, the client gets stuck and eventually times
out and quits.
The ab command was:
$ ab -t 500 -n 50 -c 1 http://192.168.1.102:8000/bgwrite/sns-thumb.jpg
There is a rotation of 0-5 sec delay built in to this, so even with a 28k
file, it takes a while to chew through 50 requests. (about 101 sec). The -t
500 appears to be a useless option, ab always times out much quicker than
this value.
Now how to get ab stuck and give up? I use a regular browser (mozilla), and I
try to load the same image, I hit reload like 5-7 times or so. The number of
reloads is larger than the number of maxconns per thread. There is only one
thread working.
What may be happening is that the keepalive system is expiring the thread all
by itself (one client with keepalive). Even though there is a queue, maybe
the sock and therefore the conn are being reused, giving no other requests a
chance. The last log entry indicates that the thread has processed too many
conns and is exiting. Then ab freezes. However, if I reload the page one more
time in the browser, ab starts up again.
I still think all the evidence points to a logical error of deleting a thread
and then not maintaining at least one thread (and prefer minthreads) if there
is something in the queue. This isn't a race condition, I don't think,
because it is easy to repeat this: it happens exactly the same way every
time.
This might not be easy to do correctly because killing threads is part of
shutting down, so a thread might want to die because of a shutdown, not
because maxconns was reached, or a timeout.
In driver.c here is the keep-alive section:
/*
* Process Sock's returned for keep-alive or close.
*/
while ((sockPtr = closePtr) != NULL) {
closePtr = sockPtr->nextPtr;
if (!stop && sockPtr->state == SOCK_READWAIT) {
sockPtr->connPtr = AllocConn(drvPtr, &now, sockPtr);
SockWait(sockPtr, &now, drvPtr->keepwait, &waitPtr); /* <--keep*/
} else if (!drvPtr->closewait || shutdown(sockPtr->sock, 1) != 0) {
/* Graceful close diabled or shutdown() failed. */
SockClose(sockPtr);
} else {
SockWait(sockPtr, &now, drvPtr->closewait, &waitPtr);
}
}
So it looks like a keep-wait socket is immediately put to the head of the line
looking for more requests from the same client. If you have a web page with
50+ images, you can easily blow out any reasonable maxconns setting, or if
someone reloads a less heavy page.
tom jackson
| |
| Jeff Rogers 2007-10-20, 7:11 am |
| Tom Jackson wrote:
> Now how to get ab stuck and give up? I use a regular browser (mozilla), and I
> try to load the same image, I hit reload like 5-7 times or so. The number of
> reloads is larger than the number of maxconns per thread. There is only one
> thread working.
>
> What may be happening is that the keepalive system is expiring the thread all
> by itself (one client with keepalive). Even though there is a queue, maybe
> the sock and therefore the conn are being reused, giving no other requests a
> chance. The last log entry indicates that the thread has processed too many
> conns and is exiting. Then ab freezes. However, if I reload the page one more
> time in the browser, ab starts up again.
>
> I still think all the evidence points to a logical error of deleting a thread
> and then not maintaining at least one thread (and prefer minthreads) if there
> is something in the queue. This isn't a race condition, I don't think,
> because it is easy to repeat this: it happens exactly the same way every
> time.
> So it looks like a keep-wait socket is immediately put to the head of the line
> looking for more requests from the same client. If you have a web page with
> 50+ images, you can easily blow out any reasonable maxconns setting, or if
> someone reloads a less heavy page.
Do you still get the lockup in your scenario with keepalives off? If
not, it might be a workaround until it gets fixed. If memory serves, it
would not be the first time that keepalives have been broken.
-J
| |
| Tom Jackson 2007-10-20, 7:11 am |
| On Saturday 20 October 2007 00:14, Jeff Rogers wrote:
> Do you still get the lockup in your scenario with keepalives off? If
> not, it might be a workaround until it gets fixed. If memory serves, it
> would not be the first time that keepalives have been broken.
What I suspect is that there isn't a race condition, just a test condition
which causes the system to appear stuck.
I doubt I'll ever see this in production. But a low number of threads may lead
to this type of thing where you have a few clients requesting a bunch of
files using keepalive. When the thread exits, it could leave unprocessed
requests in the queue. Or maybe a few clients could essentially hog all the
threads within the keep-alive loop. I'm not 100% sure this is a short circuit
of the regular queue, but it could be.
If these greedy clients hog all the connection threads, and work fast enough,
the connection threads will never exit (due to the new patch). This might
actually work against sharing threads, but I'll have to test if exiting a
thread and creating another one gives queued requests a chance to execute.
tom jackson
| |
| Gustaf Neumann 2007-10-20, 1:11 pm |
| Tom Jackson schrieb:
> On Friday 19 October 2007 09:12, Gustaf Neumann wrote:
>
>
> AOLserver does not respect minthreads except at startup. This is part of the
> same issue: Nothing except a request can create a thread. It seems like on
> thread exit, a little accounting process could go on to bring threads.current
> up to threads.min, this might require more than one thread creation, maybe
> not.
>
fully agreed.
i continued a little on the isue and commited a patch
to CVS, which respects
maxconns always. Instead of performing
in boundary situations in the worst case more than the
configured maxconns requests, the code creates now
a new connection thread automatically after the exit of a thread
coming to the end of it work cycle, when jobs are pending
and no other thread is able to process these.
i made as well a small variant of this (not in CVS), which
defines a new command NsCheckPools() in pools.c
which iterates over pools and starts a thread in the
same situation as above.
NsCheckPools could be extended to check for the existance of
minthreads etc. However, for this patch, i wanted to be
as little invasive as necessary, and added the thread-revival
code to the place, where a thread exits.
For the new thread generation, i had to parameterize
NsCreateConnThread() to avoid a resource deadlock
in Ns_ThreadJoin().
> In your patch, you change the while loop test:
> + while (poolPtr->threads.maxconns <= 0
> + || ncons-- > 0
> + || poolPtr->queue.wait.num > 1) {
>
> Shouldn't the loop continue with poolPtr->queue.wait.num > 0 ?
>
0 or 1, both is fine here. the condition trigger for the new cases only
when > 1 is used.
> Your patch looks like a great fix...I just still don't understand why the
> server would completely lock up. As long as you have requests coming in,
> seems like a new thread would get created. I wonder if what is happening is
> that Apache Bench simple stops sending new requests because none of the
> requests are finishing. If it can block, I wonder if simply visiting with a
> browser would kick things off again?
>
one can use the browser to kick things off again, but if there are
already say 50 requests
in the queue, and the browser hangs in the first request. The bad thing
is, that in the
meantime, some other bulky requests are queued as well. So the queue
might be
acutally permanently growing.
for my testing, i use:
ab -c 20 -n 35 http://localhost:8003/file-storage/...otlrn-2.3.1.tgz
with maxthreads 5 and maxconns 3 I get reliable hangs, where
the last thread exists with about 5 to 15 queued requests. Without the
recent patches, the queue is processed until it is empty.
btw, the reason, when the patch helps, is no miracle. It is completely
clear why the server hangs. This bug is not a classical deadlock (it
is possible that the queued reuests are processed), but shares some
properties of a life lock (some other requests prevent the processing
of some other requests, at least for a while). It is not a race condition
either. I don't believe that TriggerDriver() is the right place to
handle the problem, since the driver is happily accepting requests
in the bug situation. It can certainly be, that the fixed bug is different
from the bug the Jeff Rogers fixed originally.
-gustaf neumann
PS: i am pretty sure that this is the same bug as on openacs.org.
Around the time of the bug, somebody in spain was apparently
benchmarking his internet connection, downloading dotlrn*.tar*
in multiple parallel sessions. Once i saw that the user
was trying to download 1200 copies of dotlrn.tar in about 10 minutes.
| |
| Tom Jackson 2007-10-21, 7:11 pm |
| Hey,
(Note: none of my changes have been committed, they are not complete or
integrated with Gustaf's code yet.)
I also looked more at this, the idea of triggering the accounting process at
the end of a thread's life seems very good. One real problem with this setup
is that there is no independent thread which can be triggered to do a pool
maintenance cycle. Until now this has been characterized by several bad
behaviors which leads to poor performance and too much thread exiting on
timeouts. If a timeout exists, any lull in activity will expire all threads.
I haven't tried Gustaf's patch yet because I have been working on the
accounting portion of the problem. Here are some of the things I have tried,
and the apparent results:
1. Added a new counter to the Pool structure to signal the number of threads
to create. (poolPtr->threads.create)
2. A connection thread incrs this under a certain conditions:
a. if (ncons == 1 && poolPtr->threads.maxconns > 0 ) {
++poolPtr->threads.create;
}
However, this leads to slightly more threads sometimes than maxthreads,
Usually, the number of threads exactly equals minthreads due to other
changes. Maxthreads is exceeded if you use a low number of maxconns.
b. Thread timeout is moved inside the wait loop, and recalculated on each
cycle:
while (!poolPtr->shutdown
&& status == NS_OK
&& poolPtr->queue.wait.firstPtr == NULL) {
/*
* Wait for a connection to arrive, exiting if one doesn't
* arrive in the configured timeout period.
*/
if (poolPtr->threads.current -1 <= poolPtr->threads.min) {
timePtr = NULL;
} else {
Ns_GetTime(&wait);
Ns_IncrTime(&wait, poolPtr->threads.timeout, 0);
timePtr = &wait;
}
status = Ns_CondTimedWait(&poolPtr->cond, &poolPtr->lock, timePtr);
}
Also, the conditions for infinite wait are slightly changed.
c. If a thread times out, a check is made if this will push the min threads
too low, and an additional create is added. It seems possible that a
single thread could signal the creation of two threads during it's
lifetime, but this may just be making up for threads which exit without
signalling a new thread. This is probably a mistake. What I have observed
is that this doesn't prevent threads from exiting below minthreads if
there is a lull in activity, but create still has the accounting of these
and if a single request comes in, they are all created. Then they may
timeout again.
Probably timed out threads should not be replaced, but a way to keep
timed out threads from exiting if this pushes below minthreads doesn't
exist yet. So this will be removed, but for now it allows a little probe
as to the thread state and driver thread state.
if (poolPtr->queue.wait.firstPtr == NULL) {
msg = "timeout waiting for connection";
if (poolPtr->threads.current -1 <= poolPtr->threads.min) {
++poolPtr->threads.create;
}
break;
}
3. The number of queued requests was unknown before. poolPtr->threads.queued
only increased. I added a line to reduce this number when requests are
dequeued. This made it very easy to discover how things are working between
the driver thread and the connection threads.
For instance, if a heavy load of requests suddenly ends, there appear to be
leftover queued requests. Using netstat, you can see these in CLOSE_WAIT.
These appear to be requests which were abandoned (by apache bench), but the
driver thread or connection threads are stuck, and they remain in this state
until another (even single) request comes in.
So another difference with this code appears to be that the queued, but
abandoned requests are not being serviced anymore. It may be that the driver
thread gets triggerd one more time and cleans up the mess. I still have to
verify this, but it would be good if true.
4. Gustaf identified another problem with the code: the thread creation proc
includes a thread join. This makes it impossible to use inside a connection
thread. There is actually no reason to do the thread join in a the thread
create proc. Only the driver thread can do this, or the pool exit code. So it
should be removed from where it is and placed in the NsQueueConn proc, just
before thread creation is done.
My code doesn't yet reflect this idea, but I have added another proc to create
multiple threads. It does the accounting and then loops while
poolPtr->threads.create-- > 0:
void
NsCreateConnThreads(Pool *poolPtr)
{
/*
* Reap any dead threads. NOTE: Move this to NsQueueConn
*/
NsJoinConnThreads();
/*
* Create new connection threads.
*/
Ns_MutexLock(&poolPtr->lock);
if (poolPtr->threads.queued > 1) {
poolPtr->threads.create += 1;
}
if ( poolPtr->threads.current < poolPtr->threads.min ) {
poolPtr->threads.create += poolPtr->threads.min - poolPtr->threads.current;
}
while (poolPtr->threads.create-- > 0) {
NsCreateConnThread(poolPtr);
++poolPtr->threads.idle;
++poolPtr->threads.current;
}
Ns_MutexUnlock(&poolPtr->lock);
Ns_CondBroadcast(&poolPtr->cond);
}
It isn't obvious how to create more than minthreads. All that is known, as far
as I can tell, is that if the number of queued requests is greater than 1, an
additional thread should be created for the incoming request. A queue with
more than one request waiting means that all threads are busy, and there is
still no guarantee that all of them will not exit and leave the current
request waiting. This may push the number of threads above maxthreads by 1.
Weirdly, it is easy to maintain minthreads under continuous load. The above
code, perfectly calculates the deficit when there are no threads exiting due
to a timeout. This is an important difference between the previous code which
was thread starved after thread exit. Only new requests could boost the
number up, and only by 1.
Another addition is the final line: Ns_CondBroadcast. I haven't tried this
inside the mutex yet, it is supposed to be okay outside, but behavior isn't
guaranteed to be consistent.
I still haven't considered how (or if) to integrate this with Gustaf's latest
code. His code appears to minimize resources, taking care to replace any
thread which exits when there is still work to be done. Due to my changes,
poolPtr->queue.wait.num = poolPtr->threads.queued. Maybe I should have used
queue.wait.num, but the fact that they are the same (or should be) is good to
know. (poolPtr->threads.queued is printed with the ns_pools get command).
My code has the potential to exceed maxthreads while a thread services it's
last request. In practice, this doesn't happen unless maxconns is set low and
requests take a long time. Essentially this jumpstarts the thread creation by
a little bit, but even Gustaf's code creates a thread prior to exit, so the
difference may not be that large (or it might be!). It isn't clear to me yet
that if a thread exits or starts if this affects the memory use. Considering
the fact that pool threads can be shared with different servers, interps must
hang around with the bulk of the bytes even on thread exit.
One thing I'm not too sure about (due completely to ignorance on my part) is
this part of Gustaf's code:
505 Ns_MutexUnlock(&poolPtr->lock);
506
507 if (poolPtr->queue.wait.num > 0 && poolPtr->threads.idle == 0
&& !poolPtr->shutdown) {
508 /* We are exiting from a thread in a situation, where more
509 queue entries are waiting. Since no other mechanism ensures
510 that the entries are processed, we recreate a new connection
thread.
511 */
512 Ns_MutexLock(&poolPtr->lock);
513 poolPtr->threads.current++;
514 poolPtr->threads.idle++;
515 Ns_MutexUnlock(&poolPtr->lock);
516 NsCreateConnThread(poolPtr, 0); /* joinThreads == 0 to avoid
deadlock */
517 }
The if statement on line 507 checks the Pool structure outside of a mutex
lock. I don't believe there is any guarantee that we know what is going to
happen next. These conditions could change before the code has a chance to
grab a mutex.
tom jackson
On Saturday 20 October 2007 05:35, Gustaf Neumann wrote:
> Tom Jackson schrieb:
>
> fully agreed.
>
> i continued a little on the isue and commited a patch
> to CVS, which respects
> maxconns always. Instead of performing
> in boundary situations in the worst case more than the
> configured maxconns requests, the code creates now
> a new connection thread automatically after the exit of a thread
> coming to the end of it work cycle, when jobs are pending
> and no other thread is able to process these.
>
> i made as well a small variant of this (not in CVS), which
> defines a new command NsCheckPools() in pools.c
> which iterates over pools and starts a thread in the
> same situation as above.
> NsCheckPools could be extended to check for the existance of
> minthreads etc. However, for this patch, i wanted to be
> as little invasive as necessary, and added the thread-revival
> code to the place, where a thread exits.
>
> For the new thread generation, i had to parameterize
> NsCreateConnThread() to avoid a resource deadlock
> in Ns_ThreadJoin().
>
>
> 0 or 1, both is fine here. the condition trigger for the new cases only
> when > 1 is used.
>
>
> one can use the browser to kick things off again, but if there are
> already say 50 requests
> in the queue, and the browser hangs in the first request. The bad thing
> is, that in the
> meantime, some other bulky requests are queued as well. So the queue
> might be
> acutally permanently growing.
>
> for my testing, i use:
>
> ab -c 20 -n 35 http://localhost:8003/file-storage/...otlrn-2.3.1.tgz
>
> with maxthreads 5 and maxconns 3 I get reliable hangs, where
> the last thread exists with about 5 to 15 queued requests. Without the
> recent patches, the queue is processed until it is empty.
>
> btw, the reason, when the patch helps, is no miracle. It is completely
> clear why the server hangs. This bug is not a classical deadlock (it
> is possible that the queued reuests are processed), but shares some
> properties of a life lock (some other requests prevent the processing
> of some other requests, at least for a while). It is not a race condition
> either. I don't believe that TriggerDriver() is the right place to
> handle the problem, since the driver is happily accepting requests
> in the bug situation. It can certainly be, that the fixed bug is different
> from the bug the Jeff Rogers fixed originally.
>
> -gustaf neumann
>
> PS: i am pretty sure that this is the same bug as on openacs.org.
> Around the time of the bug, somebody in spain was apparently
> benchmarking his internet connection, downloading dotlrn*.tar*
> in multiple parallel sessions. Once i saw that the user
> was trying to download 1200 copies of dotlrn.tar in about 10 minutes.
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the
> email message. You can leave the Subject: field of your email blank.
| |
| Gustaf Neumann 2007-10-21, 7:11 pm |
| Tom Jackson schrieb:
> The if statement on line 507 checks the Pool structure outside of a mutex
> lock.
>
Good observation.
The code was not correct, although the potential damage was small.
Fixed by now.
-gustaf
| |
| Tom Jackson 2007-10-22, 1:11 am |
| I've been working on this more with some effort to get the number of
connection threads to actually stay somewhere close to the range of
minthreads to maxthreads.
But there was something a little more interesting to track down: what happens
when a thread exits? I think there is some idea that the reason for exiting a
thread is to remove memory leaks which might occur. But the question is, if
threads are shared among servers, and threads exit on a number of
connections, what else happens when the thread exits?
Apparently, all the interps associated with the thread exit. I haven't seen
any evidence against this conclusion. What this means is that thread exit is
expensive. It could take down multiple interps (one for each server sharing
the thread).
This impacts the current discussion because I'm having trouble finding an easy
way to avoid thread exit because of a timeout. But this information seems to
simplify the problem. Why exit a thread just because of a timeout? I can't
see any possible advantage. Instead, every effort should be directed at
keeping threads alive until maxconns is reached. Also, shared threadpools
should be limited to the default and error pools, otherwise a simple server
could quickly expire interps of a more expensive server.
tom jackson
On Sunday 21 October 2007 14:47, Gustaf Neumann wrote:
> Tom Jackson schrieb:
>
> Good observation.
> The code was not correct, although the potential damage was small.
> Fixed by now.
| |
| Gustaf Neumann 2007-10-22, 7:11 am |
| Tom Jackson schrieb:
> But there was something a little more interesting to track down: what happens
> when a thread exits? ...
>
> Apparently, all the interps associated with the thread exit.
certainly, every worker thread has its own tcl interpreter associated.
When the interpreter
exits, the tcl interpreter with all its state exists as well.
> This impacts the current discussion because I'm having trouble finding an easy
> way to avoid thread exit because of a timeout. But this information seems to
> simplify the problem. Why exit a thread just because of a timeout?
i don't understand. what timeouts exactly are you taking about? In the
code are several timeouts for
various purposes...
-gustaf neumann
| |
| Tom Jackson 2007-10-22, 1:11 pm |
| On Sunday 21 October 2007 23:47, Gustaf Neumann wrote:
> i don't understand. what timeouts exactly are you taking about? In the
> code are several timeouts for
> various purposes...
The timeout associated with pool threads.
If a thread times out waiting for a request, it exits. But in AOLserver, you
don't really reclaim any memory, so unless there is some other reason to
exit, it doesn't seem like a very useful parameter, when performance is a
consideration.
My only point here was that I'm going to stop looking at the timeout
parameter, and timed out threads as an issue. If threads timeout, the number
of threads in a threadpool will drop below minthreads, usually to zero.
tom jackson
| |
| Andrew Piskorski 2007-10-22, 1:11 pm |
| On Mon, Oct 22, 2007 at 06:30:40AM -0700, Tom Jackson wrote:
> My only point here was that I'm going to stop looking at the timeout
> parameter, and timed out threads as an issue. If threads timeout, the number
> of threads in a threadpool will drop below minthreads, usually to zero.
But isn't that a bug? If not, just what is "minthreads" supposed to
really mean?
--
Andrew Piskorski <atp@piskorski.com>
http://www.piskorski.com/
| |
| Tom Jackson 2007-10-22, 1:11 pm |
| Andrew,
It might be a bug, and I just discovered another one related to this.
If -timeout is set to zero or less, the value is still used, but it should
probably be set to null, meaning infinite timeout:
403 if (poolPtr->threads.current <= poolPtr->threads.min) {
404 timePtr = NULL;
405 } else {
For one, I've moved this check inside the following while loop so it gets a
chance to change on each cycle, but I also made these modifications:
.. if (poolPtr->threads.timeout <= 0
.. || poolPtr->threads.current -1 <= poolPtr->threads.min) {
.. timePtr = NULL;
.. } else {
The behavior I see with apache bench is that after a bunch of requests are
finished, most of the threads exit, if there is a timeout. However, this may
be a result of the fact that apache bench sends extra requests and then
disconnects them. The problem is that each thread calculates for itself what
to do, and there is no way for them to know what other threads are going to
do within the next cycle.
I've started looking at JMeter, because apache bench doesn't really do what
you ask it to do, but maybe that is good. You discover things you weren't
testing for. For one, even with concurrency set at some number, fast requests
will lead apache bench to send another request before cleanup is complete on
AOLserver. AOLserver thinks the concurrency is higher than ab, up to double
the concurrency. I know this because I'm testing a boost factor for
minthreads. On large requests, threads.current never reaches the concurrency
level of ab, but on small requests, it can exceed it, up to maxthreads-2.
With my latest changes to my setup, the threadpool moderates to something
else: with high concurrency, the idle threads value tends to zero. If
concurrency is less than maxthreads, the queued requests tends to zero.
Eventually, if concurrency is less than maxthreads, both of these values tend
to zero, they bounce around a low number and zero, so all threads are busy
and the queue is empty. Otherwise the trend appears to be
current+queued=concurrency and idle=0.
tom jackson
On Monday 22 October 2007 07:48, Andrew Piskorski wrote:
> On Mon, Oct 22, 2007 at 06:30:40AM -0700, Tom Jackson wrote:
>
> But isn't that a bug? If not, just what is "minthreads" supposed to
> really mean?
| |
| Tom Jackson 2007-10-22, 1:11 pm |
| Andrew,
Maybe I should have said that I'll look at the issue, but not as a performance
issue. The bug aspect may be simply allowing a timeout. Otherwise finding a
solution would probably require a thread to say that it is going to sleep
until a connection arrives. Then other threads can figure out if they can
exit on timeout without pushing current below min.
On Monday 22 October 2007 07:48, Andrew Piskorski wrote:
> On Mon, Oct 22, 2007 at 06:30:40AM -0700, Tom Jackson wrote:
>
> But isn't that a bug? If not, just what is "minthreads" supposed to
> really mean?
| |
| Gustaf Neumann 2007-10-22, 7:11 pm |
| Andrew Piskorski schrieb:
>
>
> But isn't that a bug? If not, just what is "minthreads" supposed to
> really mean?
>
you are right, if someone specified minthreads, they should be maintained.
i have added an additional condition to the thread recreation to
maintain the specified minimal number of threads.
maybe Tom has something more sophisticated in mind. this is again
a minimal invasive fix, similar to the fix for assuring the processing
of the queued requests, when all connection threads exit.
for more details, see:
http://aolserver.cvs.sourceforge.ne...ueue.c?view=log
best regards
-gustaf neumann
| |
| Tom Jackson 2007-10-23, 7:11 pm |
| I've been working on a slightly different angle here, trying to avoid thread
exit, and to allow threads to range usefully between min and max.
The current code in CVS has a number of problems which lead to the server not
creating more threads when there are queued requests and the number of
threads is below max (sometimes below min!).
Now my test code has solved most of these issues.
1. threads go up very quickly to with load increase.
2. threads go down when load goes down (if queue is kept empty) to slightly
below concurrency.
3. if maxthreads is lowered, threads slowly go down to meet the new value.
There are still a few things I'd like to look at, so I'll hold off on any
commit for now.
tom jackson
On Monday 22 October 2007 16:04, Gustaf Neumann wrote:
> Andrew Piskorski schrieb:
>
> you are right, if someone specified minthreads, they should be maintained.
> i have added an additional condition to the thread recreation to
> maintain the specified minimal number of threads.
> maybe Tom has something more sophisticated in mind. this is again
> a minimal invasive fix, similar to the fix for assuring the processing
> of the queued requests, when all connection threads exit.
>
> for more details, see:
> http://aolserver.cvs.sourceforge.ne.../queue.c?view=l
>og
>
> best regards
> -gustaf neumann
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the
> email message. You can leave the Subject: field of your email blank.
| |
| Gustaf Neumann 2007-10-24, 7:11 am |
| Tom Jackson schrieb:
> I've been working on a slightly different angle here, trying to avoid thread
> exit, and to allow threads to range usefully between min and max.
>
> The current code in CVS has a number of problems which lead to the server not
> creating more threads when there are queued requests and the number of
> threads is below max (sometimes below min!).
>
concerning the code in cvs head:
the server creates now worker threads when a
- thread exits and more requests are queued,
and no other thread is idle, or when a
- thread exits and the number of connection threads
falls below minthreads,
both cases unless shutdown. The handled thread
exists are typically due to exceeding maxconns
or the idle timeout.
This is what my tests show. What are the problems? Or are
you taking about the version in cvs before these changes?
Of course, before these changes, minthreads was respected
only at startup, queued requests were not processed.
My first solution was based on a strategy, where a
thread did not exit, when there was more to do (more
queued requests), allthough it should exit due to maxconns.
However, a user specifying maxconns might have good
reasons for doing so, therefore the value must be respected.
Avoiding exit in this situation is not viable option IMHO.
It can be discussed, whether the idle timeout should be
obeyed e.g. in situations where number of threads is
below minthreads. In the code in cvs the thread exits
and is recreated as in other cases, when it falls below
minthreads. It would be more efficient to avoid the
timeout in such situations, but one might end up
with quite old worker threads, which are more
sensitive to growth.
-gustaf neumann
| |
| Tom Jackson 2007-10-24, 1:11 pm |
| On Wednesday 24 October 2007 01:07, Gustaf Neumann wrote:
> concerning the code in cvs head:
> the server creates now worker threads when a
> - thread exits and more requests are queued,
> and no other thread is idle, or when a
> - thread exits and the number of connection threads
> falls below minthreads,
> both cases unless shutdown. The handled thread
> exists are typically due to exceeding maxconns
> or the idle timeout.
>
> This is what my tests show. What are the problems? Or are
> you taking about the version in cvs before these changes?
> Of course, before these changes, minthreads was respected
> only at startup, queued requests were not processed.
>
I've pointed out several times that I'm looking at the operation of the queue
from the point of view of performance (both economy and responsiveness), not
the specific numbers (min, max, timeout, maxconns) being maintained. Also,
I'm looking at the impact of choosing different combinations of numbers.
So for instance, if Andrew thinks it is a bug for minthreads to be violated,
it is best to not use a thread timeout. That fixes the issue instantly.
I don't think the operation of the queue has to hold your hand to keep you
from making bad decisions. But lets address this a little more carefully
before rushing in to fix something which may logically appear to be a bug,
but probably isn't one. And also we need to look at the solution. I'll start
with yours.
If minthreads is taken as a sacred number, then your solution doesn't cut it.
The thread exits and then recreates another one. This is merely cosmetic, and
it only appears correct if you don't notice that the number actually went
below minthreads.
If there was a correct solution, it would prevent the violation of this lower
limit, not fix it up after it happens.
As an example, just look at the original code. How did threads escape exit at
startup if there is a timeout?
388 if (poolPtr->threads.current <= poolPtr->threads.min) {
389 timePtr = NULL;
390 } ...
So there was already some thought at preventing thread exit based upon a
timeout, if minthreads would be violated. So the question is: why isn't this
code working?
There are a number of bugs in this code, and fixing it up at the end, after a
thread has exited doesn't remove the actual bugs.
An individual thread sits in a while loop waiting for a queued request this
loop, and the following check is currently:
status = NS_OK;
while (!poolPtr->shutdown
&& status == NS_OK
&& poolPtr->queue.wait.firstPtr == NULL) {
/*
nothing is queued, we wait for a queue entry
*/
status = Ns_CondTimedWait(&poolPtr->cond, &poolPtr->lock, timePtr);
}
if (poolPtr->queue.wait.firstPtr == NULL) {
msg = "timeout waiting for connection";
break;
}
Status starts out as NS_OK, the entire while loop is skipped if there is a
waiting request, poolPtr->queue.wait.firstPtr != NULL.
Once the wait is done, the conditions are again checked, the loop is skipped
if status becomes NS_TIMEOUT, or if there is a waiting request (or shutdown).
The problem is that we made the decision to exit on timeout before we knew if
exiting would violate the minthread condition.
So we could do a wait, then again check if we will violate this condition, and
if so, update status to NS_OK, and avoid unnecessary exit. If we move the
timeout code inside the loop, we also avoid repeating/executing this anywhere
else.
But there is another bug/error in the code. You can set timeout to <= 0. In
this case, the code happily sets the timeout (as a real future time), but
will immediately exit in wait.
I've fixed the last bug in my code, but I haven't yet focused on the timeout
issue as I'm benchmarking the current cvs code.
Current CVS:
Running the latest code, I discovered the same issue that my code had, but has
been eliminated. If maxthreads is reduced [ns_pools set mypool -maxthreads x]
the number of threads doesn't decrease. It also doesn't go down based upon
load. Instead it goes immediately to original maxthreads and stays there.
I'll need to be more exact in my descriptions later, but basically, the queue
operation goes to max, which is better than it was before, when it was hard
to get even up to min. However, as an overall performance goal, threadpools
should moderate their behavior, as activity may move from one threadpool to
another. If everything stays at max, this isn't achieved.
The original problem is easy to describe now that I have done a bunch of
testing: build up a large enough queue so that all remaining threads will not
be able to service them, then stop sending requests. This is what was
happening under Apache Bench. Send a 1000 requests, as long as concurrency is
a little bigger than current threads, things get stuck. It doesn't
necessarily appear to be a thread exit issue. Other problems also show up.
The driver thread also does sock cleanup, which gets stuck. These stuck socks
are dropped connections, they just appear to be in a queue. In fact, they are
not waiting to be serviced, the sock is just waiting for cleanup.
Another issue was the fact that the driver thread was not doing a
condBroadcast on thread create, and if thread create is skipped, it just does
a condSignal. Apparently it is not a good idea to just do condSignal. I
haven't checked what happens if this is changed, but already my queue works
very well. I haven't seen any stuck requests that are not dropped (but I'm
still looking).
The current code in cvs makes it difficult to view the operation of the queue.
The main problem is that threads.queued is a cumulative total, a totally
pointless value. queue.wait.num has the correct value, but this is
unavailable (and should be) to the ns_pools command. Without the ability to
query this information while benchmarking it is difficult to tell what is
going on. The biggest problem is assuming that using apache bench to send 10
concurrent requests means that AOLserver is only queueing 10 (at most)
requests at one time. However this is not necessarily true. First, a request
might come in faster than a thread can recycle. This is one reason that the
current CVS code pushes threads.current all the way up to maxthreads. But I
have also been experimenting with [ns_conn channel] to send the request to a
worker thread to relieve the connection thread. When this is used, queued
requests can build up much higher values.
tom jackson
> My first solution was based on a strategy, where a
> thread did not exit, when there was more to do (more
> queued requests), allthough it should exit due to maxconns.
> However, a user specifying maxconns might have good
> reasons for doing so, therefore the value must be respected.
> Avoiding exit in this situation is not viable option IMHO.
>
> It can be discussed, whether the idle timeout should be
> obeyed e.g. in situations where number of threads is
> below minthreads. In the code in cvs the thread exits
> and is recreated as in other cases, when it falls below
> minthreads. It would be more efficient to avoid the
> timeout in such situations, but one might end up
> with quite old worker threads, which are more
> sensitive to growth.
>
> -gustaf neumann
>
>
> --
> AOLserver - http://www.aolserver.com/
>
> To Remove yourself from this list, simply send an email to
> <listserv@listserv.aol.com> with the body of "SIGNOFF AOLSERVER" in the
> email message. You can leave the Subject: field of your email blank.
| |
| Tom Jackson 2007-10-24, 7:11 pm |
| Okay, there is a slight bug here in the new cvs code.
What is happening is that sometimes, the driver thread and a connection
thread are creating threads at the same time. I've added a few log statements,
which makes the bug show up less often, but I can easily reproduce it in a
few minutes at most (7-100k requests).
Here is the instrumented proc, minus the code comments:
void
NsCreateConnThread(Pool *poolPtr, int joinThreads)
{
ConnData *dataPtr;
Ns_Log(Notice, "Creating Threads, joinThreads = %i", joinThreads);
if (joinThreads) {
NsJoinConnThreads();
}
dataPtr = ns_malloc(sizeof(ConnData));
dataPtr->poolPtr = poolPtr;
dataPtr->connPtr = NULL;
Ns_ThreadCreate(NsConnThread, dataPtr, 0, &dataPtr->thread);
Ns_Log(Notice, "Done Creating Threads");
}
Also I've added a log just before a thread is going to create a new thread to replace itself (from the NsConnThread)
poolPtr->threads.current++;
poolPtr->threads.idle++;
Ns_Log(Notice, "About to exit and create new conn thread");
Ns_MutexUnlock(&poolPtr->lock);
NsCreateConnThread(poolPtr, 0);
The logs which lead up to fatal 11:
[24/Oct/2007:14:20:20.924295][26946.1126394176][-default:36551-] Notice: Added thread to join list for thread exit
[24/Oct/2007:14:20:20.924335][26946.1126394176][-default:36551-] Notice: exiting: exceeded max connections per thread
[24/Oct/2007:14:20:20.925328][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.925450][26946.1121110336][-nssock3:driver-] Notice: Creating Threads, joinThreads = 1
[24/Oct/2007:14:20:20.925508][26946.1121110336][-nssock3:driver-] Notice: Done Creating Threads
[24/Oct/2007:14:20:20.926428][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.927418][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.928339][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.929239][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.930144][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.930271][26946.1181346112][-default:36554-] Notice: Added thread to join list for thread exit
[24/Oct/2007:14:20:20.930293][26946.1181346112][-default:36554-] Notice: About to exit and create new conn thread
[24/Oct/2007:14:20:20.930301][26946.1181346112][-default:36554-] Notice: Creating Threads, joinThreads = 0
[24/Oct/2007:14:20:20.931191][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.932100][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.932986][26946.1121110336][-nssock3:driver-] Notice: Closing Sock in SockClose
[24/Oct/2007:14:20:20.933628][26946.1130621248][-default:36553-] Notice: Added thread to join list for thread exit
[24/Oct/2007:14:20:20.933650][26946.1130621248][-default:36553-] Notice: exiting: exceeded max connections per thread
[24/Oct/2007:14:20:20.936221][26946.1126394176][-default:36555-] Notice: Added thread to join list for thread exit
[24/Oct/2007:14:20:20.936242][26946.1126394176][-default:36555-] Notice: exiting: exceeded max connections per thread
[24/Oct/2007:14:20:20.937692][26946.1121110336][-nssock3:driver-] Notice: Creating Threads, joinThreads = 1
[24/Oct/2007:14:20:20.937722][26946.1121110336][-nssock3:driver-] Fatal: received fatal signal 11
Summary:
thread 36554: adds itself to join list
thread 36554: about to exit mutex and run NsCreateConnThread
thread 36554: At top of NsCreateConnThread
thread 36554: sleeping
driver thred: closing a sock, three times
thread 36553: adds itself to join list, then exits (Doesn't create thread)
thread 36555: adds itself to join list, then exits (Doesn't create thread)
driver thred: At top of NsCreateConnThread
driver thred: receives signal 11.
Regardless of the actual cause of this bug, this is very illustrative of
how quickly action switches from one thread to another.
I'm guessing that since the original code only had the driver creating threads,
there might be something unprotected, or maybe something is deleted, then assumed
to exist.
I kind of stumbled upon this when trying to test a pool setup like this:
ns_pools set default -minthreads 1 -maxthreads 5 -maxconns 5 -timeout 10
Then I ran ab on a 28k static file (very fast to serve):
ab -n 120000 -c 120 http://192.168.1.102:8000/bgwrite/sns-thumb.jpg
With higher -maxconns I doubt this bug would show up in real life.
tom jackson
On Wednesday 24 October 2007 10:10, Tom Jackson wrote:
> I've fixed the last bug in my code, but I haven't yet focused on the
> timeout issue as I'm benchmarking the current cvs code.
| |
| Gustaf Neumann 2007-10-27, 1:11 am |
| Tom Jackson schrieb:
> I've pointed out several times that I'm looking at the operation of the queue
> from the point of view of performance (both economy and responsiveness), not
> the specific numbers (min, max, timeout, maxconns) being maintained.
....
> So for instance, if Andrew thinks it is a bug for minthreads to be violated,
> it is best to not use a thread timeout. That fixes the issue instantly.
>
to obey minthreads helps to improve performance, since when all
connection threads are
gone, there will be a delay for thread creation upon a new request. I
agree with
Andrew that it is a bug if - as it was before - minthreads are only
available at server start,
and disappear as soon the threads timeout.
> If minthreads is taken as a sacred number, then your solution doesn't cut it.
> The thread exits and then recreates another one. This is merely cosmetic, and
> it only appears correct if you don't notice that the number actually went
> below minthreads.
>
> If there was a correct solution, it would prevent the violation of this lower
> limit, not fix it up after it happens.
>
i do not agree here. maxconns should be obeyed as well. If a thread has
served the
specified number of items, it should exit. If the number of threads is
reduced
below minthreads, a fresh thread should be created. Keeping the thread
instead
alife is not correct. If one does not like the behavior, one could
specify maxconns=9999999999.
> As an example, just look at the original code. How did threads escape exit at
> startup if there is a timeout?
>
> 388 if (poolPtr->threads.current <= poolPtr->threads.min) {
> 389 timePtr = NULL;
> 390 } ...
>
>
> So there was already some thought at preventing thread exit based upon a
> timeout, if minthreads would be violated. So the question is: why isn't this
> code working?
>
simply because it does not consider that maxconn cycles may lead as well
to thread exits.
> There are a number of bugs in this code, and fixing it up at the end, after a
> thread has exited doesn't remove the actual bugs.
>
> An individual thread sits in a while loop waiting for a queued request this
> loop, and the following check is currently:
>
> status = NS_OK;
> while (!poolPtr->shutdown
> && status == NS_OK
> && poolPtr->queue.wait.firstPtr == NULL) {
> /*
> nothing is queued, we wait for a queue entry
> */
> status = Ns_CondTimedWait(&poolPtr->cond, &poolPtr->lock, timePtr);
> }
>
> if (poolPtr->queue.wait.firstPtr == NULL) {
> msg = "timeout waiting for connection";
> break;
> }
>
> Status starts out as NS_OK, the entire while loop is skipped if there is a
> waiting request, poolPtr->queue.wait.firstPtr != NULL.
>
this is fine, since the waiting request can be immediately processed
without the
condwait.
> Once the wait is done, the conditions are again checked, the loop is skipped
> if status becomes NS_TIMEOUT, or if there is a waiting request (or shutdown).
>
> The problem is that we made the decision to exit on timeout before we knew if
> exiting would violate the minthread condition.
>
> So we could do a wait, then again check if we will violate this condition, and
> if so, update status to NS_OK, and avoid unnecessary exit. If we move the
> timeout code inside the loop, we also avoid repeating/executing this anywhere
> else.
>
this won't get rid of the exits due to maxconns. The code after your
snippet assumes, that
it will be reached only when there is a request ready for processing.
> This is what was
> happening under Apache Bench. Send a 1000 requests, as long as concurrency is
> a little bigger than current threads, things get stuck. It doesn't
> necessarily appear to be a thread exit issue. Other problems also show up.
> The driver thread also does sock cleanup, which gets stuck. These stuck socks
> are dropped connections, they just appear to be in a queue. In fact, they are
> not waiting to be serviced, the sock is just waiting for cleanup.
>
I have noticed this as well, but have not tried to fix this. Sometimes
the driver
stops queueing requests, although some connection threads are idling around.
Maybe this is the problem that jeff rogers had already fixed for 4.0?
> Another issue was the fact that the driver thread was not doing a
> condBroadcast on thread create, and if thread create is skipped, it just does
> a condSignal. Apparently it is not a good idea to just do condSignal.
there was another problem in the book-keeping of idle etc. which i have
fixed
today (committed to cvs). The problem was that under heavy concurrency,
NsQueueConn() did not know if there are conneciton threads in a cond
wait or not. A condSignal is the right thing. It does not make sense to
broadcast and wake up multiple waiting threads for a single request.
i was not able to recreate the crash that you have reported in the other
mail.
However, with ab with small files and pure aolserver (without openacs),
i could verify that some of the counters were wrong. The decision to
create a thread or not in NsQueueConn() was based on a counter that
was wrong, since it did not take into account that some more threads
are already starting in the background. To address this issuess, there
are now two additional variables, namely starting (# of currently
starting threads) and waiting (# of threads in a cond wait).
It would be great if you could test this code again in your environment.
> The current code in cvs makes it difficult to view the operation of the queue.
> The main problem is that threads.queued is a cumulative total, a totally
> pointless value.
well, for statistics, it might make sense.
> queue.wait.num has the correct value, but this is
> unavailable (and should be) to the ns_pools command. Without the ability to
> query this information while benchmarking it is difficult to tell what is
> going on.
it would be simple to return it via pool commands, but that might already
break some applications. but i do agree, that one should return that value.
Is there an objection from the community to return the number of currently
queued requests via "ns_pools get default"?
> The biggest problem is assuming that using apache bench to send 10
> concurrent requests means that AOLserver is only queueing 10 (at most)
> requests at one time.
there is no indication that someone assumes this. The original problem
with the
high number of queued requests is most easily created with slow requests
and
slow starting threads. The number are controlled via limits (defaults
for maxrun
and maxwait is 100).
-gustaf
| |
| Tom Jackson 2007-10-27, 7:11 am |
| On Friday 26 October 2007 17:27, Gustaf Neumann wrote:
> i was not able to recreate the crash that you have reported in the other
> mail.
> However, with ab with small files and pure aolserver (without openacs),
> i could verify that some of the counters were wrong. The decision to
> create a thread or not in NsQueueConn() was based on a counter that
> was wrong, since it did not take into account that some more threads
> are already starting in the background. To address this issuess, there
> are now two additional variables, namely starting (# of currently
> starting threads) and waiting (# of threads in a cond wait).
> It would be great if you could test this code again in your environment.
I'm not sure if you are disputing the bug or not, but I'm somewhat tired of
this. I've been testing various patches for the last few weeks in addition to
my own code. It seems that everyone knows I've been working on this for a few
weeks or more, and everytime I identify a problem, a new patch with a new
problem is placed into CVS.
Each time I try to point out that this isn't a simple issue, but patches show
up with a one dimensional solution or no testing to back them up.
I'll commit my code over the weekend replacing the current code. My fix will
be a bug fix. The bugs are these:
1. unable to respond to load changes (up and down)
2. unable to respond to configuration changes (up and down)
3. poor, incorrect or missing synchronization code.
4. queued doesn't reflect current queued requests.
5. threads exit unnecessarily on timeout, then new threads to replace
6. current patches cause signal 11 under easily repeated conditions.
If anyone wants to stick to the current code, let me know, and I'll just keep
my changes here. My code is well tested with small, large and very large
files, low, and high maxconns, low-high threads, and also using [ns_conn
channel] to send the sock to a worker thread. I run the queue under
continuous loads and vary the threads/maxconns to see the response.
tom jackson
| |
| Gustaf Neumann 2007-10-27, 1:11 pm |
| Tom Jackson schrieb:
> I'm not sure if you are disputing the bug or not, but I'm somewhat tired of
> this. I've been testing various patches for the last few weeks in addition to
> my own code. It seems that everyone knows I've been working on this for a few
> weeks or more, and everytime I identify a problem, a new patch with a new
> problem is placed into CVS.
>
Well, i think, i have identified and pinpointed a couple of bugs, and
you did as well.
Your are probably as well aware, that i am working on the code, so i
commited
over time a couple of improvements based on different load scenarios etc.
Yes, i tried rather to fix the code than to come up with a different
approach
(and was as well surprised the original code has much problems).
> Each time I try to point out that this isn't a simple issue, but patches show
> up with a one dimensional solution or no testing to back them up.
>
> I'll commit my code over the weekend replacing the current code. My fix will
> be a bug fix. The bugs are these:
> 1. unable to respond to load changes (up and down)
> 2. unable to respond to configuration changes (up and down)
> 3. poor, incorrect or missing synchronization code.
> 4. queued doesn't reflect current queued requests.
> 5. threads exit unnecessarily on timeout, then new threads to replace
> 6. current patches cause signal 11 under easily repeated conditions.
>
> If anyone wants to stick to the current code, let me know, and I'll just keep
> my changes here.
well, this list of issues is not complete (e.g. queuing in the driver)
and rather
vague (1-3). For (4) you seem to propose a change in semantics for the
reported
value, for (5) you proposal seems to be "ignore maxconns", (6) is not
easy to repeat,
i was not able to do so.
What makes you believe that (4) is a bug? The total number of processed
queries is certainly interested for e.g. implementing a performance
monitor. If you siltently change the semantics, you will break such
applications. Therefore, it is better to add another variable (e.g.
"currently_queued") than altering something you find not useful.
Btw, with the current code, i get on my home pc (intel imac,
mac os x) up to 5000 to 6000 requests per second on small files
(maxhreads 5) in cases without driver stalls. With maxthreads 10
i get about twice the value (e.g.g 10169.49 [#/sec]).
Certainly, these values are hard to compare since they
depend on OS and hardware.
Server Software: AOLserver/4.5.0
Server Hostname: localhost
Server Port: 8001
Document Path: /hires_image-learn.gif
Document Length: 6339 bytes
Concurrency Level: 10
Time taken for tests: 2.111 seconds
Complete requests: 12000
Failed requests: 0
Broken pipe errors: 0
Total transferred: 78638204 bytes
HTML transferred: 76093356 bytes
Requests per second: 5684.51 [#/sec] (mean)
Time per request: 1.76 [ms] (mean)
Time per request: 0.18 [ms] (mean, across all concurrent requests)
Transfer rate: 37251.64 [Kbytes/sec] received
Connnection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 2
Processing: 0 1 0.3 1 6
Waiting: 0 1 0.3 1 6
Total: 0 1 0.4 1 6
> My code is well tested with small, large and very large
> files, low, and high maxconns, low-high threads, and also using [ns_conn
> channel] to send the sock to a worker thread. I run the queue under
> continuous loads and vary the threads/maxconns to see the response.
>
note that different load it is not primarily an issue of small and large
files
but more an issue of fast and slow requests (fast requests are typically
small
files, but slow requests might return as well small responses, but require
a many CPU cycles) and as well an issue of slow and fast thread start up
time
(aolserver vs. openacs). I have observed quite different issues in such
situations.
I have no objections in replacing the current code in CVS, as long
- it fixes the issues we have discussed,
- it does not introduce new issues,
- it respects configuration values,
- it is performance-wise no drawback.
best regards
-gustaf neumann
| |
| Tom Jackson 2007-10-27, 7:11 pm |
| On Saturday 27 October 2007 05:22, Gustaf Neumann wrote:
>
> well, this list of issues is not complete (e.g. queuing in the driver)
> and rather
> vague (1-3). For (4) you seem to propose a change in semantics for the
> reported
> value, for (5) you proposal seems to be "ignore maxconns", (6) is not
> easy to repeat,
> i was not able to do so.
>
> I have no objections in replacing the current code in CVS, as long
> - it fixes the issues we have discussed,
> - it does not introduce new issues,
> - it respects configuration values,
> - it is performance-wise no drawback.
You're right, nevermind. I can't guarantee any of these conditions.
We don't need two people overwriting each other's commits.
tom jackson
|
|
|
|
|