Unix Programming - How could this core dump happen?

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > October 2004 > How could this core dump happen?





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author How could this core dump happen?
Kevin Ren

2004-09-22, 9:21 pm

Hello,

I have a program got two strange core dumps. I can't find out why this could
happen. Can anyone give me some help?

The program was written in C++ and compiled with SUNW CC 5.0.
Platform: Solaris 2.7.


The code which is suspected to cause core dump like this:

//Start a new thread by creating a SockThread object.
SockThread *pst = new SockThread();

In recent two months, I got two core dumps. One caused by SIG_ABORT, another
by SIG_SEGV.


Here is the pseudo code of SockThread:

//Pseudo Code of SockThread

//Thread Function
void *OSThreadMain( void *pContext ) {
SockThread *pst = (SockThread *) pContext;
//Call the thread code, Some time core dump
pst->ThreadEntry();
return NULL;
}

class SockThread
{
public:
//Construtor, create a thread
SockThread() {
//Do some initialization
...

//Start worker thread
StartWorker();
}

//Real Thread Code
void ThreadEntry() {
while( true ) {
//Do some thing
...
}
}
private:

int StartWorker() {
... open some sockets by socket system call
... call poll system call
... then
return OSCreateWorker();
}

int OSCreateWorker() {
//Create Posix Thread
pthread_create( &tid, pattr, OSThreadMain, (void *)this );
...
}
};

I debug the core files with dbx, they are shown like this:

The first one:
t@1833225 (l@1833229) terminated by signal BUS (invalid address alignment)
0xfede2584: ThreadEntry+0x0288: call SockState #Nvariant 1
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx) where
current thread: t@1833225
=>[1] SockThread::ThreadEntry(0xe7878d0, 0x1, 0x0, 0xd599a70, 0x3,
0xc2000004), at 0xfede2584
[2] SockThread::StartWorker(0xe0888d0, 0x0, 0x1, 0xfee2e0c0, 0x1,
0xfee2ca54), at 0xfede2c24
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx)

The second one:
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx) where
current thread: t@148124
[1] __sigprocmask(0x0, 0xf9d95538, 0x0, 0xffffffff, 0xffffffff, 0x0), at
0xfee19dc0
[2] _resetsig(0xfee2ca54, 0x0, 0x0, 0x0, 0xf9d95e3c, 0xf9d95e40), at
0xfee0f3c4
[3] _sigon(0xfee34400, 0xfee34360, 0xf9d95e38, 0xf9d9560c, 0x6, 0xfed4e168),
at 0xfee0eb80
[4] _thrp_kill(0x0, 0x5, 0x6, 0xfee2ca54, 0xf9d95dc0, 0x0), at 0xfee11954
[5] abort(0xfedb5f74, 0xc304c, 0xb, 0xfee2ca54, 0xf9d95664, 0x0), at
0xfed39590
[6] sigSegv(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee2ca54, 0xf9d95e48,
0xf9d95e28), at 0x56904
[7] __libthread_segvhdlr(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee2ca54, 0x0, 0x0),
at 0xfee19378
[8] __sighndlr(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee19298, 0xf9d95e48,
0xf9d95e28), at 0xfee1be20
[9] sigacthandler(0xb, 0xf9d95dc0, 0xf9d95918, 0xfee2ca54, 0xf9d95bd0,
0xf9d95dc0), at 0xfee186e0
---- called from signal handler with signal 11 (SIGSEGV) ------
=>[10] SockThread::ThreadEntry(0x0, 0x1, 0x0, 0xd5b1cd8, 0x3, 0xc2000004),
at 0xfede266c
[11] SockThread::StartWorker(0xeb5e2b0, 0x0, 0x1, 0xfee2e0c0, 0x1,
0xfee2ca54), at 0xfede2c24


When I debug a runing program which in normal status, I got:

(/opt2/SUNWrkShp/SUNWspro/bin/../WS6/bin/sparcv9/dbx) where
current thread: t@473
=>[1] SockThread::ThreadEntry(0x2e57178, 0xfea45d54, 0x0, 0x0, 0x0, 0x0), at
0xff0e3770
[2] OSThreadMain(0x2e57178, 0x0, 0x1, 0xff13e0c0, 0x1, 0xff13ca54), at
0xff0e5000
(/opt2/SUNWrkShp/SUNWspro/bin/../WS6/bin/sparcv9/dbx)



There are two things make me very confused:

1. You may find the StartWorker shall not call the ThreadEntry directly. The
two functions shall be called in two different threads. In fact it should
be:
SockThread::SockThread()->SockThread::StartWorker()->pthread_create();
After call pthread_create, in the new thread,
OSThreadMain()->SockThread::ThreadEntry();
You can see when I debug the program in normal staus, the call stack trace
is correct.

2. The StartWorker and the ThreadEntry are all non-static member functions
of the class SockThread. So the first prameter of them shall be THIS
pointer. OSThreadMain is a normal C style function, its first prameter is
the THIS pointer of the SockThread object who calls it(refer to code calling
pthread_create). In the debug information in normal case, you can see it it
true. But in the core files, you can see the first pramaters of StartWorker
and ThreadEntry are not same. That's why the core dump happened.

Additional informations:
1) The only way to call ThreadEntry is to create a new SockThread object. In
the program, there are many SockThread objects which are created on the
heap.
2) The core dump seldom occurs.
3) The new thread is created in constructor, so it is possible that the
ThreadEntry is called when the object is not constructed completely. .
4) purify shows there are lot of BSR error near the point at which core dump
occurs. poll is called in SockThread::StartWorker.
BSR: Beyond stack read error (325 times)
This is occurring while in thread 30:
_poll [libc.so.1]
poll [libthread.so.1]
OSThreadMain [libpxtrans.a]
_thread_start [libthread.so.1]
Reading 4 bytes from 0xfcc45bc8.
Stack pointer 0xfcc45bf0


Thanks,
Kevin Ren



Paul Pluzhnikov

2004-09-22, 9:21 pm

"Kevin Ren" <hren@lucent.com> writes:

> I have a program got two strange core dumps. I can't find out why this could
> happen. Can anyone give me some help?


Not really: you did not provide enough info.

> The code which is suspected to cause core dump like this:


You've omitted too many relevant details. Try to at least create
a complete compilable example (you can omit all the application
functionality, but the test better create threads in exactly the
same fasion as the application).

> 4) purify shows there are lot of BSR error near the point at which core dump
> occurs. poll is called in SockThread::StartWorker.
> BSR: Beyond stack read error (325 times)


This is probably the reason for your crashes, but I can't deduce what
you are doing wrong given the limited code snippets you've posted.

Cheers,
--
In order to understand recursion you must first understand recursion.
Remove /-nsp/ for email.
Kevin Ren

2004-09-22, 9:22 pm

Yes, I think your suggestion is right. I am writing a test program to
simulate the situation. I also find there is a rapid memory leak before
crash.

What I am wondering is why the call stack shows the function StartWorker
call ThreadEntry directly. This is impossible. They should be called in
different threads. Maybe memory leak can explain it. I found the case that
the call stack shown a function call a completely not relevant function when
there was a block of memory was freeed twice.

Thanks for your help.
Kevin


Chuck Dillon

2004-09-22, 9:22 pm

Kevin Ren wrote:
>
> I debug the core files with dbx, they are shown like this:
>
> The first one:
> t@1833225 (l@1833229) terminated by signal BUS (invalid address alignment)
>


> There are two things make me very confused:


SIGBUS generally indicates you are trashing memory that contains
pointers. Strange stack traces indicate the memory being trashed
includes the stack. In such a situation the stack trace is of little
help. The code that is trashing memory may not be anywhere near (in
any sense) where the symptoms are being seen.

Check code that allocates space for data and make sure that in all
cases sufficient space is provided for and add checks for things that
aren't supposed to happen. Be meticulous about checking the code and
the design. Have someone else look through it. This could be a very
small and subtle bug or design oversight.

The rarity of the problem could indicate that it is data related (e.g.
a protocol violation that you aren't checking for) or it could be a
race condition between the threads (thread A trying to stuff it's data
into space allocated for thread B), or a combination of both.

As much as memory checkers and debuggers are great tools. These kinds
of problems often can't be tracked down at runtime without insertion of
traces (e.g. debug prints). Sometimes they just hide the problem by
changing how memory is arranged or changing the timing of events.

Good luck,

-- ced



--
Chuck Dillon
Senior Software Engineer
NimbleGen Systems Inc.
Kevin Fox

2004-10-04, 6:01 pm

Hi Chunk,
Thank you for your help. I am still struglling against this problem. But I
havn't got any progress yet. The rarity of this problem makes it is very
difficult to reproduce in or lab. At least I failed to reproduce it. I can't
activate the log and trace facility on customer site, since we can't afford
that performance cost. Anyway, thank you all very much.

Kevin


Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com