09-23-04 02:21 AM
Hello,
I have a program got two strange core dumps. I can't find out why this could
happen. Can anyone give me some help?
The program was written in C++ and compiled with SUNW CC 5.0.
Platform: Solaris 2.7.
The code which is suspected to cause core dump like this:
//Start a new thread by creating a SockThread object.
SockThread *pst = new SockThread();
In recent two months, I got two core dumps. One caused by SIG_ABORT, another
by SIG_SEGV.
Here is the pseudo code of SockThread:
//Pseudo Code of SockThread
//Thread Function
void *OSThreadMain( void *pContext ) {
SockThread *pst = (SockThread *) pContext;
//Call the thread code, Some time core dump
pst->ThreadEntry();
return NULL;
}
class SockThread
{
public:
//Construtor, create a thread
SockThread() {
//Do some initialization
..
//Start worker thread
StartWorker();
}
//Real Thread Code
void ThreadEntry() {
while( true ) {
//Do some thing
..
}
}
private:
int StartWorker() {
.. open some sockets by socket system call
.. call poll system call
.. then
return OSCreateWorker();
}
int OSCreateWorker() {
//Create Posix Thread
pthread_create( &tid, pattr, OSThreadMain, (void *)this );
..
}
};
I debug the core files with dbx, they are shown like this:
The first one:
t@1833225 (l@1833229) terminated by signal BUS (invalid address alignment)
0xfede2584: ThreadEntry+0x0288: call SockState #Nvariant 1
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx) where
current thread: t@1833225
=>[1] SockThread::ThreadEntry(0xe7878d0, 0x1, 0x0, 0xd599a70, 0x3,
0xc2000004), at 0xfede2584
[2] SockThread::StartWorker(0xe0888d0, 0x0, 0x1, 0xfee2e0c0, 0x1,
0xfee2ca54), at 0xfede2c24
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx)
The second one:
(/ora04/WS_U2/SUNWspro/bin/../WS6U2/bin/sparcv9/dbx) where
current thread: t@148124
[1] __sigprocmask(0x0, 0xf9d95538, 0x0, 0xffffffff, 0xffffffff, 0x0), at
0xfee19dc0
[2] _resetsig(0xfee2ca54, 0x0, 0x0, 0x0, 0xf9d95e3c, 0xf9d95e40), at
0xfee0f3c4
[3] _sigon(0xfee34400, 0xfee34360, 0xf9d95e38, 0xf9d9560c, 0x6, 0xfed4e1
68),
at 0xfee0eb80
[4] _thrp_kill(0x0, 0x5, 0x6, 0xfee2ca54, 0xf9d95dc0, 0x0), at 0xfee1195
4
[5] abort(0xfedb5f74, 0xc304c, 0xb, 0xfee2ca54, 0xf9d95664, 0x0), at
0xfed39590
[6] sigSegv(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee2ca54, 0xf9d95e48,
0xf9d95e28), at 0x56904
[7] __libthread_segvhdlr(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee2ca54, 0x0, 0
x0),
at 0xfee19378
[8] __sighndlr(0xb, 0xf9d95bd0, 0xf9d95918, 0xfee19298, 0xf9d95e48,
0xf9d95e28), at 0xfee1be20
[9] sigacthandler(0xb, 0xf9d95dc0, 0xf9d95918, 0xfee2ca54, 0xf9d95bd0,
0xf9d95dc0), at 0xfee186e0
---- called from signal handler with signal 11 (SIGSEGV) ------
=>[10] SockThread::ThreadEntry(0x0, 0x1, 0x0, 0xd5b1cd8, 0x3, 0xc2000004
),
at 0xfede266c
[11] SockThread::StartWorker(0xeb5e2b0, 0x0, 0x1, 0xfee2e0c0, 0x1,
0xfee2ca54), at 0xfede2c24
When I debug a runing program which in normal status, I got:
(/opt2/SUNWrkShp/SUNWspro/bin/../WS6/bin/sparcv9/dbx) where
current thread: t@473
=>[1] SockThread::ThreadEntry(0x2e57178, 0xfea45d54, 0x0, 0x0, 0x0, 0x0)
, at
0xff0e3770
[2] OSThreadMain(0x2e57178, 0x0, 0x1, 0xff13e0c0, 0x1, 0xff13ca54), at
0xff0e5000
(/opt2/SUNWrkShp/SUNWspro/bin/../WS6/bin/sparcv9/dbx)
There are two things make me very confused:
1. You may find the StartWorker shall not call the ThreadEntry directly. The
two functions shall be called in two different threads. In fact it should
be:
SockThread::SockThread()->SockThread::StartWorker()->pthread_create();
After call pthread_create, in the new thread,
OSThreadMain()->SockThread::ThreadEntry();
You can see when I debug the program in normal staus, the call stack trace
is correct.
2. The StartWorker and the ThreadEntry are all non-static member functions
of the class SockThread. So the first prameter of them shall be THIS
pointer. OSThreadMain is a normal C style function, its first prameter is
the THIS pointer of the SockThread object who calls it(refer to code calling
pthread_create). In the debug information in normal case, you can see it it
true. But in the core files, you can see the first pramaters of StartWorker
and ThreadEntry are not same. That's why the core dump happened.
Additional informations:
1) The only way to call ThreadEntry is to create a new SockThread object. In
the program, there are many SockThread objects which are created on the
heap.
2) The core dump seldom occurs.
3) The new thread is created in constructor, so it is possible that the
ThreadEntry is called when the object is not constructed completely. .
4) purify shows there are lot of BSR error near the point at which core dump
occurs. poll is called in SockThread::StartWorker.
BSR: Beyond stack read error (325 times)
This is occurring while in thread 30:
_poll [libc.so.1]
poll [libthread.so.1]
OSThreadMain [libpxtrans.a]
_thread_start [libthread.so.1]
Reading 4 bytes from 0xfcc45bc8.
Stack pointer 0xfcc45bf0
Thanks,
Kevin Ren
[ Post a follow-up to this message ]
|