|
Home > Archive > Unix Programming > March 2006 > Multi-CPU ?
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
|
|
| seenutn@gmail.com 2006-02-23, 2:54 am |
| Hi All,
I have an multithreaded application. It is running properly on a
single CPU machine, but crashes on multi-cpu machine. Analysing the
core also did not help as each time the core dump happens at different
place (core is generated by signal 11).
Any idea about what may be happening? Documents or links which
inform the points to be considered for a program on multi-cpu machine
will be of help.
I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
on IBM eServer 336 (Dual Intel Xeon processor).
Regards,
Seenu.
| |
| Tommy Willoughby 2006-02-23, 2:54 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
> I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
> on IBM eServer 336 (Dual Intel Xeon processor).
My bet is that it's caused by bad ram.
--
mail: teeuu at qwest dot net
| |
| Ian Collins 2006-02-23, 2:54 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
One thread deleting something another is still using maybe? Or
modifying something another doesn't expect to be modified?
It all depends what you doing and how thread safe your implementation is.
--
Ian Collins.
| |
|
| ["Followup-To:" header set to comp.unix.programmer
for this really doesn't need to be crossposted.]
Begin <1140670783.883180.120380@i40g2000cwc.googlegroups.com>
On 2006-02-23, seenutn@gmail.com <seenutn@gmail.com> wrote:
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
Any text on multithreading on multiple cpus. You might have a race
condition which somehow causes null or meaningless pointers, and no lock
to protect against it. Using multiple CPUs really is harder than merely
faking it with one cpu.
> I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
> on IBM eServer 336 (Dual Intel Xeon processor).
IIRC these things usually run with ECC RAM, and keep logs of ECC faults
in the BIOS. That makes me not be as quick to assume bad RAM as another
poster did, altough the randomized problem place does point in that
direction.
It won't hurt to check with a memory checker, and it's probably easier
than trying to pin down multi-cpu race conditions. Note that it will
need to go through the entire memory at least a few times as no error
reported is no guarantee there is no error.
--
j p d (at) d s b (dot) t u d e l f t (dot) n l .
This message was originally posted on Usenet in plain text.
Any other representation, additions, or changes do not have my
consent and may be a violation of international copyright law.
| |
| Chris Friesen 2006-02-26, 10:15 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
The most likely scenario is that you have bugs in your code.
Now that you have threads that are actually running simultaneously,
you're hitting race conditions that you didn't see before.
Chris
| |
| Logan Shaw 2006-02-26, 10:16 am |
| jpd wrote:
> ["Followup-To:" header set to comp.unix.programmer
> for this really doesn't need to be crossposted.]
> Begin <1140670783.883180.120380@i40g2000cwc.googlegroups.com>
> On 2006-02-23, seenutn@gmail.com <seenutn@gmail.com> wrote:
>
> Any text on multithreading on multiple cpus.
I wouldn't trust any book that claims writing correct multithreading code
for a multiple-CPU system is any different than writing correct code for
a single-CPU system (unless you are talking about multithreading where
one thread can't be preempt another). Since in a preemptive multithreading
system any thread can be preempted at any time, anything that's a bug on
a multiple-CPU system also is a bug on a single-CPU system. The only
difference (other than performance issues) is that the multiple-CPU
system will expose the bugs more easily.
- Logan
| |
| David Schwartz 2006-02-26, 10:16 am |
|
"jpd" <read_the_sig@do.not.spam.it.invalid> wrote in message
news:4655juF9dr4jU1@individual.net...
> Any text on multithreading on multiple cpus. You might have a race
> condition which somehow causes null or meaningless pointers, and no lock
> to protect against it. Using multiple CPUs really is harder than merely
> faking it with one cpu.
Huh?! Actually, it's much easier on multiple CPUs. Anyone who develops
or debugs multithreaded code on a single CPU machine will find their job
much, much harder because of it.
DS
| |
| loic-dev@gmx.net 2006-02-26, 10:16 am |
| Hello,
> I wouldn't trust any book that claims writing correct multithreading code
> for a multiple-CPU system is any different than writing correct code for
> a single-CPU system (unless you are talking about multithreading where
> one thread can't be preempt another). Since in a preemptive multithreading
> system any thread can be preempted at any time, anything that's a bug on
> a multiple-CPU system also is a bug on a single-CPU system. The only
> difference (other than performance issues) is that the multiple-CPU
> system will expose the bugs more easily.
very true... I always recommend people to check intensively their
multi-threaded programs on multiple-CPU system, if they have access to
such systems. From my experience, it's not unusual that new bugs
appear, that never pop up on a single CPU machine.
Cheers,
Loic.
| |
| Lee Sau Dan 2006-02-26, 10:16 am |
| >>>>> "seenutn" =3D=3D seenutn <seenutn@gmail.com> writes:
seenutn> Hi All, I have an multithreaded application. It is
seenutn> running properly on a single CPU machine, but crashes on
seenutn> multi-cpu machine. Analysing the core also did not help
seenutn> as each time the core dump happens at different place
seenutn> (core is generated by signal 11). Any idea about what
seenutn> may be happening? Documents or links which inform the
seenutn> points to be considered for a program on multi-cpu
seenutn> machine will be of help. I am using RHEL 4 update 2
seenutn> (kernel-2.6.9-5EL.smp), running the app on IBM eServer
seenutn> 336 (Dual Intel Xeon processor).
Sounds like problems in your code. First time running your own
multi-threaded code on a real SMP machine?
You should read standard university textbooks on OS or parallel
programming. Keywords include "race condition", "critical section",
"synchronization", etc. Many of these concurrent-programming problems
do not show up when running a multi-threaded program in a single-CPU
system.
BTW, learn to use Apache Log4j (or the less popular and less flexible
java.util.logging.*) to print out log messages from your code. It's
very helpful for tracing what your threads are really doing, and more
importantly, how they dance together. Very likely, one of your
threads is stepping on the foot of the other, causing the latter to
cry! (Try also the Chainsaw GUI for viewing the log messages!)
--=20
Lee Sau Dan =A7=F5=A6u=B4=B0 ~=
{@nJX6X~}
E-mail: danlee@informatik.uni-freiburg.de
Home page: http://www.informatik.uni-freiburg.de/~danlee
| |
| Dave (from the UK) 2006-02-26, 10:16 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
> I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
> on IBM eServer 336 (Dual Intel Xeon processor).
>
> Regards,
> Seenu.
>
Been there, done that.
Debugging multi-threaded code is no simple task. A very talented guy I
used to work with run his MT code on a dual processor Linux box with no
hassle. He then tried it on a quad processor Sun SPARCstation 20 and
found it crashed. He found it was a real bug, that just never seemed to
occur on single processor Suns or a dual processor Linux box.
I wrote a program that would work fine on numerous UNIX boxes *all* are
multi-processor. These include PCs running Linux, *BSD etc, Suns, HPs
running both HP-UX, Cray and others. It also run on single processor Dec
Alpha. But it would not run properly on a Sun running Linux. I tended to
ignore that and blame Linux on SPARC, as multi-threaded Linux on SPARC
is not well tested.
But then I found it sometimes mis-behaved on a quad processor IBM
RS/60000. Then I became suspicious of my code, as I trusted AIX somewhat
better than Linux on a SPARC.
Sure enough, I found a bug in my code.
There is a open-source library that you can link in that is supposed to
help find such bugs. I can't say I was over-impressed with it, but it
might be worth a try.
Multi-threaded code is just hard to write properly. Finding bugs is hard.
BTW, I have a couple of book recomdations.
1) This is one *NOT* to buy. It is the second worst technical book I
have ever seen. I am afraid to say it is published by Sun.
"Multithreaded programming with pthreads"
by B. Lewis and Daniel J Berg, published by Sun Microsystems.
IMHO a total waste of time. It tries to cover far too much, including
threads on OS2, yet ignores LWP on Solaris. Seems odd for a book
published by Sun.
2) "Foundations of multithraaded, parallel and distributed programming"
by Gregory R. Andrews, Addisonm Wesley. Excellent book.
I'm not sure if the author is the same 'Greg Andrews' that used to post
to the Sun related newsgroups, but I suspect not. But that is well worth
the money. Just forget the Sun book.
--
Dave K
Minefield Consultant and Solitaire Expert (MCSE).
Please note my email address changes periodically to avoid spam.
It is always of the form: month-year@domain. Hitting reply will work
for a couple of months only. Later set it manually.
| |
| Gianni Mariani 2006-02-26, 10:16 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
> I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
> on IBM eServer 336 (Dual Intel Xeon processor).
I would wager this is a bug in your code.
Finding such bugs is very difficult so you need to:
a) Instrument your code so that debug builds uncover broken assumptions
b) Keep to a small number of well tested MT helper classes that "just
work". (not necessarily simple classes BTW)
c) Monte-carlo unit tests (stress tests).
I recently deployed a heavily multithreaded system which followed these
principles and no MT related bugs have been found yet (in the wild).
The monte-carlo tests really did work to ferret out the most unthinkable
issues.
MT code results in severe limitations on how your design looks like. I
remember having a number of engineers complain about the design
constraints because of the "strange" way the MT classes worked. I now
hear amazement at the stability of the application.
| |
| seenutn@gmail.com 2006-03-03, 6:43 pm |
| >
> Sure enough, I found a bug in my code.
>
> There is a open-source library that you can link in that is supposed to
> help find such bugs. I can't say I was over-impressed with it, but it
> might be worth a try.
> well worth
Hi Dave,
Any idea about what was the library you used? I thought I will give a
try.
Regards,
Seenu.
| |
|
| Begin <68uLf.36330$UN2.17701@tornado.texas.rr.com>
On 2006-02-24, Logan Shaw <lshaw-usenet@austin.rr.com> wrote:
>
> I wouldn't trust any book that claims writing correct multithreading code
> for a multiple-CPU system is any different than writing correct code for
> a single-CPU system
And, of course, ``correct'' is the kicker the OP likely stumbled on.
> The only difference (other than performance issues) is that the
> multiple-CPU system will expose the bugs more easily.
Exactly.
--
j p d (at) d s b (dot) t u d e l f t (dot) n l .
This message was originally posted on Usenet in plain text.
Any other representation, additions, or changes do not have my
consent and may be a violation of international copyright law.
| |
|
| Begin <dtlvj3$vdr$1@nntp.webmaster.com>
On 2006-02-24, David Schwartz <davids@webmaster.com> wrote:
> "jpd" <read_the_sig@do.not.spam.it.invalid> wrote in message
> news:4655juF9dr4jU1@individual.net...
>
>
> Huh?! Actually, it's much easier on multiple CPUs. Anyone who develops
> or debugs multithreaded code on a single CPU machine will find their job
> much, much harder because of it.
I said ``using it'', not ``getting it right'', which you implicitly took
for granted. Which means we're basically saying the same thing. I'll
admit I wasn't too clear so apologies for the confusion.
--
j p d (at) d s b (dot) t u d e l f t (dot) n l .
This message was originally posted on Usenet in plain text.
Any other representation, additions, or changes do not have my
consent and may be a violation of international copyright law.
| |
| seenutn@gmail.com 2006-03-09, 8:08 am |
| seenutn@gmail.com wrote:
> Hi All,
> I have an multithreaded application. It is running properly on a
> single CPU machine, but crashes on multi-cpu machine. Analysing the
> core also did not help as each time the core dump happens at different
> place (core is generated by signal 11).
> Any idea about what may be happening? Documents or links which
> inform the points to be considered for a program on multi-cpu machine
> will be of help.
> I am using RHEL 4 update 2 (kernel-2.6.9-5EL.smp), running the app
> on IBM eServer 336 (Dual Intel Xeon processor).
The article
http://docs.sun.com/app/docs/doc/81...idelines&a=view
(Working With Multiprocessors section) helps little bit.
Regards,
Seenu.
| |
| Dave (from the UK) 2006-03-12, 5:51 pm |
| seenutn@gmail.com wrote:
>
>
> Hi Dave,
>
> Any idea about what was the library you used? I thought I will give a
> try.
>
> Regards,
> Seenu.
>
No,
I can't. Thinking about it more, it might have been a patch on gcc.
There are some thread specific newsgroups - that would probably be the
place to ask.
--
Dave K MCSE.
MCSE = Minefield Consultant and Solitaire Expert.
Please note my email address changes periodically to avoid spam.
It is always of the form: month-year@domain. Hitting reply will work
for a couple of months only. Later set it manually.
|
|
|
|
|