Unix Programming - SysV shared memory coherency issues

This is Interesting: Free IT Magazines  
Home > Archive > Unix Programming > February 2006 > SysV shared memory coherency issues





You are viewing an archived Text-only version of the thread. To view this thread in it's original format and/or if you want to reply to this thread please [click here]

Author SysV shared memory coherency issues
Ryan Underwood

2006-02-13, 6:04 pm

[This is crossposted to comp.lang.asm.x86 due to the instruction level issue
below]

Hi,
I'm getting really puzzled here. The basic story is that we have a 4 way Xeon
machine with hyperthreading enabled (8 logical processors). The application is
a fairly straightforward SysV (not POSIX) shared memory application. The
shared memory region is declared as a volatile double *. We are using GCC
4.0.2 with -O0 and a 2.6.6 Linux kernel with RTLinux extensions.

One of the processes sets up essentially a RPC call by placing several double
arguments in the shared memory region along with a function ID, and then
setting a flag to notify the other process that a call is waiting. The other
process is doing soft-realtime computational work as fast as it can with no
system calls, and occasionally checks the flag mailbox to see if a RPC is
waiting. If it is, then it reads the function ID and arguments and goes to
work on that task.

This seemed to work fine in my test cases, but eventually broke down for a
different user. The shared memory is written to and verified by the writer
process, but the computation process _never_ "sees" the flag even if allowed to
run forever. I can get the computation process to "see" the flag in several
ways:

1) Insert a sleep of between 1 and 10ms before the flag is polled (less than
1ms is not sufficient)

2) Print the contents of the mailbox on the computation process side before
polling the flag

Running under gdb or valgrind does not affect the bug.

Based on the above, my hypothesis is that in the cases where the condition
improves, it is because a system call has been taken, and thus the registers
clobbered. In the case where the sleep is insufficient, I think it is because
short sleep durations are executed as busy waits instead of forfeiting the
processor.

However, as a matter of experimentation, I changed the memory access in the
computation process from:

if (m_Mailbox[STALE_MBOX] != 1)
{ /* these actions are never taken */ }

which generated the following assembly:

movl 8(%ebp), %eax # this, this
movl 32(%eax), %eax # <variable>.m_Mail
addl $288, %eax #, D.40882
fldl (%eax) #* D.40882
fld1

to: [with knowledge that sizeof(double)==2*sizeof(uint32_t)]

if (((uint32_t*)m_Mailbox)[STALE_MBOX*2] != 1)
{ /* these actions are taken when mbox value is set to 0 */ }

which generated the following assembly:

movl 8(%ebp), %eax # this, this
movl 32(%eax), %eax # <variable>.m_Mail
addl $288, %eax #, D.40882
movl (%eax), %eax #* D.40882, D.40883
cmpl $1, %eax #, D.40883

As you can see, the load changes from a floating point to an integer load. To
my utter astonishment, this corrected the symptom that I was observing.
Apparently the memory access semantics of integer and FP loads differ in some
way on these processors. But is this actually the root of the problem, or is
there a more fundamental issue I am failing to understand? And is the issue
specific to the SysV shared page implementation (meaning I should look into
POSIX shared memory, pthreads, or mmap instead), or is it a processor-level
issue, perhaps with hyperthreading?

I was advised that I need memory barriers in this application for two reasons:

1) to keep the cache consistent across processors
2) because the mailbox writes may be reordered, meaning that even if the other
process saw the flag, there would be no guarantee that the function argument
writes had been performed yet.

If memory barrier functionality even exists on the IA-32 platform I don't seem
to be finding a way to access it from userspace. I am under the impression
that all operations on IA-32 are cache-coherent, and did not believe
hyperthreading changed that. Also, the wmb() macros in the Linux kernel seem
to evaluate to void for IA-32, supporting my position that memory barriers are
unnecessary in this case (that the writes, even if reordered when issued, are
committed in program order):

* For now, "wmb()" doesn't actually do anything, as all
* Intel CPU's follow what Intel calls a *Processor Order*,
* in which all writes are seen in the program order even
* outside the CPU.
#define wmb() __asm__ __volatile__ ("": : :"memory")

I'm having a difficult time seeing how this application differs from a typical
pthreads application, except in this case we only share specific preallocated
physical pages, instead of the entire page table of the parent process.

Looking forward to your thoughts,

Ryan Underwood

Brian Raiter

2006-02-14, 2:49 am

There surely must be something I'm missing here, because the two code
snippets are not at all identical in function.

> However, as a matter of experimentation, I changed the memory access
> in the computation process from:
>
> if (m_Mailbox[STALE_MBOX] != 1)
> { /* these actions are never taken */ }
>
> which generated the following assembly:
>
> movl 8(%ebp), %eax # this, this
> movl 32(%eax), %eax # <variable>.m_Mail
> addl $288, %eax #, D.40882
> fldl (%eax) #* D.40882
> fld1


Your assembly snippet is truncated a bit early, but it certainly
appears that the code is about to compare m_Mailbox[STALE_MBOX] to the
floating-point value 1.0.

> to: [with knowledge that sizeof(double)==2*sizeof(uint32_t)]
>
> if (((uint32_t*)m_Mailbox)[STALE_MBOX*2] != 1)
> { /* these actions are taken when mbox value is set to 0 */ }
>
> which generated the following assembly:
>
> movl 8(%ebp), %eax # this, this
> movl 32(%eax), %eax # <variable>.m_Mail
> addl $288, %eax #, D.40882
> movl (%eax), %eax #* D.40882, D.40883
> cmpl $1, %eax #, D.40883


And here, you are comparing the lower four bytes of
m_Mailbox[STALE_MBOX] to 0x0001.

But a double set to 1.0 has a bit pattern of 0x000000000000F03F. So
what am I missing?

b

Ian Collins

2006-02-14, 2:49 am

Ryan Underwood wrote:
> I'm having a difficult time seeing how this application differs from a typical
> pthreads application, except in this case we only share specific preallocated
> physical pages, instead of the entire page table of the parent process.
>
> Looking forward to your thoughts,
>

(sorry if this gets posted twice, one of the many groups this was posted
must be moderated)

A couple,

Why not use a semaphore rather than a flag in shared memory?

Why is the flag a double?

If you have to use a flag, what happens if you protected it with a
process shared mutex?

--
Ian Collins.
Nils O. Selåsdal

2006-02-14, 7:49 am

Ryan Underwood wrote:
> [This is crossposted to comp.lang.asm.x86 due to the instruction level issue
> below]
>
> Hi,
> I'm getting really puzzled here. The basic story is that we have a 4 way Xeon
> machine with hyperthreading enabled (8 logical processors). The application is
> a fairly straightforward SysV (not POSIX) shared memory application. The
> shared memory region is declared as a volatile double *. We are using GCC
> 4.0.2 with -O0 and a 2.6.6 Linux kernel with RTLinux extensions.
>
> One of the processes sets up essentially a RPC call by placing several double
> arguments in the shared memory region along with a function ID, and then
> setting a flag to notify the other process that a call is waiting. The other
> process is doing soft-realtime computational work as fast as it can with no
> system calls, and occasionally checks the flag mailbox to see if a RPC is
> waiting. If it is, then it reads the function ID and arguments and goes to
> work on that task.
>
> This seemed to work fine in my test cases, but eventually broke down for a
> different user. The shared memory is written to and verified by the writer
> process, but the computation process _never_ "sees" the flag even if allowed to
> run forever. I can get the computation process to "see" the flag in several
> ways:
>
> 1) Insert a sleep of between 1 and 10ms before the flag is polled (less than
> 1ms is not sufficient)
>
> 2) Print the contents of the mailbox on the computation process side before
> polling the flag
>
> Running under gdb or valgrind does not affect the bug.
>
> Based on the above, my hypothesis is that in the cases where the condition
> improves, it is because a system call has been taken, and thus the registers
> clobbered. In the case where the sleep is insufficient, I think it is because
> short sleep durations are executed as busy waits instead of forfeiting the
> processor.
>
> However, as a matter of experimentation, I changed the memory access in the
> computation process from:
>
> if (m_Mailbox[STALE_MBOX] != 1)
> { /* these actions are never taken */ }
>
> which generated the following assembly:
>
> movl 8(%ebp), %eax # this, this
> movl 32(%eax), %eax # <variable>.m_Mail
> addl $288, %eax #, D.40882
> fldl (%eax) #* D.40882
> fld1
>
> to: [with knowledge that sizeof(double)==2*sizeof(uint32_t)]
>
> if (((uint32_t*)m_Mailbox)[STALE_MBOX*2] != 1)
> { /* these actions are taken when mbox value is set to 0 */ }


Why all this casting ?

It'd be interresting to see what differences there were if you made
the flag a simple int, with the volatile keyword too.

(But really - you should be protecting this using semaphores)

Robert Redelmeier

2006-02-14, 5:54 pm

In comp.lang.asm.x86 Ryan Underwood <spamtrap@crayne.org> wrote in part:
> if (m_Mailbox[STALE_MBOX] != 1)
> { /* these actions are never taken */ }
>
> which generated the following assembly:
>
> movl 8(%ebp), %eax # this, this
> movl 32(%eax), %eax # <variable>.m_Mail
> addl $288, %eax #, D.40882
> fldl (%eax) #* D.40882
> fld1


Egads! You're not testing FP numbers for _equality_, are you?
This is a cardinal sin! FP numbers are like sandpiles.
Whenever you move them, you lose some sand, and mix some dirt in.

When comparing floats/doubles, subtract&abs and test if the
difference is small enough.

In theory, if you start with clean FP registers (load an integer)
and never do anything nasty with them (like divide by non-2powers
or multiply by non-integers) then maybe they'll stay clean enough
for an equality compare. But I wouldn't count on it!

>
> to: [with knowledge that sizeof(double)==2*sizeof(uint32_t)]
>
> if (((uint32_t*)m_Mailbox)[STALE_MBOX*2] != 1)
> { /* these actions are taken when mbox value is set to 0 */ }
>
> which generated the following assembly:
>
> movl 8(%ebp), %eax # this, this
> movl 32(%eax), %eax # <variable>.m_Mail
> addl $288, %eax #, D.40882
> movl (%eax), %eax #* D.40882, D.40883
> cmpl $1, %eax #, D.40883
>
> As you can see, the load changes from a floating point to
> an integer load. To my utter astonishment, this corrected
> the symptom that I was observing.


Doh!

-- Robert

Ryan Underwood

2006-02-14, 5:54 pm

Ian Collins <ian-news@hotmail.com> writes:

>Why not use a semaphore rather than a flag in shared memory?


We are trying to avoid the computation process ever blocking or even syscalling
unless absolutely necessary. The flag is to pre-notify the computation process
that data is available. Then it takes the semaphore (also in shared memory).
There is only one process writing and one reading, so the mutual exclusion need
not be rigid.

>Why is the flag a double?


Because the whole shared memory region is a double *. I could split it out,
but I'd rather understand what's going on first to make sure I'm not goin to
have the same problem with values that actually are doubles.

>If you have to use a flag, what happens if you protected it with a
>process shared mutex?


Good question, it would probably work given that any I/O before the flag check
seems to make it work. But a mutex that blocks or syscalls every time I poll
just isn't going to work very well in this application.

Ryan Underwood

2006-02-14, 5:54 pm

Robert Redelmeier <redelm@ev1.net.invalid> writes:

>In comp.lang.asm.x86 Ryan Underwood <spamtrap@crayne.org> wrote in part:
[vbcol=seagreen]
>Egads! You're not testing FP numbers for _equality_, are you?
>This is a cardinal sin! FP numbers are like sandpiles.
>Whenever you move them, you lose some sand, and mix some dirt in.


No, no. This person is testing it for inequality. I understand what you're
saying, but in that case shouldn't that code block ALWAYS be executed rather
than NEVER? I agree that it is bad style, but check out the generated asm
in the correction post.

>In theory, if you start with clean FP registers (load an integer)
>and never do anything nasty with them (like divide by non-2powers
>or multiply by non-integers) then maybe they'll stay clean enough
>for an equality compare. But I wouldn't count on it!


Yes, excellent point.

Ryan Underwood

2006-02-14, 5:54 pm

spamtrap@crayne.org (Brian Raiter) writes:

>There surely must be something I'm missing here, because the two code
>snippets are not at all identical in function.


Yeah, that was my fault.

[vbcol=seagreen]
>Your assembly snippet is truncated a bit early, but it certainly
>appears that the code is about to compare m_Mailbox[STALE_MBOX] to the
>floating-point value 1.0.


Here is the rest:
fxch %st(1) #
fucompp
fnstsw %ax # tmp558
andb $69, %ah #, tmp558
xorb $64, %ah #, tmp558
setne %al #, retval.121
testb %al, %al # retval.121
je .L148 #,

[vbcol=seagreen]
>And here, you are comparing the lower four bytes of
>m_Mailbox[STALE_MBOX] to 0x0001.


>But a double set to 1.0 has a bit pattern of 0x000000000000F03F. So
>what am I missing?


I believe the type conversion is done automatically, but I could be wrong.
Anyway, since it works with the delay, the comparison should be correct.

Ryan Underwood

2006-02-14, 5:54 pm

=?ISO-8859-1?Q?=22Nils_O=2E_Sel=E5sdal=22?= <spamtrap@crayne.org> writes:

[vbcol=seagreen]
>Why all this casting ?


The memory region is a double *. The cast is so that the compiler would
generate an integer load instead of a FP load.

>It'd be interresting to see what differences there were if you made
>the flag a simple int, with the volatile keyword too.


I imagine it would work since the above cast works. My question is why loading
it as a double does not work.

>(But really - you should be protecting this using semaphores)


Why? I don't see a mutual exclusion problem.

Ian Collins

2006-02-14, 5:54 pm

Ryan Underwood wrote:
> Ian Collins <ian-news@hotmail.com> writes:
>
>
>
>
> We are trying to avoid the computation process ever blocking or even syscalling
> unless absolutely necessary. The flag is to pre-notify the computation process
> that data is available. Then it takes the semaphore (also in shared memory).
> There is only one process writing and one reading, so the mutual exclusion need
> not be rigid.
>

If the process is ready for data, does it matter if it blocks? I assume
if it is ready, it has nothing to do.
>
>
>
> Because the whole shared memory region is a double *. I could split it out,
> but I'd rather understand what's going on first to make sure I'm not goin to
> have the same problem with values that actually are doubles.
>

Still easier than using the FPU....

>
>
>
> Good question, it would probably work given that any I/O before the flag check
> seems to make it work. But a mutex that blocks or syscalls every time I poll
> just isn't going to work very well in this application.
>

Someone will correct me if I'm wrong, but using a mutex as a barrier
will ensure the CPU caches are consistent.

If it blocks, it is blocking for a reason! If not, then the call is
optimised to have a low overhead.

I do normally do this style of IPC with a process shared mutex and a
condition variable, or Solaris doors.

--
Ian Collins.
Brian Raiter

2006-02-17, 10:40 pm

> I believe the type conversion is done automatically, but I could be
> wrong.


You are in fact mistaken, as the assembly language snippets you've
posted make clear.

If you were casting a floating-point value an int, then yes, the
compiler would automatically insert the code to convert the value. But
that's not what you're doing. You're casting a pointer to a
floating-point value to a pointer to an int. The only conversion that
the compiler is doing is on the pointer value proper. (And in this
case, no conversion is needed, so no extra code is inserted.)

b

Ryan Underwood

2006-02-17, 10:40 pm

Ryan Underwood <spamtrap@crayne.org> writes:

>I believe the type conversion is done automatically, but I could be wrong.
>Anyway, since it works with the delay, the comparison should be correct.


I found the problem. Another process using the shared memory area was
overwriting the flag due to a program error. Somehow the race always worked
out in that process's favor, except when the delay was introduced. I knew what
I seemed to be seeing was too strange to be true!

Thanks for all of your speculations and reinforcement, it certainly helped me
deduce that the problem must have been elsewhere.

Concurrent programming is a walk in the park. Central Park.

Sponsored Links






Free braindumps | Software forum | Database administration forum

Copyright 2003 - 2008 webservertalk.com