| Slawomir J. 2006-05-09, 7:12 pm |
| David DiNucci wrote:
> Since you also posted a request for comments to comp.distributed, and
> since this is part of such a response and is on topic there, I'm
> cross-posting this there. Hope you don't mind.
No problem whatsoever.
>
> I don't understand this last sentence--e.g. what sorts of constraints,
> what you are calling an atom.
The constraints within the atom correspond directly to how much static or
dynamic in terms of dataflow technology an atom work can be. In great
simplification, all that needs to be deterministic (static-dataflow like) is
placed within the stressed section. All that is to be dynamic is placed
within the relaxed section. All that needs to be dynamic but still kept in
order uses ordered calls out of the relaxed section. (Ordered calls out of
the stressed section make no sense).
> I've never considered Kahn-McQueen (KM) as a machine, just as a definition
> used in a theorem that states that when you have a bunch of deterministic
> transforms connected by fixed sequential (i.e. non-passing) channels, the
> results will be deterministic. It's sort of a generalization of function
> composition. To the extent that some dataflow models (though perhaps not
> yours) allow such fixed networks at various places throughout the program,
> whether those models are static or dynamic, it is a useful (if not
> somewhat obvious) result, and it's a shame not to let programmers take
> advantage of it. If I understand, it's not possible to build such a fixed
> network with stressflow.
Sure it is. To do it, the stressflow program has to be special-cased to
consist of atoms that:
1. Return no values.
2. Pass needed parameters by value
3. Have all their code in stressed section (thus removing any “dynamicity”)
> Another issue here is that the ScalPL semantics describe the possible
> outcomes of any given ScalPL program, not necessarily how to get there.
> (In fact, the F-Net semantics, upon which ScalPL is based, have been
> described axiomatically as well as procedurally.) So, ScalPL does not
> preclude optimizations which produce the required outcome in some more
> parallel way than the obvious pipelined execution. Put another way, using
> matching as you've described (where partial results pass one another) can,
> in some circumstances at least, be used by a ScalPL compiler or runtime
> even if the user program is expressed in an easily-understood "static
> dataflow" Kahn-McQueen form.
That can always be said, that some “behind the scenes magic” can be used to
improve performance. The devil is always in the details on how the hard
reality is dealt with. Dynamic, as in fully parallel firing of the same node
connection/arc instance wise, will always result in a situation where the
multiple results will be arriving at the output arc out of order and
something, somehow has to reorder them or stall the fast ones so the slower
ones can catch up. That call pre-reservation thing is nice because it simply
does reserve a place in line for delivering the output as the first
operation guaranteed to be executed in the right order. When the actual call
is ready, the pre-reserved place is used. If the pre-reserved place reaches
the head of the lock reservation queue the target stalls until the reserving
task produces the data. This guarantees strict deterministic order without
limiting dynamic execution and without tagging and collecting results in
sorted queues.
> Perhaps this is one reason that I avoid these terms and all of the
> historical baggage that goes along with them. For example, a single node
> in ScalPL can often be instantiated multiple times with the results still
> arriving in order (to use your terminology). Is that static because of
> the way the results come out, or dynamic because of the multiple
> instantiations? And, if the programmer wants individual tasks or entire
> strategies (dataflow graphs) to be instantiated multiple times (e.g. once
> per input) to get more parallelism, that's fine, too. Is each of those
> strategies static or dynamic? It's even possible to instantiate entire
> strategies multiple times and still ensure that the results arrive in
> order.
Dynamic dataflow means unlimited firings per same arc (or connection
context) without waiting for the previous firing to complete. It is
generally assumed that a well constructed/optimized static machine can do
(statistically speaking) above 1 firing and below 2 firings per same arc.
Meaning – the next firing can happen when the previous one is, say, half way
through processing.
Multiple instantiations of a whole or part of a system are therefore not
considered dynamic, because they use separate arcs (connection context
instance).
>
> I'm not sure what your point is, but it does seem quite sensible to me to
> provide enough information that the platform (including compilers and
> runtime) can optimize the program as either dynamic (with tagging) or as
> independent (partitioned) sets of static graphs.
I have no problem with it, except for the fact that efficient method is
still needed to do it – low level wise. My project was more about the low
level implementation scheme than high level syntax using it.
>
> Speaking of which: How do you coordinate shared data between the caller
> and the callee? For example, if large data structures are being passed
> from the caller, and potentially updated in the caller between each call,
> must the callee's programmer choose between (a) performing all
> reading/processing of the structure in the stressed section (so that
> reading is finished by the time it returns) or (b) copying the structure
> within the stressed section and then processing the copy in the relaxed
> section to allow the caller to continue or (c) not copying at all and just
> reading/processing the data in the relaxed section in hopes that the
> caller won't modify it? If so, an optimal decision of which to do depends
> not only on the callee's programmer knowing what the caller will be doing
> with the data (since, for example, c is optimal if the caller doesn't
> modify the data), but also on whether the callee and the caller are on
> distributed processors (since the act of messaging the data from caller to
> callee will automatically create a copy anyway).
(a) or (b) – never (c) which should be detected by a compiler as an error.
Anyway, strict, puritan dataflow-type model can always be used with the
needed data passed as call parameters, which already handles the copying.
Other tricks are simply matters of avoiding unnecessary copying,
de-composing and recomposing the data for parallel processing logically but
not physically, etc. The idea was to give a whole bag of tricks for sake of
complete efficiency, not to hide the silverware because someone could hurt
himself with it.
> In ScalPL, the programmer specifies what sorts of access each plan will
> need to interface data, and the platform can optimize that access and
> scheduling depending on the circumstances at the time. In other words,
> the mechanism of accomplishing the sharing (e.g. call by name or value, by
> passing messages or through shared memory) is not hardwired into the
> program. Only the policy of what needs to be shared is.
As I said before, the goal of my project was to provide a model that
directly translates into implementation, so that it could be used for
performance programming on both high and very low level. The goal was to
create a C++ of parallel programming and not Pascal or Smalltalk of parallel
programming. In spite of many pure, “safe” structural methods of parallel
programming being proposed, most commercial parallel programming is done
with low level tools or tools based on 50 year old languages like Fortran.
The reason behind it, in my view, is utopian Puritanism behind most of the
structural parallel programming methods. Puritanism always carries the
hidden cost, in the case of most puritan parallel programming methods,
“guaranteed” fool-proof data consistency was always based on passing copies
of entire sets of data being worked as “tokens” or messages. It worked and
was “safe”, but for “real” programming there was C++ and ugly non-structural
methods of parallelism.
> Thanks. So it sounds as though the you *do* have constructs to perform the
> matching within the runtime (i.e. lock access mechanism). That's good.
> Still, a principal difference from ScalPL appears to be that stressflow
> defaults to no matching, and requires matching for determinism whenever
> data from different sources merges, while ScalPL defaults to determinism
> by virtue of Kahn-McQueen, but provides ways to build constructs which are
> not. I would argue that this implies that ScalPL provides well-defined
> behavior in cases where stressflow does not. For example, the potential
> results that a program produces can be directly dependent upon how "out of
> order" the intermediate results can become, and this out-of-orderness is a
> direct part of a ScalPL specification, whereas I don't think it is in a
> stressflow specification.
Full “dynamic” parallel execution will produce results out of order – there
are no miracles. There is no timing “determinism” by some mythical virtue
that happens by itself on tasks that run at the same time. That is talking
timing Perpetum-Mobile. If they are free to start almost together and run as
fast as they can, they will sometimes come at the finish out of order – so
the question still is – how are they kept/forced into the right order at the
finish line (only when specifically necessary) and without being slowed down
at the start? Criticizing a working implementation method on the basis that
it can all by avoided altogether magically by some mystical magic is rather
silly.
> I know you've written much more that's probably worth responding to some
> day, but I do have other work to do. Maybe we've both provided at least a
> glimpse into each other's development methods. I am tempted to construct
> a ScalPL plan for each of the major stressflow constructs to illustrate
> the primary differences and similarities, but I rather doubt that will
> come soon. It appears that, in general, there are four principal
> stressflow segments to model: (1) The caller (precall), (2) the caller
> (postcall), (3) the callee (stressed), and (4) the callee (unstressed).
> In ScalPL, these would most naturally be 4 different tasks, with the
> obvious dependences between them, but it sometimes sounds as though 1, 3,
> and 2 could all be one task with only 4 being a separate task. I will
> need to understand more fully the circumstances (if any) under which the
> flow from 1 to 3 to 2 can block.
This is making it far more complex that it is. The caller creates a callee
and cooperates with him (sort of) while the callee is in its stressed
section. The parting ways inside the callee and not at the point of making
the call allows for smootly controlling amount of free "dynamicity" of the
callee.
Ultimate test will be to test drive both methods, I think, and I am looking
forward to it. As one of the next steps I will provide complete examples
with benchmarks for different hardware. Of course, I will say something here
at the newsgroups once I am ready for that. One way to go about it is
propose some specific parallel programming jobs to solve as benchmarking
tests by anybody and any method.
Thanks for your interest and your time and Very Best Regards
Slawomir J. info@stressflow.com
|