MPI-2 : Communication

/***** Point to point discussion that lead to collective communication ******/
toonen/Projects/MPICH-Design/pt2pt.txt
Brian's notes from this file:

- David suggested that we might be able to use the xfer interface for
  point-to-point messaging as well as for collective operations.

  What should the xfer interface look like?

  - David provided a write-up of the existing interface

  - We questioned whether or not multiple receive blocks could be used to
    receive a message sent from a single send block.  We decided that blocks
    define envelopes which match, where a single block defines an envelope (and
    payload) per destination and/or source.  So, a message sent to a particular
    destination (from a single send block) must be received by a single receive
    block.  In other words, the message cannot be broken across receive blocks.
 
    - there is an asymmetry in the existing interface which allows multiple
      destinations but prevents multiple sources.  the result of this is that
      scattering operations can be naturally described, but aggregation
      operations cannot.  we believe that there are important cases where
      aggregation would benefit collective operations.

    - to address this we believe that we should extend the interface to
      implement a many-to-one, in addition to the existing one-to-many
      interface.  we hope we don't need the many-to-many...

    - perhaps we should call these scatter_init and gather_init (etc)?

  - Nick proposed that the interface be split up such that sends requests were
    separate from receive requests.  This implies that there would be a
    xfer_send_init() and xfer_recv_init().  We later threw this out, as it
    didn't make a whole lot of sense with forwards existing in the recv case.

  - Brian wondered about aggregating sends into a single receive and whether
    that could be used to reduce the overhead of message headers when
    forwarding.  We think that this can be done below the xfer interface when
    converting into a dataflow-like structure (?)

- We think it may be necessary to describe dependencies, such as progress,
  completion and buffer.  These dependencies as frighteningly close to
  dataflow...

- basically we see the xfer init...start calls as being converted into a set of
  comm. agent requests and a dependency graph.  we see the dependencies as
  being possibly stored in a tabular format, so that ranges of the incoming
  stream can have different dependencies on them -- specifically this allows
  for progress dependencies on a range basis, which we see as a requirement.
  completion dependencies (of which there may be > 1) would be listed at the
  end of this table

  the table describes what depends on THIS request, rather than the other way
  around.  this is tailored to a notification system rather than some sort of
  search-for-ready approach (which would be a disaster).

- for dependencies BETWEEN blocks, we propose waiting on the first block to
  complete before starting the next block.  you can still create blocks ahead
  of time if desired.  otherwise blocks may be processed in parallel

- blocks follow the same envelope matching rules as posted mpi send/recvs
  (commit time order).  this is the only "dependency" between blocks

reminder: envelope = (context (communicator), source_rank, tag)

QUESTION: what exactly are the semantics of a block?  Sends to the same
destination are definitely ordered.  Sends to different desinations could
proceed in parallel.  Should they?

example:
  init
  rf(5)
  rf(4)
  r
  start

a transfer block defines 0 or 1 envelope/payloads for sources and 0 to N envelope/payloads for destinations, one per destination.


- The communication agent will need to process these requests and data
  dependencies.  We see the agent having queues of requests similar in nature
  to the run queue within an operating system.  (We aren't really sure what
  this means yet...)  Queues might consist of the active queue, the wait queue,
  and the still-to-be-matched queue.

  - the "try to send right away" code will look to see if there is anything in
    the active queue for the vc, and if not just put it in run queue and call
    the make progress function (whatever that is...)

- adaptive polling done at the agent level, perhaps with method supplied
  min/max/increments.  comm. agent must track outstanding requests (as
  described above) in order to know WHAT to poll.  we must also take into
  account that there might be incoming active message or error conditions, so
  we should poll all methods (and all vcs) periodically.

- We believe that a MPID_Request might simply contain enough information for
  signalling that one or more CARs have completed.  This implies that a
  MPID_Request might consist of a integer counter of outstanding CARs.  When
  the counter reached zero, the request is complete.  David suggests making
  CARs and MPID_Requests reside in the same physical structure so that in the
  MPI_Send/Recv() case, two logical allocations (one for MPID_Request and CAR)
  are combined into one.

- operations within a block are prioritized by the order in which they are
  added to the block.  operations may proceed in parallel so long as higher
  priority operations are not slowed down by lesser priority operations.  a
  valid implementation is to serialize the operations thus guaranteeing that
  the current operation has all available resources at its desposal.

end of Brian's notes.

/***** Collective communication ***********************************************/
David's notes:

CAR stands for communication agent request.  For buffer movement, it describes 
operations that do not need to be or cannot be broken into smaller requests.

int Initialize_car_table();
int Finalize_car_table();


- Operations to describe buffer flow:

These can be used to describe collective and point to point operations.
We may want to add operations for RMA

Xfer_init is called first to acquire a request.  Then any number of Xfer_xx_op calls may
be made to describe buffer flow.  Then Xfer_start is called to enqueue and start the
operation.

There is only one receiver so the source is placed in the init call to guarantee this.  
It could be a parameter to recv_op and recv_forward_op but then we would have to detect errors.
buf, count, and datatype describe an MPI buffer. 
first and last are byte displacements into the MPI buffer as if it were a contiguous buffer.

int Xfer_init(src, tag, comm, &request); 
int Xfer_recv_op(request, buf, count, datatype, first, last); 
int Xfer_send_op(request, buf, count, datatype, first, last, destination);
int Xfer_recv_forward_op(request, buf, count, datatype, first, last, destination);
int Xfer_forward_op(request, size, destination);
int Xfer_start(request);


------- David, Brian and Rob have not discussed the following section yet
------- These are just some ideas that David thinks the agent needs to address

- Operations performed on a virtual connection:

Next retrieves a pointer from the method pointing to data available to be read.
It can work for TCP, VIA and Shmem:
 bsocket can return a pointer to its internal buffer.
 VIA can return a pointer to the data portion of the current descriptor
 Shmem can return a pointer into the shmem queue
Consume tells the method that 'num' bytes have been read from the internal buffers 
and they are not needed any more.
Next and consume will not be appropriate for methods that cannot provide a pointer to 
an internal buffer.  They will also be bad for methods that can read directly into a user buffer.

int Next(VC, &buf, [IN/OUT] &len);
int Nextv(VC, &iovec, [IN/OUT] &len);
int Consume(VC, num);

Reserve_read does not request a buffer until the first byte is ready to be read.
int Reserve_read(VC, len, request);
int Read(VC, buf, len, request);
int Readv(VC, iovec, len, request);

int Write(VC, buf, len, request);
int Writev(VC, iovec, len, request);
int Reserve_write(VC, &marker);

Set_marker provides a buffer to the agent to be inserted where the marker was placed by a 
previous Reserve_write call.
What should this be changed to in order to allow pipelining?  
We will want to insert the buffer even if it is not full and inform the agent as more data
becomes available.
int Set_marker(VC, buf, len, request, marker);
int Set_markerv(VC, iovec, len, delete_flag, requests, marker);

Get and release method specific buffers.  These functions allow the buffers for two VC's to
be tied together.  This may allow for data from one VC to be read directly into the write buffer 
of another VC.  If the VC does not have a buffer available it returns NULL.
int Get_buffer(VC, &buf, &len);
int Get_bufferv(VC, &iovec, [IN/OUT] &len);
int Release_buffer(VC, buf);
int Release_bufferv(VC, iovec, len);

The queues must be locked so that multiple reads/reserve_reads or send/reserve_sends can be 
inserted in blocks.
int Lock_write(VC);
int Lock_read(VC);
int Unlock_write(VC);
int Unlock_read(VC);

There need to be progress functions for polling and blocking implementations.
int Make_progress(VC);
int Make_progress_blocking();

int Init();
int Finalize();
int Connect(VC, rank, comm or BNR_Group, &request);

-------- End of undiscussed section

The xfer functions mentioned above are really a scatter interface.  They allow for one to many
communication.  We should add a gather interface to allow for many to one.  But these two interfaces
should be separate because many to many is too hard to implement and not necessary for any of our 
collective algorithms.

xfer_gather_init(dest, tag, comm, &req);
xfer_gather_recv_op(req, buf, count, dtype, first, last, src);
xfer_gather_recv_forward_op(req, buf, count, dtype, first, last, src);
xfer_gather_forward_op(req, size);
xfer_gather_send_op(req, buf, count, dtype, first, last);
xfer_gather_start(req);

xfer_scatter_init(src, tag, comm, &req);
xfer_scatter_recv_op(req, buf, count, dtype, first, last);
xfer_scatter_recv_forward_op(req, buf, count, dtype, first, last, dest);
xfer_scatter_forward_op(req, size, dest);
xfer_scatter_send_op(req, buf, count, dtype, first, last, dest);
xfer_scatter_start(req);

xfer_gather_init and xfer_scatter_init may be the same function
xfer_gather_start and xfer_scatter_start may be the same function

What is an xfer block? - A block is defined by an init call, followed by one or more
xfer operations, and finally a start call.
Priority in a block is defined by the order of operations issued in a block.
Lower priority operations may make progress as long as they can be preempted
by higher priority operations.

Progress on blocks may occur in parallel.  Blocks are independent.

dependencies:
1) completion
2) progress
3) resource (buffer)

Xfer blocks are used to create graphs of cars. Multiple cars within an xfer block 
may be aggregated in the order issued.

Add:
xfer_scatter_recv_mop_forward_op
xfer_gather_recv_mop_forward_op
xfer_scatter_recv_mop_op
xfer_gather_recv_mop_op

mop = MPI operation
Do not add xfer_gather/scatter_mop_send_op because it is not necessary at the 
xfer level.

These above xfer_mop functions generate cars like this:
(recv_mop) - (send)
(recv) - (mop_send)

The mop_send exists at the car level, not the xfert level.

xfer_gather/scatter_recv_forward_op generates these cars:
(recv) - (send)

recv_mop can be used for accumulate

send_mop could cause remote operations to occur.  We will not use send_mop 
currently.