Skip to content

L10: Message Passing Interface (MPI)

MPI Program Structure

MPI Program Structure

int MPI_Init(int\* argc, char** argv[])
  • Initialize the MPI program
  • Must be called only once and before any other MPI routines

int MPI_Finalize(void)

  • Terminate all MPI processing
  • Must be the last MPI call

int MPI_Abort(MPI_Comm comm, int errorCode)

  • Force all processes to terminate
  • Return the errorcode to mpirun

MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

Point-to-point Communication

Blocking:

  • MPI_Send, MPI_Recv
  • MPI_Sendrecv, MPI_Sendrecv_replace

Non-blocking:

  • MPI_Isend, MPI_Irecv
  • Blocking and non-blocking operations can be mixed
    • Data sent by MPI_Isend() can be received by MPI_Recv()
    • Data sent by MPI_Send() can be received by MPI_Irecv()

MPI Message Format

Message = data (actual data that you want to send/receive) + envelope (how to route the data)

Data = start-buffer (address where data starts) + count (number of elements of data in the message) + datatype (type of data to be transmitted) +

Envelope = destination or source (using rank in a communicator) + tag (an arbitrary number to distinguish among messages) + communicator (send/receive must specified the same communicator)

MPI Send and Receive

1
int MPI_Send(void* buf, int count, MPI_Datatype dt,
2
int dest, int tag, MPI_Comm c);
3
4
int MPI_Recv(void* buf, int count, MPI_Datatype dt,
5
int src, int tag, MPI_Comm c, MPI_Status *status);
MPI Send and Receive

MPI Program Structure

  • Received message must be less than or equal to the length of the receive buffer
  • For receiving message, use:
    • src = MPI_ANY_SOURCE → from any process
    • tag = MPI_ANY_TAG → message with any tag
  • MPI_Status is a structure with MPI_SOURCE, MPI_TAG, and MPI_ERROR
SynchronousAsynchronous
BlockingMPI_SSend, (MPI_Mrecv), MPI_RSendMay be buffered: MPI_Send, MPI_Recv; Buffered: MPI_Bsend
Non-BlockingMPI_ISSend, (MPI_ImRecv)MPI_ISend, MPI_Irecv

Order of Receive Operations:

  • 2 processes (one sender, one receiver): A sender sends two or more messages to the same receiver, messages delivered in the order in which they have been sent
  • If more than 2 processes: Message delivery order not guaranteed!

Deadlocks in MPI

  • Message Order: Deadlock occurs when message passing cannot be completed
    • Process 0 waits for process 1 and vice versa → always deadlock!
  • System Buffer: Deadlock occurs if the runtime system does not use system buffers or if the system buffers used are too small
    • No system buffer / buffer too small → deadlock (because both sends cannot complete)

Process Groups and Communicators

Process Group:

  • An ordered set of processes, each process has a unique rank
  • A process may be a member of multiple groups, and may have different ranks in each of these groups
  • MPI system handles the representation and management of process groups

Communicator:

  • Communicator is the communication domain for a group of processes
  • Intra-communicators: Support the execution of arbitrary collective communication operations on a single group
    • Default: MPI_COMM_WORLD
  • Inter-communicators: Support the point-to-point communication operations between two process groups

Rationale:

  • Allow us to organize tasks, based on functions, into task groups
  • Enable collective communication operations across a subset of related tasks
  • Provide the basis for user-defined virtual topologies
  • Provide for safe communication

Operations on Process Group:

  • Obtain a new group: MPI_Comm_group

  • Get size of a group: MPI_Group_size

  • Rank a process in a group: MPI_Group_rank

  • Group union: MPI_Group_union

  • Group intersection: MPI_Group_intersection

  • Group difference: MPI_Group_difference

  • Group inclusion: MPI_Group_incl

  • Group exclusion: MPI_Group_excl

  • Group compare: MPI_Group_compare

  • Delete group: MPI_Group_free

Operations on Communicators:

  • Get size of communicator: MPI_Comm_size

  • Get rank of process in communicator: MPI_Comm_rank

  • Create communicator: MPI_Comm_create

  • Compare communicators: MPI_Comm_comp

  • Duplicate communicator: MPI_Comm_dupl

  • Split communicator: MPI_Comm_split

  • Delete communicator: MPI_Comm_free

Process Virtual Topologies

  • Sometimes it is useful to have an alternative representation and access e.g. processes communicating with neighbor processes only in a mesh pattern
  • Create topologies where neighbors are easily addressable

Virtual Topology Operations:

  • Virtual topology: a communicator with a Cartesian style of addressing the ranks of the processes
  • Create a Cartesian topology: MPI_Cart_create
  • Get info on Cartesian topology: MPI_Cart_get
  • Get number of dimension: MPI_Cartdim_get
  • Comm rank → Cartesian coords: MPI_Cart_coords
  • Cartesian coords → Comm rank: MPI_Cart_rank
  • Access neighbors in Cartesian coords: MPI_Cart_shift

Collective Communication

  • Operations that involve all processes in a communicator
    • Otherwise: deadlock
    • Blocking operations by default
  • Two types of operations:
    • Scatter: from 1 process to many processes
    • Gather: from many processes to 1 process
  • With accumulation (reduction) using an arithmetic operation OR Without accumulation

Barrier

Processes block until all processes of the communicator reach the barrier

1
int MPI_Barrier(MPI_Comm comm);

Measuring Program Timings

  • Measure the parallel execution time. Return the absolute time elapsed between the start and the end of a program part
1
double MPI_Wtime (void)
  • Get resolution of MPI_Wtime()
1
double MPI_Wtick (void)

Communication Operations

Scatter-Gather
Scatter-Gather
Single Broadcast

Single Broadcast

Multi Broadcast

Multi Broadcast - No root processor

Single Accumulation

Single Accumulation (Gather with Reduction) - Data of the same type and same size

Multi Accumulation

Multi Accumulation - No root process

Total Exchange

Total Exchange - No root process

Duality of Communication Operations:

  • A communication operation can be presented by a graph
  • Two communication operations are a duality if the same spanning tree can be used for both operations
Single-broadcast operation (top-down traversal)

Single-broadcast operation (top-down traversal)

Single-accumulation operation (bottom-up traversal)

Single-accumulation operation (bottom-up traversal)

Stepwise Specialization:

  • Communication operations can be ordered into a hierarchy:
    • From the most general to the most specific
    • Operations that are resulted from stepwise specialization are placed near to each other
  • Example: A stepwise specialization from total exchange to multi-broadcast
    • total exchange: each processor sends a different message to each other processor
    • multi-broadcast: each processor sends the same message to each other
Duality and Specialization

Duality and Specialization