L4: Parallel Computing Models

Parallelism

Average number of units of work that can be performed in parallel per unit time
- Example: MIPS, MFLOPS, average number of threads (processes) per second
Limits in exploiting parallelism
- Program dependencies – data dependencies, control dependencies
- Runtime – memory contention, communication overheads, thread/process overhead, synchronization (coordination)

Data Parallelism

Same operation is applied to different elements of a data set
- If operations are independent, elements can be distributed among cores for parallel execution data parallelism
SIMD computers / instructions are designed to exploit data parallelism
Many algorithms perform computations by iteratively traversing a large data structure
- Commonly expressed as a loop
If the iterations are independent, iterations can be executed in arbitrary order and in parallel on different cores

1
// Parallelize the matrix multiplication (result = a x b)
2
// Each thread will work on one iteration of the outer-most loop
3
// Variables are shared among threads (a, b, result)
4
// and each thread has its own private copy (i, j, k)
5

6
#pragma omp parallel for num_threads(8) shared(a, b, result) private (i, j, k)
7
for (i = 0; i < size; i++)
8
  for (j = 0; j < size; j++)
9
    for (k = 0; k < size; k++)
10
      result.element[i][j] += a.element[i][k] * b.element[k][j];

Data Parallelism on MIMD:

Common model: SPMD
One parallel program is executed by all cores in parallel (both shared and distributed address space)

$E.g: Scalar product of \vec{x} \cdot \vec{y} on p processing units$

E.g: Scalar product of $\vec\{x} \cdot \vec\{y}$ on p processing units

Query Model = "civic" AND Year = "2001" AND (Color = "green" OR Color = "white")

Query Model = "civic" AND Year = "2001" AND (Color = "green" OR Color = "white")

Task (Functional) Parallelism:

Independent program parts (tasks) can be executed in parallel
Tasks: single statement, series of statements, loops or function calls
Further decomposition: A single task can be executed sequentially by one processing units, or in parallel by multiple processing units

Task Dependence Graph:

Can be used to visualize and evaluate the task decomposition strategy
A directed acyclic graph:
- Node: Represent each task, node value is the expected execution time
- Edge: Represent control dependency between task
Properties:
- Critical path length: Maximum completion time
- Degree of concurrency = Total Work / Critical Path Length
  - An indication of amount of work that can be done concurrently

Untitled

Models of Coordination

Shared Address Space

Communication abstraction
- Tasks communicate by reading/writing from/to shared variables
- Ensure mutual exclusion via use of locks
- Logical extension of uniprocessor programming
Requires hardware support to implement efficiently
- Any processor can load and store from any address – contention
- Even with NUMA, costly to scale
Matches shared memory systems – UMA, NUMA, etc

Data Parallel

Historically: same operation on each element of an array → SIMD, vector processors
Basic structure: map a function onto a large collection of data
- Functional: side-effect-free execution
- No communication among distinct function invocations → Allows invocations to be scheduled in parallel
- Stream programming model
Modern performance-oriented data-parallel languages do not enforce this SIMD structure anymore (SPMD instead).
- E.g: CUDA, OpenCL, ISPC

Message Passing

Tasks operate within their own private address spaces and communicate by explicitly sending/receiving messages
Popular software library: MPI (message passing interface)
Hardware does not implement system-wide loads and stores
- Can connect commodity systems together to form large parallel machine
Matches distributed memory systems → Programming model for clusters, supercomputers, etc.

It is common to implement message-passing abstractions on shared memory machines (hardware)
- “Sending message” means copying data into message library buffers
- “Receiving message” means copying data from message library buffers
It is possible to implement shared address space abstraction on machines that do not support it in hardware
- Less efficient software solutions
- Modify a shared variable: send messages to invalidate all (mem) pages containing the shared variable
- Reading a shared variable: page-fault handler issues appropriate network requests (messages)

Shared address space:

very little structure
All threads can read and write to all shared variables
Drawback: not all reads and writes have the same cost (and that cost is not apparent in program text)

Data-parallel:

very rigid computation structure
Programs perform the same function on different data elements in a collection

Message passing:

highly structured communication
All communication occurs in the form of messages

Program Parallelization

Foster’s Design Methodology

1. Partitioning

Divide computation and data into independent pieces to discover maximum parallelism
- Different way of thinking about problems – reveals structure in a problem, and hence opportunities for optimization

Data Centric: Domain decomposition → Data Parallelism

Divide data into pieces of approximately equal size
Determine how to associate computations with the data

Computation Centric: Functional decomposition → Task Parallelism

Divide computation into pieces (tasks)
Determine how to associate data with the computations

Partitioning Rules of Thumb:

At least 10x more primitive tasks than cores in target computer
Minimize redundant computations and redundant data storage
Primitive tasks are roughly of the same size
Number of tasks ia an increasing function of problem size

2. Communication (Coordination)

Tasks are intended to execute in parallel, but generally not executing independently

→ Need to determine data passed among tasks

Local communication:

Task needs data from a small number of other tasks (“neighbors”)
Create channels illustrating data flow

Global communication:

Significant number of tasks contribute data to perform a computation
Don’t create channels for them early in design

Communication Rules of Thumb:

Communication operations balanced among tasks
Each task communicates with only a small group of neighbors
Tasks can perform communication in parallel
Overlap computation with communication

3. Agglomeration

Combine tasks into larger tasks → Number of tasks >= number of cores
Goals:
- Improve performance (cost of task creation + communication)
- Maintain scalability of program
- Simplify programming
Motivation: Eliminate communication between primitive tasks agglomerated into consolidated task

Agglomeration Rules of Thumb:

Locality of parallel algorithm has increased
Number of tasks increases with problem size
Number of tasks suitable for likely target systems
Tradeoff between agglomeration and code modifications costs is reasonable

4. Mapping

Assignment of tasks to execution units
Conflicting goals:
- Maximize processor utilization – place tasks on different processing units to increase parallelism
- Minimize inter-processor communication – place tasks that communicate frequently on the same processing units to increase locality
Mapping may be performed by:
- OS for centralized multiprocessor
- User for distributed memory systems

Mapping Rules of Thumb:

Finding optimal mapping is NP hard in general → Must rely on heuristic
Consider designs based on one task per core and multiple tasks per core
Evaluate static and dynamic task allocation
- If dynamic task allocation is chosen, the task allocator should not be a bottleneck to performance
- If static task allocation is chosen, the ratio of tasks to cores is at least 10:1

Automatic Parallelization

Parallelizing compilers perform decomposition and scheduling

Drawbacks:

Dependence analysis is difficult for pointer based computations or indirect addressing
Execution time of function calls or loops with unknown bounds is difficult to predict at compile time

Functional Programming Languages

Describe the computations of a program as the evaluation of mathematical functions without side effects

Advantages: New language constructs are not necessary to enable a parallel execution

Challenge: Extract the parallelism at the right level of recursion

Parallel Programming Pattern

Fork-Join

Task T creates child tasks

Children run in parallel, but they are independent of each other
The children can execute the same or a different program part, or function
Children might join the parent at different times

Implementation: Processes, threads, and any paradigm that makes use of these concepts

Parbegin–Parend

Programmer specifies a sequence of statements (function calls) to be executed by a set of cores in parallel

When an executing thread reaches a parbegin-parend construct, a set of threads is created and the statements of the construct are assigned to these threads for execution
Usually, the threads execute the same code (function)

The statements following the parbegin-parend construct are only executed after all these threads have finished their work Like a fork-join pattern, where all forks are done at the same time, and all joins are done at the same time

Implementation: A language construct such as OpenMP or compiler directives

SIMD

Single instructions are executed synchronously by the different threads on different data

Similar to parbegin-parend, all threads execute the same instruction at the same time

Implementation: AVX/SSE Instruction on Intel processor

SPMD

Same program executed on different cores but operate on different data

Different threads may execute different parts of the parallel program because of
- Different speeds of the executing cores
- Control statement in the program, e.g., If statement
Similar to parbegin-parend, but SPMD is the preferred name when we do not follow the pattern

No implicit synchronization

Synchronization can be achieved by explicit synchronization operations

Implementation: Programs running on GPGPU

Master–Worker

A single program (master) controls the execution of the program

Master executes the main function
Assigns work to worker threads

Master task: Generally responsible for coordination and perform initializations, timings, and output operations Worker task: Wait for instruction from master task

Task (Work) Pools

A common data structure from which threads can access to retrieve tasks for execution

Number of threads is fixed

Threads are created statically by the main thread
Once a task is finished, the worker thread retrieves another task from the pool
Work is not pre-allocated to the worker threads; instead, a new task is retrieved from the pool by the worker thread

During the processing of a task, a thread can generate new tasks and insert them into the task pool

Access to the task pool must be synchronized to avoid race conditions

Execution of a parallel program is completed when

Task pool is empty
Each thread has terminated the processing of its last task

Advantages:

Useful for adaptive and irregular applications
Tasks can be generated dynamically
Overhead for thread creation is independent of the problem size and the number of tasks

Disadvantages:

For fine grained tasks, the overhead of retrieval and insertion of tasks becomes important

Producer–Consumer

Producer threads produce data which are used as input by consumer threads

Synchronization has to be used to ensure correct coordination between producer and consumer threads

Pipelining

Data in the application is partitioned into a stream of data elements that flows through the pipeline stages one after the other to perform different processing steps

A form of functional parallelism: Stream parallelism