L3: Parallel Computing Architectures

Source of Processor Performance Gain

Bit Level Parallelism

Word size may mean: (16/32/64 bits)

Unit of transfer between processor memory
Memory address space capacity
Integer size
Single precision floating point number size

Instruction Level Parallelism

Pipelining:

Split instruction execution in multiple stages, e.g. Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
Allow multiple instructions to occupy different stages in the same clock cycle
- Provided there is no data / control dependencies
Number of pipeline stages == Maximum achievable speedup
Disadvantages: Independence, Bubbles, Hazards: data and control flow

Superscalar: Duplicate the pipelines:

Allow multiple instructions to pass through the same stage
Scheduling is challenging (decide which instructions can be executed together):
- Dynamic (Hardware decision)
- Static (Compiler decision)
Disadvantages: structural hazard, data/control dependencies

Pipelined vs Superscalar Processor:

Thread Level Parallelism

Processor can provide hardware support for multiple “thread contexts“: simultaneous multithreading (SMT)
- Information specific to each thread, e.g. Program Counter, Registers, etc
- Software threads can then execute in parallel
E.g: Intel processors with hyper-threading technology, e.g. each i7 core can execute 2 threads at the same time

Processor Level Parallelism (Multiprocessing)

Add more cores to the processor
The application should have multiple execution flows
- Each process/thread needs an independent context that can be mapped to multiple processor cores

Flynn’s Parallel Architecture Taxonomy

Instruction stream: A single execution flow i.e. a single Program Counter (PC)

Data stream: Data being manipulated by the instruction stream

Single Instruction Single Data (SISD)

A single instruction stream is executed
Each instruction work on single data
Most of the uniprocessors fall into this category

Single Instruction Multiple Data (SIMD)

A single stream of instructions
Each instruction works on multiple data
Exploit data parallelism, commonly known as vector processor
Same instruction broadcasted to all ALUs
Not great for divergent executions

Multiple Instruction Single Data (MISD)

Multiple instruction streams
All instruction work on the same data at any time
No actual implementation except for the systolic array

Multiple Instruction Multiple Data (MIMD)

Each PU fetch its own instruction
Each PU operates on its data
Currently the most popular model for multiprocessor

Multicore Architecture

Hierarchical Design

Multiple cores share multiple caches
Cache size increases from the leaves to the root
Each core can have a separate L1 cache and shares the L2 cache with other cores
All cores share the common external memory
Usages: Standard desktop, Server processors, Graphics processing units

Pipelined Design

Data elements are processed by multiple execution cores in a pipelined way
Useful if same computation steps have to be applied to a long sequence of data elements
E.g. processors used in routers and graphics processors

Network-based Design

Cores and their local caches and memories are connected via an interconnection network

Memory Organization

Distributed Memory System

Each node is an independent unit
- With processor, memory and, sometimes, peripheral elements
Physically distributed memory module
- Memory in a node is private

Shared Memory System

Intel Core i7 (quad core) (interconnect is a ring)

Parallel programs / threads access memory through the shared memory provider
Program is unaware of the actual hardware memory architecture
- Cache coherence and memory consistency

Cache Coherence:

Multiple copies of the same data exist on different caches
Local update by processor → Other processors should not see the unchanged data

Untitled

Memory Consistency:

Untitled

Uniform Memory Access (Time) (UMA):

Latency of accessing the main memory is the same for every processor
Suitable for small number of processors – due to contention

Untitled

Non-Uniform Memory Access (NUMA):

Untitled

Physically distributed memory of all processing elements are combined to form a global shared-memory address space
- also called distributed shared-memory
Accessing local memory is faster than remote memory for a processor

ccNUMA:

Cache Coherent Non-Uniform Memory Access
- Each node has cache memory to reduce contention

Untitled

COMA:

Cache Only Memory Architecture
- Each memory block works as cache memory
- Data migrates dynamically and continuously according to the cache coherence scheme

Advantages:

No need to partition code or data
No need to physically move data among processors → communication is efficient

Disadvantages:

Special synchronization constructs are required
Lack of scalability due to contention

Hybrid (Distributed-Shared Memory)

Hybrid with Shared-memory Multicore Processors

Hybrid with Shared-memory Multicore Processor and GPU