Skip to content

L3: Parallel Computing Architectures

Source of Processor Performance Gain

Bit Level Parallelism

Word size may mean: (16/32/64 bits)

  • Unit of transfer between processor memory
  • Memory address space capacity
  • Integer size
  • Single precision floating point number size

Instruction Level Parallelism

Pipelining:

  • Split instruction execution in multiple stages, e.g. Fetch (IF), Decode (ID), Execute (EX), Write-Back (WB)
  • Allow multiple instructions to occupy different stages in the same clock cycle
    • Provided there is no data / control dependencies
  • Number of pipeline stages == Maximum achievable speedup
  • Disadvantages: Independence, Bubbles, Hazards: data and control flow

Superscalar: Duplicate the pipelines:

  • Allow multiple instructions to pass through the same stage
  • Scheduling is challenging (decide which instructions can be executed together):
    • Dynamic (Hardware decision)
    • Static (Compiler decision)
  • Disadvantages: structural hazard, data/control dependencies
Pipelined
Pipelined
Superscalar
Superscalar

Pipelined vs Superscalar Processor:

Pipelined
Pipelined
Superscalar
Superscalar

Thread Level Parallelism

Thread Level Parallelism

Thread Level Parallelism Hierarchy

  • Processor can provide hardware support for multiple “thread contexts“: simultaneous multithreading (SMT)
    • Information specific to each thread, e.g. Program Counter, Registers, etc
    • Software threads can then execute in parallel
  • E.g: Intel processors with hyper-threading technology, e.g. each i7 core can execute 2 threads at the same time

Processor Level Parallelism (Multiprocessing)

  • Add more cores to the processor
  • The application should have multiple execution flows
    • Each process/thread needs an independent context that can be mapped to multiple processor cores

Flynn’s Parallel Architecture Taxonomy

Instruction stream: A single execution flow i.e. a single Program Counter (PC)

Data stream: Data being manipulated by the instruction stream

Single Instruction Single Data (SISD)

SISD
SISD
  • A single instruction stream is executed
  • Each instruction work on single data
  • Most of the uniprocessors fall into this category

Single Instruction Multiple Data (SIMD)

SIMD
SIMD
  • A single stream of instructions
  • Each instruction works on multiple data
  • Exploit data parallelism, commonly known as vector processor
  • Same instruction broadcasted to all ALUs
  • Not great for divergent executions

Multiple Instruction Single Data (MISD)

MISD
MISD
  • Multiple instruction streams
  • All instruction work on the same data at any time
  • No actual implementation except for the systolic array

Multiple Instruction Multiple Data (MIMD)

MIMD
MIMD
  • Each PU fetch its own instruction
  • Each PU operates on its data
  • Currently the most popular model for multiprocessor

Multicore Architecture

Hierarchical Design

Hierarchical Design
Hierarchical Design
  • Multiple cores share multiple caches
  • Cache size increases from the leaves to the root
  • Each core can have a separate L1 cache and shares the L2 cache with other cores
  • All cores share the common external memory
  • Usages: Standard desktop, Server processors, Graphics processing units

Pipelined Design

Pipelined Design
Pipelined Design
  • Data elements are processed by multiple execution cores in a pipelined way
  • Useful if same computation steps have to be applied to a long sequence of data elements
  • E.g. processors used in routers and graphics processors

Network-based Design

Network-based Design 1
Network-based Design 2
  • Cores and their local caches and memories are connected via an interconnection network

Memory Organization

Memory Organization
Memory Organization

Distributed Memory System

Distributed Memory System

Distributed Memory System

  • Each node is an independent unit
    • With processor, memory and, sometimes, peripheral elements
  • Physically distributed memory module
    • Memory in a node is private

Shared Memory System

Shared Memory System

Shared Memory System

Intel Core i7 (quad core)

Intel Core i7 (quad core) (interconnect is a ring)

  • Parallel programs / threads access memory through the shared memory provider
  • Program is unaware of the actual hardware memory architecture
    • Cache coherence and memory consistency

Cache Coherence:

  • Multiple copies of the same data exist on different caches
  • Local update by processor → Other processors should not see the unchanged data

Untitled

Memory Consistency:

Untitled

Uniform Memory Access (Time) (UMA):

  • Latency of accessing the main memory is the same for every processor
  • Suitable for small number of processors – due to contention

Untitled

Non-Uniform Memory Access (NUMA):

Untitled

Modern multi-socket config

Modern multi-socket config

  • Physically distributed memory of all processing elements are combined to form a global shared-memory address space
    • also called distributed shared-memory
  • Accessing local memory is faster than remote memory for a processor

ccNUMA:

  • Cache Coherent Non-Uniform Memory Access
    • Each node has cache memory to reduce contention

Untitled

Untitled

COMA:

  • Cache Only Memory Architecture
    • Each memory block works as cache memory
    • Data migrates dynamically and continuously according to the cache coherence scheme

Advantages:

  • No need to partition code or data
  • No need to physically move data among processors → communication is efficient

Disadvantages:

  • Special synchronization constructs are required
  • Lack of scalability due to contention

Hybrid (Distributed-Shared Memory)

Hybrid with Shared-memory Multicore Processors

Hybrid with Shared-memory Multicore Processors

Hybrid with Shared-memory Multicore Processor and GPU

Hybrid with Shared-memory Multicore Processor and GPU