Skip to content

L4: Resource Hosting & Datacenter

Resource Hosting

On-prem:

  • organizations responsible for deploying, operating and maintaining internet connectivity
  • have complete control over QoS (Quality of Service)
  • safeguard using firewalls and monitoring software

Cloud-based:

  • use mutiple ISPs → difficult QoS management
  • need multiple could carriers: connectivity and reliability of cloud application
  • easier to adopt for applications with relaxed bandwidth and latency
  • security concerns

Untitled

Private Cloud Networking

Untitled

Latency: time it takes a packet to travel from 1 data node to another; highly variable latency with shared nodes; important when application requires swift response time

Bandwidth: number of bits transferred per unit time; important when applications require substantial amount of data transfer

Cloud-based Networking

Untitled

On-prem vs Cloud-based

On-prem IT ResourcesCloud-based IT Resources
internal users access corporate IT services using corporate networkinternal users access corporate IT services using internet connection
while roaming in external networks, internal users access corporate IT services using corporate internet connectionwhile roaming in external networks, internal users access corporate IT services using cloud provider’s internet connection
external users access corporate IT services using corporate Internet connectionexternal users access corporate IT services using cloud provider’s internet connection

Datacenter

Besides housing the physical resources (server, storage and network) in a building, additional infrastructures includes power and cooling, humidity control, fire protection and security of access and installation.

Main components

Main components

Server (1U Server) → Cluster (42U Server Rack) → Multiple Clusters → Datacenter

Machine Rack

  • Support servers, storage and networking equipment
  • Classic rack: 19-inch wide and 48-inch sizes to support different form factors
  • Power conversion (AC to DC) and delivery, battery backup
  • Rack-level switch, e.g. TOR (top of rack) with ports divided between downlinks that connect servers to TOR and uplinks that connect the TOR to datacenter network fabric

Example: Intel Haswell-based Server Tray

  • Supports 2 Intel Haswell CPU sockets
  • Each CPU can support up to 145W with DVFS, e.g. Haswell processot with 18-core per socket
  • 16 DIMM (dual in-line memory modules) slots
  • PCIe (peripheral component interconnect express) for SSD, NIC, accelerators
  • SATA ports for DAS and PCIe-attached storage (disk trays)

Computing Accelerator: GPUs

  • GPU to CPU host via PCIe-attached accelerator tray with multiple GPUs
  • within a tray GPUs use interconnects such as NVlink
  • Multiple GPU trays are connected to datacenter with NICs, GPUs and NICs communicate directly through PCI-e

Storage

  • 2 categories of data:
    • private to individual running tasks: resides in local DRAM or disk
    • part of shared state of distributed workload - distributed storage network
  • Example: Google’s disk tray with 22 hard drives, provides storage to servers over Ethernet

Storage Hierarchy

Storage Hierarchy

Network

  • carrier interconnection between external WAN and datacenter LAN
  • LAN fabric
    • within a rack, e.g. Ethernet switch with 48 ports to connect servers at 40-100 Gbps
    • redundant connectivity
  • SAN fabric between servers and storage systems (Fibre channel)

Building

  • Described in terms of either floor area for IT equipment or critical power supplied to IT equipment
  • Power supply, UPS (uninterruptible power supply), battery power backup
  • Cooling - Hot-Cold Aisles: 2 main types of cooling systems:
    1. Cold air is blown into the datacenter from the raised floor. The cold air blown into the cool aisles get suck into the rack and heat escape from the other side of the rack resulting in the hot aisles. Hot air rises and get suck out of the datacenter.
    2. Water cool – this is analogous to put the server in a refrigerator, i.e., the server rack is enclosed in a box and heat exchange using cold water is performed within the box. Hot water that take off the heat is discharged through pipes hidden under raised floor. This method has a higher cooling efficiency than using cold air.
  • Fire protection
  • Security and Monitoring

Datacenter Tiers

Tier I: a single path for power consumption, UPS, and cooling distribution, no redundant components

→ Availability: 99.671% (29hrs downtime/yr)

Tier II: adds redundant components to this design (N + 1)

→ Availability: 99.741% (23hrs downtime/yr)

Tier III: one active and one alternate distribution path for utilities

  • Each path with redundant components, concurrently maintainable

  • Redundancy even during maintenance

    → Availability: 99.982% (1.5hrs downtime/yr)

Tier IV: 2 simultaneously active power and cooling distribution paths

  • Redundant components in each path

  • Tolerate any single equipment failure without impacting the load

    → Availability: 99.995% (0.5hrs downtime/yr)

Energy Usage and Power Efficiency

PUE (Power Usage Efficiency) = total power used by datacenter / power used by IT equipment

→ Ideal: PUE = 1

→ Avg. US datacenter PUE = 2

→ Energy efficient datacenter PUE = 1.2

Energy-proportion systems:

  • Even when power requirement scale linearly with the load, energy efficiency is not a linear function of load; idle system use 50% power
  • An energy-proportional system consumes no power when idle, very little power under a light load and, gradually, more power as the load increases
  • Dynamic power range: low and upper range of the power consumption of a device
    • Processor (70%), DRAM (50%), disk drive (25%), network switches (15%), human (?%)
    • wider range is better