Jax Ml Understanding Gpus

Source: https://jax-ml.github.io/scaling-book/gpus/

Building Blocks (Single SM) [WIP]

Subpartition

1. CUDA Cores:

General arithmetic operations (Floating point, integer)
Each subpartition has 32 FP32 cores and a smaller number of INT32 and FP64 cores.
ReLUs and pointwise operations.

2. Tensor Cores:

Specialized cores for matrix multiplications, accounting for most FLOPs (floating point operations).
Each Tensor Core of an H100 GPU can perform 1024 FLOPs/TC.cycle.
There is one Tensor Core per subpartition.

3. Thread:

Thread loads data from memory and loads into registers.
Once an instruction arrives from the scheduler, work is executed on the core.
Basic unit of work.

4. Warp:

A group of 32 threads.
All threads execute the same instruction on different data (SIMT).
Threads are executed on either CUDA cores or Tensor cores.

5. Warp Scheduler:

Manages multiple warps.
Decides which warp to send the work to.
Manages warps to hide latency (if a warp is waiting for memory, it switches to a ready warp).
Uses the SIMT principle to send work to cores.

6. Titbits

warps and threads are logical groupings.

Final Flow

Warp Scheduler -> Warp (32 threads) -> CUDA Cores/Tensor Cores (with data from registers/memory, using L0 cache for instructions)

Note: CUDA cores are very flexible. Each core can perform different operations, which is managed by masking out the cores that do not need to perform a divergent operation. However, if warps diverge too often, performance silently degrades.

Memory 🧠

Register File: Private memory accessible to individual threads. The H100 SM has a 256KiB register file. The number of resident warps depends on how much of this memory is used. Let’s calculate how many warp’s can fit into the registers.
- Total register capacity: 16384 registers x 4bytes(32 bits) x 4 subparitions = 2,62,144 bytes = 256KiB(KibiByte - Binary Prefix)
- Each CUDA core can access max 256 registers at a time, even though we can schedule 64 resident warps.
- 4 bytes(per register) x 32 (threads per warp) x 256 (max registers a warp can access) = 32768
- Total Bytes / Max bytes per warp –> 2,62,144 / 32,768 = 8. 8 Warps can be fit into registers when each warp uses it’s max registers(256).
L0 Instruction Cache: Stores instructions to speed up execution.
L1 Cache(SMEM):
- Capacity: 256KB.
- On-chip cache called SMEM.
- Each SM has a L1 Cache.
- Either a programmable shared memory or on-chip cache.
- Used to store TC matmuls, input data and thread block communication(these threads are part of thread block i.e warp groups).
L2 Cache:
- Capacity: 50MB
- On-Chip cache.
- Accessible to all SM’s.
- Isn’t programmery controlled, hence memory access patterns has to be optimized for proper usage.
- Slower than L1 cache.
- Bandwidth of 5.5TB/s.
- If data is not found in register, L1 Cache, thread checks in L2 cache.
High Bandwidth Memor(HBM):
- Main GPU memory
- Capacity: 32GB in Volta to 192GB in Blackwell.
- Off-chip memory.
- Bandwidth(HBM to Tensor Core): 3.5TB/s to 9TB/s.
- Stores model weights, activations, gradients.

Final Flow:

Subpartion finds data in this hirearchy: Register file –> L1 Cache –> L2 Cache –> HBM.

Performance Calculation

Performance Calculations (at 1 GHz Clock Speed)

FP32 CUDA Cores:

A single FP32 CUDA core can perform 1 GFLOPS (1 billion operations per second).
A subpartition has 32 FP32 CUDA cores, so it can perform 32 billion operations per second.
A single SM has 4 subpartitions, so it can perform 4 * 32 = 128 billion operations per second.
The H100 has 132 SMs, so the total theoretical FP32 performance is 132 * 128 = 16,896 billion operations per second, or 16.896 TeraFLOPS.

Tensor Cores:

A single Tensor Core can perform 1024 FLOPS per cycle.
At 1 GHz, a single Tensor Core can perform 1024 billion operations per second, or 1.024 TeraFLOPS.
A subpartition has 1 Tensor Core.
A single SM has 4 subpartitions (and thus 4 Tensor Cores).
A single SM can perform 4 * 1.024 = 4.096 TeraFLOPS.
The H100 has 132 SMs, so the total theoretical Tensor Core performance is 132 * 4.096 = 540.672 TeraFLOPS.

In Progress…