Gpus

08 Jan 2025 - cohlem

GPU physcial structure

let’s first understand the structure of GPU.

Inside a GPU it has a chip named GA102 (depends on architecture, this is for ampere architecture) built from 28.3million transistors (semiconductor device that can switch or amplify electrical signals) one and majority area covered by processing cores. two processing core is divide into seven Graphics processing clusters (GPCs)

three

among each GPC there are 12 Streaming Multiprocessors. four Inside each SM there are 4 warps and 1 Raytracing core five inside a warp there are 32 Cudas and 1 Tensor Core.

Altogether there are

Each cores have different function.

Cuda Cores

six cuda core is like a basic calculator with multiplication and addition operations.

Ray tracing Cores

eight

Depending on the amount of streaming multiprocessors that are damaged during manufacturing they are categoriezed and sold at different prices, for instance RTX 3090 ti has full 10752 cuda where as 3090 might have some damaged SMs. These cards might have different clock speed too.

Graphics Memory GDDR6X SDRAM

these 24GBs of GDDR6X surround the GPU chip In order to run the operations of GPU chip

How are operations executed?

If processor runs at 1Ghz, it can run 10^9 cycles per second, assuming 1cycle = 1 basic operation it can execute 10^9 operations.

- Global memory access (up to 80GB): ~380 cycles
- L2 cache: ~200 cycles
- L1 cache or Shared memory access (up to 128 kb per Streaming Multiprocessor): ~34 cycles
- Fused multiplication and addition, a*b+c (FFMA): 4 cycles
- Tensor Core matrix multiply: 1 cycle

Tensor Cores (Most important)

These cores are used for matrix multiplication and matrix additions.

seven

First lets start by understanding the precision formats that’s used.

1. FP16 (Half-Precision Floating-Point)

2. FP32 (Single-Precision Floating-Point)

The operation performed by Tensor Core is something like this.

(picture) It performs 64 FMA operations per clock.

Understanding matrix multiplication in tensor cores.

Task:

Matrix multiply A and B with each size 32 x 32

lets say our tensor cores process 4x4 matrix multiplications per 1 cycle. The step to multiply A and B are

Memory Bandwidth

we have seen that Tensor Cores are very fast. So fast, in fact, that they are idle most of the time as they are waiting for memory to arrive from global memory.  For example, during GPT-3-sized training, which uses huge matrices — the larger, the better for Tensor Cores — we have a Tensor Core TFLOPS utilization of about 45-65%, meaning that even for the large neural networks about 50% of the time, Tensor Cores are idle.

Nvidia Ampere Architecture

More details in the picture below.

twelve

These TFLOPS are calculated using this formula

Number of cores for that precision (FP64) x clock speed x Operations per clock (generally 1FMA or 2 Operations)

This figure shows the range and precision of each of these.

thirteen

Preferred:

How numbers are stored (example of fp16)

lets store -8764.781267 in FP16 format

first convert int and frac part to binary, we get

(8764.781267)base10≈(10001000111100.110012)base2

Normalize the binary number to the form 1.mantissa x 2^(exponent)

10001000111100.11001​=1.000100011110011001​×2^13

The exponent is biased by 15 in FP16: i.e Exponent Bits=Actual Exponent+15=13+15=28

(28)base10​=(11100)base2

The mantissa is the fractional part of the normalized number, truncated to 10 bits:

(1.000100011110011001)base2→(00010001111.000100011110011001)base2​→(0001000111)base2

The number is negative, so the sign bit is:

Sign Bit=1Sign Bit=1

Combine the sign bit, exponent bits, and mantissa bits:

The FP16 representation is:

11110000010001111111000001000111

Tensor cores are optimized for matrix multiplications so it can peform more operations per clock rather than just 64 FMA per clock.

Sparsity

Matrix contains large number of zeros in it, by using a fine-grained pruning algorithm to compress (essentially removing) small and zero-value matrices, the GPU saves computing resources, power, memory and bandwidth.

Form Factor

They define how the GPU is physically integrated into a system and how it connects to other components like the CPU and memory.

| Bandwidth | PCIe Gen4: 64 GB/s (x16) | NVLink: 600 GB/s (per GPU pair) | | ————- | ———————— | ——————————- |

References