MIT 6.5940(2024 Fall)

course website: (https://hanlab.mit.edu/courses/2024-fall-65940)[https://hanlab.mit.edu/courses/2024-fall-65940]

Pruning

  • Pruning Granularity
  • Pruning Criterion
  • Pruning Ratio
  • Fine-tune: improve performance

Distinguish Between Regularization and Normalization

Regularization(正则化)

  • purpose: prevent overfitting
  • method: add a penalty term to the loss function

Normalization(归一化)

  • purpose: improve training speed and stability
  • method: scale the input to have zero mean and unit variance

Quantization

K-means Quantization

Linear Quantization

Post-Training Quantization VS Quantization-Aware Training

Neural Architecture Search(NAS)

Knowledge Distillation

Matrix Multiplication

GEMM: General Matrix Multiplication

  • Loop optimization
    • reordering
    • tiling
    • unrolling
  • SIMD
  • Multithreading
  • CUDA programming

Loop Optimization

Loop Reordering

Target: improve data locality of caches

1
2
3
4
5
6
7
8
9
10
11
# normal loop
for i in range(N):
for j in range(N):
for k in range(N):
C[i][j] += A[i][k] * B[k][j]

# reorder loop
for i in range(N):
for k in range(N):
for j in range(N):
C[i][j] += A[i][k] * B[k][j]

Loop Tiling

Target: reduce cache miss (when B is much larger than cache size)

Determine the TILE_SIZE according to the cache size

1
2
3
4
5
6
7
8
9
10
11
T2_j = TIILE_SIZE2  # TILE_SIZE2 * TILE_SIZE => L2 Cache 
T_j = T_k = T_i = TILE_SIZE
for j_t2 in range(0, N, T2_j):
for i_t in range(0, N, T_i):
for k_t in range(0, N, T_k):
for j_t1 in range(j_t2, j_t2+T2_j, T_j):
# Loop for L1 cache
for i in range(i_t, i_t+T_i):
for k in range(k_t, k_t+T_k):
for j in range(j_t1, j_t1+T_j):
C[i][j] += A[i][k] * B[k][j]

Loop Unrolling

Target: reduce branching overhead

1
2
3
4
5
6
7
for i in range(0, N):
for j in range(0, N):
for k in range(0, N, 4):
C[i][j] += A[i][k] * B[k][j]
C[i][j] += A[i][k+1] * B[k+1][j]
C[i][j] += A[i][k+2] * B[k+2][j]
C[i][j] += A[i][k+3] * B[k+3][j]

SIMD