6. The Memory Hierarchy - 6.45 - 《CSAPP-3e-Solutions Build Status》

6.45

assume matrix size N = 4, cache block size B = 8Byte, sizeof(int) = 4

matrix size 4*4
+--+--+--+--+
|0 |1 |2 |3 |
+--+--+--+--+
|4 |5 |6 |7 |
+--+--+--+--+
|8 |9 |10|11|
+--+--+--+--+
|12|13|14|15|
+--+--+--+--+

function transponse

void transpose(int* dst, int* src, int N) {
  int i, j;
  for (i = 0; i <= N-1; i++)
    for (j = 0; j <= N-1; j++)
      dst[j*N+i] = src[i*N+j];
}

every cache block can store 2 int numbers

traverse src by step 1, order:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

order 0: cache miss, load src[0],src[1]
order 1: cache hit because of order 0
order 2: cache miss, load src[2],src[3]
….
order 15: cache hit

if B is greater, hit rate will be greater too.

traverse dst by step 4, order:

0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

element 0: miss, load dst[0],dst[1]
element 4: miss, load dst[4],dst[5]
element 8: miss, load dst[8],dst[9]
element 12: miss, load dst[12],dst[13]
element 1: interesting, order 0 has loaded dst[1] into cache, hit. but in many
cases, element 4/8/12 may use the same cache block with element 0. so miss is
also possible.
….
element 15: hit / miss

let’s assume worest case: all element miss cache

if we split matrix by 2*2

split matrix by size 2*2 block
+--+--+--+--+
|     |     |
+  0  +  1  +
|     |     |
+--+--+--+--+
|     |     |
+  2  +  3  +
|     |     |
+--+--+--+--+

transpose block 0 itself

transpose block 1 with 2

transpose block 3 itself

code

  for (i = 0; i <= N-2; i+=2)
    for (j = 0; j <= N-2; j+=2) {
      dst[j*N+i] = src[i*N+j];
      dst[j*N+i+1] = src[(i+1)*N+j];
      dst[(j+1)*N+i] = src[i*N+j+1];
      dst[(j+1)*N+i+1] = src[(i+1)*N+j+1];
    }

when i = 0, j = 0, transpose block 0 itself

+--+--+      +--+--+
|0 |1 |      |0 |4 |
+--+--+  =>  +--+--+
|4 |5 |      |1 |5 |
+--+--+      +--+--+
dst[0] = src[0];
dst[1] = src[4];
dst[4] = src[1];
dst[5] = src[5];

if element 0 is miss, element 1 must hit;
if element 4 is miss, element 5 must hit;

50% is the highest hit rate in such low cache block size.

if B is greater and cache size C is larger, we can split matrix into 44, 88 or
more larger. theoretically we will archive the highest hit rate.

finally code:

!INCLUDE "./code/transpose.c"

in code, matrix size 1024*1024, loop 1000 times to measure program run time.

change BLOCK from 2 to 16, record time statistics

origin function run time: 16.46s

BLOCK	time(s)
2	9.99
3	7.16
4	5.6
5	5.66
6	5.34
7	5.39
8	5.38
9	5.48
10	6.21
11	7.9
12	10.17
13	11.14
14	11.88
15	12.11
16	11.85

tip: look up cpu cache info

cat /sys/devices/system/cpu/cpu0/cache/*