6.45

    assume matrix size N = 4, cache block size B = 8Byte, sizeof(int) = 4

    1. matrix size 4*4
    2. +--+--+--+--+
    3. |0 |1 |2 |3 |
    4. +--+--+--+--+
    5. |4 |5 |6 |7 |
    6. +--+--+--+--+
    7. |8 |9 |10|11|
    8. +--+--+--+--+
    9. |12|13|14|15|
    10. +--+--+--+--+

    function transponse

    1. void transpose(int* dst, int* src, int N) {
    2. int i, j;
    3. for (i = 0; i <= N-1; i++)
    4. for (j = 0; j <= N-1; j++)
    5. dst[j*N+i] = src[i*N+j];
    6. }

    every cache block can store 2 int numbers

    traverse src by step 1, order:

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    order 0: cache miss, load src[0],src[1]
    order 1: cache hit because of order 0
    order 2: cache miss, load src[2],src[3]
    ….
    order 15: cache hit

    if B is greater, hit rate will be greater too.

    traverse dst by step 4, order:

    0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

    element 0: miss, load dst[0],dst[1]
    element 4: miss, load dst[4],dst[5]
    element 8: miss, load dst[8],dst[9]
    element 12: miss, load dst[12],dst[13]
    element 1: interesting, order 0 has loaded dst[1] into cache, hit. but in many
    cases, element 4/8/12 may use the same cache block with element 0. so miss is
    also possible.
    ….
    element 15: hit / miss

    let’s assume worest case: all element miss cache

    if we split matrix by 2*2

    1. split matrix by size 2*2 block
    2. +--+--+--+--+
    3. | | |
    4. + 0 + 1 +
    5. | | |
    6. +--+--+--+--+
    7. | | |
    8. + 2 + 3 +
    9. | | |
    10. +--+--+--+--+

    transpose block 0 itself

    transpose block 1 with 2

    transpose block 3 itself

    code

    1. for (i = 0; i <= N-2; i+=2)
    2. for (j = 0; j <= N-2; j+=2) {
    3. dst[j*N+i] = src[i*N+j];
    4. dst[j*N+i+1] = src[(i+1)*N+j];
    5. dst[(j+1)*N+i] = src[i*N+j+1];
    6. dst[(j+1)*N+i+1] = src[(i+1)*N+j+1];
    7. }

    when i = 0, j = 0, transpose block 0 itself

    1. +--+--+ +--+--+
    2. |0 |1 | |0 |4 |
    3. +--+--+ => +--+--+
    4. |4 |5 | |1 |5 |
    5. +--+--+ +--+--+
    6. dst[0] = src[0];
    7. dst[1] = src[4];
    8. dst[4] = src[1];
    9. dst[5] = src[5];

    if element 0 is miss, element 1 must hit;
    if element 4 is miss, element 5 must hit;

    50% is the highest hit rate in such low cache block size.

    if B is greater and cache size C is larger, we can split matrix into 44, 88 or
    more larger. theoretically we will archive the highest hit rate.

    finally code:

    1. !INCLUDE "./code/transpose.c"

    in code, matrix size 1024*1024, loop 1000 times to measure program run time.

    change BLOCK from 2 to 16, record time statistics

    origin function run time: 16.46s

    BLOCK time(s)
    2 9.99
    3 7.16
    4 5.6
    5 5.66
    6 5.34
    7 5.39
    8 5.38
    9 5.48
    10 6.21
    11 7.9
    12 10.17
    13 11.14
    14 11.88
    15 12.11
    16 11.85

    tip: look up cpu cache info

    cat /sys/devices/system/cpu/cpu0/cache/*