4.59

    4.47 is the better one

    loop part in 4.47

    1. L4:
    2. mrmovq 8(%rax), %r9
    3. mrmovq (%rax), %r10
    4. rrmovq %r9, %r8
    5. subq %r10, %r8
    6. jge L3
    7. rmmovq %r10, 8(%rax)
    8. rmmovq %r9, (%rax)

    50% jge is right, run 5 instructions; 50% jge is wrong, run 7 instructions and 2
    nop bubble. so Cycles Per Loop is 50% 5 + (7 + 2) 50% = 7

    loop part in 4.48

    1. L4:
    2. mrmovq 8(%rax), %r9
    3. mrmovq (%rax), %r10
    4. rrmovq %r9, %r8
    5. subq %r10, %r8
    6. cmovl %r9, %r11
    7. cmovl %r10, %r9
    8. cmovl %r11, %r10
    9. rmmovq %r9, 8(%rax)
    10. rmmovq %r10, (%rax)

    Cycles Per Loop is 9

    loop part in 4.49

    1. L4:
    2. mrmovq 8(%rax), %r9
    3. mrmovq (%rax), %r10
    4. rrmovq %r9, %r8
    5. rrmovq %r10, %r11
    6. xorq %r9, %r10
    7. subq %r11, %r8
    8. cmovge %r11, %r9
    9. xorq %r10, %r9
    10. xorq %r9, %r10
    11. rmmovq %r9, 8(%rax)
    12. rmmovq %r10, (%rax)

    Cycles Per Loop is 11