Loop unswitching – a performance appraisal
Continue from the last blog post
After reading Mr. Hưng last blog post about loop unswitching optimization technique and his question whether JVM or other languages interpreter uses this or not, I was Googling around and the answer is “Yes”, unfortunately not in a positive light: http://bugs.java.com/view_bug.do?bug_id=8021898 – “Broken JIT compiler optimization for loop unswitching”. Java Virtual Machine is one of the most advanced VM with state of the arts optimization techniques, so obviously it can do this kind of optimization. In fact JVM can optimize even better than traditional compilers (C/C++) as it can utilize runtime information to judge if the optimization is worth doing or not. On the other hand, interpreted implementations of other languages (CPython, Ruby, Perl) can’t perform this kind of optimization since they are not advanced enough.
I also wonder for which cases this optimization will be useful. First, GCC lists loop-unswitching in level 3 optimizations (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html), which means it is not enabled by default and careful consideration is needed as it may not increase performance or even cause your program to run incorrectly. Okay, so it is not easy to find a good case for this kind of optimization, I think. To answer that question, I have conducted a small test as below:
The setup
I have written two code snippets to test:
volatile bool w = true; for (int i = 0; i < SIZE; i++) { a[i] += sin(a[i]) + cos(b[i]); if (w) b[size-1-i] += sin(c[i]); }
and its loop unswitched version:
volatile bool w = true; for (int i = 0; i < SIZE; i++) { a[i] += sin(a[i]) + cos(b[i]); } for (int i = 0; i < SIZE; i++) { if (w) b[size-1-i] += sin(c[i]); }
(These snippets are written in C99, I have set w to be volatile so GCC won’t optimize it away and so we can have consistent result)
Which one will run faster? A few considerations before the results:
- The total number of steps is the roughly the same between both case, in fact the loop-switched version has more code due to the 2nd loop (around 2 more instructions per loop, 1 INC and one comparison). So there won’t be any gain from executing less code.
- The optimized version should generate less cache miss, as within one loop, it only need to keep either (a and b) or (b and c) within the cache, while the unoptimized version would need to keep all (a, b and c) within the cache. So in theory the edge case would be where the (a and b) fits exactly within the cache. My CPU’s L1 cache is 32 KiB, so the size of each array should be 32 * 1024 / 4 / 2= 4096 floats. In this case the optimized version should runs much faster than the unoptimized version.
I then create 2 versions (as I can’t make sure that GCC -O3 will optimize as my expectation), feed them into a python script that execute them 1000 times to take average executing time and standard deviation. Unfortunately this method is not very accurate as CPU these days are extremely fast so the majority of this program’s execution time will be spent in starting up, loading library and allocate memory (it still does give some information though). So I have used another tool that is Linux’s performance counter ‘perf’ to measure the page faults and other data.
The results
Here are the results:
perf_3.2 stat -B ./optimized 4096 Performance counter stats for './optimized 4096': 5.548000 task-clock # 0.738 CPUs utilized 1 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 41 page-faults # 0.007 M/sec 4,945,792 cycles # 0.891 GHz 2,748,457 stalled-cycles-frontend # 55.57% frontend cycles idle 216,234 stalled-cycles-backend # 4.37% backend cycles idle <not counted> instructions <not counted> branches <not counted> branch-misses 0.007518521 seconds time elapsed
perf_3.2 stat -B ./unoptimized 4096 Performance counter stats for './unoptimized 4096': 6.099000 task-clock # 0.748 CPUs utilized 1 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 41 page-faults # 0.007 M/sec 4,330,276 cycles # 0.710 GHz 2,331,382 stalled-cycles-frontend # 53.84% frontend cycles idle 170,320 stalled-cycles-backend # 3.93% backend cycles idle <not counted> instructions <not counted> branches <not counted> branch-misses 0.008151564 seconds time elapsed
Much to my surprise, the result is almost the same between the two versions! Most surprising of all is the number of page faults is the same in both case (measured using time command), which means loop-unswitching doesn’t help lowering cache miss at all! And so there is no performance gain in this case. My guess is that Linux divides memory into 4 KiB page, which means if we just access the array sequentially then the cache miss only happens once per 4 KiB/4 bytes = 1024 accesses, so we will change our iterating pattern to something else other than sequential. One more possible reason is that the modern CPUs have sophisticated branch predictor and some other advanced optimization techniques so the gain in this simple program using loop-unswitching is null.
Final words
So, the final question is: Is loop-unswitching optimization worth doing it by hand? IMO, it doesn’t. This kind of optimization is only suitable for BLAS or some other numerical libraries, for which the library author will need to squeeze every bit of performance out of every CPU cycles. And yes, I said library author, not the library users, because it’s better to use a carefully handcrafted piece of codes which has been tested and reviewed than writing one yourself. And if you are not writing heavy numerical programs but only to shuffling memory and data around (like in typical web application), then it is simply not worth it. Still, it is interesting technique to know and a fun practice to test your knowledge.
(The code used in the benchmark shall be uploaded to github soon, comments and correction are welcomed)