## Tuesday, July 7, 2015

### Observing Cache Associativity Effects

How Certain Memory Access Patterns Can Actually Degrade Performance

As processor clock speeds increase and memory latency improving comparatively slowly, it's important to make efficient use of the system's cache. Having your programs data stay within one of the cache hierarchies (the lower the better), is crucial for having a high performance system that can take advantage of the CPU's internal optimization mechanisms without being bounded by memory access time.

Cache Structure
Most cache structures have converged on the model of set associativity which is a combination of a theoretically ideal fully associative cache and a direct-mapped cache. The benefits of both are combined -- data is mapped into a set via a direct mapping scheme, and then uses a fully-associative scheme to find its location within the assigned set. For example, in an "8-way associative" cache, each line is assigned to set and then finds its home within one of the 8 slots available per set.

The bits of a memory address of the beginning of the data's cache line are used to decide in which set this particular cache line should go. 64-byte line sizes prompt for the first 6 bits to be reserved as offsets into the cache line to get the specific data in question, which is log2(64).  Depending on the number of sets in the cache, log2 of that number  of the following bits will be used to determine which set the line should go in. Afterwards, the line is fitted as it would in a typical fully-associative cache.  Standard stuff.

The Implications
What follows is that only a fairly limited amount of lines can be fit per set, and if a new cache line is to be loaded into L1 cache and it happens to be placed in a set that's already full, a cache line needs to be evicted conforming to the eviction policy used -- LRU for instance. This can have a tremendous and usually unexpected performance degradation on your application should there be consistent eviction and loading to and from the same set. But wait a minute, what are the chances that a new cache line would reside on the same set? For a caching system that has 64 sets and 64 byte line sizes, data whose base cache line address (the address of where its cache line begins)  are 4096 bytes apart would end up residing on the same set. This number comes from the following calculation: Every 64 bytes,  a new cache line begins. If we take a look at the binary representation, every 64 bytes would mean that the first six bits are all 1 (32+16+8+4+2+1 = 63) and in order to get to the next address, the 7th bit (which is part of the calculation of which set a cache line should go in) needs to be modified. So every 64 bytes, the bits pertaining to the cache set are changed. But eventually , the bits that correspond to the cache set will also be all 1 and need to reset back to 0 in order to keep going to a further address, and this process continues. Since every 64 bytes the cache line bits are changed, and only 6 bits are dedicated toward calculating the cache set (for a 64-set system), it follows that data 64*64 bytes apart are going to reside on the same cache set: The first 64 coming from the fact that every 64 bytes, the cache set bits are modified and the second 64 coming from the fact that there are 64 cache sets (or, 6 bits can make 64 unique values).  It just so happens that 64*64 == 4096, as I mentioned above.

The following simple program demonstrates this:

#include <iostream>
#include <array>
#include <algorithm>

int main(int argc, char* argv[])
{

std::array<int,32768> buffer = {0};
std::iota(buffer.begin(),buffer.end(),0);

unsigned long sum1 = 0;
unsigned long sum2 = 0;
unsigned long sum3 = 0;
unsigned long sum4 = 0;

constexpr int num_iterations = 1000000;
for(int i = 0; i < num_iterations;++i)
{
sum1 += buffer[0];
sum2 += buffer[1024];
sum3 += buffer[2048];
sum4 += buffer[3072];

sum1 += buffer[4096];
sum2 += buffer[5120];
sum3 += buffer[6144];
sum4 += buffer[7168];

sum1 += buffer[8192]; // hidden domino!

}
std::cout << (sum1 + sum2) + (sum3 + sum4) << '\n';

}


All its doing is summing up values from the array whose index is a multiple of 1024. What's so significant about 1024? On the system I ran this code against, sizeof(int) gives 4, which means that I'm really accessing elements that 4096 bytes apart! Each one of these accesses into the array will load the appropriate cache line and into the appropriate set. Since they're 4096 bytes away, the critical stride, we should expect each cache line to be loaded into the same set. It's easy to verify this with a simple printing routine:

void print_addr(int* const ptr)
{
std::cout << ptr << "\n\t Set: " << set_offset << '\n';
}


This function will print the bits corresponding to the cache set. Running print_addr for each multiple of 1024 (up until 8 in this case), gives the following result:

Each address in question will be placed in the same cache set, as indicated by the output of the print_addr function.

You may have noticed the comment I put in the main program -- "hidden domino". What could be so destructive about accessing that element of the buffer? It isn't an inherently expensive operation, and it looks the same as the others. In fact, we're using nice linear access traversal , going in multiples of 1024 so as to be friendly to the prefetcher. So what gives? This program has fallen victim to the 8-way associativity. The first 8 elements accessed, namely buffer[0],buffer[1024],...,buffer[7168]  will initially incur a cache miss and then be placed into the same set.  Once those first 8 elements are loaded, we do the dreaded ninth operation, which is a load from yet another location that is 4096 bytes away. What's going to happen? We'll incur an initial cache miss, as expected, and will try to place the cache line into the same set as the others. But the set is full! This leads to the eviction of the least-recently-used line which happens to be the corresponding line for buffer[0]. Here's where the dominoes start rolling. On the next iteration, buffer[0] will be accessed again but it was just evicted, leading to a miss! The line is loaded again, and the LRU line which would be buffer[1024] is evicted, but the next operation happens to be accessing buffer[1024], which misses again. The entire chain of events causes miss after miss, which is detrimental to performance.

Surely most processors would do out-of-order execution and may fetch some values before they're evicted since most of the operations are independent of each other, but this seemingly innocent traversal can cause massive L1 misses. Running the sample program above with valgrind --tool=cachegrind gives the following metrics:

Perhaps the most interesting metric is "D1 misses" which is around 8 million. There are 1 million loop iterations, and on each one we expect around 8 cache misses due to the long domino chain.

How can we remedy this? To demonstrate that this is in fact a problem with the associativity, changing "sum2 += buffer[1024]" to "sum2 += buffer[42]", which lies on a different set,  the cache misses go down significantly.

D1 misses is now at roughly 20,000. This around 400x (not 400%, but 400x) less!

How Does This Affect Me?
You may be thinking that the above example is only a theoretical case and that nobody would actually run into such problems, since a case like the one above is artificial. However, several real world programs such as matrix multiplication, matrix transposing,  and image processing lend themselves easily susceptible to this sort of problem.

Having power-of-two sized matrices and performing column-wise operations or row-column multiplications can easily lend itself to this since traversing a column on a power-of-two square matrix requires that you skip by a power-of-two amount of bytes to get to the next column element. The critical stride is by nature a power of two! It can be quite easy to begin filling up a particular cache set, and eventually lead to huge performance deviations. For an extreme case, consider this question asked on StackOverflow.

Final Remarks
I did leave off optimizations for this test. Turning on -O2 completely avoided the loop and instead loaded the elements once into registers and multiplied it out by num_iterations. This makes both the domino version and the "safe" version run equally fast. However, most programs will not be a simple addition of numbers in an array. It's important to pay attention to the usage of the cache and the access patterns used, to avoid tripping over one of the most important components in optimizing memory access and avoiding latency.