What Every Programmer Should Know About Memory by Ulrich Drepper (2007) (read in 2021)

Published by marco on

Updated by marco on

Standard disclaimer[1]

This is a 114-page document about various features of processor architectures and of programming languages and techniques that affect performance. It starts with a discussion of how memory (RAM or cache) even works, on an electrical level. How much maintenance overhead does a capacitor in a memory unit need in order to maintain its value? That is, there are cycles during which the RAM is refreshing its capacitance and cannot be read. If a processor request for that data comes at the wrong time, the reply will stall longer than usual.

“A second problem resulting from the tiny charge is that the information read from the cell is not directly usable. The data line must be connected to a sense amplifier which can distinguish between a stored 0 or 1 over the whole range of charges which still have to count as 1. A third problem is that reading a cell causes the charge of the capacitor to be depleted. This means every read operation must be followed by an operation to recharge the capacitor. This is done automatically by feeding the output of the sense amplifier back into the capacitor.”

Position 162-165

This takes him into discussing processor caches and how and when data is retained or evicted. How can you align structures and data so that they fit into a cache line? How do you order operations so that a cache line can stay in the cache for as long as possible? What sort of eviction policies do processors even use? LRU? Something else? Can you control them? How do you avoid thrashing?

“The state changes are accomplished without too much effort by the processors listening, or snooping, on the other processors’ work. Certain operations a processor performs are announced on external pins and thus make the processor’s cache handling visible to the outside. The address of the cache line in question is visible on the address bus.”

Position 841-843

Next up is concurrency: when multiple processors, each with their own tier of caches, are working with the “same” memory, then they have to coordinate on cache evictions so that no processor is using stale data. The amount of coordination is incredible and an eviction becomes even more costly as the number of processors increases. Changing data in-memory becomes quite costly and should be avoided, if possible. At the very least, one should be aware of how much it costs to change memory. There might be simple things you can do at a high level of abstraction that makes it easier for the compiler to generate performant code for you.

“When we modify memory things get even uglier. Figure 3.20 shows the results for the sequential Increment test. This graph is using a logarithmic scale for the Y axis. So, do not be fooled by the apparently small differences. We still have about a 18% penalty for running two threads and now an amazing 93% penalty for running four threads. This means the prefetch traffic together with the write-back traffic is pretty much saturating the bus when four threads are used.”

Position 898-901

It’s very good to know about the problems, but solving them yourself, as instructed in this document, is almost certainly a bridge too far for most developers. This is why you use task libraries. Let the experts figure out what the optimal number of threads is for your tasks and the available hardware. If you’re launching your own thread, you’re almost certainly doing it wrong.

This stuff is hard, as evidenced by this book. Instead, follow the old edict: use as high a level of abstraction as possible and leave the lower-level stuff to people who are experts at the lower levels. If your application is doing a high-level task, use a high-level API and trust that the runtime and compiler will help optimize it for you.

Obviously, this has limits. If measurements (see end of the paper, where he demonstrates tools like valgrind) show that there is a performance problem and it’s critical, then you’ll have to drop one or more levels of abstraction and take care of it yourself. This may also become necessary if the API, language, or runtime doesn’t allow you to provide enough information about your application to allow a compiler or runtime to optimize performance.

This is, for example, why .NET and C# continue to introduce concepts like struct record and readonly struct record. These are all additional hints about what sort of changes are even allowed. A struct has copy semantics, which automatically means that the original structure won’t be modified when passed across a function boundary. A readonly struct record promises that the fields can’t be modified in any case, so the compiler can actually just pass a reference instead of a copy. This avoids cache eviction on function boundaries as well.

Most of the information in this long paper is still accurate today, but a good number of the middle sections that deal with BP (branch prediction) say nothing about potential data leaks. Instead, the paper still discusses the advantages of processors being able to inspect each other’s caches, at least to some degree. Spectre and Meltdown showed us that those caches will then sometimes continue to contain the results of branches that were discarded—and can still be read by clever programs. Some of this functionality is no longer available for the aggressive optimizer.

“Program flow is much more predictable than data access patterns. Today’s CPUs are very good at detecting patterns. This helps with prefetching. As a result CPU designers spend a lot of time and chip real estate on branch prediction so that pipeline stalls happen as infrequently as possible.”

Position 1006-1008

There is also a section on how one could really optimize for performance by bypassing unneeded OS restrictions or saving memory by allowing an application to modify itself in ways that can be highly beneficial for performance but that a general-purpose operating system would not be able to distinguish from the workings of a malicious program.

“In early computer days memory was a premium. People went to great lengths to reduce the size of the program to make more room for program data. One trick frequently deployed was to change the program itself over time. Such Self Modifying Code (SMC) is occasionally still found, these days mostly for performance reasons or in security exploits. SMC”

Position 1024-1026

Forth was one such language/environment. One use for something like this would be to discard code like one-time init/startup code or large and intricate singleton initializers (e.g. data generation). There are ways of doing this now by loading dynamic modules and then unloading them when no longer needed. Such solutions aren’t used very often as the work is both complex and involved.

“If performance is really much more important than security, randomization can be turned off. The OS will then usually at least load all DSOs contiguously in virtual memory.”

Position 1248-1249

I find it fascinating to consider that, for a really fast program, you’d almost have to consider using a different operating system. You can’t use any consumer-grade stuff because those have so many layers of code built to protect against hacking in an open network. None of that is necessary if you have much more control over the environment. If you’re air-gapped and control the processes, then you don’t have to worry about malicious code and can remove many layers of protection.

You can use self-modifying code (because you know it’s not a trojan), you can skip virus checks, you can keep things in memory contiguously instead of randomizing loading positions, you can run a much lighter, more-dedicated kernel. Don’t run in any more of a VM than you have to. If you’re really, really pushing for performance, then it makes more sense to first remove the umpteen layers of protective code that aren’t necessary before you start rewriting your application to try to get around them.

That takes the discussion to virtualization, which has become even more of a hot topic than it was in 2007.

“Overall, programmers must be aware that, with virtualization used, the cost of cache misses (instruction, data, or TLB) is even higher than without virtualization. Any optimization which reduces this work will pay off even more in virtualized environments. Processor designers will, over time, reduce the difference more and more through technologies like EPT and NPT but it will never completely go away.”

Position 1379-1382

We’re very often using several layers of virtualization in our programming environments. The code that we write is farther and farther away from running “on the metal”.

For example, I’m working on an M1 Mac right now, typing into a browser window. The Browser uses JavaScript, which runs in an interpreter that is optimized with several layers to try to determine what kind of changes could happen to the code and to what degree it can be optimized down to simple machine code.

If the browser doesn’t run natively on M1, then it will run through a virtualized emulator named Rosetta Stone. If I’m running MySQL in a Docker container, there’s another layer of virtualization away. All of these layers have to do their level best to present enough information to the next layer down so that it can produce the simplest and most efficient representation possible. This is all getting even more complicated.

For example, cache misses of various types are not even so easy to predict

“While page faults usually are a one-time cost, TLB misses are a perpetual penalty given that the TLB cache is usually small and it is flushed frequently. Page faults are orders of magnitude more expensive than TLB misses but, if a program is running long enough and certain parts of the program are executed frequently enough, TLB misses can outweigh even page fault costs.”

Position 1944-1947

“The measured overhead, computed by dividing the time needed when using one single cache line versus a separate cache line for each thread, is 390%, 734%, and 1,147% respectively. These large numbers might be surprising at first sight but, when thinking about the cache interaction needed, it should be obvious. The cache line is pulled from one processor’s cache just after it has finished writing to the cache line. All processors, except the one which has the cache line at any given moment, are delayed and cannot do anything. Each additional processor will just cause more delays.”

Position 2166-2170

For these reasons, manufacturers like Apple have moved to architectures called System On a Chip (SOC). These have many, many cores and include the RAM very close to these chips and their caches, with very wide buses. As Drepper mentions several times, coordination between the various components from various manufacturers must be very finely coordinated. The SOC approach collects all of these components under the aegis of one manufacturer and opens up the possibility for improving cache coherence and interaction with memory. Interesting times.

“In general the best advice which can be given is 1. Separate at least read-only (after initialization) and read-write variables. Maybe extend this separation to read-mostly variables as a third category. 2. Group read-write variables which are used together into a structure. Using a structure is the only way to ensure the memory locations for all of those variables are close together in a way which is translated consistently by all gcc versions..”

Position 2208-2211

This aligns pretty well with the advice to use immutable data structures as much as possible. It also explains why it’s often faster to just create a memcopy of a structure with a modified value than it is to just modify the value directly. The initial modification costs less, but the overall cost of using a structure that can be modified ends up being much more over the lifetime of the program. The compiler will simply not be able to optimize cache usage for a mutable data structure.

Again, for most of us programmers, we won’t have very much direct control over any of this, but knowing how it all works is very helpful in being able to predict when you’re doing something really silly. Remember: your application logic is probably operating at a level of abstraction where you can concern yourself with being as explicit and clear about what your code needs and does as possible and letting the compiler and runtime do the best it can with it. Use immutable data, use non-nullable references, avoid side-effects (be pure), use task libraries for concurrency…and you should be OK almost all of the time.

[1] Disclaimer: these are notes I took while reading this book. They include citations I found interesting or enlightening or particularly well-written. In some cases, I’ve pointed out which of these applies to which citation; in others, I have not. Any benefit you gain from reading these notes is purely incidental to the purpose they serve of reminding me of what I once read. Please see Wikipedia for a summary if I’ve failed to provide one sufficient for your purposes. If my notes serve to trigger an interest in this book, then I’m happy for you.↩

Citations

“The title of this paper is an homage to David Goldberg’s classic paper “What Every Computer Scientist Should Know About Floating-Point Arithmetic” [12]. This paper is still not widely known, although it should be a prerequisite for anybody daring to touch a keyboard for serious programming.”

Position 49-51

“A couple of bottlenecks are immediately apparent in this design. One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. To work around this problem some devices became capable of direct memory access (DMA). DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost).”

Position 77-80

“Today all high-performance devices attached to any of the buses can utilize DMA. While this greatly reduces the workload on the CPU, it also creates contention for the bandwidth of the Northbridge as DMA requests compete with RAM access from the CPUs. This problem, therefore, must be taken into account.”

Position 80-82

“When we modify memory things get even uglier. Figure 3.20 shows the results for the sequential Increment test. This graph is using a logarithmic scale for the Y axis. So, do not be fooled by the apparently small differences. We still have about a 18% penalty for running two threads and now an amazing 93% penalty for running four threads. This means the prefetch traffic together with the write-back traffic is pretty much saturating the bus when four threads are used.”

Position 898-901

Also security issues.

“[…] to accommodate the huge number of cells (chips with 109 or more cells are now common) the capacity to the capacitor must be low (in the femto-farad range or lower). A fully charged capacitor holds a few 10’s of thousands of electrons. Even though the resistance of the capacitor is high (a couple of tera-ohms) it only takes a short time for the capacity to dissipate. This problem is called “leakage”. This leakage is why a DRAM cell must be constantly refreshed. For most DRAM chips these days this refresh must happen every 64ms. During the refresh cycle no access to the memory is possible since a refresh is simply a memory read operation where the result is discarded.”

Position 156-161

“A second problem resulting from the tiny charge is that the information read from the cell is not directly usable. The data line must be connected to a sense amplifier which can distinguish between a stored 0 or 1 over the whole range of charges which still have to count as 1. A third problem is that reading a cell causes the charge of the capacitor to be depleted. This means every read operation must be followed by an operation to recharge the capacitor. This is done automatically by feeding the output of the sense amplifier back into the capacitor.”

Position 162-165

“A program selects a memory location using a virtual address. The processor translates this into a physical address and finally the memory controller selects the RAM chip corresponding to that address. To select the individual memory cell on the RAM chip, parts of the physical address are passed on in the form of a number of address lines.”

Position 182-184

“In the example the address lines a0 and a1 through the row address selection (RAS)9 demultiplexer select the address lines of a whole row of cells. When reading, the content of all cells is thusly made available to the column address selection (CAS)9 multiplexer. Based on the address lines a2 and a3 the content of one column is then made available to the data pin of the DRAM chip. This happens many times in parallel on a number of DRAM chips to produce a total number of bits corresponding to the width of the data bus.”

Position 196-201

“Hardware and software prefetching (see section 6.3) can be used to create more overlap in the timing and reduce the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right before the data is actually needed. This is a frequent problem when the data produced in one round has to be stored and the data required for the next round has to be read.”

Position 395-398

“While DMA is certainly beneficial, it means that there is more competition for the FSB bandwidth. In times with high DMA traffic the CPU might stall more than usual while waiting for data from the main memory.”

Position 404-406

“In symmetric multi-processor (SMP) systems the caches of the CPUs cannot work independently from each other. All processors are supposed to see the same memory content at all times. The maintenance of this uniform view of memory is called “cache coherency”.”

Position 519-521

“If a write access is detected and the processor has a clean copy of the cache line in its cache, this cache line is marked invalid. Future references will require the cache line to be reloaded. Note that a read access on another CPU does not necessitate an invalidation, multiple clean copies can very well be kept around.”

Position 525-527

“If these rules can be maintained, processors can use their caches efficiently even in multi-processor systems. All the processors need to do is to monitor each others’ write accesses and compare the addresses with those in their local caches […]”

Position 535-536

“Write-combining is a limited caching optimization more often used for RAM on devices such as graphics cards. Since the transfer costs to the devices are much higher than the local RAM access it is even more important to avoid doing too many transfers. Transferring an entire cache line just because a word in the line has been written is wasteful if the next operation modifies the next word. One can easily imagine that this is a common occurrence, the memory for horizontal neighboring pixels on a screen are in most cases neighbors, too. As the name suggests, write-combining combines multiple write accesses before the cache line is written out. In ideal cases the entire cache line is modified word by word and, only after the last word is written, the cache line is written to the device.”

Position 814-819

“Processor operations on cache lines are frequent (of course, why else would we have this paper?) which means broadcasting information about changed cache lines after each write access would be impractical.”

Position 832-834

“The state changes are accomplished without too much effort by the processors listening, or snooping, on the other processors’ work. Certain operations a processor performs are announced on external pins and thus make the processor’s cache handling visible to the outside. The address of the cache line in question is visible on the address bus.”

Position 841-843

“[…] the second processor wants to write to the cache line the first processor sends the cache line content and marks the cache line locally as Invalid. This is the infamous “Request For Ownership” (RFO) operation. Performing this operation in the last level cache, just like the I→M transition is comparatively expensive. For write-through caches we also have to add the time it takes to write the new cache line content to the next higher-level cache or the main memory, further increasing the cost.”

Position 850-854

“When we modify memory things get even uglier. Figure 3.20 shows the results for the sequential Increment test. This graph is using a logarithmic scale for the Y axis. So, do not be fooled by the apparently small differences. We still have about a 18% penalty for running two threads and now an amazing 93% penalty for running four threads. This means the prefetch traffic together with the write-back traffic is pretty much saturating the bus when four threads are used.”

Position 898-901

“Finally in Figure 3.21 we have the numbers for the Addnextlast test with random access of memory. This figure is provided mainly to show the appallingly high numbers. It now takes around 1,500 cycles to process a single list element in the extreme case. The use of more threads is even more questionable. We can summarize the efficiency of multiple thread use in a table.”

Position 916-918

This is why you use task libraries. Let the experts figure out what the optimal number of threads is for your tasks and the available hardware. If you’re launching your own thread, you’re almost certainly doing it wrong. This stuff is hard, as evidenced by this book.

“as soon as the L3 cache is not sufficient to hold the working set the numbers crash. They crash to the point that the speed-up of two and four threads is identical (see the fourth column in Table 3.3). This is one of the reasons why one can hardly find motherboard with sockets for more than four CPUs all using the same memory controller. Machines with more processors have to be built differently.”

Position 926-928

“As for the cache replacement there is not much a programmer can do. If the cache is using physical address tags there is no way to find out how the virtual addresses correlate with the cache sets. It might be that cache lines in all logical pages are mapped to the same cache sets, leaving much of the cache unused. If anything, it is the job of the OS to arrange that this does not happen too often. With the advent of virtualization things get even more complicated. Now not even the OS has control over the assignment of physical memory. The Virtual Machine Monitor (VMM, aka Hypervisor) is responsible for the physical memory assignment.”

Position 995-999

“Program flow is much more predictable than data access patterns. Today’s CPUs are very good at detecting patterns. This helps with prefetching. As a result CPU designers spend a lot of time and chip real estate on branch prediction so that pipeline stalls happen as infrequently as possible.”

Position 1006-1008

Oh, this was before Spectre and Meltdown ruined that party.

“A long pipeline means that if the pipeline stalls (i.e., the instruction flow through it is interrupted) it takes a while to get up to speed again. Pipeline stalls happen, for instance, if the location of the next instruction cannot be correctly predicted or if it takes too long to load the next instruction (e.g., when it has to be read from memory).”

Position 1021-1024

“In early computer days memory was a premium. People went to great lengths to reduce the size of the program to make more room for program data. One trick frequently deployed was to change the program itself over time. Such Self Modifying Code (SMC) is occasionally still found, these days mostly for performance reasons or in security exploits. SMC”

Position 1024-1026

Forth was one such language/environment. One use would be to discard one-time init/startup code or large singleton initializers.

“Finally, since the processor assumes–for simplicity reasons and because it is true in 99.9999999% of all cases– that the code pages are immutable, the L1i implementation does not use the MESI protocol but instead a simplified SI protocol. This means if modifications are detected a lot of pessimistic assumptions have to be made.”

Position 1031-1033

“If performance is really much more important than security, randomization can be turned off. The OS will then usually at least load all DSOs contiguously in virtual memory.”

Position 1248-1249

“The cache into which the computed values are stored is called the Translation Look-Aside Buffer (TLB). It is usually a small cache since it has to be extremely fast. Modern CPUs provide multi-level TLB caches, just as for the other caches; the higher-level caches are larger and slower. The small size of the L1TLB is often made up for by making the cache fully associative, with an LRU eviction policy.”

Position 1263-1266

“In the first case the TLB is flushed whenever a context switch is performed. Since, in most OSes, a switch from one thread/process to another requires executing some kernel code, TLB flushes are restricted to leaving (and sometimes entering) the kernel address space. On virtualized systems it also happens when the kernel has to call the VMM and on the way back.”

Position 1281-1284

“One possibility to optimize the cache flushes is to individually invalidate TLB entries. For instance, if the kernel code and data falls into a specific address range, only the pages falling into this address range have to evicted from the TLB. This only requires comparing tags and, therefore, is not very expensive. This method is also useful in case a part of the address space is changed, for instance, through a call to munmap.”

Position 1293-1296

“Overall, programmers must be aware that, with virtualization used, the cost of cache misses (instruction, data, or TLB) is even higher than without virtualization. Any optimization which reduces this work will pay off even more in virtualized environments. Processor designers will, over time, reduce the difference more and more through technologies like EPT and NPT but it will never completely go away.”

Position 1379-1382

“When data is produced and not (immediately) consumed again, the fact that memory store operations read a full cache line first and then modify the cached data is detrimental to performance. This operation pushes data out of the caches which might be needed again in favor of data which will not be used soon. This is especially true for large data structures, like matrices, which are filled and then used later. Before the last element of the matrix is filled the sheer size evicts the first elements, making caching of the writes ineffective.”

Position 1539-1542

“The multimedia extensions previously mentioned in this section almost always require that the memory accesses are aligned. I.e., for 16 byte memory accesses the address is supposed to be 16 byte aligned. The x86 and x86-64 processors have special variants of the memory operations which can handle unaligned accesses but these are slower. This hard alignment requirement is nothing new for most RISC architectures which require full alignment for all memory accesses. Even if an architecture supports unaligned accesses this is sometimes slower than using appropriate alignment, especially if the misalignment causes a load or store to use two cache lines instead of one.”

Position 1727-1731

“These problems create stalls in execution with a possibly severe impact on performance. This is why today’s processors invest heavily in branch prediction (BP). Highly specialized BP units try to determine the target of a jump as far ahead of the jump as possible so that the processor can initiate loading the instructions at the new location into the cache. They use static and dynamic rules and are increasingly good at determining patterns in execution.”

Position 1808-1812

Spectre and Meltdown are again a problem.

“if a function should never be inlined despite being small enough, the noinline function attribute can be used. Using this attribute makes sense even for small functions if they are called often from different places. If the L1i content can be reused and the overall footprint is reduced this often makes up for the additional cost of the extra function call. Branch prediction units are pretty good these days.”

Position 1842-1845

“if a function should never be inlined despite being small enough, the noinline function attribute can be used. Using this attribute makes sense even for small functions if they are called often from different places. If the L1i content can be reused and the overall footprint is reduced this often makes up for the additional cost of the extra function call. Branch prediction units are pretty good these days. If inlining can lead to more aggressive optimizations things look different. This is something which must be decided on a case-by-case basis.”

Position 1842-1846

“If the condition is frequently false, the execution is not linear. There is a big chunk of unused code in the middle which not only pollutes the L1i due to prefetching, it also can cause problems with branch prediction. If the branch prediction is wrong the conditional expression can be very inefficient.”

Position 1854-1856

“Whenever conditional execution is used and it is lopsided (i.e., the expression far more often leads to one result than the other) there is the potential for false static branch prediction and thus bubbles in the pipeline. This can be prevented by telling the compiler to move the less often executed code out of the main code path.”

Position 1856-1858

“While page faults usually are a one-time cost, TLB misses are a perpetual penalty given that the TLB cache is usually small and it is flushed frequently. Page faults are orders of magnitude more expensive than TLB misses but, if a program is running long enough and certain parts of the program are executed frequently enough, TLB misses can outweigh even page fault costs.”

Position 1944-1947

“Prefetching has one big weakness: it cannot cross page boundaries. The reason should be obvious when one realizes that the CPUs support demand paging. If the prefetcher were allowed to cross page boundaries, the access might trigger an OS event to make the page available. This by itself can be bad, especially for performance. What is worse is that the prefetcher does not know about the semantics of the program or the OS itself. It might therefore prefetch pages which, in real life, never would be requested.”

Position 1975-1979

“The OS, when processing the packet, first has to determine what kind of packet it is. If the DCA hint is not ignored, the loads the OS has to perform to identify the packet most likely hit the cache. Multiply this saving of hundreds of cycles per packet with tens of thousands of packets which can be processed per second, and the savings add up to very significant numbers, especially when it comes to latency. Without the integration of I/O hardware (a NIC in this case), chipset, and CPUs such an optimization is not possible. It is therefore necessary to make sure to select the platform wisely if this technology is needed.”

Position 2137-2141

“The measured overhead, computed by dividing the time needed when using one single cache line versus a separate cache line for each thread, is 390%, 734%, and 1,147% respectively. These large numbers might be surprising at first sight but, when thinking about the cache interaction needed, it should be obvious. The cache line is pulled from one processor’s cache just after it has finished writing to the cache line. All processors, except the one which has the cache line at any given moment, are delayed and cannot do anything. Each additional processor will just cause more delays.”

Position 2166-2170

“In general the best advice which can be given is 1. Separate at least read-only (after initialization) and read-write variables. Maybe extend this separation to read-mostly variables as a third category. 2. Group read-write variables which are used together into a structure. Using a structure is the only way to ensure the memory locations for all of those variables are close together in a way which is translated consistently by all gcc versions..”

Position 2208-2211

“Now a reader might ask a question: why would somebody use the complicated and longer code which utilizes CAS? The answer to this is: the complexity is usually hidden. As mentioned before, CAS is currently the unifying atomic operation across all interesting architectures. So some people think it is sufficient to define all atomic operations in terms of CAS. This makes programs simpler. But as the numbers show, the results can be everything but optimal. The memory handling overhead of the CAS solution is huge.”

Position 2265-2269

“The processor cores themselves run at frequencies where, at full speed, even in perfect conditions, the connection to the memory cannot fulfill all load and store requests without waiting. Now, further divide the available bandwidth by the number of cores, hyper-threads, and processors sharing a connection to the Northbridge and suddenly parallelism becomes a big problem. Efficient programs may be limited in their performance by the available memory bandwidth.”

Position 2295-2298

“The performance measurement counters of modern processors allow the observation of FSB contention. On Core 2 processors the NUS_BNR_DRV event counts the number of cycles a core has to wait because the bus is not ready. This indicates that the bus is highly used and loads from or stores to main memory take even longer than usual. The Core 2 processors support more events which can count specific bus actions like RFOs or the general FSB utilization.”

Position 2301-2304

“cachegrind is a simulator which does not use measurements from the processor. The actual cache implementation in the processor might very well be quite different. cachegrind simulates Least Recently Used (LRU) eviction, which is likely to be too expensive for caches with large associativity. Furthermore, the simulation does not take context switches and system calls into account, both of which can destroy large parts of L2 and must flush L1i and L1d. This causes the total number of cache misses to be lower than experienced in reality. Nevertheless, cachegrind is a nice tool to learn about a program’s memory use and its problems with memory.”

Position 2691-2695

“Before the process terminates massif creates two files: massif.XXXXX.txt andmassif.XXXXX.ps; XXXXX is as before the PID of the process. The .txt file is a summary of the memory use for all call sites and the .ps is what can be seen in Figure”

Position 2709-2710

“This is a nice separation of concerns. A GUI mindset would draw directly to screen. This streams data to handlers that transform data into outputs. Writing it like this is not overdesign. It makes the components simpler and more boring. It makes it testable. It’s better to have more, but simpler components. Maximize obviousness.”

Position 2710

“Each block represents one memory word. In this small region of memory we have four allocated blocks. The overhead due to the block header and padding is 50%. Due to the placement of the header, this automatically means that the effective prefetch rate of the processor is lowered by up to 50% as well. If the blocks were be processed sequentially (to take maximum advantage of prefetching), the processor would read all the header and padding words into the cache, even though they are never supposed to be read from or written to by the application itself. Only the runtime uses the header words, and the runtime only comes into play when the block is freed.”

Position 2763-2767

“This all means that memory used by the program is interspersed with memory only used by the allocator for administrative purposes. We might see something like this: Header Data Padding Each block represents one memory word. In this small region of memory we have four allocated blocks. The overhead due to the block header and padding is 50%. Due to the placement of the header, this automatically means that the effective prefetch rate of the processor is lowered by up to 50% as well. If the blocks were be processed sequentially (to take maximum advantage of prefetching), the processor would read all the header and padding words into the cache, even though they are never supposed to be read from or written to by the application itself. Only the runtime uses the header words, and the runtime only comes into play when the block is freed.”

Position 2761-2767

“The most important thing to get right is the selection of representative tests to perform the measurements. If the test workload does not match the way the program is actually used, the performed optimizations might actually do more harm than good. For this reason, is it often hard to use PGO for libraries.”

Position 2797-2799

“The waste of physical RAM would simply be too large. But very large pages have their advantages: if huge data sets are used, storing them in 2MB pages on x86-64 would require 511 fewer page faults (per large page) than using the same amount of memory with 4k pages. This can make a big difference. The solution is to selectively request memory allocation which, just for the requested address range, uses huge memory pages and, for all the other mappings in the same process, uses the normal page size. Huge page sizes come with a price, though. Since the physical memory used for large pages must be continuous, it might, after a while, not be possible to allocate such pages due to memory fragmentation. People are working on memory defragmentation and fragmentation avoidance, but it is very complicated.”

Position 2875-2880

“By using the same file name in the open call, multiple processes can share the same huge pages and collaborate.”

Position 2887-2888

“There are a few more details to iron out with respect to context switches (possible modification on the same processor) and accidental reloading of the cache line after a write on another processor. This is nothing that policies (cache flush on context switch) and extra flags, or separate cache lines for LL/SC instructions, cannot fix. In general, the LL/SC implementation comes almost for free with the implementation of a cache coherence protocol like MESI.”

Position 2992-2995

“Neither the VALIDATE nor COMMIT operations automatically and implicitly create bus operations. This is the huge advantage transactional memory has over atomic operations. With atomic operations, concurrency is made possible by writing changed values back into main memory. If you have read this document thus far, you should know how expensive this is. With transactional memory, no accesses to the main memory are forced”

Position 3094-3097

“In section 6.4.2, we already discussed how the lock prefix, available on x86 and x86-64, can be used to avoid the coding of atomic operations in some situations. The proposed tricks falls short, though, when there are multiple threads in use which do not contend for the same memory. In this case, the atomic operations are used unnecessarily. With transactional memory this problem goes away. The expensive RFO bus message are issued only if memory is used on different CPUs concurrently or in succession; this is only the case when they are needed. It is almost impossible to do any better.”

Position 3108-3112

“Buffered (or registered) memory changes the situation: instead of directly connecting the RAM chips on the DRAM module to the memory, they are connected to a buffer which, in turn, is then connected to the memory controller. This significantly reduces the complexity of the electrical connections. The ability of the memory controllers to drive DRAM modules increases by a factor corresponding to the number of connections saved.”

Position 3359-3362

“As with registered DRAM, the question has to be asked: why is ECC DRAM not the norm? The answer to this question is the same as for the equivalent question about registered RAM: the extra RAM chip increases the cost and the parity computation increases the delay.”

Position 3418-3420