Designing for low latency
Table of Contents
Cache Coherence
Modern multi-core processors use multiple levels of cache—L1, L2, and L3—to reduce memory access latency and improve performance. However, when each core has its own cache, it introduces the problem of maintaining a consistent view of memory. This is where cache coherence becomes crucial.
Key Cache Coherence Properties
• Write Propagation: If one core writes a value to a memory location, that change must eventually become visible to all other cores.
• Transaction Serialization: All cores must agree on the order of writes to a single memory location. This ensures consistency across caches.
• Program Order: A single core must see its own memory operations in the order specified by its program.
L1, L2, and L3 Cache Overview
Processor caches are organized in a hierarchy to balance speed and size:
L1 Cache: The smallest and fastest cache, located closest to each CPU core. It is typically split into an instruction cache (L1i) and a data cache (L1d), and is private to each core.
L2 Cache: Larger and slightly slower than L1. Depending on the processor architecture, it may be private to a core or shared between a few cores. It may reside on-chip or on a separate die connected via a high-speed bus.
L3 Cache: The largest and slowest of the three, typically shared among all CPU cores. It serves as the Last Level Cache (LLC) before accessing main memory.
Maintaining Coherence in Write-Back Caches
Most modern CPUs use write-back caches, where data is written to main memory only when a cache line is evicted. This can lead to situations where multiple cores hold stale copies of the same data.
To maintain coherence, the system uses protocols like MESI (Modified, Exclusive, Shared, Invalid) to track the state of cache lines across cores. For example, when one core writes to a cache line, other cores’ caches may mark that line as invalid. This ensures that future reads will fetch the updated data from memory or the writing core's cache.
Consider a scenario where Core 0 updates a cache line. It notifies other cores that the data is invalid, prompting them to either discard or update their own copies. This coordination maintains a coherent and consistent view of memory across all cores.
Cache Coherence Protocols
Snooping Protocols
In snooping-based protocols, all caches monitor a shared communication bus. When a cache writes to a memory location, it broadcasts the change to all other caches. These caches then either update or invalidate their copies, ensuring coherence across the system.
Directory-Based Protocols
Directory-based protocols are commonly used in large-scale multiprocessor systems. A centralized or distributed directory keeps track of which caches hold copies of each memory block. The directory maintains metadata about each block's state—such as whether it's uncached, shared, or modified—and coordinates access to maintain consistency.
Cache states:
• Uncached: No processor holds a copy.
• Shared: Multiple caches hold read-only copies.
• Exclusive/Modified: A single cache holds the only copy and has write permission.
Read Miss Example:
Processor P1 wants to read block A. The directory sees that A is uncached. Memory sends the block to P1 and updates the state to Shared, noting that P1 now holds a copy.
Write Miss Example:
Processor P2 wants to write to block A. The directory sees that A is in a Shared state and currently held by P1. It sends invalidation messages to all sharers (e.g., P1). After receiving acknowledgments, it grants P2 exclusive (Modified) access. P2 can now update the block.
Subsequent Read by Another Processor:
If P1 later attempts to read block A again, it checks its cache and sees the block is marked invalid. This triggers a read miss.
Read Request Handling:
P1 sends a read request for block A to the directory. The directory examines the block's current state:
• If the block is Uncached: The directory fetches it from main memory, sends it to P1, and updates the state to Shared.
• If the block is held Modified in another cache (e.g., P2): The directory sends a downgrade or fetch request to P2. P2 flushes the modified data to memory or forwards it directly to P1. The directory then updates the sharers list to include P1.
P1 receives the block and stores it in its cache, typically in Shared state (or Exclusive if it's the only owner).
Cache Lines
Modern CPUs use multiple levels of cache (L1, L2, and L3) to bridge the speed gap between the processor and main memory. These caches do not store individual bytes or words directly. Instead, they operate on fixed-size blocks of memory called cache lines, which are typically 64 bytes in size on most modern architectures.
When the CPU accesses a piece of data in memory, the entire cache line containing that data is loaded into the cache. For example, when reading the first element of a vector of integers, the cache fetches not just that element but also the surrounding elements that fit into the same 64-byte block. This behavior takes advantage of spatial locality—the idea that programs often access data stored close together in memory. As a result, subsequent accesses to nearby elements are much faster because they are already present in the cache.
However, if the required data is not found in the cache—a situation known as a cache miss—the CPU must fetch it from main memory. This is significantly slower, often by an order of magnitude or more compared to cache access. By writing programs that make efficient use of spatial locality (e.g., processing contiguous arrays rather than scattered memory locations), developers can greatly improve cache hit rates and overall performance.
Cache Ping-Pong
Cache ping-pong is a performance bottleneck in multiprocessor systems caused by the frequent invalidation and transfer of cache lines between cores. This typically occurs when multiple processors repeatedly write to the same shared memory location.
When two or more processors take turns modifying a shared variable, the associated cache line continuously “bounces” between their caches. Each processor must first obtain exclusive access to the line, which triggers invalidations in the other caches—leading to high latency due to constant coherence traffic.
This phenomenon is common in systems that implement cache coherence protocols like MESI or directory-based protocols. The repeated ownership handoff causes delays and can significantly degrade performance in multi-threaded applications.
Example:
Consider two processors, P1 and P2, accessing a shared variable x (initially set to 0) and alternately incrementing it:
• P1 reads x → the cache line is loaded into P1's cache in a Modified state.
• P2 wants to increment x → requires exclusive access.
• The cache coherence protocol invalidates P1's copy and transfers the line to P2.
• P2 writes to x.
• Next, P1 wants to increment again → the process repeats in reverse.
This back-and-forth movement of the cache line is the essence of cache ping-pong. Each write forces the other processor to relinquish access, causing repeated invalidations and transfers.
How to Prevent Cache Ping-Pong:
• Avoid sharing writable data between threads when possible.
• Use thread-local storage to keep data private to each core.
• Apply padding or alignment to separate frequently updated variables into different cache lines (also known as false sharing mitigation).
• Minimize the use of spinlocks or busy-waiting on shared variables.
• Be cautious with atomic operations—they can also lead to ping-pong effects if used on shared data.
Cache Alignment
Cache alignment refers to organizing data in memory so that its starting address aligns with the boundaries of a cache line (commonly 64 bytes on modern CPUs). When data is properly aligned, the CPU can fetch and operate on it more efficiently, since the data fits neatly within a single cache line rather than being split across two.
In C++, you can explicitly enforce alignment using the alignas(64) specifier. For example, aligning a struct on a 64-byte boundary ensures that its fields start exactly at the beginning of a cache line. This avoids situations where a single object spans multiple cache lines, which would otherwise require multiple memory fetches for a single access.
Proper cache alignment not only improves cache hit rates but also reduces performance issues such as false sharing in multithreaded programs. False sharing occurs when two threads operate on different variables that happen to reside on the same cache line, causing unnecessary cache invalidations. Aligning data structures to cache line boundaries helps mitigate this problem by keeping frequently accessed variables isolated.
False Sharing
False sharing is a performance-degrading phenomenon in multithreaded programs. It occurs when multiple threads modify different variables that reside on the same cache line. Although the threads operate on separate variables, they inadvertently invalidate each other's cache lines, leading to unnecessary cache coherence traffic.
Modern CPUs transfer data between memory and cache in units called cache lines, typically 64 bytes in size. If two independent variables are close enough in memory to occupy the same cache line, modifying one causes the entire line to be marked invalid in other cores—even if those cores are only accessing the other variable. This results in frequent invalidation and reloading of cache lines, despite the absence of actual data sharing—hence the term "false" sharing.
Example: Consider two threads updating separate counters stored in adjacent memory locations. If these counters share a cache line, each update by one thread invalidates the cache line for the other, causing performance degradation due to constant cache line transfers.
How to Avoid False Sharing
• Padding: Introduce unused fields or use compiler directives to separate variables onto different cache lines. This ensures that each thread operates on data in its own cache line.
• Aligning Data: Use alignment attributes or language-specific features to align variables to cache line boundaries, preventing them from sharing cache lines inadvertently.
• Thread-Local Storage: Utilize thread-local variables when sharing is unnecessary. This confines data to individual threads, eliminating the possibility of false sharing.
• Data Structure Reorganization: Rearrange data structures so that frequently modified variables accessed by different threads are placed in separate cache lines.
By implementing these strategies, developers can minimize the adverse effects of false sharing, leading to improved performance and scalability in multithreaded applications.
Coherence vs Consistency
In a multiprocessor or multicore system, memory behavior can be understood through two important concepts: cache coherence and memory consistency. While both are related to how memory operations are observed across threads or cores, they refer to different guarantees.
Cache Coherence
Cache coherence ensures that all processing units (e.g., CPU cores) agree on the order of reads and writes to the same memory location. If one core updates a memory location, the new value must eventually be reflected in the caches of other cores in a consistent order.
• Applies to operations on the same address.
• Guarantees that writes to a particular address are seen in the same order by all processors.
• For example, if Core A writes value 1 to address X and then writes value 2, all other cores must see these writes in the same order: 1 followed by 2.
Memory Consistency
Memory consistency, on the other hand, defines the rules for the visibility and ordering of memory operations (reads and writes) across different memory locations. It governs when a write by one thread becomes visible to other threads relative to other memory operations.
• Applies to operations on different addresses.
• Defines the allowed order in which memory operations may appear to execute across threads.
• Different consistency models exist (e.g., Sequential Consistency, Total Store Order), each with different visibility guarantees.
Key Difference
Coherence ensures agreement on the value and order of operations to a single memory location, while consistency determines how the overall memory system behaves with respect to the order of operations on multiple locations. Coherence is a prerequisite for consistency, but consistency models go further in defining how memory operations can be interleaved or reordered.
Out-of-Order Execution
Modern CPUs use out-of-order (OoO) execution to maximize performance by keeping their internal pipelines and execution units as busy as possible. Instead of executing instructions strictly in the order they appear in a program, the CPU reorders them at runtime. Instructions are executed as soon as their operands become available, rather than waiting for earlier instructions to finish.
At the heart of this process is instruction scheduling. The CPU's scheduler looks at a window of pending instructions and determines the best order to execute them. By analyzing data dependencies and available resources, the scheduler allows independent instructions to run in parallel, improving throughput and reducing idle cycles.
To support this, instructions that are waiting for their operands are held in reservation stations. Once the required data becomes available, they are dispatched to execution units. After execution, the results are temporarily stored in a structure called the reorder buffer. The reorder buffer ensures that, even though instructions may have been executed out of order, their results are committed back to the program state in the original sequence. This preserves the correct program semantics.
Out-of-order execution is one of the key innovations that allows modern processors to exploit instruction-level parallelism. By reordering work while maintaining logical correctness, CPUs achieve significantly higher utilization and performance compared to a purely in-order approach.
Write-to-Read Relaxed Consistency Models
These models allow reads to be reordered before earlier writes (write → read reordering). The rationale is that writes typically take longer to complete than reads, and in many cases, it's more performant to allow reads to proceed while a write is still pending.
Total Store Order (TSO)
Under TSO, writes appear in the same order to all other processors — this is called the write atomicity property. For example, if Processor A writes x = 2, all other processors will see x = 2 at the same logical time. No processor will see x = 2 while another sees the old value. This ensures an all-or-nothing visibility model: a write is either fully visible to all processors or to none.
TSO guarantees that writes from a single processor are observed in the same order by all other processors. If Processor A writes to X and then Y, then all other processors will observe the write to X before the write to Y.
However, reads can be reordered. This means a processor might perform a read and receive a stale value (from cache) rather than the latest written value by another processor. But critically, a processor will never see a value from a write that logically occurs after the read.
In short:
• Writes are seen in order by all processors (preserved write order).
• Reads may be reordered and may see stale values.
• Write atomicity ensures that a write is visible to either all or none.
• Processors may not immediately see the most recent writes made by others unless synchronization is enforced.
Relaxed Consistency (Processor consistency)
In more relaxed consistency models (e.g., those weaker than TSO), a processor may observe the result of a write from another processor before that write is visible to all processors. This breaks write atomicity — some processors may see the new value while others still see the old one.
The system guarantees only eventual visibility — all processors will eventually see the write, but not necessarily at the same time.
These relaxed models allow for even more performance optimizations but require careful synchronization (e.g., memory barriers or fences) to ensure correctness.
Write-to-Write Relaxed Consistency Models
These models allow writes to bypass earlier writes to different memory locations via a write buffer. This enables higher performance by relaxing the strict ordering of writes when they target different addresses.
Partial Store Ordering (PSO)
PSO relaxes both write → read (W→R) and write → write (W→W) ordering. Like Total Store Order (TSO), it allows reads to be reordered before writes, but goes further by allowing later writes to different addresses to be committed before earlier ones.
Despite relaxing these orderings, PSO still enforces write atomicity. This means that a write, once visible, is seen by all processors simultaneously. There are no scenarios where one processor observes the new value of a variable while another still sees the old value.
In practice, this allows hardware to optimize memory access by writing to independent memory locations out of order, while still preserving correctness through synchronization mechanisms like memory fences when needed.
• Relaxes both W→R and W→W ordering.
• Allows faster execution by reordering independent writes.
• Maintains write atomicity across processors.
• Requires memory fences to enforce ordering when needed.
Branch Prediction
Branch prediction is a technique used by modern CPUs to anticipate the outcome of conditional branches, such as if statements or loop exits, before the actual result is known. By making an educated guess, the CPU can keep its instruction pipeline full and avoid the stalls that would occur if it had to wait for the branch decision.
To make these predictions, CPUs rely on algorithms that analyze past branch behavior. By observing whether a branch was previously taken or not, the CPU learns patterns and applies them to future decisions. Over time, this history-based approach allows the processor to predict outcomes with a high degree of accuracy.
Once a prediction is made, the CPU engages in speculative execution. It begins executing instructions along the predicted path before the actual branch result is available. If the guess is correct, execution continues seamlessly, and performance is improved because the pipeline never went idle. However, if the prediction is wrong, the speculative work must be discarded, and the pipeline is flushed and reloaded with the correct instructions. This flushing introduces delays and can hurt performance if mispredictions happen frequently.
Although modern branch prediction is highly effective, it struggles with irregular control flow or conditions that depend on unpredictable data. Programmers can sometimes help by providing hints to the compiler, such as marking certain branches as likely or unlikely, or using built-in functions like __builtin_expect to guide optimization. When applied well, branch prediction significantly reduces idle cycles and maximizes instruction throughput, making it one of the key techniques that enables the high performance of modern CPUs.