Tell CPU as coding C/C++ certain variable must put in cache L1,L2

Question

How do we instruct CPU when we're writing or coding C/C++ certain (usually) global variable must be put in cache L1, L2, or L3 so on ?

colanderman · Accepted Answer

Generally, caches don't hold fixed memory locations. They hold whatever has been most recently used (to a first approximation).
Some architectures do support "cache pinning" -- wherein you can instruct the CPU to reserve a portion of cache for a specific memory location (actually cache line or page). To the best of my knowledge, neither Intel nor AMD processors implement such a feature.
You can however instruct the CPU to load something into cache prior to using it, using a prefetch instruction. (It is still subject to be evicted from cache at a later time as usual.) In GCC, this is done using __builtin_prefetch() [1].
But -- if you add prefetches blindly, it will almost certainly slow down whatever it is you're trying to fix. You need to analyze your memory access patterns and the assembly code being executed, using tools like perf and llvm-mca, and recognize the cache line usage, cache access latency, pipeline stalls, cross-process and -core contention for the cache line, register pressure, etc. of the code in question, to understand whether a prefetch is appropriate, and where to place it if so. Notably, it's a challenge to get a compiler to emit a prefetch at a useful line of assembly.
What evidence are you working from, that access latency to a single specific global variable is a performance bottleneck for your application?
To share an anecdote -- the only time I've been in a similar situation, was working on a 10 Gbps line-rate network processor. Many functions had CPU budgets of fewer than 100 cycles. We found (using perf) that often, these cycles were eaten up stalling for various globals to be fetched from DRAM, despite that the globals were accessed with every packet. Notably, prefetching didn't help -- we could not prefetch early enough to avoid the entire stall, and issuing the prefetch itself ate precious CPU cycles. However, the true culprit was that the small handful of globals in question were each allocated on the start of a hugepage boundary -- and therefore assigned to the same cache line. This particular CPU had low cache associativity (L1 was 2-way IIRC), and thus the globals kept bumping each other out of cache. The solution in our case was to manually align the globals to different cache lines.
[1] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

_moof · Answer

You're in CPU reference manual territory - and if you got here by accident, you've taken at least seven wrong turns, and are about to take an eighth.

daniel-thompson · Answer

Not exactly what you're asking, since C/C++ doesn't expose this functionality, but ISPC (a C-derived language that exposes a CUDA-like SPMD model over the vector lanes in your CPU) has some standard library functions to prefetch data into L1 cache; see https://ispc.github.io/ispc.html#prefetches for more details. If you're worried about this in your application, you may find it useful to investigate writing your compute kernel in ISPC (it uses C linkage and can be called from C code with no overhead).

ceeplusplus · Answer

A better solution, if you don't need to synchronize writes to said variable across cores, is just to let the compiler figure it out. Dumping the contents of the variable into a register is much faster than accessing L1. If you label the variable as const the compiler should be able to figure out that it's safe to concurrently read it into a register from multiple cores. Unless you are writing highly optimized assembly it's a good bet the compiler will do a better job than whatever you do.

kadoban · Answer

Last I remember (which, it has been a while), on x86 and friends anyway, ~all reads go to cache. So the request is somewhat nonsensical I believe?
You might then ask how to keep it there, but I doubt you can.
Your best bet is to keep the working set of used memory small during the steps you care about. Also could be worth keeping in mind that caches work on "lines", small blocks of data. So you can give yourself a small edge if you keep all the crap you're going to use in one contiguous block of memory.

josephcsible · Answer

Why do you think you need/want to do that?

rramadass · Answer

Oldie but Goodie: https://www.usenix.org/legacy/publications/library/proceedin...See also the book The Art of Multiprocessor Programming.

pid-1 · Answer

You can't.