HACKER Q&A
📣 kloch

What server or CPU/motherboard combo offers maximum memory bandwidth?


A side project i'm working on is maxing out memory bandwidth well before cpu. It reads data from a 43GB binary data file (cached in ram by OS), does some processing and generates an output file. On an AWS r6gd.16xlarge instance (64 cpu, 512GB ram) it takes 14 seconds to run with 64 threads. I can reduce the number of threads down to 32 before it starts taking longer so it seems to be memory bandwidth limited. The application is interactive so shorter run times are highly desirable. I want to buy/build a server to maximize performance, what are my options?

For the sake of discussion let's assume the program is optimally written (Ha!). Porting to GPU is also out of the question because some settings require 300GB of ram or more during processing.


  👤 toast0 Accepted Answer ✓
You should look at Netflix engineering blog/usenix papers about their CDN appliances. In the latest instance types their main bottleneck is physical memory bandwidth. I'm fairly sure they mention profiling tools which would be useful for you to verify your assumptions (assuming they work in your environment) as well as hardware details if you want to build a similar system.

I think Epyc 7313 processors are likey a good fit if you are ram bandwidth limited though. Zen3, lowest core count, but still has all the ram channels. There's a lower cost single socket version too.

Can you tell us more about the data dependencies between threads? If there's contention on the shared memory, that's a more likely bottleneck than raw bandwidth.


👤 wmf
Intel Ice Lake-SP, AMD Epyc or Threadripper Pro, or Ampere Altra have the highest memory bandwidth but in theory they're the same as the Graviton2 you're already using. If you can handle NUMA, maybe try a 2-socket m6i.32xlarge.

👤 mattbillenstein
Are you sure 64 -> 32 threads isn't some sort of locking?

Sequential read memory bandwidth on modern hardware is tens of GB per second as I recall, so I think there's probably some optimization you can do on the program?