For the sake of discussion let's assume the program is optimally written (Ha!). Porting to GPU is also out of the question because some settings require 300GB of ram or more during processing.
I think Epyc 7313 processors are likey a good fit if you are ram bandwidth limited though. Zen3, lowest core count, but still has all the ram channels. There's a lower cost single socket version too.
Can you tell us more about the data dependencies between threads? If there's contention on the shared memory, that's a more likely bottleneck than raw bandwidth.
Sequential read memory bandwidth on modern hardware is tens of GB per second as I recall, so I think there's probably some optimization you can do on the program?