Throughput over latency. The problem is embarrassingly parallel and can be divided to suit any number of cores. Cores could be register or stack machines. Cores need to have access to >100 MB. Algorithm does not depend on order of communication between cores however inter core communication is required (shared memory or message passing is fine). The workload is branch rich so SIMD/SIMT is not possible. The workload is mostly integer instructions. The workload is memory bound not compute bound.
Can you think of any off the shelf or rentable hardware better suited to this workload then many core ARM chips which have some vector instructions i won't use?
[1] https://www.igorslab.de/wp-content/uploads/2021/12/alder_lak...
Notes: GPUs tends to have huge vector width. Xeon PHI has huge vectors. https://www.greenarraychips.com/ s cores are to small, have to little RAM and are not available in large machines, Adapteva chips are not taped out on a competitive node, Sunway SW26010 has huge vectors, Graphcore has huge vectors
Just because there’re many branches doesn’t mean SIMD/SIMT ain’t possible or too slow.
I recently needed to solve a problem where I wanted to traverse a tree, computing something for every node, and deciding whether to descend or skip nodes based on results of some computations. Obviously branch rich, but modern GPUs weren’t too bad at that use case. My compute shader has a main loop like the following HLSL.
ByteAddressBuffer tree: register( t0 );
for( uint state = 0; state != UINT_MAX; )
{
const uint2 nodeHeader = tree.Load2( state ); // [ child, next ]
if( shouldDescend( state + 8 ) )
{
// Need to go deeper, compute something for that node
state = nodeHeader.x;
}
else
{
// Skipping the child, compute something else here
state = nodeHeader.y;
}
}
> The workload is memory bound not compute bound.
I don't think so? How many gigabytes per second per core are you processing?
Edit: If for some reason you can talk about this problem to random SIMD programmers online privately but you cannot post about this problem publicly, please add your contact information to your profile or your post.
https://en.m.wikipedia.org/wiki/Branch_predictor
You can probably use SIMD just fine.
You can get an EPYC with a shit ton of CPU cache, 100MB is easy. Faster & higher IPC cores can be better than more cores. Even at 100MB per core you can do some work on the code to make it have better cache locality so that the CPU cache prefetcher(s) handle that constraint well. I am guessing C/C++? Are you compiling with -O3 and profile-guided-optimization (GCC)?
Worst case scenario throw it on an FPGA :)
I think that those vector functional units more than pay for themselves. If you combined them all on that alder lake die you linked you would get _one_ extra core of space. Every recent manycore mesh I know of packs in a vector or tensor unit because compute circuits are actually pretty small for the punch they pack! You vastly reduce dispatch overhead which is the main source of slowdown once you saturate your cache hierarchy. Distributing memory access in a manycore is frustratingly slow.