Where to run embarrassingly parallel, Integer, no SIMD workloads?

Question

Many modern CPUs (f.e. Intel, AMD), GPUs and TPUs include SIMD instructions. Following the motto "don't pay for what you do not use" i would rather have more cores. Even the ARM based servers (Graviton, Ampere Altra) have NEON vector instructions however those smaller vectors seem to allow them to have more cores (among other trade-offs). Even on consumer chips vector instructions take up a noticeable amount of die space [1].Throughput over latency. The problem is embarrassingly parallel and can be divided to suit any number of cores. Cores could be register or stack machines. Cores need to have access to >100 MB. Algorithm does not depend on order of communication between cores however inter core communication is required (shared memory or message passing is fine). The workload is branch rich so SIMD/SIMT is not possible. The workload is mostly integer instructions. The workload is memory bound not compute bound.Can you think of any off the shelf or rentable hardware better suited to this workload then many core ARM chips which have some vector instructions i won't use?[1] https://www.igorslab.de/wp-content/uploads/2021/12/alder_lak...Notes: GPUs tends to have huge vector width. Xeon PHI has huge vectors. https://www.greenarraychips.com/ s cores are to small, have to little RAM and are not available in large machines, Adapteva chips are not taped out on a competitive node, Sunway SW26010 has huge vectors, Graphcore has huge vectors

Const-me · Accepted Answer

> The workload is branch rich so SIMD/SIMT is not possible

Just because there’re many branches doesn’t mean SIMD/SIMT ain’t possible or too slow.

I recently needed to solve a problem where I wanted to traverse a tree, computing something for every node, and deciding whether to descend or skip nodes based on results of some computations. Obviously branch rich, but modern GPUs weren’t too bad at that use case. My compute shader has a main loop like the following HLSL.

    ByteAddressBuffer tree: register( t0 );
    for( uint state = 0; state != UINT_MAX; )
    {
        const uint2 nodeHeader = tree.Load2( state ); // [ child, next ]
        if( shouldDescend( state + 8 ) )
        {
            // Need to go deeper, compute something for that node
            state = nodeHeader.x;
        }
        else
        {
            // Skipping the child, compute something else here
            state = nodeHeader.y;
        }
    }

theandrewbailey · Answer

Processors with vector support are ubiquitous. There's no point in trying to escape them, even if you don't want them or use them. There's simply not enough demand for non-vector processors.

anonymoushn · Answer

If you explain more about the problem, I can probably write a SIMD implementation. Most "branch rich integer instruction" algorithms such as e.g. running a billion different Miller Rabin tests can be made into SIMD algorithms that use almost no branches. This conversion will give you the throughput benefits of not having a bunch of hard-to-predict branches constantly causing pipeline flushes and the throughput benefits of SIMD.
> The workload is memory bound not compute bound.
I don't think so? How many gigabytes per second per core are you processing?
Edit: If for some reason you can talk about this problem to random SIMD programmers online privately but you cannot post about this problem publicly, please add your contact information to your profile or your post.

bick_nyers · Answer

If you are worried about branching then use perf to profile it and see how many branch misses you have. You might be guessing 90%+ correctly and not actually have anything to worry about.
https://en.m.wikipedia.org/wiki/Branch_predictor
You can probably use SIMD just fine.
You can get an EPYC with a shit ton of CPU cache, 100MB is easy. Faster & higher IPC cores can be better than more cores. Even at 100MB per core you can do some work on the code to make it have better cache locality so that the CPU cache prefetcher(s) handle that constraint well. I am guessing C/C++? Are you compiling with -O3 and profile-guided-optimization (GCC)?
Worst case scenario throw it on an FPGA :)

nynx · Answer

A GPU is almost certainly what you want. Modern GPUs have some interesting tricks up their sleeves that make them much better at branching code than previously.

yababa_y · Answer

This isn&rsquo;t off the shelf but you can rent enormous FPGA&rsquo;s from amazon. You can dispatch ~6800 dsp slices at 500MHz for a hot ~1.7Tu32op/s. Each FPGA can have 4 DDR4 controllers for you to saturate with memory ops. You can cram ~800 vexriscv cores in there (possibly more, i&rsquo;m extrapolating from less featured fpga architectures) for a minimum of 220 GIPS. Or about 1/7th of a threadripper 3990X (dhrystone ips).I think that those vector functional units more than pay for themselves. If you combined them all on that alder lake die you linked you would get _one_ extra core of space. Every recent manycore mesh I know of packs in a vector or tensor unit because compute circuits are actually pretty small for the punch they pack! You vastly reduce dispatch overhead which is the main source of slowdown once you saturate your cache hierarchy. Distributing memory access in a manycore is frustratingly slow.

amirhirsch · Answer

If you are doing few compute steps per memory access then you are memory bandwidth limited and >100MB is too much RAM requirement per core to not be bottlenecked by RAM access. Use a GPU or FPGA with HBM.