I've been pursuing a path that is decidedly edgy... and might work out great, or might be a miserable failure... the BitGrid[1]. It's dead nuts simple... a cartesian grid of 4 bit input, 4 bit output LUTs, latched and clocked in 2 phases (like the colors on a checkerboard) to prevent race conditions. It's a turing complete architecture that doesn't have the routing issues of an FPGA because there's no routing hardware in the way. But it is also nuts because there's no routing fabric to get data rapidly across the chip.
If you can unlearn the aversion to latency that we've all had since the days of TTL and the IMSAI, you realize that you could clock an array at least 3 Ghz, giving 3 billion answers/second. It's all a question of programming. (Which is where I'm stuck right now, analysis paralysis)
[1] https://github.com/mikewarot/Bitgrid