Of course, the most used stack is currently CUDA, which is proprietary and only works on Nvidia.
AMD made its own version called HIP, which translates quite directly to CUDA, but also runs on AMD cards via ROCm.
Intel has its own stack called oneAPI, which I think is based on SYCL, a Khronos standard (just like OpenCL/OpenGL/Vulkan/etc). I believe SYCL programs can also be ran on AMD and NVIDIA using third-party compilers/translation tools such as hipSYCL, and I think SYCL can also be compiled to OpenCL.
I recently also heard about efforts to support running HIP programs on SYCL, so hopefully soon GPU compute will be less vendor-bound, with CUDA translating relatively easily to HIP, and HIP and SYCL also not offering compatibility problems either way.
The only hope is that Intel pushes OpenCL as one of the selling points of their new Arc GPUs. They seem to be starting with the gaming market, however, which isn't very interested in computing. It could be that they have plans to attack the non-HPC market, in which case it would make sense to follow OpenCL.
At the same time, Intel is developing oneAPI, so it may make more sense to look at that instead of OpenCL.
From others' comments here, it seems there hasn't been much development since then in OpenCL land. I imagine details of device support might vary over generations of hardware and driver releases... I made use of half-precision float storage but ran compute at single-precision since my devices did not offer fast half-precision math.
This was small "hpc", i.e. a task that would run on one workstation or modest single socket or dual socket x86_64 server or VM. The most performance-sensitive aspect was that it was also used during the data loading/startup phase of an interactive tool. So a human user was impatiently waiting for results. The first prototype was just using Python numpy routines for convolutions, etc. I used OpenCL to get running time from many minutes down to tens of seconds and called it good enough.
I enjoyed that I could run the same code on a workstation GPU with the NVIDIA driver or via x86 multithreaded SIMD using the Intel driver. I did not do any real work with Intel nor AMD GPUs because of the hardware selection we had on hand. I also needed more than 6GB GPU RAM for it to be worth using. Even my Titan X with 12GB was only about 2-3x faster than using x86 SIMD for my problem, due to the complex tradeoffs between RAM and bus bandwidths for data transfers. This is after I'd already done some algorithmic optimizations to bring down the compute/IO ratio.
A big thing I hear others talk about is the rich development tools with CUDA and the relatively impoverished OpenCL tooling. I am old school enough to get along without it. I was able to compensate for limited Python OpenCL tools by doing some of my development and debugging cycle embedded in hacked up variants of my own viz tools. You might think of this a bit like debugging with printf, except my print statement could send a dense 3D array into an OpenGL-based renderer on my workstation.