I set up some basic compute stuff with the ROCm stack on a 6900HX-based mini computer the other week (mostly to see if it was possible as there are some image processing workloads a colleague was hoping to accelerate on a similar host) and noticed that the docs occasionally pretend you could use GTT dynamicly allocated memory for compute tasks, but there was no evidence of it ever having worked for anyone.
That machine had flexible firmware and 64GB of RAM stuffed in it so I just shuffled the boot time allocation in the EFI to give 8GB to the GPU to make it work, but it’s not elegant.
It’s also pretty clumsy to actually make things run, lot of “set the magic environment variable because the tool chain will mis-detect the architecture of your unsupported card” and “Inject this wall of text into your CMake list to override libraries with our cooked versions” to make things work. Then it performs like an old GTX1060, which is on one hand impressive for an integrated part in a fairly low wattage machine, and on the other hand is competing with a low-mid range card from 2016.
Pretty on brand really, they’ve been fucking up their compute stack since before any other vendor was doing the GPGPU thing (abandoning CTM for Stream in like a year).
I think the OpenMP situation was the least jank of the ways I tried getting something to offload on an APU, but it was also one of the later attempts so maybe I was just getting used to it’s shit.
“set the magic environment variable because the tool chain will mis-detect the architecture of your unsupported card”
I don’t think it’s misdetecting it. Rather it detects it correctly, tries to use specific support for that device, but then finds that the support was switched off at compile time. The environment variable forces it to pretend to be a different (very similar) device.
I find the hardware architecture and licensing situation with AMD much more appealing than Nivida and really want to like their cards for compute, but they sure make it challenging to recommend.
I had to do a little dead reckoning with the list of supported targets to find one that did the right thing with the 12CU RDNA2 680M.
I’ve been meaning to put my findings on the internet since it might be useful to someone else, this is a good a place as any.
On a fresh Xubuntu 22.04.4 LTS install doing the official ROCm 6.1 setup instructions, using a Minisforum UM690S Ryzen 9 6900HX/64GB/1TB box as the target, and after setting the GPU Memory to 8GB in the EFI before boot so it doesn’t OOM.
For OpenMP projects, you’ll probably need to install libstdc++-12-dev in addition to the documented stuff because HIP won’t see the cmath libs otherwise (bug), then the <CMakeConfig.txt> mods for adapting a project with accelerator directives to that target are
And torch, because I was curious how that would go (after I watched the Docker based suggested method download 30GB of trash then fall over, and did the bare metal install instead) seems to work with
PYTORCH_TEST_WITH_ROCM=1 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 testtorch.py
which is the most confidence inspiring.
Also amdgpu_top is your friend for figuring out if you actually have something on the GPU compute pipes or if it’s just lying and running on the CPU.
Neat.
I set up some basic compute stuff with the ROCm stack on a 6900HX-based mini computer the other week (mostly to see if it was possible as there are some image processing workloads a colleague was hoping to accelerate on a similar host) and noticed that the docs occasionally pretend you could use GTT dynamicly allocated memory for compute tasks, but there was no evidence of it ever having worked for anyone.
That machine had flexible firmware and 64GB of RAM stuffed in it so I just shuffled the boot time allocation in the EFI to give 8GB to the GPU to make it work, but it’s not elegant.
It’s also pretty clumsy to actually make things run, lot of “set the magic environment variable because the tool chain will mis-detect the architecture of your unsupported card” and “Inject this wall of text into your CMake list to override libraries with our cooked versions” to make things work. Then it performs like an old GTX1060, which is on one hand impressive for an integrated part in a fairly low wattage machine, and on the other hand is competing with a low-mid range card from 2016.
Pretty on brand really, they’ve been fucking up their compute stack since before any other vendor was doing the GPGPU thing (abandoning CTM for Stream in like a year).
I think the OpenMP situation was the least jank of the ways I tried getting something to offload on an APU, but it was also one of the later attempts so maybe I was just getting used to it’s shit.
I don’t think it’s misdetecting it. Rather it detects it correctly, tries to use specific support for that device, but then finds that the support was switched off at compile time. The environment variable forces it to pretend to be a different (very similar) device.
Clunky, yes.
That’s credible.
I find the hardware architecture and licensing situation with AMD much more appealing than Nivida and really want to like their cards for compute, but they sure make it challenging to recommend.
I had to do a little dead reckoning with the list of supported targets to find one that did the right thing with the 12CU RDNA2 680M.
I’ve been meaning to put my findings on the internet since it might be useful to someone else, this is a good a place as any.
On a fresh Xubuntu 22.04.4 LTS install doing the official ROCm 6.1 setup instructions, using a Minisforum UM690S Ryzen 9 6900HX/64GB/1TB box as the target, and after setting the GPU Memory to 8GB in the EFI before boot so it doesn’t OOM.
For OpenMP projects, you’ll probably need to install
libstdc++-12-dev
in addition to the documented stuff because HIP won’t see the cmath libs otherwise (bug), then the <CMakeConfig.txt> mods for adapting a project with accelerator directives to that target arefind_package(hip REQUIRED) list(APPEND CMAKE_PREFIX_PATH /opt/rocm-6.1.0) set(CMAKE_CXX_COMPILER ${HIP_HIPCC_EXECUTABLE}) set(CMAKE_CXX_LINKER ${HIP_HIPCC_EXECUTABLE}) target_compile_options(yourtargetname PUBLIC "-lm;-fopenmp;-fopenmp-targets=amdgcn-amd-amdhsa;-Xopenmp-target=amdgcn-amd-amdhsa;-march=gfx1035"
And torch, because I was curious how that would go (after I watched the Docker based suggested method download 30GB of trash then fall over, and did the bare metal install instead) seems to work with
PYTORCH_TEST_WITH_ROCM=1 HSA_OVERRIDE_GFX_VERSION=10.3.0 python3 testtorch.py
which is the most confidence inspiring.Also amdgpu_top is your friend for figuring out if you actually have something on the GPU compute pipes or if it’s just lying and running on the CPU.