I believe I have the definitive answer about why we see this.Joe_H wrote: ↑Tue Mar 04, 2025 4:40 am From what I understand, the difference is in how Nvidia and AMD wrote their drivers. Nvidia's driver is doing a spin-wait looking for instructions to be processed and sent to the GPU. AMD from the explanations I have seen implemented this as an interrupt instead. As soon as something is handed off to the driver to process, it wakes up and takes CPU cycles to handle the request and then goes inactive until the next request. So the Nvidia driver process is always active, but the actual amount of work done by the CPU may be a fraction of the cycles available.
It turns out it's not the driver, just a choice that FAH made in their configuration of OpenMM. They overrode the default for UseBlockingSync and set it to false. This increases performance slightly but causes the CPU usage people report.
http://docs.openmm.org/latest/userguide ... a-platform
When the CPU sends data to the GPU, it calls cudaDeviceSynchronize() which will wait until the GPU has finished before it returns. The majority of the CPU's time will be spent in that function. That function will either use a spin wait loop or will yield the CPU and wait for an interrupt before returning, depending on if cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync) has been called or not. The UseBlockingSync feature in OpenMM just calls that function.UseBlockingSync: This is used to control how the CUDA runtime synchronizes between the CPU and GPU. If this is set to “true” (the default), CUDA will allow the calling thread to sleep while the GPU is performing a computation, allowing the CPU to do other work. If it is set to “false”, CUDA will spin-lock while the GPU is working. Setting it to “false” can improve performance slightly, but also prevents the CPU from doing anything else while the GPU is working.
So why did FAH do this? I don't know. From reports, the performance improvement is very very slight. Maybe it was just set and forget? Maybe there are some systems out there where the performance improvement is big enough to make it worth while? I'll test this out when I get a new Nvidia and see if I can write a simple program that injects the blocking sync flag for a test WU.