GPU temperature control

Open Hardware Monitor has been running (via a scheduled task) on my machine, since I discovered it in February. It collects data on the 36 sensors in my system, and logs it every 5 seconds. I’m generally pretty happy with the performance, so the logging is mostly in aid of sheer curiosity.

One small improvement I’d like to make is the behaviour of the GPU cooler. Cooling capacity is fine, as far as I can tell, but it’s not always employed as much as I think it might be. In particular, the automatic control will let the GPU get quite warm — up to 85 degrees — while its fan is at only 50% capacity. It’s true that the fan does get noticeably louder at this speed; and it is almost unbearably loud at full capacity (not to mention possibly shortening its lifetime). But this temperature is close to worryingly hot and I would like the option of having a little more noise for a slightly lower temperature.

According to here and here, a Radeon HD 6870 should run up to about 50 degrees hotter than room temperature. I’m reasonably confident that my room is colder than 35 degrees, so I do not know why my GPU runs so hot. It could be because of the game I’m playing (Neverwinter Nights 2, with near-best settings). It could simply be that my case layout and cooling it less than optimal for it. Whatever the explanation, 85 degrees seems to be close to, if not over, the recommended max temperature (ranging from 75 to 100 depending on source; I’m not sure what AMD says).

Here’s a chart of GPU temperature (red), load (green), and fan speed (blue). The data is aggregated per hour and distributed into polygons depending on density. Darker polygons mean more data points for that hour are within that range. The temperature scale is shown; the others are on scales from 0 to 100% and 1000 to 4500 RPM. (Graph generated from the data in PostgreSQL and rendered as SVG. I’m not sure it’s the best way of presenting the data, so let me know if you can think of something that might be better.)

Temperature (red), load (green) and fan (blue) for Radeon 6870

Observant readers will have noticed that the temperature has gone as high as 103 degrees. The story is that OHM provides a Fan Control option. This lets the user override automatic control and specify a fixed speed (as a percentage of capacity). I played around with this a bit. What I didn’t realise is that once set in OHM, the option is not unset when the OHM user sets it back to default. (It needs to be reset in the Radeon Catalyst Control Centre). The card seems to be no worse the wear, but I don’t intend to repeat that precise experiment again. ;-)

I then added a new hack to OHM, to set a target temperature, and let OHM adjust the fan speed accordingly. The algorithm is:

Check every second:
    If too hot and fan < 100%, increase fan by 1%.
    If too cold and fan > 0%, decrease fan by 1%.
    Otherwise leave fan alone.

Yes, it’s quite simplistic. At first glance it might seem that it’s not responsive enough: if the GPU starts being heavily used it might take a minute or more to spin the fan up. In practise, it’s hard to say what the problem is. It tends to oscillate up and down between 0 and 100% fan. Now, it turns out that temperature is not a direct function of fan speed at that instant (who’d have guessed?!), and consequently, the temperature oscillates up and down too, out of phase with the fan. (Chart is forthcoming…)

A slightly improved algorithm is:

Leave fan alone.

I have stuck with this one in the meantime.

However I can’t help thinking that I’ve now got enough data to reverse engineer the GPU’s built-in control, and then to merely tweak some of the parameters to it. I suspect it relies on a better heat model (possibly involving differential equations and Lyapunov functions). It might also use the GPU load to pre-emptively predict upcoming temperature changes. All this starts to sound very hard so I’m not sure I’m going to embark on the science required to replicate it. But if I do make some tentative steps in that direction I’ll post about them here. ;-)

3 Responses to GPU temperature control

Pingback: Open Hardware Monitor logging and fan control | EJRH
Michael Möller says:

May 15, 2011 at 1:41 am

“What I didn’t realise is that once set in OHM, the option is not unset when the OHM user sets it back to default.”

The Open Hardware Monitor should have unset it. Unfortunately there are some systems where this didn’t work correctly (while on others everything worked fine). Anyway, I have added a patch to the SVN that should hopefully fix this on all systems. Maybe you can check if it is correct now on your system as well.

For temperature fan control usually a table/curve which maps temperature to desired fan speed works very well usually.

- ejrh says:
  
  May 15, 2011 at 12:42 pm
  
  Thanks — I didn’t want to make the assumption that OHM was to blame, so I hope it didn’t come across like that. :-)
  
  It’s a great piece of software, thanks for working on it Michael.