Getting accurate per thread timing on Windows

Published 17 July 07 05:04 PM | john 

If you need to accurately time operations in Windows, you're usually directed to the QueryPerformanceCounter API.  This API is also neatly wrapped in .NET under the Stopwatch class.  Indeed, this is this is simplest way to get reasonably accurate timings of code under Windows.  Ah, you sense some mischievousness in my tone. 

Sadly, one of the big problems with timing some code in Windows is the fact that Windows is a multi-threaded environment.  To state the obvious, this means that periodically, without your consent, your codes ability to control the CPU is wrenched away and given to some other chunk of code.  Eventually, your code gets control back, but if you are in the process of timing something your code is supposed to be doing, guess what?  Your timing measurement now includes all the time the O/S and who knows how many other threads just spent doing something completely different. 

In practice, this shows up as small amounts of randomness in your timings.  How much depends on exactly what else is happening on your system at the time.  Try measuring some performance when your doing some massive compilation and you'll see your timing values be, not surprisingly, much bigger.  If your system is quiet, you'll get only very small perturbations, perhaps on the order of milliseconds.

Another problem with the QueryPerformanceCounter API is that (a) it has has quite a high overhead and (b) it's not the most accurate timer on your computer.

The most accurate timer on your computer will be connected to the clock that is running the fastest, and that my friends would be the one that is driving your CPU.  Intel CPU's increment a counter every time the CPU clock ticks.  You can get the current value of this clock through the RDTSC instruction.  Yes, it's just too tempting to walk away from, isn't it?

So lets say that despite the warnings, we want to use the RDTSC instruction to do some timing.  Microsoft recommends that you don't and so does Raymond, because of two reasons (i) systems with multiple cores won't always have synchronized tick counters and (ii) power management can effectively slow down a counter or even stop it by changing a CPU core's clock speed.

But what if you absolutely, positively, have to have the most accurate timing possible?  For the sake of argument let's assume that your reasons are solid, and just get to the part where we discuss how one might do that.

Firstly, you can use the RDTSC instruction directly in your x86 C++ code if you like; it's not a protected instruction.  It will put the 64 bit clock cycle counter in the EDX:EAX registers, subject to all the caveats about sleep states and processor halting, given in the links.  If you want the full skinny on this instruction, take a look at the IA-32 Intel Architecture Manual Vol.3.  So you'll need to use some inline assembler, or the Macro Assembler (MASM, or ml.exe)  Getting the value is the easy part.  It's all the step that you have to take to make sure that the rate of change of the value stays consistent that aren't.

To avoid power management problems the only recourse you have is to set your system power scheme to the "high performance" state to prevent CPU throttling.  It's also not a bad idea to tell the system that your thread is doing something important and it shouldn't change the state, such as hibernating or going into standby.  See SetThreadExecutionState and SetActivePwrScheme for more information.  You can also register for a power state notification with RegisterPowerSettingNotification to make sure that nothing changes while you are timing.  Note, that a power state notification will not notify you of dynamic CPU throttling, only that the system went from A/C power to battery, or that the power scheme was changed, which is also a likely indicator that CPU speed changed.  In short, there's no way to prevent power state changes.  The best way to avoid getting accidentally hosed by CPU throttling is to take a bunch of timings of the same thing and check for consistency.

Next, it doesn't hurt to affinitize your process threads.  This will avoid nasty problems with inconsistent clock counters across all your multiple CPU cores. SetProcessAffinityMask should do the trick.  Don't forget to scare yourself with all the caveats in the documentation.

Lastly, what about the thread context switch problem?  RDTSC doesn't help with that.  But Windows Vista can.  Vista actually uses the RDTSC instruction as a better way of ensuring fairness when scheduling threads.  Vista keeps track of the clock cycle count when thread context switches occur and also when threads are hijacked to handle interrupt servicing.  Vista gets away with this because it doesn't so much care how long your time quantum of CPU time takes in wall clock time, just that everyone gets their fair share of cycles.  Anyway, the functions you need are QueryThreadCycleTime and QueryProcessCycleTime.  Call them instead of using RDTSC directly and you can calculate only the cycles spent executing your code and nothing else.

All you are left with is the problem of turning a cycle count into an actual time.  Don't just use your theoretical CPU speed, measure it.  Sit in a loop for a second or two as measured by QueryPerformanceCounter and save that value off somewhere. 

Despite the warnings the RDTSC instruction is still the mechanism used by all the major performance measurement tools that have ever existed on Windows, including Visual Studio Team System.   If you decide you understand the issues and you want to go ahead anyway, at least now you have some idea of how to solve the challenges ahead. 

Comments

# The Director of Random Technologies said on July 23, 2007 4:01 PM:

While I haven't been posting much here, I have been posting pretty regularly over on my main development

# Noticias externas said on July 23, 2007 5:02 PM:

While I haven't been posting much here, I have been posting pretty regularly over on my main development

Anonymous comments are disabled