1

I was going through this link delay in assembly to add delay in assembly. I want to perform some experiment by adding different delay value.

The useful code to generate delay

; start delay

mov bp, 43690
mov si, 43690
delay2:
dec bp
nop
jnz delay2
dec si
cmp si,0    
jnz delay2
; end delay

What I understood from the code, the delay is proportion to the time it spends to execute nop instructions (43690x43690 ). So in different system and different version of OS, delay will be different. Am I right?

Can anyone explain to me how to calculate the amount of delay in nsec, the following assembly code is generating so that I can conclude my experiment with respect to delay I added in my experimental setup?

This is the code I am using to generate delay without understanding the logic behind use of 43690 value ( I used only one loop against two loops in original source code). To generate different delay (without knowing its value), I just varied number 43690 to 403690 or other value.

Code in 32bit OS

movl  $43690, %esi   ; ---> if I vary this 4003690 then delay value ??
.delay2:
    dec %esi
    nop
    jnz .delay2

How much delay is generated by this assembly code ?

If I want to generate 100nsec or 1000nsec or any other delay in microsec, what will be initial value I need to load in register?

I am using ubuntu 16.04 (both 32bit as well as 64bit ), in Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz and Core-i3 CPU 3470 @ 3.20GHz processor.

Thank you in advance.

5
  • 3
    The delay is not deterministic, nor should you expect it to be.
    – old_timer
    Commented Apr 19, 2018 at 15:07
  • @old_timer: Why do you believe that caching, prefetch, branch prediction, threading and memory latency has any influence? It is assembly or not ;)
    – Klaus
    Commented Apr 19, 2018 at 19:19
  • @Klaus well you know on this platform it is actually microcoded, so maybe if it were written in microcode then it would be deterministic. Just add a delay instruction to the instruction set and there you go
    – old_timer
    Commented Apr 19, 2018 at 21:31
  • 1
    @old_timer: those are all single-uop instructions on the OP's Kaby Lake and IvyBridge. It's not microcode that's the problem, it's dynamic CPU frequency, competition from other hyperthreads, and interrupt delays. Even possibly system-manage-mode interrupts that even the kernel doesn't know about. (Linux is not a hard-realtime OS, and modern PCs are full of voodoo apart from that.) The loop is totally predictable at 1 iteration per core clock cycle, whether or not there's a nop in it. (agner.org/optimize) Commented Apr 20, 2018 at 1:46
  • @PeterCordes please re-read the last two comments and realize neither are serious, just a little humor. Should I have added a grin to my comment as well? It is too late to edit at this point.
    – old_timer
    Commented Apr 20, 2018 at 1:49

1 Answer 1

11

There is no very good way to get accurate and predictable timing from fixed counts for delay loops on a modern x86 PC, especially in user-space under a non-realtime OS like Linux. (But you could spin on rdtsc for very short delays; see below). You can use a simple delay-loop if you need to sleep at least long enough and it's ok to sleep longer when things go wrong.

Normally you want to sleep and let the OS wake your process, but this doesn't work for delays of only a couple microseconds on Linux. nanosleep can express it, but the kernel doesn't schedule with such precise timing. See How to make a thread sleep/block for nanoseconds (or at least milliseconds)?. On a kernel with Meltdown + Spectre mitigation enabled, a round-trip to the kernel takes longer than a microsecond anyway.

(Or are you doing this inside the kernel? I think Linux already has a calibrated delay loop. In any case, it has a standard API for delays: https://www.kernel.org/doc/Documentation/timers/timers-howto.txt, including ndelay(unsigned long nsecs) which uses the "jiffies" clock-speed estimate to sleep for at least long enough. IDK how accurate that is, or if it sometimes sleeps much longer than needed when clock speed is low, or if it updates the calibration as the CPU freq changes.)


Your (inner) loop is totally predictable at 1 iteration per core clock cycle on recent Intel/AMD CPUs, whether or not there's a nop in it. It's under 4 fused-domain uops, so you bottleneck on the 1-per-clock loop throughput of your CPUs. (See Agner Fog's x86 microarch guide, or time it yourself for large iteration counts with perf stat ./a.out.) Unless there's competition from another hyperthread on the same physical core...

Or unless the inner loop spans a 32-byte boundary, on Skylake or Kaby Lake (loop buffer disabled by microcode updates to work around a design bug). Then your dec / jnz loop could run at 1 per 2 cycles because it would require fetching from 2 different uop-cache lines.

I'd recommend leaving out the nop to have a better chance of it being 1 per clock on more CPUs, too. You need to calibrate it anyway, so a larger code footprint isn't helpful (so leave out extra alignment, too). (Make sure calibration happens while CPU is at max turbo, if you need to ensure a minimum delay time.)

If your inner loop wasn't quite so small (e.g. more nops), see Is performance reduced when executing loops whose uop count is not a multiple of processor width? for details on front-end throughput when the uop count isn't a multiple of 8. SKL / KBL with disabled loop buffers run from the uop cache even for tiny loops.


But x86 doesn't have a fixed clock frequency (and transitions between frequency states stop the clock for ~20k clock cycles (8.5us), on a Skylake CPU).

If running this with interrupts enabled, then interrupts are another unpredictable source of delays. (Even in kernel mode, Linux usually has interrupts enabled. An interrupts-disabled delay loop for tens of thousands of clock cycles seems like a bad idea.)

If running in user-space, then I hope you're using a kernel compiled with realtime support. But even then, Linux isn't fully designed for hard-realtime operation, so I'm not sure how good you can get.

System management mode interrupts are another source of delay that even the kernel doesn't know about. PERFORMANCE IMPLICATIONS OF SYSTEM MANAGEMENT MODE from 2013 says that 150 microseconds is considered an "acceptable" latency for an SMI, according to Intel's test suite for PC BIOSes. Modern PCs are full of voodoo. I think/hope that the firmware on most motherboards doesn't have much SMM overhead, and that SMIs are very rare in normal operation, but I'm not sure. See also Evaluating SMI (System Management Interrupt) latency on Linux-CentOS/Intel machine

Extremely low-power Skylake CPUs stop their clock with some duty-cycle, instead of clocking lower and running continuously. See this, and also Intel's IDF2015 presentation about Skylake power management.


Spin on RDTSC until the right wall-clock time

If you really need to busy-wait, spin on rdtsc waiting for the current time to reach a deadline. You need to know the reference frequency, which is not tied to the core clock, so it's fixed and nonstop (on modern CPUs; there are CPUID feature bits for invariant and nonstop TSC. Linux checks this, so you could look in /proc/cpuinfo for constant_tsc and nonstop_tsc, but really you should just check CPUID yourself on program startup and work out the RDTSC frequency (somehow...)).

I wrote such a loop as part of a silly-computer-tricks exercise: a stopwatch in the fewest bytes of x86 machine code. Most of the code size is for the string manipulation to increment a 00:00:00 display and print it. I hard-coded the 4GHz RDTSC frequency for my CPU.

For sleeps of less than 2^32 reference clocks, you only need to look at the low 32 bits of the counter. If you do your compare correctly, wrap-around takes care of itself. For the 1-second stopwatch, a 4.3GHz CPU would have a problem, but for nsec / usec sleeps there's no issue.

 ;;; Untested,  NASM syntax

 default rel
 section .data
    ; RDTSC frequency in counts per 2^16 nanoseconds
    ; 3200000000 would be for a 3.2GHz CPU like your i3-3470

    ref_freq_fixedpoint: dd  3200000000 * (1<<16) / 1000000000

    ; The actual integer value is 0x033333
    ; which represents a fixed-point value of 3.1999969482421875 GHz
    ; use a different shift count if you like to get more fractional bits.
    ; I don't think you need 64-bit operand-size


 ; nanodelay(unsigned nanos /*edi*/)
 ; x86-64 System-V calling convention
 ; clobbers EAX, ECX, EDX, and EDI
 global nanodelay
 nanodelay:
      ; take the initial clock sample as early as possible.
      ; ideally even inline rdtsc into the caller so we don't wait for I$ miss.
      rdtsc                   ; edx:eax = current timestamp
      mov      ecx, eax       ; ecx = start
      ; lea ecx, [rax-30]    ; optionally bias the start time to account for overhead.  Maybe make this a variable stored with the frequency.

      ; then calculate edi = ref counts = nsec * ref_freq
      imul     edi, [ref_freq_fixedpoint]  ; counts * 2^16
      shr      edi, 16        ; actual counts, rounding down

.spinwait:                     ; do{
    pause         ; optional but recommended.
    rdtsc                      ;   edx:eax = reference cycles since boot
    sub      eax, ecx          ;   delta = now - start.  This may wrap, but the result is always a correct unsigned 0..n
    cmp      eax, edi          ; } while(delta < sleep_counts)
    jb     .spinwait

    ret

To avoid floating-point for the frequency calculation, I used fixed-point like uint32_t ref_freq_fixedpoint = 3.2 * (1<<16);. This means we just use an integer multiply and shift inside the delay loop. Use C code to set ref_freq_fixedpoint during startup with the right value for the CPU.

If you recompile this for each target CPU, the multiply constant can be an immediate operand for imul instead of loading from memory.

pause sleeps for ~100 clock on Skylake, but only for ~5 clocks on previous Intel uarches. So it hurts timing precision a bit, maybe sleeping up to 100 ns past a deadline when the CPU frequency is clocked down to ~1GHz. Or at a normal ~3GHz speed, more like up to +33ns.

Running continously, this loop heated up one core of my Skylake i7-6700k at ~3.9GHz by ~15 degrees C without pause, but only by ~9 C with pause. (From a baseline of ~30C with a big CoolerMaster Gemini II heatpipe cooler, but low airflow in the case to keep fan noise low.)

Adjusting the start-time measurement to be earlier than it really is will let you compensate for some of the extra overhead, like branch-misprediction when leaving the loop, as well as the fact that the first rdtsc doesn't sample the clock until probably near the end of its execution. Out-of-order execution can let rdtsc run early; you might use lfence, or consider rdtscp, to stop the first clock sample from happening out-of-order ahead of instructions before the delay function is called.

Keeping the offset in a variable will let you calibrate the constant offset, too. If you can do this automatically at startup, that could be good to handle variations between CPUs. But you need some high-accuracy timer for that to work, and this is already based on rdtsc.

Inlining the first RDTSC into the caller and passing the low 32 bits as another function arg would make sure the "timer" starts right away even if there's an instruction-cache miss or other pipeline stall when calling the delay function. So the I$ miss time would be part of the delay interval, not extra overhead.


The advantage of spinning on rdtsc:

If anything happens that delays execution, the loop still exits at the deadline, unless execution is currently blocked when the deadline passes (in which case you're screwed with any method).

So instead of using exactly n cycles of CPU time, you use CPU time until the current time is n * freq nanoseconds later than when you first checked.

With a simple counter delay loop, a delay that's long enough at 4GHz would make you sleep more than 4x too long at 0.8GHz (typical minimum frequency on recent Intel CPUs).

This does run rdtsc twice, so it's not appropriate for delays of only a couple nanoseconds. (rdtsc itself is ~20 uops, and has a throughput of one per 25 clocks on Skylake/Kaby Lake.) I think this is probably the least bad solution for a busy-wait of hundreds or thousands of nanoseconds, though.

Downside: a migration to another core with unsynced TSC could result in sleeping for the wrong time. But unless your delays are very long, the migration time will be longer than the intended delay. The worst case is sleeping for the delay-time again after the migration. The way I do the compare: (now - start) < count, instead of looking for a certain target target count, means that unsigned wraparound will make the compare true when now-start is a large number. You can't get stuck sleeping for nearly a whole second while the counter wraps around.

Downside: maybe you want to sleep for a certain number of core cycles, or to pause the count when the CPU is asleep.

Downside: old CPUs may not have a non-stop / invariant TSC. Check these CPUID feature bits at startup, and maybe use an alternate delay loop, or at least take it into account when calibrating. See also Get CPU cycle count? for my attempt at a canonical answer about RDTSC behaviour.


Future CPUs: use tpause on CPUs with the WAITPKG CPUID feature.

(I don't know which future CPUs are expected to have this.)

It's like pause, but puts the logical core to sleep until the TSC = the value you supply in EDX:EAX. So you could rdtsc to find out the current time, add / adc the sleep time scaled to TSC ticks to EDX:EAX, then run tpause.

Interestingly, it takes another input register where you can put a 0 for a deeper sleep (more friendly to the other hyperthread, probably drops back to single-thread mode), or 1 for faster wakeup and less power-saving.

You wouldn't want to use this to sleep for seconds; you'd want to hand control back to the OS. But you could do an OS sleep to get close to your target wakeup if it's far away, then mov ecx,1 or xor ecx,ecx / tpause ecx for whatever time is left.

Semi-related (also part of the WAITPKG extension) are the even more fun umonitor / umwait, which (like privileged monitor/mwait) can have a core wake up when it sees a change to memory in an address range. For a timeout, it has the same wakeup on TSC = EDX:EAX as tpause.

10
  • A busy-loop based wait for very short times isn't too insane, even on x86, and indeed the Linux kernel calculates the bogomips value for this reason. Yes, you'll get outliers, but AFAIK they are all in the same direction: a longer wait than you wanted, and that's often the type of outlier which doesn't break the underlying reason for the wait: often you want to sleep at least T before doing something, say checking a hardware response, but longer is OK (although not desirable if it happens too frequently).
    – BeeOnRope
    Commented Apr 20, 2018 at 4:27
  • 1
    Perhaps more relevantly, if these "longer" outliers aren't acceptable, you are mostly screwed anyways since you can't really avoid them: the things that cause the outliers tend to pause the CPU entirely from the PoV of the user, so at best you can detect them, but not prevent them. So for sleeps which are on the order of the events in question, a busy loop seems about as good as any other approach. For longer sleeps, something like polling rdtsc starts to have considerably better QoI since you can cancel out delays and get a lot closer to your deadline.
    – BeeOnRope
    Commented Apr 20, 2018 at 4:29
  • Two reasons you might not be able to use rdtsc: if it isn't synced closely enough on different CPUs (although a context switch is probably going to blow your timing anyways), you might do something bad like sleep too short, or if you really want to ensure your delay is in CPU cycles not wall-clock time and/or you really want counting to stop when the CPU stops.
    – BeeOnRope
    Commented Apr 20, 2018 at 4:32
  • 1
    oh, good point about context switches with unsynced TSCs, and other use-cases. Commented Apr 20, 2018 at 4:35
  • 1
    Yeah, I agree. If you really wanted precision, you could always use pause; rdtsc in the primary loop, but when you calculate you have less than one pause; rdtsc iteration left before the deadline, drop into a very small delay-loop for the remaining time. Most of the problems of delay loops are eliminated if they are very short like that.
    – BeeOnRope
    Commented Apr 20, 2018 at 7:20

Not the answer you're looking for? Browse other questions tagged or ask your own question.