Skip to main content

Questions tagged [assembly]

Assembly language questions. Please tag the processor and/or the instruction set you are using, as well as the assembler, a valid set should be like this: ([assembly] [x86] [gnu-assembler] or [att]). Use the [.net-assembly] tag instead for .NET assemblies, [cil] for .NET assembly language, [wasm] for web assembly, and for Java bytecode, use the tag java-bytecode-asm instead.

10,531 questions with no upvoted or accepted answers
13 votes
0 answers
423 views

Why does a NOP (as a 5th uop) speed up a 4 uop loop on Ice Lake?

All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark) Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake I was ...
Noah's user avatar
  • 1,759
10 votes
0 answers
2k views

Difference between VMOVDQA and VMOVAPS?

I read the ISA reference and I am clear that the 2 instructions differ in the type of value they load (integer vs single precision float). What I don't understand is that the effect of the load is ...
Anmol Sahoo's user avatar
9 votes
0 answers
204 views

Are there processors on which VPMASKMOVD generates faults for the masked-out elements?

Are there processors on which VPMASKMOVD generates faults for the masked-out elements? Going by the Intel Software Developer's Manual, the answer is plainly "no": Faults occur only due to ...
user555045's user avatar
  • 63.9k
9 votes
0 answers
300 views

Why does gcc -O3 produce wildly different assembly for the same function?

I have a loop in a physics engine which detects collisions like this: // now check for collisions // we only allow 1 collision per 2 partcles per frame so the // one with the lower index will ...
brenzo's user avatar
  • 769
9 votes
1 answer
460 views

Best way to do a packed 16 element blend using SSE

I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used. void packedBlend16(uint8_t mask, ...
Nick's user avatar
  • 397
9 votes
0 answers
751 views

What's up with the "half fence" behavior of rdtscp?

For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a ...
BeeOnRope's user avatar
  • 63.1k
9 votes
0 answers
561 views

Intellisense warning that it can't find function definition for assembly function

In my MSVC 2015 project I have a function, int foo(int, int) which is implemented in an .asm file. When I extern "C" declare this function in a .cpp file in the same project, Intellisense complains ...
BeeOnRope's user avatar
  • 63.1k
9 votes
0 answers
5k views

ld: Undefined symbols for architecture x86_64

I have made a nasm assembly hello world program like this: global start section .text start: mov rax, 0x20000004 mov rdi, 1 lea rsi, [rel msg] mov rdx, msg.len syscall mov ...
Jerfov2's user avatar
  • 5,455
8 votes
0 answers
272 views

Golang goroutine preemption

I was wondering how Golang does preemption of goroutines, after 1.14 version where scheduler became non-cooperative and studied the source code, but it seems my knowledge is not enough to comprehend ...
toozyfuzzy's user avatar
  • 1,198
8 votes
1 answer
153 views

Why does GCC fail to reduce a loop that increments two locations of the same buffer?

Here is a bounded loop that increments two locations of the same buffer. unsigned int getid(); void foo(unsigned int *counter, unsigned int n) { unsigned int A = getid(); unsigned int ...
AceSrc's user avatar
  • 99
8 votes
0 answers
165 views

Why newer clang is generating one more instruction than just popcntl to count the bits of an int on haswell architecture?

While watching this talk by Matt Godbolt, I was astonished to see that Clang, if instructed to compile for the Haswell¹ architecture, works out that the following code int foo(int a) { int count = ...
Enlico's user avatar
  • 26.7k
8 votes
0 answers
826 views

Optimizing cumulative sum

I need some help to understand how an optimization I tried is even working. The cumsum function gets a vector, and writes a vector with the accumulated sum. I tried the following to optimize this: ...
user avatar
8 votes
0 answers
556 views

gdb tui -- turn off printing function parameters for asm layout

Gdb in tui mode in asm layout prints something like: <address+0> <namespace:func(int, int, ..... many many many parameters)+0> instruction1 <address+4> <namespace:func(int, int, ....
JenyaKh's user avatar
  • 2,268
8 votes
0 answers
296 views

Why is x/10 optimized with an unnecessary shift when x has a restricted range?

I have this function long long int divideBy10(long long int a){ return a / 10; } it's compiled to: mov rax, rdi movabs rcx, 7378697629483820647 imul rcx ...
Slei's user avatar
  • 111
8 votes
0 answers
1k views

Usage of instruction pxor before SSE instruction cvtsi2ss

I am currently writing various implementations of a color to black/white image converter. I would like to do a : Simple C++ implementation Self made ASM implementation Self made ASM implementation ...
Sydney Hauke's user avatar

15 30 50 per page
1
2 3 4 5
703