Questions tagged [assembly]
Assembly language questions. Please tag the processor and/or the instruction set you are using, as well as the assembler, a valid set should be like this: ([assembly] [x86] [gnu-assembler] or [att]). Use the [.net-assembly] tag instead for .NET assemblies, [cil] for .NET assembly language, [wasm] for web assembly, and for Java bytecode, use the tag java-bytecode-asm instead.
assembly
10,531
questions with no upvoted or accepted answers
13
votes
0
answers
423
views
Why does a NOP (as a 5th uop) speed up a 4 uop loop on Ice Lake?
All benchmarks are done on: Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (ark)
Edit: I was not able to reproduce this on broadwell and @PeterCordes was unable to reproduce it on skylake
I was ...
10
votes
0
answers
2k
views
Difference between VMOVDQA and VMOVAPS?
I read the ISA reference and I am clear that the 2 instructions differ in the type of value they load (integer vs single precision float). What I don't understand is that the effect of the load is ...
9
votes
0
answers
204
views
Are there processors on which VPMASKMOVD generates faults for the masked-out elements?
Are there processors on which VPMASKMOVD generates faults for the masked-out elements?
Going by the Intel Software Developer's Manual, the answer is plainly "no":
Faults occur only due to ...
9
votes
0
answers
300
views
Why does gcc -O3 produce wildly different assembly for the same function?
I have a loop in a physics engine which detects collisions like this:
// now check for collisions
// we only allow 1 collision per 2 partcles per frame so the
// one with the lower index will ...
9
votes
1
answer
460
views
Best way to do a packed 16 element blend using SSE
I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used.
void packedBlend16(uint8_t mask, ...
9
votes
0
answers
751
views
What's up with the "half fence" behavior of rdtscp?
For many years x86 CPUs supported the rdtsc instruction, which reads the "time stamp counter" of the current CPU. The exact definition of this counter has changed over time, but on recent CPUs it is a ...
9
votes
0
answers
561
views
Intellisense warning that it can't find function definition for assembly function
In my MSVC 2015 project I have a function, int foo(int, int) which is implemented in an .asm file. When I extern "C" declare this function in a .cpp file in the same project, Intellisense complains ...
9
votes
0
answers
5k
views
ld: Undefined symbols for architecture x86_64
I have made a nasm assembly hello world program like this:
global start
section .text
start:
mov rax, 0x20000004
mov rdi, 1
lea rsi, [rel msg]
mov rdx, msg.len
syscall
mov ...
8
votes
0
answers
272
views
Golang goroutine preemption
I was wondering how Golang does preemption of goroutines, after 1.14 version where scheduler became non-cooperative and studied the source code, but it seems my knowledge is not enough to comprehend ...
8
votes
1
answer
153
views
Why does GCC fail to reduce a loop that increments two locations of the same buffer?
Here is a bounded loop that increments two locations of the same buffer.
unsigned int getid();
void foo(unsigned int *counter, unsigned int n) {
unsigned int A = getid();
unsigned int ...
8
votes
0
answers
165
views
Why newer clang is generating one more instruction than just popcntl to count the bits of an int on haswell architecture?
While watching this talk by Matt Godbolt, I was astonished to see that Clang, if instructed to compile for the Haswell¹ architecture, works out that the following code
int foo(int a) {
int count = ...
8
votes
0
answers
826
views
Optimizing cumulative sum
I need some help to understand how an optimization I tried is even working.
The cumsum function gets a vector, and writes a vector with the accumulated sum.
I tried the following to optimize this: ...
8
votes
0
answers
556
views
gdb tui -- turn off printing function parameters for asm layout
Gdb in tui mode in asm layout prints something like:
<address+0> <namespace:func(int, int, ..... many many many parameters)+0> instruction1
<address+4> <namespace:func(int, int, ....
8
votes
0
answers
296
views
Why is x/10 optimized with an unnecessary shift when x has a restricted range?
I have this function
long long int divideBy10(long long int a){
return a / 10;
}
it's compiled to:
mov rax, rdi
movabs rcx, 7378697629483820647
imul rcx
...
8
votes
0
answers
1k
views
Usage of instruction pxor before SSE instruction cvtsi2ss
I am currently writing various implementations of a color to black/white image converter. I would like to do a :
Simple C++ implementation
Self made ASM implementation
Self made ASM implementation ...