-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: convert global opcode string pointers to offsets #6206
base: master
Are you sure you want to change the base?
Conversation
CC @melver |
The goal of this change is to reduce drdecode library size. Static library size is reduced by ~200KB (18%) and runtime VM footprint is reduced by ~50KB (10%). FILE SIZE VM SIZE -------------- -------------- +139% +36.2Ki +140% +36.2Ki .rodata +0.4% +100 [ = ] 0 .strtab +6.1% +79 [ = ] 0 .comment +0.2% +72 [ = ] 0 .symtab +5.9% +64 [ = ] 0 [ELF Header] +4.7% +60 [ = ] 0 [AR Headers] +0.1% +48 [ = ] 0 .rela.text +7.3% +4 [ = ] 0 .llvm_addrsig -0.2% -220 -0.2% -211 .text -0.7% -576 [ = ] 0 [ELF Section Headers] -87.2% -28.2Ki -87.2% -28.2Ki .rodata.str1.1 -19.8% -62.9Ki -19.8% -62.9Ki .data.rel.ro -63.5% -151Ki [ = ] 0 .rela.data.rel.ro -18.6% -206Ki -10.5% -55.1Ki TOTAL Any pointers in .data/.rodata take additional 24 bytes in the binary for a relocation (on top of the pointer and pointee themselves). This change replaces opcode names in decoding tables with offsets into a specially created string that contains all opcodes. This removes the 24-byte relocation and reduces 8-byte pointer with a 2-byte offset (it takes the alignment padding in instr_info_t, so effectively -8 bytes). The idea behind the approach is to keep the source code as readable and as maintainable as possible, while still replacing direct string literal pointers with offsets. The crux of the approach is as follows. A string "foo" is replaced with STROFF(foo). A code generator extracts all STROFF(.*) from source files and generates a single global concatenated string that contains all individual strings. Plus it generates STROFF_foo macro that expands to the offset of the string "foo" in the global string. With the addition of: define STROFF(foo) STROFF_ ## foo the source code does not need any additional modifications. The downside is that strings cannot contain any characters that cannot appear in C identifiers. This is not a problem for real opcodes, but affects some fake internal names (e.g. "(bad)" is converted to "BAD"). If this change looks good in general, I will work on productionizing it: - convert other arches - integrate code generator into the build system - rewrite the script in Python Also: - does the STROFF prefix for everything looks good? should I change it to something else? - suggestions for conversion rules for current strings that are not valid C identifiers are welcome - potentially all strings can be kept as is, if we do: STROFF(bad, "(bad)") normal opcodes can still be STROFF(mov), but not sure if it's worth it
Change instr_info_t::code from ptr_int_t to uint. This strips ~128K from drdecode binary and ~50K from runtime VM overhead: FILE SIZE VM SIZE -------------- -------------- +21% +5.60Ki +22% +5.61Ki .rodata +1.5% +720 [ = ] 0 .symtab +1.8% +431 [ = ] 0 .strtab +0.5% +384 [ = ] 0 [ELF Section Headers] +0.5% +264 [ = ] 0 .rela.text +0.1% +117 +0.1% +137 .text +31% +17 [ = ] 0 .llvm_addrsig +0.2% +15 [ = ] 0 [Unmapped] +0.0% +4 [ = ] 0 .rodata.str1.1 -19.3% -61.3Ki -19.3% -61.3Ki .data.rel.ro -31.0% -73.9Ki [ = ] 0 .rela.data.rel.ro -11.5% -127Ki -10.6% -55.6Ki TOTAL The idea is to change the pointer to table number + index in the table encoded as a single uint. This removes the need in runtime relocations for pointers (each taking 24 bytes in the binary). Along with DynamoRIO#6206 this should also shrink sizeof(instr_info_t) from 48 bytes to 32 bytes (not fully realized w/o DynamoRIO#6206 due to alignment). If this change looks good in general, I will work on productionizing it and update other arches. Format of the uint encoding is discussable. For example, if we collect all instr_info_t into a single global array (and then make individual tables pointers into it), then the encoding can be made simpler and faster, but will require more significant changes.
@@ -41,6 +41,7 @@ | |||
#define DECODE_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinav92003 and I discussed this one.
-
We assume PIC/PIE is a requirement for your use case and building a version of the library w/o relocations is not a possible solution?
-
Are there possible compiler optimizations to help here: I guess duplicate string literal and string shared suffix optimizations don't help much and we'd need the compiler to convert our pointer uses into offsets.
-
The code readability impact here isn't too bad; mostly it's a maintenance burden as now a script has to be run. The script would be run as part of the build process, or a dev would run it manually whenever adding new entries?
-
but affects some fake internal names (e.g. "(bad)" is converted to "BAD").
Could we have some indication that
"BAD"
is not really an opcode name?"_BAD_"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth getting feedback from @khuey as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We assume PIC/PIE is a requirement for your use case and building a version of the library w/o relocations is not a possible solution?
Yes, we need to support all build modes.
Are there possible compiler optimizations to help here
I can't think of any. The main problem is with relocations rather than deduping/etc.
The script would be run as part of the build process, or a dev would run it manually whenever adding new entries?
The intention is that it will run automatically, but I din't look into how easy it is to integrate into cmake, I assume should have required support.
Could we have some indication that "BAD" is not really an opcode name? "BAD"?
We sure can adjust them any way you want (and these will also be 100% chanable in future patches).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other option is to make name char[16]
(embed right in the struct).
Longest real x86 instruction I see are vpbroadcastmw2d/vbroadcasti64x4/vscatterpf1dpd. So 16 should be enough.
The compiler won't ensure that it's 0-terminated, but it can be reliably verified in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other option is to make name char[16] (embed right in the struct).
I like that: but it doesn't save as much space?
If we have 6778 table entries: the original proposal removed the 24 bytes of relocs plus 6 bytes for pointer-to-2-byte-offset which is 6778*30 = 203340 == your 200K savings.
Here we'd go from the 8-byte pointer to a 16-byte string, but removing the 24 reloc bytes, for a net reduction of 6778*16 = 108448 == just 100K. But the strings themselves disappear: so if the strings averaged 6 chars with newline we're saving 6778*22 = 149116. Is 150K savings enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I did a quick prototype and this version saves 76K on top of char[16] version:
FILE SIZE VM SIZE
-------------- --------------
+43% +18.9Ki +44% +18.9Ki .rodata
-27.3% -96.1Ki -27.3% -96.1Ki .data.rel.ro
-7.8% -76.1Ki -14.1% -77.1Ki TOTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this change is to reduce drdecode library size. Static library size is reduced by ~200KB (18%) and runtime VM footprint is reduced by ~50KB (10%).
Any pointers in .data/.rodata take additional 24 bytes in the binary for a relocation (on top of the pointer and pointee themselves).
This change replaces opcode names in decoding tables with offsets into a specially created string that contains all opcodes. This removes the 24-byte relocation and reduces 8-byte pointer with a 2-byte offset (though, this is not necessary realized in this change due to alignment padding, but can be).
The idea behind the approach is to keep the source code as readable and as maintainable as possible, while still replacing direct string literal pointers with offsets.
The crux of the approach is as follows.
A string "foo" is replaced with STROFF(foo).
A code generator extracts all STROFF(.*) from source files and generates a single global concatenated string that contains all individual strings. Plus it generates STROFF_foo macro that expands to the offset of the string "foo" in the global string. With the addition of:
the source code does not need any additional modifications.
The downside is that strings cannot contain any characters that cannot appear in C identifiers. This is not a problem for real opcodes, but affects some fake internal names (e.g. "(bad)" is converted to "BAD").
If this change looks good in general, I will work on productionizing it:
Also: