Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: convert global opcode string pointers to offsets #6206

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dvyukov
Copy link
Contributor

@dvyukov dvyukov commented Jul 15, 2023

The goal of this change is to reduce drdecode library size. Static library size is reduced by ~200KB (18%) and runtime VM footprint is reduced by ~50KB (10%).

    FILE SIZE        VM SIZE
 --------------  --------------
  +139% +36.2Ki  +140% +36.2Ki    .rodata
  +0.4%    +100  [ = ]       0    .strtab
  +6.1%     +79  [ = ]       0    .comment
  +0.2%     +72  [ = ]       0    .symtab
  +5.9%     +64  [ = ]       0    [ELF Header]
  +4.7%     +60  [ = ]       0    [AR Headers]
  +0.1%     +48  [ = ]       0    .rela.text
  +7.3%      +4  [ = ]       0    .llvm_addrsig
  -0.2%    -220  -0.2%    -211    .text
  -0.7%    -576  [ = ]       0    [ELF Section Headers]
 -87.2% -28.2Ki -87.2% -28.2Ki    .rodata.str1.1
 -19.8% -62.9Ki -19.8% -62.9Ki    .data.rel.ro
 -63.5%  -151Ki  [ = ]       0    .rela.data.rel.ro
 -18.6%  -206Ki -10.5% -55.1Ki    TOTAL

Any pointers in .data/.rodata take additional 24 bytes in the binary for a relocation (on top of the pointer and pointee themselves).
This change replaces opcode names in decoding tables with offsets into a specially created string that contains all opcodes. This removes the 24-byte relocation and reduces 8-byte pointer with a 2-byte offset (though, this is not necessary realized in this change due to alignment padding, but can be).

The idea behind the approach is to keep the source code as readable and as maintainable as possible, while still replacing direct string literal pointers with offsets.

The crux of the approach is as follows.
A string "foo" is replaced with STROFF(foo).
A code generator extracts all STROFF(.*) from source files and generates a single global concatenated string that contains all individual strings. Plus it generates STROFF_foo macro that expands to the offset of the string "foo" in the global string. With the addition of:

#define STROFF(foo) STROFF_ ## foo

the source code does not need any additional modifications.

The downside is that strings cannot contain any characters that cannot appear in C identifiers. This is not a problem for real opcodes, but affects some fake internal names (e.g. "(bad)" is converted to "BAD").

If this change looks good in general, I will work on productionizing it:

  • convert other arches
  • integrate code generator into the build system
  • rewrite the script in Python

Also:

  • does the STROFF prefix for everything looks good? should I change it to something else?
  • suggestions for conversion rules for current strings that are not valid C identifiers are welcome
  • potentially all strings can be kept as is, if we do: STROFF(bad, "(bad)"), normal opcodes can still be STROFF(mov), but not sure if it's worth it (?)
@dvyukov
Copy link
Contributor Author

dvyukov commented Jul 15, 2023

CC @melver

The goal of this change is to reduce drdecode library size.
Static library size is reduced by ~200KB (18%) and
runtime VM footprint is reduced by ~50KB (10%).

    FILE SIZE        VM SIZE
 --------------  --------------
  +139% +36.2Ki  +140% +36.2Ki    .rodata
  +0.4%    +100  [ = ]       0    .strtab
  +6.1%     +79  [ = ]       0    .comment
  +0.2%     +72  [ = ]       0    .symtab
  +5.9%     +64  [ = ]       0    [ELF Header]
  +4.7%     +60  [ = ]       0    [AR Headers]
  +0.1%     +48  [ = ]       0    .rela.text
  +7.3%      +4  [ = ]       0    .llvm_addrsig
  -0.2%    -220  -0.2%    -211    .text
  -0.7%    -576  [ = ]       0    [ELF Section Headers]
 -87.2% -28.2Ki -87.2% -28.2Ki    .rodata.str1.1
 -19.8% -62.9Ki -19.8% -62.9Ki    .data.rel.ro
 -63.5%  -151Ki  [ = ]       0    .rela.data.rel.ro
 -18.6%  -206Ki -10.5% -55.1Ki    TOTAL

Any pointers in .data/.rodata take additional 24 bytes
in the binary for a relocation (on top of the pointer and
pointee themselves).
This change replaces opcode names in decoding tables with
offsets into a specially created string that contains all opcodes.
This removes the 24-byte relocation and reduces 8-byte pointer
with a 2-byte offset (it takes the alignment padding
in instr_info_t, so effectively -8 bytes).

The idea behind the approach is to keep the source code
as readable and as maintainable as possible, while still
replacing direct string literal pointers with offsets.

The crux of the approach is as follows.
A string "foo" is replaced with STROFF(foo).
A code generator extracts all STROFF(.*) from source files
and generates a single global concatenated string that contains
all individual strings. Plus it generates STROFF_foo macro that
expands to the offset of the string "foo" in the global string.
With the addition of:
	define STROFF(foo) STROFF_ ## foo
the source code does not need any additional modifications.

The downside is that strings cannot contain any characters
that cannot appear in C identifiers. This is not a problem
for real opcodes, but affects some fake internal names
(e.g. "(bad)" is converted to "BAD").

If this change looks good in general, I will work on productionizing it:
 - convert other arches
 - integrate code generator into the build system
 - rewrite the script in Python

Also:
 - does the STROFF prefix for everything looks good?
   should I change it to something else?
 - suggestions for conversion rules for current strings
   that are not valid C identifiers are welcome
 - potentially all strings can be kept as is, if we do:
	STROFF(bad, "(bad)")
   normal opcodes can still be STROFF(mov),
   but not sure if it's worth it
dvyukov added a commit to dvyukov/dynamorio that referenced this pull request Jul 17, 2023
Change instr_info_t::code from ptr_int_t to uint.
This strips ~128K from drdecode binary and ~50K
from runtime VM overhead:

    FILE SIZE        VM SIZE
 --------------  --------------
   +21% +5.60Ki   +22% +5.61Ki    .rodata
  +1.5%    +720  [ = ]       0    .symtab
  +1.8%    +431  [ = ]       0    .strtab
  +0.5%    +384  [ = ]       0    [ELF Section Headers]
  +0.5%    +264  [ = ]       0    .rela.text
  +0.1%    +117  +0.1%    +137    .text
   +31%     +17  [ = ]       0    .llvm_addrsig
  +0.2%     +15  [ = ]       0    [Unmapped]
  +0.0%      +4  [ = ]       0    .rodata.str1.1
 -19.3% -61.3Ki -19.3% -61.3Ki    .data.rel.ro
 -31.0% -73.9Ki  [ = ]       0    .rela.data.rel.ro
 -11.5%  -127Ki -10.6% -55.6Ki    TOTAL

The idea is to change the pointer to table number
+ index in the table encoded as a single uint.
This removes the need in runtime relocations for pointers
(each taking 24 bytes in the binary).
Along with DynamoRIO#6206 this should also shrink sizeof(instr_info_t)
from 48 bytes to 32 bytes (not fully realized w/o DynamoRIO#6206 due
to alignment).

If this change looks good in general,
I will work on productionizing it and update other arches.
Format of the uint encoding is discussable.
For example, if we collect all instr_info_t into a single
global array (and then make individual tables pointers into it),
then the encoding can be made simpler and faster,
but will require more significant changes.
@dvyukov dvyukov marked this pull request as draft July 17, 2023 13:15
@@ -41,6 +41,7 @@
#define DECODE_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abhinav92003 and I discussed this one.

  • We assume PIC/PIE is a requirement for your use case and building a version of the library w/o relocations is not a possible solution?

  • Are there possible compiler optimizations to help here: I guess duplicate string literal and string shared suffix optimizations don't help much and we'd need the compiler to convert our pointer uses into offsets.

  • The code readability impact here isn't too bad; mostly it's a maintenance burden as now a script has to be run. The script would be run as part of the build process, or a dev would run it manually whenever adding new entries?

  • but affects some fake internal names (e.g. "(bad)" is converted to "BAD").

    Could we have some indication that "BAD" is not really an opcode name? "_BAD_"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth getting feedback from @khuey as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume PIC/PIE is a requirement for your use case and building a version of the library w/o relocations is not a possible solution?

Yes, we need to support all build modes.

Are there possible compiler optimizations to help here

I can't think of any. The main problem is with relocations rather than deduping/etc.

The script would be run as part of the build process, or a dev would run it manually whenever adding new entries?

The intention is that it will run automatically, but I din't look into how easy it is to integrate into cmake, I assume should have required support.

Could we have some indication that "BAD" is not really an opcode name? "BAD"?

We sure can adjust them any way you want (and these will also be 100% chanable in future patches).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other option is to make name char[16] (embed right in the struct).
Longest real x86 instruction I see are vpbroadcastmw2d/vbroadcasti64x4/vscatterpf1dpd. So 16 should be enough.
The compiler won't ensure that it's 0-terminated, but it can be reliably verified in tests.

Copy link
Contributor

@derekbruening derekbruening Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other option is to make name char[16] (embed right in the struct).

I like that: but it doesn't save as much space?

If we have 6778 table entries: the original proposal removed the 24 bytes of relocs plus 6 bytes for pointer-to-2-byte-offset which is 6778*30 = 203340 == your 200K savings.

Here we'd go from the 8-byte pointer to a 16-byte string, but removing the 24 reloc bytes, for a net reduction of 6778*16 = 108448 == just 100K. But the strings themselves disappear: so if the strings averaged 6 chars with newline we're saving 6778*22 = 149116. Is 150K savings enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I did a quick prototype and this version saves 76K on top of char[16] version:

    FILE SIZE        VM SIZE    
 --------------  -------------- 
   +43% +18.9Ki   +44% +18.9Ki    .rodata
 -27.3% -96.1Ki -27.3% -96.1Ki    .data.rel.ro
  -7.8% -76.1Ki -14.1% -77.1Ki    TOTAL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khuey do you have any feedback/objections to this and #6211?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants