Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain base languages for compiled data #5195

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jul 8, 2024

#58

Copy link
Member Author

@robertbastian robertbastian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks like it's roughly +50% for the lookup trie, in some cases it's much more, in few cases less. Highlighted some cases where it by far exceeds the previous size, or the size of the actual data.

@@ -4,7 +4,7 @@
/// `icu`'s `_unstable` constructors.
///
/// Using this implementation will embed the following data in the binary's data segment:
/// * 41B for the lookup data structure (2 data identifiers)
/// * 606B for the lookup data structure (151 data identifiers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not great

@@ -4,7 +4,7 @@
/// `icu`'s `_unstable` constructors.
///
/// Using this implementation will embed the following data in the binary's data segment:
/// * 650B for the lookup data structure (117 data identifiers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not great

@@ -4,7 +4,7 @@
/// `icu`'s `_unstable` constructors.
///
/// Using this implementation will embed the following data in the binary's data segment:
/// * 130B for the lookup data structure (25 data identifiers)
/// * 631B for the lookup data structure (156 data identifiers)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not great

@jedel1043
Copy link
Contributor

jedel1043 commented Jul 8, 2024

Overall it looks like it's roughly +50% for the lookup trie, in some cases it's much more, in few cases less. Highlighted some cases where it by far exceeds the previous size, or the size of the actual data.

Now that DryDataProvider is a thing, maybe we could move the duplicated locales to it, then always do full deduplication for the DataProvider impl? That could enable DCE for people who don't use dry_load. However, it would make dry_load and load differ on the locale, but maybe that's fine?

@robertbastian
Copy link
Member Author

You mean having separate lookup tries for dry_load and load? That will mean duplication for people who use dry_load. It also means there will be different behaviour between load and dry_load

@jedel1043
Copy link
Contributor

jedel1043 commented Jul 8, 2024

Yeah, but I also was thinking that the Rust compiler cannot deduplicate between DataProvider impls because they don't share the same data, but maybe the compiler could deduplicate between DryDataProvider impls since the metadata is basically the same in a lot of places.

About the differing behaviour, I think it's fine. The trait just guarantees that if DataProvider doesn't fail, then DryDataProvider must not fail for the same request, but I think it wouldn't be out of place to not guarantee that the returned locale will also be the same between them.

@robertbastian robertbastian marked this pull request as ready for review July 8, 2024 19:20
@robertbastian robertbastian requested a review from a team as a code owner July 8, 2024 19:20
@sffc
Copy link
Member

sffc commented Jul 8, 2024

We had a hypothesis at some point that retain-base-languages might be more efficient since it saves a hop during locale fallbacking. Do we have figures for how much it saves on average? Maybe based on one of our examples?

@sffc
Copy link
Member

sffc commented Jul 8, 2024

I think @jedel1043's idea is worth investigating. Maybe a hybrid approach where there are 2 tries, one that has maximal deduplication lookup and the other that holds the diff between retain-base-languages and maximal.

@sffc
Copy link
Member

sffc commented Jul 8, 2024

Or actually, all of the DryDataProvider impls could just use the same trie, which always contains only the base langauges, and all of the impls generated from the same datagen invocation can use the exact same trie.

@robertbastian robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Jul 9, 2024
@sffc
Copy link
Member

sffc commented Jul 9, 2024

Some discussion:

  • @robertbastian - DryDataProvider can be achieved without retain-base-languages, this PR is basically a lookup size/speed tradeoff
  • @sffc - I would generally trade a 10% size increase for a 25% speed increase, or a 1-2% size increase for a 5% speed increase. I would be happy if we made "load formatter and use it once" call sites faster, and data loading is still a nontrivial portion of the time spent in those call sites.
  • @robertbastian - % of what? It really depends on the usage pattern, how many fallback vs non-fallback locales, which ICU4X APIs, etc
  • @Manishearth - I don't think compiled data loading is 25% of a typical ICU4X usage, so Amdahl's law stops us from getting speedups that big. It may be for usage patterns where people never retain formatter objects ("load formatter and use it once"), but I'd like to see benchmarks, and people in those situations have other optimization tradeoffs available to them (cache the objects, which may give them a small size increase trading off for fast lookups)
  • @sffc - We shouldn't add these impls unless we have base languages, because they are useless without base languages.
  • @robertbastian - I don't want to conditionally add them based on deduplication strategy.
  • @sffc - Then let's export the macros without calling the macros on provider::Baked.
@sffc
Copy link
Member

sffc commented Jul 9, 2024

This PR doesn't add DryDataProvider impls yet, right? It's just changing the deduplication mode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Discuss at a future ICU4X-SC meeting
3 participants