Retain base languages for compiled data #5195

robertbastian · 2024-07-08T15:42:21Z

robertbastian

Overall it looks like it's roughly +50% for the lookup trie, in some cases it's much more, in few cases less. Highlighted some cases where it by far exceeds the previous size, or the size of the actual data.

robertbastian · 2024-07-08T16:25:39Z

provider/data/collator/data/collation_diacritics_v1_marker.rs.data

@@ -4,7 +4,7 @@
 /// `icu`'s `_unstable` constructors.
 ///
 /// Using this implementation will embed the following data in the binary's data segment:
-/// * 41B for the lookup data structure (2 data identifiers)
+/// * 606B for the lookup data structure (151 data identifiers)


robertbastian · 2024-07-08T16:27:16Z

provider/data/datetime/data/buddhist_year_names_v1_marker.rs.data

@@ -4,7 +4,7 @@
 /// `icu`'s `_unstable` constructors.
 ///
 /// Using this implementation will embed the following data in the binary's data segment:
-/// * 650B for the lookup data structure (117 data identifiers)


robertbastian · 2024-07-08T16:30:38Z

provider/data/plurals/data/plural_ranges_v1_marker.rs.data

@@ -4,7 +4,7 @@
 /// `icu`'s `_unstable` constructors.
 ///
 /// Using this implementation will embed the following data in the binary's data segment:
-/// * 130B for the lookup data structure (25 data identifiers)
+/// * 631B for the lookup data structure (156 data identifiers)


jedel1043 · 2024-07-08T17:38:04Z

Overall it looks like it's roughly +50% for the lookup trie, in some cases it's much more, in few cases less. Highlighted some cases where it by far exceeds the previous size, or the size of the actual data.

Now that DryDataProvider is a thing, maybe we could move the duplicated locales to it, then always do full deduplication for the DataProvider impl? That could enable DCE for people who don't use dry_load. However, it would make dry_load and load differ on the locale, but maybe that's fine?

robertbastian · 2024-07-08T17:41:24Z

You mean having separate lookup tries for dry_load and load? That will mean duplication for people who use dry_load. It also means there will be different behaviour between load and dry_load

jedel1043 · 2024-07-08T17:51:13Z

Yeah, but I also was thinking that the Rust compiler cannot deduplicate between DataProvider impls because they don't share the same data, but maybe the compiler could deduplicate between DryDataProvider impls since the metadata is basically the same in a lot of places.

About the differing behaviour, I think it's fine. The trait just guarantees that if DataProvider doesn't fail, then DryDataProvider must not fail for the same request, but I think it wouldn't be out of place to not guarantee that the returned locale will also be the same between them.

sffc · 2024-07-08T20:46:03Z

We had a hypothesis at some point that retain-base-languages might be more efficient since it saves a hop during locale fallbacking. Do we have figures for how much it saves on average? Maybe based on one of our examples?

sffc · 2024-07-08T20:48:00Z

I think @jedel1043's idea is worth investigating. Maybe a hybrid approach where there are 2 tries, one that has maximal deduplication lookup and the other that holds the diff between retain-base-languages and maximal.

sffc · 2024-07-08T20:49:15Z

Or actually, all of the DryDataProvider impls could just use the same trie, which always contains only the base langauges, and all of the impls generated from the same datagen invocation can use the exact same trie.

sffc · 2024-07-09T16:09:00Z

Some discussion:

@robertbastian - DryDataProvider can be achieved without retain-base-languages, this PR is basically a lookup size/speed tradeoff
@sffc - I would generally trade a 10% size increase for a 25% speed increase, or a 1-2% size increase for a 5% speed increase. I would be happy if we made "load formatter and use it once" call sites faster, and data loading is still a nontrivial portion of the time spent in those call sites.
@robertbastian - % of what? It really depends on the usage pattern, how many fallback vs non-fallback locales, which ICU4X APIs, etc
@Manishearth - I don't think compiled data loading is 25% of a typical ICU4X usage, so Amdahl's law stops us from getting speedups that big. It may be for usage patterns where people never retain formatter objects ("load formatter and use it once"), but I'd like to see benchmarks, and people in those situations have other optimization tradeoffs available to them (cache the objects, which may give them a small size increase trading off for fast lookups)
@sffc - We shouldn't add these impls unless we have base languages, because they are useless without base languages.
@robertbastian - I don't want to conditionally add them based on deduplication strategy.
@sffc - Then let's export the macros without calling the macros on provider::Baked.

sffc · 2024-07-09T16:11:43Z

This PR doesn't add DryDataProvider impls yet, right? It's just changing the deduplication mode?

retain base languages for baked

b3ec1c5

robertbastian commented Jul 8, 2024

View reviewed changes

robertbastian marked this pull request as ready for review July 8, 2024 19:20

robertbastian requested a review from a team as a code owner July 8, 2024 19:20

robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Jul 9, 2024

sffc mentioned this pull request Jul 9, 2024

DryDataProvider for baked providers #5211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain base languages for compiled data #5195

Retain base languages for compiled data #5195

robertbastian commented Jul 8, 2024 •

edited

Loading

robertbastian left a comment

robertbastian Jul 8, 2024

robertbastian Jul 8, 2024

robertbastian Jul 8, 2024

jedel1043 commented Jul 8, 2024 •

edited

Loading

robertbastian commented Jul 8, 2024

jedel1043 commented Jul 8, 2024 •

edited

Loading

sffc commented Jul 8, 2024

sffc commented Jul 8, 2024

sffc commented Jul 8, 2024

sffc commented Jul 9, 2024

sffc commented Jul 9, 2024

Retain base languages for compiled data #5195

Are you sure you want to change the base?

Retain base languages for compiled data #5195

Conversation

robertbastian commented Jul 8, 2024 • edited Loading

robertbastian left a comment

Choose a reason for hiding this comment

robertbastian Jul 8, 2024

Choose a reason for hiding this comment

robertbastian Jul 8, 2024

Choose a reason for hiding this comment

robertbastian Jul 8, 2024

Choose a reason for hiding this comment

jedel1043 commented Jul 8, 2024 • edited Loading

robertbastian commented Jul 8, 2024

jedel1043 commented Jul 8, 2024 • edited Loading

sffc commented Jul 8, 2024

sffc commented Jul 8, 2024

sffc commented Jul 8, 2024

sffc commented Jul 9, 2024

sffc commented Jul 9, 2024

robertbastian commented Jul 8, 2024 •

edited

Loading

jedel1043 commented Jul 8, 2024 •

edited

Loading

jedel1043 commented Jul 8, 2024 •

edited

Loading