Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocaleFallbackPriority for Transliterator fallback #3972

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

skius
Copy link
Member

@skius skius commented Aug 30, 2023

This adds simple language-ID-only fallback that will be used for transliterator constructor fallback. Fallback is only performed on language, script, and region, as the examples in UTS #35 show.

@skius skius marked this pull request as ready for review August 30, 2023 10:59
@skius skius requested review from robertbastian and removed request for dminor, zbraniecki and Manishearth August 30, 2023 10:59
("en-US", &["en-US", "en", "und-Latn"]),
(
"az-Arab-IR",
&["az-Arab-IR", "az-Arab", "az-IR", "az", "und-Arab"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: do we want und-Arab-IR in this chain too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specification says no (that chain is copy-pasted from there), but I don't know a good way to check what ICU4C is doing.

Comment on lines 61 to 63
// 2. Remove the script if it is implied by the other subtags and the fallback priority is
// not transliteration.
if self.config.priority != FallbackPriority::Transliteration {
Copy link
Member

@sffc sffc Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought/Suggestion: Rather than skipping this step, it seems like we could instead maximize the script at this point. Then, for example, ru-RU becomes ru-Cyrl-RU and we run the fallback chain from that reference point. If you do this then you don't need the max_script function; just change the body of this branch to add instead of remove the script.

Copy link
Member Author

@skius skius Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ru-Cyrl-RU is not part of ru-RU's fallback chain, neither is ru-Cyrl that would be the next step: https://unicode.org/reports/tr35/tr35-general.html#Inheritance

We need to store the maximized script outside the locale itself, because it's invalid to maximize after we've already changed the locale. It needs to be on the initial locale, otherwise e.g. az would fallback to und-Latn, which is the maximized script for az but not for az-IR, and the spec explicitly says az-IR ends up at und-Arab (ignoring supplemental data that ICU4C is also ignoring).

I pushed an alternative version that reuses the normalize code block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the specification is inconsistent. It also says

For either the source or the target, the fallback starts from the maximized locale ID (using the likely-subtags data).

emphasis mine. What complicates it even more is that

Where there are multiple scripts, the maximized script is tried first, and then the other scripts associated with the language (from supplemental data).

I think we might need supplemental data for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, ICU4C does what I'm doing AFAICT, though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I checked that supplemental data is not accessed and that the fallback starts from the initial locale, not the maximized one)

@robertbastian robertbastian added the discuss Discuss at a future ICU4X-SC meeting label Aug 31, 2023
@sffc
Copy link
Member

sffc commented Aug 31, 2023

  • @sffc - ru-Cyrl-RU and ru-RU are equivalent, so they should normalize to the same thing. In most fallback modes we would normalize to ru-RU but for translit maybe we should use ru-Cyrl-RU.
  • @robertbastian - The spec is inconsistent. It says you should add in the script.
  • @robertbastian - Do we want to design a nice script fallback mode and use it in the transliterator, or do we want to handle this as a transliterator fallback mode?
  • @sffc - we should do what makes sense and try to upstream that into the spec. we already did this before. If we can implement a more sensible algorithm I'm 90% sure we can get that upstream.
  • @robertbastian - Transliterator fallback doesn't seem to have been thought about much before
  • @sffc - previous impls might not have thought about fallback, this is something that ICU4X has really championed and we could upstream
  • @sffc - Seems like something we can resolve later.
  • @skius - CLDR discussion on transform fallback: https://cldr.unicode.org/development/development-process/design-proposals/transform-fallback
@robertbastian robertbastian marked this pull request as draft August 31, 2023 19:15
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants