Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Encoding Issues #4

Open
nathanhammond opened this issue Jan 30, 2021 · 8 comments · May be fixed by #12
Open

Possible Encoding Issues #4

nathanhammond opened this issue Jan 30, 2021 · 8 comments · May be fixed by #12

Comments

@nathanhammond
Copy link

It's possible that this library is selecting the wrong encoding for some characters. In comparing the output from this library to the content of https://github.com/lshk-org/jyutping-table I've noticed the following discrepancies.

I believe that these issues should be resolved in this library, and that the other output is correct.

The below results are also included in a related issue filed at lshk-org/jyutping-table#5

- From the original export, present in `list-20040907.tsv`
+ From a new export, https://github.com/nathanhammond/parse-jyutping-table-full/blob/master/totsv.js

- 十	U+5341	sap6
+ 〸	U+3038	sap6
- 卄	U+5344	jaa6
- 卄	U+5344	je6
- 卄	U+5344	lim6
- 卄	U+5344	nim6
+ 〹	U+3039	jaa6
+ 〹	U+3039	je6
+ 〹	U+3039	lim6
+ 〹	U+3039	nim6
- 卅	U+5345	saa1 aa6
+ 〺	U+303A	saa1 aa6
- 兀	U+5140	at6
- 兀	U+5140	ngat6
+ 兀	U+FA0C	at6
+ 兀	U+FA0C	ngat6
- 嗀	U+55C0	hok3
+ 嗀	U+FA0D	hok3

Further, there is a weird one:

+ 浧	U+6D67	wun3
- 𤧬	U+249EC	wun3

From JPTableFull.pdf that is defined as: { ucs2: "E6C5", jyutping: "wun3" }.

I do believe that U+249EC is the correct value here.

@chaklim
Copy link
Owner

chaklim commented Jan 30, 2021

The purpose of this repo is to convert characters with old unicode encoding to new unicode encoding,
in which the new unicode-encoded characters can be displayed properly in devices that supported the new unicode encoding standard.

ref:
https://www.compart.com/en/unicode/U+3038
https://www.compart.com/en/unicode/U+3039
https://www.compart.com/en/unicode/U+303A
https://www.compart.com/en/unicode/U+FA0C
https://www.compart.com/en/unicode/U+FA0D
𤧬 https://www.compart.com/en/unicode/U+249EC

The output from "+ From a new export" should be the one with newer unicode encoding.
The previous unicode encoding for those words can also be found in "Decomposition" row from ref.

For the weird one, I think I need some time to investigate it.

@nathanhammond
Copy link
Author

nathanhammond commented Jan 30, 2021

Okay, I'm throwing content here so that I can try to figure it out later:

var normalizations = ["NFC", "NFD", "NFKC", "NFKD"];
var u5341 = normalizations.map((normalization) => "\u5341".normalize(normalization).codePointAt(0).toString(16));
var u3038 = normalizations.map((normalization) => "\u3038".normalize(normalization).codePointAt(0).toString(16));

console.log(normalizations.join('\t'));
console.log(u3038.join('\t'));
console.log(u5341.join('\t'));
ORIG	NFC	NFD	NFKC	NFKD
3038	3038	3038	5341	5341
5341	5341	5341	5341	5341


So, U+5341 is a "compatibility decomposition" of U+3038. You can get from U+3038 to U+5341, but not the other way around. I don't yet know what that means. Especially since there is also something known as a "canonical decomposition." My future reading:

@nathanhammond
Copy link
Author

Sorry, just realized that my previous comment could be read as passive aggressive. This is actually me trying to figure things out and you're seeing me pause and serialize my state so that I can spend the evening with my family. I'll continue tomorrow.

@nathanhammond
Copy link
Author

After further review, I'm pretty sure that U+5341 (and all of the others) should be the selected encoding. Reasoning:

  • Every keyboard I've tested for entering 十 outputs the U+5341 code point.
  • Paired with downstream consumers of this, if a pronunciation is attached to U+3038 you can't use Unicode NFKC normalization to U+5341 to discover the pronunciation. The inverse, however, does work.

This also seems to be in the intent of compatibility characters serving as the "base" value, without any additional formatting included. See below for concrete examples of limitations.


Example for generically identifying pronunciation of a string if this were stored at U+5341:

var lookup = {
  "5341": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u3038";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]

Example of failure if stored at U+3038:

var lookup = {
  "3038": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ undefined ]

The latter solution would require an in-application lookup from U+5341 to U+3038:

var lookup = {
  "3038": "sap6"
};

var indirect = {
  "5341": "3038"
};

function getPronunciation(character) {
  var codePoint = character.codePointAt(0).toString(16);
  return lookup[codePoint] || lookup[indirect[codePoint]];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]
@nathanhammond
Copy link
Author

The weird one:

+ 浧	U+6D67	wun3
- 𤧬	U+249EC	wun3

I traced it backward and it is happening because of a duplicate key in hkscs1999.tsv:

0x9447 U+6D67 U+E6C5 # <reserved> <compat> 0xD256

The table you're using comes from here:
https://moztw.org/docs/big5/
https://moztw.org/docs/big5/table/hkscs1999.txt

If you compare it to the values from the original 1999 HKSCS, you can see that 0x9447 shouldn't be mapped; it should be omitted.
https://www.ccli.gov.hk/doc/e_hkscs_1999.pdf

Alternatively, it should be mapped to the compat point:
https://www.ccli.gov.hk/doc/big5cmp2001.txt

So, the presence of that duplicate value is an error in the source data file.

@nathanhammond
Copy link
Author

https://en.wikipedia.org/wiki/Suzhou_numerals

U+3038, U+3039, and U+303A are Suzhou numerals. For running text we should instead prefer U+5341, U+5EFF, and U+5345.

@nathanhammond
Copy link
Author

Here are details confirming that this library's mapping of U+5341/U+5344/U+5345 is incorrect.

Big5 Graphical Block (0xA140 to 0xA3BF)

These three were eventually elided from Big5, but help to communicate intent.

A2CC (Suzhou 10)
A2CD (Suzhou 20, "H" presentation)
A2CE (Suzhou 30)

Big5 Frequently used characters (0xA440 to 0xC67E)

These are intended to be used in running text.

A451 (Ideograph 10)
A4CA (Ideograph 30)
A4DC (Ideograph 20, "U" presentation)

Unicode

The graphical variants (Suzhou).

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

The ideograph variants (for running text).

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

Unicode code points should not be remapped between these two sections; they're distinct. We should assume that the user has intentionally selected one or the other.

JPTableFull.pdf - Specifies Unicode code points in the ideograph range.

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

hkscs_unicode_converter

Maps inputs of:

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)

To Suzhou outputs:

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

This set of facts makes it pretty clear to me that this should avoid remapping

@nathanhammond
Copy link
Author

Here is HKCS with explicit recommendations about how certain code points should be transformed: https://www.ccli.gov.hk/doc/HKCS_En_V10.pdf

These recommendations match to what I tracked down.

@nathanhammond nathanhammond linked a pull request Jan 3, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants