Possible Encoding Issues #4

nathanhammond · 2021-01-30T12:15:12Z

It's possible that this library is selecting the wrong encoding for some characters. In comparing the output from this library to the content of https://github.com/lshk-org/jyutping-table I've noticed the following discrepancies.

I believe that these issues should be resolved in this library, and that the other output is correct.

The below results are also included in a related issue filed at lshk-org/jyutping-table#5

- From the original export, present in `list-20040907.tsv`
+ From a new export, https://github.com/nathanhammond/parse-jyutping-table-full/blob/master/totsv.js

- 十	U+5341	sap6
+ 〸	U+3038	sap6
- 卄	U+5344	jaa6
- 卄	U+5344	je6
- 卄	U+5344	lim6
- 卄	U+5344	nim6
+ 〹	U+3039	jaa6
+ 〹	U+3039	je6
+ 〹	U+3039	lim6
+ 〹	U+3039	nim6
- 卅	U+5345	saa1 aa6
+ 〺	U+303A	saa1 aa6
- 兀	U+5140	at6
- 兀	U+5140	ngat6
+ 兀	U+FA0C	at6
+ 兀	U+FA0C	ngat6
- 嗀	U+55C0	hok3
+ 嗀	U+FA0D	hok3

Further, there is a weird one:

+ 浧	U+6D67	wun3
- 𤧬	U+249EC	wun3

From JPTableFull.pdf that is defined as: { ucs2: "E6C5", jyutping: "wun3" }.

I do believe that U+249EC is the correct value here.

The text was updated successfully, but these errors were encountered:

chaklim · 2021-01-30T13:01:17Z

The purpose of this repo is to convert characters with old unicode encoding to new unicode encoding,
in which the new unicode-encoded characters can be displayed properly in devices that supported the new unicode encoding standard.

ref:
〸 https://www.compart.com/en/unicode/U+3038
〹 https://www.compart.com/en/unicode/U+3039
〺 https://www.compart.com/en/unicode/U+303A
兀 https://www.compart.com/en/unicode/U+FA0C
嗀 https://www.compart.com/en/unicode/U+FA0D
𤧬 https://www.compart.com/en/unicode/U+249EC

The output from "+ From a new export" should be the one with newer unicode encoding.
The previous unicode encoding for those words can also be found in "Decomposition" row from ref.

For the weird one, I think I need some time to investigate it.

nathanhammond · 2021-01-30T15:06:09Z

Okay, I'm throwing content here so that I can try to figure it out later:

var normalizations = ["NFC", "NFD", "NFKC", "NFKD"];
var u5341 = normalizations.map((normalization) => "\u5341".normalize(normalization).codePointAt(0).toString(16));
var u3038 = normalizations.map((normalization) => "\u3038".normalize(normalization).codePointAt(0).toString(16));

console.log(normalizations.join('\t'));
console.log(u3038.join('\t'));
console.log(u5341.join('\t'));

ORIG	NFC	NFD	NFKC	NFKD
3038	3038	3038	5341	5341
5341	5341	5341	5341	5341

So, U+5341 is a "compatibility decomposition" of U+3038. You can get from U+3038 to U+5341, but not the other way around. I don't yet know what that means. Especially since there is also something known as a "canonical decomposition." My future reading:

nathanhammond · 2021-01-30T15:08:18Z

Sorry, just realized that my previous comment could be read as passive aggressive. This is actually me trying to figure things out and you're seeing me pause and serialize my state so that I can spend the evening with my family. I'll continue tomorrow.

nathanhammond · 2021-02-01T03:36:37Z

After further review, I'm pretty sure that U+5341 (and all of the others) should be the selected encoding. Reasoning:

Every keyboard I've tested for entering 十 outputs the U+5341 code point.
Paired with downstream consumers of this, if a pronunciation is attached to U+3038 you can't use Unicode NFKC normalization to U+5341 to discover the pronunciation. The inverse, however, does work.

This also seems to be in the intent of compatibility characters serving as the "base" value, without any additional formatting included. See below for concrete examples of limitations.

Example for generically identifying pronunciation of a string if this were stored at U+5341:

var lookup = {
  "5341": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u3038";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]

Example of failure if stored at U+3038:

var lookup = {
  "3038": "sap6"
};

function getPronunciation(character) {
  return lookup[character.codePointAt(0).toString(16)];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ undefined ]

The latter solution would require an in-application lookup from U+5341 to U+3038:

var lookup = {
  "3038": "sap6"
};

var indirect = {
  "5341": "3038"
};

function getPronunciation(character) {
  var codePoint = character.codePointAt(0).toString(16);
  return lookup[codePoint] || lookup[indirect[codePoint]];
}

var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ]

nathanhammond · 2023-12-12T18:45:33Z

The weird one:

+ 浧	U+6D67	wun3
- 𤧬	U+249EC	wun3

I traced it backward and it is happening because of a duplicate key in hkscs1999.tsv:

hkscs_unicode_converter/hkscs/hkscs1999.tsv

Line 1720 in 9c39762

0x9447 U+6D67 U+E6C5 # <reserved> <compat> 0xD256

The table you're using comes from here:
https://moztw.org/docs/big5/
https://moztw.org/docs/big5/table/hkscs1999.txt

If you compare it to the values from the original 1999 HKSCS, you can see that 0x9447 shouldn't be mapped; it should be omitted.
https://www.ccli.gov.hk/doc/e_hkscs_1999.pdf

Alternatively, it should be mapped to the compat point:
https://www.ccli.gov.hk/doc/big5cmp2001.txt

So, the presence of that duplicate value is an error in the source data file.

nathanhammond · 2023-12-13T05:36:15Z

https://en.wikipedia.org/wiki/Suzhou_numerals

U+3038, U+3039, and U+303A are Suzhou numerals. For running text we should instead prefer U+5341, U+5EFF, and U+5345.

nathanhammond · 2023-12-13T12:22:32Z

Here are details confirming that this library's mapping of U+5341/U+5344/U+5345 is incorrect.

Big5 Graphical Block (0xA140 to 0xA3BF)

These three were eventually elided from Big5, but help to communicate intent.

A2CC (Suzhou 10)
A2CD (Suzhou 20, "H" presentation)
A2CE (Suzhou 30)

Big5 Frequently used characters (0xA440 to 0xC67E)

These are intended to be used in running text.

A451 (Ideograph 10)
A4CA (Ideograph 30)
A4DC (Ideograph 20, "U" presentation)

Unicode

The graphical variants (Suzhou).

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

The ideograph variants (for running text).

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

Unicode code points should not be remapped between these two sections; they're distinct. We should assume that the user has intentionally selected one or the other.

JPTableFull.pdf - Specifies Unicode code points in the ideograph range.

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)
U+5EFF (Ideograph 20, "U" presentation)

`hkscs_unicode_converter`

Maps inputs of:

U+5341 (Ideograph 10)
U+5344 (Ideograph 20, "H" presentation)
U+5345 (Ideograph 30)

To Suzhou outputs:

U+3038 (Suzhou 10)
U+3039 (Suzhou 20, "H" presentation)
U+303A (Suzhou 30)

This set of facts makes it pretty clear to me that this should avoid remapping

nathanhammond · 2024-01-03T17:55:45Z

Here is HKCS with explicit recommendations about how certain code points should be transformed: https://www.ccli.gov.hk/doc/HKCS_En_V10.pdf

These recommendations match to what I tracked down.

nathanhammond linked a pull request Jan 3, 2024 that will close this issue

Correct conversions #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Encoding Issues #4

Possible Encoding Issues #4

nathanhammond commented Jan 30, 2021

chaklim commented Jan 30, 2021 •

edited

Loading

nathanhammond commented Jan 30, 2021 •

edited

Loading

nathanhammond commented Jan 30, 2021

nathanhammond commented Feb 1, 2021

nathanhammond commented Dec 12, 2023

nathanhammond commented Dec 13, 2023

nathanhammond commented Dec 13, 2023

nathanhammond commented Jan 3, 2024

Possible Encoding Issues #4

Possible Encoding Issues #4

Comments

nathanhammond commented Jan 30, 2021

chaklim commented Jan 30, 2021 • edited Loading

nathanhammond commented Jan 30, 2021 • edited Loading

nathanhammond commented Jan 30, 2021

nathanhammond commented Feb 1, 2021

nathanhammond commented Dec 12, 2023

nathanhammond commented Dec 13, 2023

nathanhammond commented Dec 13, 2023

Big5 Graphical Block (0xA140 to 0xA3BF)

Big5 Frequently used characters (0xA440 to 0xC67E)

Unicode

JPTableFull.pdf - Specifies Unicode code points in the ideograph range.

hkscs_unicode_converter

nathanhammond commented Jan 3, 2024

chaklim commented Jan 30, 2021 •

edited

Loading

nathanhammond commented Jan 30, 2021 •

edited

Loading

`hkscs_unicode_converter`