-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Encoding Issues #4
Comments
The purpose of this repo is to convert characters with old unicode encoding to new unicode encoding, ref: The output from "+ From a new export" should be the one with newer unicode encoding. For the weird one, I think I need some time to investigate it. |
Okay, I'm throwing content here so that I can try to figure it out later: var normalizations = ["NFC", "NFD", "NFKC", "NFKD"];
var u5341 = normalizations.map((normalization) => "\u5341".normalize(normalization).codePointAt(0).toString(16));
var u3038 = normalizations.map((normalization) => "\u3038".normalize(normalization).codePointAt(0).toString(16));
console.log(normalizations.join('\t'));
console.log(u3038.join('\t'));
console.log(u5341.join('\t')); ORIG NFC NFD NFKC NFKD
3038 3038 3038 5341 5341
5341 5341 5341 5341 5341
|
Sorry, just realized that my previous comment could be read as passive aggressive. This is actually me trying to figure things out and you're seeing me pause and serialize my state so that I can spend the evening with my family. I'll continue tomorrow. |
After further review, I'm pretty sure that U+5341 (and all of the others) should be the selected encoding. Reasoning:
This also seems to be in the intent of compatibility characters serving as the "base" value, without any additional formatting included. See below for concrete examples of limitations. Example for generically identifying pronunciation of a string if this were stored at U+5341: var lookup = {
"5341": "sap6"
};
function getPronunciation(character) {
return lookup[character.codePointAt(0).toString(16)];
}
var inputString = "\u3038";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ] Example of failure if stored at U+3038: var lookup = {
"3038": "sap6"
};
function getPronunciation(character) {
return lookup[character.codePointAt(0).toString(16)];
}
var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ undefined ] The latter solution would require an in-application lookup from U+5341 to U+3038: var lookup = {
"3038": "sap6"
};
var indirect = {
"5341": "3038"
};
function getPronunciation(character) {
var codePoint = character.codePointAt(0).toString(16);
return lookup[codePoint] || lookup[indirect[codePoint]];
}
var inputString = "\u5341";
var pronunciations = [...inputString.normalize('NFKC')].map(getPronunciation);
// [ "sap6" ] |
The weird one: + 浧 U+6D67 wun3
- 𤧬 U+249EC wun3 I traced it backward and it is happening because of a duplicate key in hkscs_unicode_converter/hkscs/hkscs1999.tsv Line 1720 in 9c39762
The table you're using comes from here: If you compare it to the values from the original 1999 HKSCS, you can see that Alternatively, it should be mapped to the compat point: So, the presence of that duplicate value is an error in the source data file. |
https://en.wikipedia.org/wiki/Suzhou_numerals U+3038, U+3039, and U+303A are Suzhou numerals. For running text we should instead prefer U+5341, U+5EFF, and U+5345. |
Here are details confirming that this library's mapping of U+5341/U+5344/U+5345 is incorrect. Big5 Graphical Block (0xA140 to 0xA3BF)These three were eventually elided from Big5, but help to communicate intent.
Big5 Frequently used characters (0xA440 to 0xC67E)These are intended to be used in running text.
UnicodeThe graphical variants (Suzhou).
The ideograph variants (for running text).
Unicode code points should not be remapped between these two sections; they're distinct. We should assume that the user has intentionally selected one or the other. JPTableFull.pdf - Specifies Unicode code points in the ideograph range.
|
Here is HKCS with explicit recommendations about how certain code points should be transformed: https://www.ccli.gov.hk/doc/HKCS_En_V10.pdf These recommendations match to what I tracked down. |
It's possible that this library is selecting the wrong encoding for some characters. In comparing the output from this library to the content of https://github.com/lshk-org/jyutping-table I've noticed the following discrepancies.
I believe that these issues should be resolved in this library, and that the other output is correct.
The below results are also included in a related issue filed at lshk-org/jyutping-table#5
Further, there is a weird one:
From
JPTableFull.pdf
that is defined as:{ ucs2: "E6C5", jyutping: "wun3" }
.I do believe that
U+249EC
is the correct value here.The text was updated successfully, but these errors were encountered: