3

「骨」字虽然没有简体、正体(繁体)之分,但在简体中文和繁体中文中,所展示的字形是不同的,其表现为在不同字体下的外观表象不同。如下,在 Word 里面并列打出它们,左边是 宋体(可以说是简体中文的常见、默认字体了),右边是 PMingLiU 字体(是 Word 中「『简』译『繁』」的默认字体,从而可以说是繁体中文的常见字体),最明显的区别是两个字的上面的开口方向不一样,从而导致书写不一样、笔画不一样等:

Word 展示

但是它们的编码是相同的(都是 U+9AA8),并且(从而(?))它们之间不存在正体、异体、简体、繁体之分。两个在不同语境(简体中文 对比 繁体中文 的展示载体在形式上表现为字体)下字形有差别(虽然不明显,但是原则上是有的)的字,为什么编码相同

换言之(更一般的),Unicode 码处理类似于这种情况的原则是什么?语言的数字化出现这种情况,是历史原因,还是出于什么其他考虑呢?求教。

附注:如果这个问题不是特别适合在 Stack Exchange 的这个站点(Chinese Language)上提问(比如关联度不高等等),那么(在 Stack Exchange 上)有什么更适合提问的地方吗?


背景(可以忽略):

我是简体中文使用者。今天,我就「骨」字查字典(《辞源》),使用部首查字法(是我首次使用部首查字法查古汉字字典)。「骨」字的部首也是「骨」,于是我数了一下它的笔画,9 画,从而找 9 画 部首,但是始终找不到。我怀疑自己数错了,于是去网上(简体中文网页,比如这个)搜了它的笔画,发现我数的是没错的。可是我始终没搜到。

后来,我偶然发现它在 10 画 之列。我感觉真是咄咄怪事,细看发现这个字与我习见的「骨」字不同,具体而言,是这样,左边是作为简体中文使用者所常见的,右边是该字典中展现的: 对比图

再经过网络检索,我才知道,后者是繁体中文(字体)下的字形,虽然它们并没有简繁体之分,但是编码相同。

2

2 Answers 2

6

Recall that the contributions to the CJKV Unicode standard comprises interested parties from China (PRC), Hong Kong, Macau, Taiwan (ROC), Japan, South Korea, North Korea, and Vietnam, each with their own language administrations. The major parties in this field are from PRC, ROC, and Japan, and whatever obscure choices Unicode may make is likely going to be due to one or more of these 3.

The question Why (sometimes) multiple different shapes are encoded under a single Unicode code point is a product of how each of these regions categorise characters:


For 【骨】, the common shape

傳承字形
Common 骨

is simply standard (常用漢字) in Japan, and is considered an Old Character Form (舊字形) in both PRC and ROC. Note that 舊字形 is not a category in either the PRC or ROC's recognised Chinese character categories as regulated by the language administrations, so this 【骨】 is not a 異體字 (ROC) or 繁體字 or 異體字 (PRC); both PRC and ROC would consider it as a font difference and list this under U+9AA8 regardless.

  • The PRC shape

    陸標
    PRC 骨

    is simply a New Character Form (新字形), derived from the 1965 Table of General-Use Chinese Character Forms for Publishing (印刷通用漢字字形表). ROC and Japan don't recognise this shape, and PRC itself doesn't categorise this shape differently, so in the PRC it is simply a 規範字 and doesn't have a special Unicode code point.

  • The ROC shape

    臺標
    ROC 骨

    is simply the shape chosen in ROC's 常用國字標準字體表. PRC and Japan don't recognise this shape, and ROC doesn't categorise this shape differently, so it is simply a 正體字 and doesn't have a special Unicode code point.


Consider two other examples:

  • 【敢】 is 11 strokes in the PRC and 12 strokes everywhere else, but is also under one Unicode code point everywhere (U+6562). This is because the common shape

    傳承字形
    Common 敢

    is considered a 舊字形 (a font difference) in the PRC, with the 新字形 being

    陸標
    PRC 敢

    PRC doesn't categorise the common 敢 as a 繁體字 or 異體字, so it doesn't push for any unique Unicode code point.

  • 【爭】 is 8 strokes in ROC and Japan, and 【争】 is 6 strokes in the PRC, ROC, and Japan. The common shape under U+722D

    傳承字形
    Common 爭

    • Is 人名用漢字 in Japan.

    • Is 舊字形 in ROC, with the standard shape being

      臺標
      ROC 爭

      under the same Unicode code point U+722D

    • Is 舊字形 in the PRC (a font difference) and is totally replaced by 【争】.

    On the other hand, this shape under U+4E89

    争

    • Is 常用漢字 in Japan
    • Is 異體字 in the ROC
    • Is 新字形 in the PRC

    Since the PRC views 【爭】 and 【争】 as font differences, but Japan and ROC puts them under different Chinese character categories, you can hypothesise that if PRC was the only user of Chinese characters, then 【爭】 and 【争】 would also be under one Unicode code point.

1
  • 1
    (+1) "Obscure choices" is an apt description of the decisions made by the authors of Unicode. E.g. for 录 vs 彔 (used as a component): 剥渌禄緑绿録𫘧 have the simplified component, 剝淥祿綠錄 have the traditional component, and 娽椂氯琭盝睩碌箓簶籙粶菉觮趢逯邍醁騄鵦龣㖨㟤㪖㫽㯟䃗䎑䎼䐂䘵䚄䟿䩮䰁䱚䴪 are "unified". It is difficult to see the logic used to arrive at these, but it's probably a combination of (1) competing standards from different regions (as pointed out above); (2) "rare character = don't care = unified"; and (3) whatever the Unicode authors felt like at the time.
    – yawnoc
    Commented 2 days ago
2

我偶然发现它在 10 画 之列。我感觉真是咄咄怪事

you are kidding 😼 the 214 radicals used by 康熙字典; “骨” is categorised as 10 strokes, and you feel strange 🙀

https://www.kangxizidian.com/kxbushou/%E9%AA%A8

國語辭典

Unicode 码处理类似于这种情况的原则是什么

the book published by o’reilly cjkv Information Processing is a must read

briefly, how to represent a glyph of a unicode code point, is depended on the font used; and, fonts are designed. in cjkv context, a font designer need to address the regional, language difference.

for the code point u+9aa8, there’re three significance glyph variations: hong kong 🇭🇰 , china & taiwan 🇹🇼

enter image description here

note, the font “pingfang hk” is identical to the 康熙字典網上版 (the above provided link) 😸

有什么更适合提问的地方

any font design forum, particularly those about “chinese font” in taiwan, hong kong 😸

have fun :)

1
  • 感谢你的回答!我解释一下「我偶然发现它在 10 画 之列。我感觉真是咄咄怪事」,是因为我当时以为字典上列的就是简体的「骨」,而这是 9 画。此时我还不知道它的另一种写法(繁体),所以感到奇怪。区别在于那个开口,开口向左对应「横折」,一画;开口向右对应「竖」「横」,是两画,所以是 10 画,比简体多了一画。这是我后来(就是看到了两个字的区别之后)才知道的,在这之前我确实感到很奇怪
    – Soriak
    Commented Jul 4 at 5:47

Not the answer you're looking for? Browse other questions tagged or ask your own question.