[JMdict] add support for the <stagr> and <ke_inf> entities #388

epistularum · 2022-09-23T05:26:23Z

I have found these two valuable entities that are left out on export. <stagr> is critical to the understanding of some definitions, while <ke_inf> is very nice to have but doesn't fundamentally break the format.

<stagr>

Bundled explanation of this element within said xml file:

<!ELEMENT stagr (#PCDATA)>
        <!-- These elements, if present, indicate that the sense is restricted
        to the lexeme represented by the keb and/or reb. -->

In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".

For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.

The current implementation leaves large room for ambiguity.

For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:

Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):

<ke_inf>

Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).

To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".

Possible entities are:

<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY io "irregular okurigana usage">
<!ENTITY oK "word containing out-dated kanji or kanji usage">
<!ENTITY rK "rarely-used kanji form">
<!ENTITY sK "search-only kanji form">

The text was updated successfully, but these errors were encountered:

ilius · 2022-11-12T22:04:00Z

I don't speak Japanese so I can't fully grasp this.
But shouldn't this be fixed in the JMdict itself?

epistularum · 2022-11-13T12:26:39Z

I am not sure I understand what you mean by "But shouldn't this be fixed in the JMdict itself?"
The data is already present within the JMdict file, for instance:

<sense>
<stagr>ひすい</stagr>
<pos>&n;</pos>
<gloss>beautiful lustrous colour similar to that of the kingfisher's feathers</gloss>
</sense>

This stagr element nested into this sense element restricts the meaning of this particular definition to that particular reading of the word.

For reference, this other definition of the same word has no restriction and thus applies to all readings and writings:

<sense>
<pos>&n;</pos>
<gloss>kingfisher (esp. the common kingfisher, Alcedo atthis)</gloss>
</sense>

Here are some examples of how the data is represented within other projects:
jisho.org

JMdictDB

epistularum mentioned this issue Sep 23, 2022

Read JMDict format #239

Closed

ilius added Feature Improvement labels Nov 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JMdict] add support for the <stagr> and <ke_inf> entities #388

[JMdict] add support for the <stagr> and <ke_inf> entities #388

epistularum commented Sep 23, 2022 •

edited

Loading

ilius commented Nov 12, 2022

epistularum commented Nov 13, 2022 •

edited

Loading

[JMdict] add support for the <stagr> and <ke_inf> entities #388

[JMdict] add support for the <stagr> and <ke_inf> entities #388

Comments

epistularum commented Sep 23, 2022 • edited Loading

<stagr>

<ke_inf>

ilius commented Nov 12, 2022

epistularum commented Nov 13, 2022 • edited Loading

epistularum commented Sep 23, 2022 •

edited

Loading

epistularum commented Nov 13, 2022 •

edited

Loading