Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JMdict] add support for the <stagr> and <ke_inf> entities #388

Open
epistularum opened this issue Sep 23, 2022 · 2 comments
Open

[JMdict] add support for the <stagr> and <ke_inf> entities #388

epistularum opened this issue Sep 23, 2022 · 2 comments

Comments

@epistularum
Copy link

epistularum commented Sep 23, 2022

I have found these two valuable entities that are left out on export. <stagr> is critical to the understanding of some definitions, while <ke_inf> is very nice to have but doesn't fundamentally break the format.

<stagr>

Bundled explanation of this element within said xml file:

<!ELEMENT stagr (#PCDATA)>
        <!-- These elements, if present, indicate that the sense is restricted
        to the lexeme represented by the keb and/or reb. -->

In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)​" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".

For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.

The current implementation leaves large room for ambiguity.

For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:
image
Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):
image

<ke_inf>

Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).

To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".

Possible entities are:

<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY io "irregular okurigana usage">
<!ENTITY oK "word containing out-dated kanji or kanji usage">
<!ENTITY rK "rarely-used kanji form">
<!ENTITY sK "search-only kanji form">
@ilius
Copy link
Owner

ilius commented Nov 12, 2022

I don't speak Japanese so I can't fully grasp this.
But shouldn't this be fixed in the JMdict itself?

@epistularum
Copy link
Author

epistularum commented Nov 13, 2022

I am not sure I understand what you mean by "But shouldn't this be fixed in the JMdict itself?"
The data is already present within the JMdict file, for instance:

<sense>
<stagr>ひすい</stagr>
<pos>&n;</pos>
<gloss>beautiful lustrous colour similar to that of the kingfisher's feathers</gloss>
</sense>

This stagr element nested into this sense element restricts the meaning of this particular definition to that particular reading of the word.

For reference, this other definition of the same word has no restriction and thus applies to all readings and writings:

<sense>
<pos>&n;</pos>
<gloss>kingfisher (esp. the common kingfisher, Alcedo atthis)</gloss>
</sense>

Here are some examples of how the data is represented within other projects:
jisho.org
image
JMdictDB
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants