You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have found these two valuable entities that are left out on export. <stagr> is critical to the understanding of some definitions, while <ke_inf> is very nice to have but doesn't fundamentally break the format.
<stagr>
Bundled explanation of this element within said xml file:
<!ELEMENT stagr (#PCDATA)>
<!-- These elements, if present, indicate that the sense is restricted
to the lexeme represented by the keb and/or reb. -->
In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".
For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.
The current implementation leaves large room for ambiguity.
For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:
Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):
<ke_inf>
Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).
To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".
Possible entities are:
<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY io "irregular okurigana usage">
<!ENTITY oK "word containing out-dated kanji or kanji usage">
<!ENTITY rK "rarely-used kanji form">
<!ENTITY sK "search-only kanji form">
The text was updated successfully, but these errors were encountered:
I am not sure I understand what you mean by "But shouldn't this be fixed in the JMdict itself?"
The data is already present within the JMdict file, for instance:
<sense>
<stagr>ひすい</stagr>
<pos>&n;</pos>
<gloss>beautiful lustrous colour similar to that of the kingfisher's feathers</gloss>
</sense>
This stagr element nested into this sense element restricts the meaning of this particular definition to that particular reading of the word.
For reference, this other definition of the same word has no restriction and thus applies to all readings and writings:
<sense>
<pos>&n;</pos>
<gloss>kingfisher (esp. the common kingfisher, Alcedo atthis)</gloss>
</sense>
Here are some examples of how the data is represented within other projects: jisho.org JMdictDB
I have found these two valuable entities that are left out on export. <stagr> is critical to the understanding of some definitions, while <ke_inf> is very nice to have but doesn't fundamentally break the format.
<stagr>
Bundled explanation of this element within said xml file:
In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".
For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.
The current implementation leaves large room for ambiguity.
For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:
![image](https://cdn.statically.io/img/user-images.githubusercontent.com/34507493/191894126-d249e65b-cee1-4697-995d-e2d4ac2d689e.png)
![image](https://cdn.statically.io/img/user-images.githubusercontent.com/34507493/191894309-76af42f9-b60d-4cd9-98ee-31f2cbf2a8d9.png)
Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):
<ke_inf>
Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).
To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".
Possible entities are:
The text was updated successfully, but these errors were encountered: