Read JMDict format #239

homocomputeris · 2020-08-26T21:03:26Z

Can Pyglossary convert the JMDict files? Maybe, the CC-CEDICT reader can be used for this?

ilius · 2020-08-27T00:18:34Z

The format seems to be different.

homocomputeris · 2020-08-27T15:18:37Z

I'm afraid I'm not good enough at programming to understand how exactly Pyglossary represents a dictionary entry when it reads a file, so I don't think I can help a lot.
From the JMDict specs I've got an impression that it's an XML file that can be just formally parsed to create a dictionary entry in a markup language like DSL.

Here are some projects that may be useful depending on their license

ilius · 2020-08-29T18:46:23Z

Added JMDict support, please try it.

homocomputeris · 2020-08-30T09:42:37Z

I have successfully converted the JMDict to the StarDict format, and GoldenDict 1.5 reads it with not problems.

It would be great to be able to drop not needed languages during conversion:

<gloss xml:lang="rus">...</gloss>
<gloss xml:lang="spa">...</gloss>

ilius · 2020-08-30T10:59:37Z

Thanks.

There is also an English-only JMDict file for download.

epistularum · 2022-09-23T04:54:15Z

@ilius

Added JMDict support, please try it.

Thank you very much for supporting the JMdict file format. I have found a very valuable piece of data that is left out on export, it is marked as <stagr> within the xml file.

Bundled explanation of this element within said xml file:

<!ELEMENT stagr (#PCDATA)>
        <!-- These elements, if present, indicate that the sense is restricted
        to the lexeme represented by the keb and/or reb. -->

In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".

For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.

The current implementation leaves large room for ambiguity.

For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:

Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):

Less critical but still very useful is the information provided by <ke_inf>, it gives specific information concerning a given writing. Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).

To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".

Possible entities are:

<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY io "irregular okurigana usage">
<!ENTITY oK "word containing out-dated kanji or kanji usage">
<!ENTITY rK "rarely-used kanji form">
<!ENTITY sK "search-only kanji form">

ilius · 2022-09-23T05:19:17Z

@epistularum Can you please open a new issue for this?

epistularum · 2022-09-23T05:28:39Z

#388

ilius changed the title ~~JMDict/EDICT support~~ Aug 27, 2020

ilius added the Feature label Aug 27, 2020

ilius added a commit that referenced this issue Aug 29, 2020

add read support for JMDict, #239

9949e48

ilius closed this as completed Aug 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read JMDict format #239

Read JMDict format #239

homocomputeris commented Aug 26, 2020

ilius commented Aug 27, 2020

homocomputeris commented Aug 27, 2020 •

edited

Loading

ilius commented Aug 29, 2020

homocomputeris commented Aug 30, 2020

ilius commented Aug 30, 2020

epistularum commented Sep 23, 2022 •

edited

Loading

ilius commented Sep 23, 2022

epistularum commented Sep 23, 2022

Read JMDict format #239

Read JMDict format #239

Comments

homocomputeris commented Aug 26, 2020

ilius commented Aug 27, 2020

homocomputeris commented Aug 27, 2020 • edited Loading

ilius commented Aug 29, 2020

homocomputeris commented Aug 30, 2020

ilius commented Aug 30, 2020

epistularum commented Sep 23, 2022 • edited Loading

ilius commented Sep 23, 2022

epistularum commented Sep 23, 2022

homocomputeris commented Aug 27, 2020 •

edited

Loading

epistularum commented Sep 23, 2022 •

edited

Loading