Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read JMDict format #239

Closed
homocomputeris opened this issue Aug 26, 2020 · 8 comments
Closed

Read JMDict format #239

homocomputeris opened this issue Aug 26, 2020 · 8 comments
Labels

Comments

@homocomputeris
Copy link

Can Pyglossary convert the JMDict files? Maybe, the CC-CEDICT reader can be used for this?

@ilius
Copy link
Owner

ilius commented Aug 27, 2020

The format seems to be different.

@ilius ilius changed the title JMDict/EDICT support Aug 27, 2020
@ilius ilius added the Feature label Aug 27, 2020
@homocomputeris
Copy link
Author

homocomputeris commented Aug 27, 2020

I'm afraid I'm not good enough at programming to understand how exactly Pyglossary represents a dictionary entry when it reads a file, so I don't think I can help a lot.
From the JMDict specs I've got an impression that it's an XML file that can be just formally parsed to create a dictionary entry in a markup language like DSL.

Here are some projects that may be useful depending on their license

ilius added a commit that referenced this issue Aug 29, 2020
@ilius
Copy link
Owner

ilius commented Aug 29, 2020

Added JMDict support, please try it.

@homocomputeris
Copy link
Author

I have successfully converted the JMDict to the StarDict format, and GoldenDict 1.5 reads it with not problems.

It would be great to be able to drop not needed languages during conversion:

<gloss xml:lang="rus">...</gloss>
<gloss xml:lang="spa">...</gloss>
@ilius
Copy link
Owner

ilius commented Aug 30, 2020

Thanks.

There is also an English-only JMDict file for download.

@ilius ilius closed this as completed Aug 30, 2020
@epistularum
Copy link

epistularum commented Sep 23, 2022

@ilius

Added JMDict support, please try it.

Thank you very much for supporting the JMdict file format. I have found a very valuable piece of data that is left out on export, it is marked as <stagr> within the xml file.

Bundled explanation of this element within said xml file:

<!ELEMENT stagr (#PCDATA)>
        <!-- These elements, if present, indicate that the sense is restricted
        to the lexeme represented by the keb and/or reb. -->

In practice:
The writing "翡翠" can be read both as "ひすい" or "かわせみ" (among others) but the definition "jade (gem)​" and "beautiful lustrous colour similar to that of the kingfisher's feathers" exclusively apply to the reading "ひすい".

For all intents and purposes "翡翠[ひすい]" and "翡翠[かわせみ]" are two separate words (which is the way they're handled in all monolingual dictionaries) but JMdict chose to combine them (and many others) into a single header.

The current implementation leaves large room for ambiguity.

For reference, here is a screenshot of a dictionary I made based on the data extracted by pyglossary:
image
Here how it is handled in all monolingual dictionaries(as you can see they're two separate entries):
image

Less critical but still very useful is the information provided by <ke_inf>, it gives specific information concerning a given writing. Japanese most often than not has several writings for a single word. This entity helps in comparing different writings together (especially in regard to which writing is more common).

To give an example without getting into too much details, "阿弗利加" this writing is classified as being an "ateji" which means that it is a foreign word (africa) that was phonetically converted and then forcibly given a kanji writing. Given this information, I can deduct that this kanji writing might not be desirable, as such foreign words are more commonly just written phonetically. In this instance written as "アフリカ".

Possible entities are:

<!ENTITY ateji "ateji (phonetic) reading">
<!ENTITY ik "word containing irregular kana usage">
<!ENTITY iK "word containing irregular kanji usage">
<!ENTITY io "irregular okurigana usage">
<!ENTITY oK "word containing out-dated kanji or kanji usage">
<!ENTITY rK "rarely-used kanji form">
<!ENTITY sK "search-only kanji form">
@ilius
Copy link
Owner

ilius commented Sep 23, 2022

@epistularum Can you please open a new issue for this?

@epistularum
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 participants