Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English translation of examples sometimes missing, included in the original text #604

Open
29jm opened this issue Apr 24, 2024 · 16 comments

Comments

@29jm
Copy link

29jm commented Apr 24, 2024

Some examples in some Czech words have an issue, where instead of having both text and english in the example object, there's only text containing a concatenation of both, e.g. "the original text ― the translated text".

Some examples:

  • ani: only the last example has the issue
    _screenshot
  • dobrý: same
    _screenshot
  • It's not necessarily for the last entry though, for instance with zničit.
  • Some more like this: vzít, hlavní, čas, žít, paní, jméno, smrt, kniha, psát, názor, světový, osobní, minulý, onen, umění, věk, telefon, zástupce, ženský.

If it's a problem in the page markup, I can fix it there, but I didn't see what could cause it.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Apr 24, 2024

Yeah, this is buggy, I'll take a look at it tomorrow.

@kristian-clausal
Copy link
Collaborator

Sorry, I got stuck trying to figure out a bug? with our logging system, this might have to wait a while.

@29jm
Copy link
Author

29jm commented Apr 25, 2024

No worries at all, I know how it is :)

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Apr 26, 2024

Found part of the issue, and it's a silly one; the examples I looked at (ani and vzít) had a word that was blacklisted from our "what is an English word" set: He, with a capital. Our function that determines what kind of language or function a string has rejects the strings because 1/4 of it is classified as non-english, so it all gets lumped as a romanization string and the example is concatenated.

Actually looking at vzít, the example that is broken doesn't get through because it's 2/3rds Czech names with one English word.

I am writing this message as I'm going through examples, and hlavní is a completely different issue: the example isn't in a template that we accept as an 'example' (ux) template, but as a coi template, related to collocation. Outputs the same as an example template, and outputs the same kind of stuff (looks like an example, quacks like an example), so I just added the template and its aliases to our list of "ux" templates.

I'm going to commit these, and hopefully most of all of the issues will be addressed. If you find that some weren't fixed, just point them out.

Unless we add a bunch of Czech names (which is the simplest way), the examples that are just Eliška married Tomáš are impossible to determine with a simple heuristic.

@29jm
Copy link
Author

29jm commented Apr 26, 2024

Unfortunate that wiktionary doesn't tag the translation of examples with a language! If I understood correctly, this is why wikiextract has to implement these heuristics.

Thanks for the fixes already, I'll have a look at the json within the next week!

@29jm
Copy link
Author

29jm commented May 1, 2024

I'm looking at the most recent JSON, and some problematic words like ani, čas and žít are fixed, while others aren't, e.g. hlavní, názor, osobní. I think many of the remaining ones use coi, though not all, for instance Ukrajina which uses uxi.

(in case it helps, a pastebin of all of them here)

@kristian-clausal
Copy link
Collaborator

Many of those are just that they contain words that are not in nltk.corpus.brown; words like "cellphone", "mousetrap", "He’s" (with a unicode apostrophe or other character), "dumbfuck", "peppermint"... Hrm, many of these could be fixed if we somehow could cheaply detect compound words. If anyone has an idea of how to do this super-cheaply ("peppermint" -> contains "pepper" and "mint" and smooshed together)...

Another category of problem are the translations that are basically stuff like "common noun" or "adjective", phrases that will be get classified as tags by the classifier.

@xxyzz
Copy link
Collaborator

xxyzz commented May 6, 2024

We can't use template arguments or expanded HTML tags? "ux" and "coi" templates put translation text in the third argument, find argument should be easier and more reliable than check if words are in English.

@kristian-clausal
Copy link
Collaborator

I was thinking of that, yeah. We can check to see if the arguments map on to the template expanded output and exit early if they conform to the formatting of examples. The problem is that there might be some pitfalls with this approach, for example if example templates are used for other things, but in context it might be fine.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented May 6, 2024

EDIT: This is a post that was left unwritten earlier today, posting it here just for completeness.

"cause of death" slipped through because decode_tags classified if as tags. "of tags" is parsed as a tag (a space-including tag, so not in valid_tags), and "cause" is classified as a "topic" for some reason, and there's a small piece of boolean logic that says that if any of there are topics or a flag is set or there are no tags with spaces in the collected tags. So because "cause" is in topics, the no tags with spaces does not trigger. This is a super annoying, probably quite rare edgecase.

EDIT: This edgecase should be fixed with the template-arguments fix.

@29jm
Copy link
Author

29jm commented May 14, 2024

That last PR fixed most of the issues! Here's a pastebin of the remaining ones, from a JSON downloaded today: https://pastebin.com/hMyZBXnh

EDIT: If those remaining ones are due to issues on the wiktionary side, let me know how and I'll fix them one by one.

@kristian-clausal
Copy link
Collaborator

I'll take a look at these later, thanks for keeping your eye on the output!

kristian-clausal added a commit that referenced this issue Jul 9, 2024
Issue #604, Czech translations (continued)

In translations like

```
ví ucho ― Leonotis nepetifolia (literally, “lion's ear”)
```

the translation part starting with "Leonotis" is has its
classification returned as "taxonomic" due to the heuristics
used in classify_desc().

I've been trying to kludge something better here, but for
this specifically the right call to make is to change it so
that if a description is either "english" or "taxonomic",
that counts as English. There is not meaningful distinction
here in the examples when trying to figure out translation
stuff.

The heuristics could be better, which is what I tried to
figure out, but it works fine for now...
@kristian-clausal
Copy link
Collaborator

Sorry that this lapsed, I've kludged something small for:

 ucho       | lví ucho ― Leonotis nepetifolia (literally, “lion's ear”)
 ucho       | sloní ucho ― Haemanthus albiflos (literally, “elephant's ear”)
 ucho       | mořské ucho ― Haliotis tuberculata (literally, “sea ear”)

These above examples should soon be fine, as soon as kaikki.org updates. The change made was to accept "taxonomic" text as "english".

As for these:

 liška      | liška jezerní ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška japonská ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška mořská ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška patagonská ― culpeo, Andean fox (Lycalopex culpaeus) (= pes horský)
 liška      | liška Azarova ― hoary fox (Lycalopex vetulus) (= pes šedý)
 liška      | liška habešská ― Ethiopian wolf (Canis simensis) (= vlček etiopský)
 liška      | liška krátkouchá ― short-eared dog (Atelocynus microtis) (= pes krátkouchý)

There's too much non-English in these. Sometimes a couple of taxonomic names don't trigger the heuristics too much, but here we have 4/6 non-English words.

 dílna      | Škola — dílna lidskosti. ― School — the workshop of humanity.

This breaks because of the extra long hyphen. We end up with "škola" and "dílna lidskosti [...]".

 paní       | paní Nováková ― Mrs Nováková (see also -ová)

Too little English.

 esemeska   | Až zítra dojedete do Třebíče, pošlete mi esemesku, ať nemám starost. ― When you arrive in Třebíč tomorrow, send me an SMS so I'm not worried.
 tož        | „Tož demokracii bychom už měli; teď ještě nějaké ty demokraty.“ ― „So for now the democracy we already have – now we need some democrats as well.“ (T. G. Masaryk)

Don't yet know what the problem here is. Suspect it might be punctuation.

The rest seem to contain just rare words that the small corpus we use doesn't recognize. I'll ad them to the dictionary manually.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Jul 10, 2024

The issue with esemeska was the use of the template argument inline; that broke a condition (len(arguments) == 3), but that was reasonably easy to kludge.

There was an issue with "physical property", because "physical property" was classified as tags by classify_desc(), because both of those words are actual tag data in some language or other. However, this was fixed by improving the same block of conditions that was also used in the esemeska kludge above, and had to do with template arguments. In this case, example templates have a third argument for translations, which is usually just unnamed (and represented by 3 in our data), but sometimes people use alternative keyed forms of the argument; |translation=Foo instead of just "|Foo", which broke the condition.

I'll check out some of the issues that are left (when there are too many –– long hyphens) and maybe the liška stuff.

kristian-clausal added a commit that referenced this issue Jul 11, 2024
Issue #604, Czech translations (continued)

In translations like

```
ví ucho ― Leonotis nepetifolia (literally, “lion's ear”)
```

the translation part starting with "Leonotis" is has its
classification returned as "taxonomic" due to the heuristics
used in classify_desc().

I've been trying to kludge something better here, but for
this specifically the right call to make is to change it so
that if a description is either "english" or "taxonomic",
that counts as English. There is not meaningful distinction
here in the examples when trying to figure out translation
stuff.

The heuristics could be better, which is what I tried to
figure out, but it works fine for now...
@kristian-clausal
Copy link
Collaborator

I've committed some more fixes. The only ones left are the "does not look like English to the classify_desc heuristics" in liška and paní; with paní it's also because of the extra text after the template, which means we can't just auto-accept the line as a clean example template. Might no be much to be done about these.

@kristian-clausal
Copy link
Collaborator

I've just spent HOURS trying to figure out why a regex isn't working with a specific line and it turns out there seems to be a rendering bug (or generation bug?) with i+combining acute accent on my computer in all the programs I've tried. Havlín gets normalized to having i + combining, which is what breaks the regex, but when it's printed out into the terminal it just shows Havlin, without the combining character in the stream... Had to look at the original normalized string's ord(s) to find the culprit.

Suffice to say, the issue here is that classify_desc uses a simple regex to gatekeep what text to let through to be checked whether it's English or not, and that doesn't include allowing stuff beyond [a-zA-Z]. I thought of changing the regex to allow extended Latin letters in words starting with uppercase (which is a pain because native Python doesn't have [[:upper:]] and I did it by hand), but that didn't work... Because the normalized string, which I thought was just stripping Havlín down to Havlin, was actually hiding a combining diacritic character that it did not render in my terminal window!

Anyhow, this is more of a memo to me to explain to myself wtf is going on my local branch when I get back from vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants