English translation of examples sometimes missing, included in the original text #604

29jm · 2024-04-24T10:58:42Z

Some examples in some Czech words have an issue, where instead of having both text and english in the example object, there's only text containing a concatenation of both, e.g. "the original text ― the translated text".

Some examples:

ani: only the last example has the issue
dobrý: same
It's not necessarily for the last entry though, for instance with zničit.
Some more like this: vzít, hlavní, čas, žít, paní, jméno, smrt, kniha, psát, názor, světový, osobní, minulý, onen, umění, věk, telefon, zástupce, ženský.

If it's a problem in the page markup, I can fix it there, but I didn't see what could cause it.

The text was updated successfully, but these errors were encountered:

kristian-clausal · 2024-04-24T11:20:41Z

Yeah, this is buggy, I'll take a look at it tomorrow.

kristian-clausal · 2024-04-25T09:43:31Z

Sorry, I got stuck trying to figure out a bug? with our logging system, this might have to wait a while.

29jm · 2024-04-25T10:14:07Z

No worries at all, I know how it is :)

kristian-clausal · 2024-04-26T10:12:54Z

Found part of the issue, and it's a silly one; the examples I looked at (ani and vzít) had a word that was blacklisted from our "what is an English word" set: He, with a capital. Our function that determines what kind of language or function a string has rejects the strings because 1/4 of it is classified as non-english, so it all gets lumped as a romanization string and the example is concatenated.

Actually looking at vzít, the example that is broken doesn't get through because it's 2/3rds Czech names with one English word.

I am writing this message as I'm going through examples, and hlavní is a completely different issue: the example isn't in a template that we accept as an 'example' (ux) template, but as a coi template, related to collocation. Outputs the same as an example template, and outputs the same kind of stuff (looks like an example, quacks like an example), so I just added the template and its aliases to our list of "ux" templates.

I'm going to commit these, and hopefully most of all of the issues will be addressed. If you find that some weren't fixed, just point them out.

Unless we add a bunch of Czech names (which is the simplest way), the examples that are just Eliška married Tomáš are impossible to determine with a simple heuristic.

29jm · 2024-04-26T11:07:25Z

Unfortunate that wiktionary doesn't tag the translation of examples with a language! If I understood correctly, this is why wikiextract has to implement these heuristics.

Thanks for the fixes already, I'll have a look at the json within the next week!

29jm · 2024-05-01T10:59:08Z

I'm looking at the most recent JSON, and some problematic words like ani, čas and žít are fixed, while others aren't, e.g. hlavní, názor, osobní. I think many of the remaining ones use coi, though not all, for instance Ukrajina which uses uxi.

(in case it helps, a pastebin of all of them here)

kristian-clausal · 2024-05-06T07:10:09Z

Many of those are just that they contain words that are not in nltk.corpus.brown; words like "cellphone", "mousetrap", "He’s" (with a unicode apostrophe or other character), "dumbfuck", "peppermint"... Hrm, many of these could be fixed if we somehow could cheaply detect compound words. If anyone has an idea of how to do this super-cheaply ("peppermint" -> contains "pepper" and "mint" and smooshed together)...

Another category of problem are the translations that are basically stuff like "common noun" or "adjective", phrases that will be get classified as tags by the classifier.

xxyzz · 2024-05-06T07:47:30Z

We can't use template arguments or expanded HTML tags? "ux" and "coi" templates put translation text in the third argument, find argument should be easier and more reliable than check if words are in English.

kristian-clausal · 2024-05-06T09:03:13Z

I was thinking of that, yeah. We can check to see if the arguments map on to the template expanded output and exit early if they conform to the formatting of examples. The problem is that there might be some pitfalls with this approach, for example if example templates are used for other things, but in context it might be fine.

kristian-clausal · 2024-05-06T10:40:05Z

EDIT: This is a post that was left unwritten earlier today, posting it here just for completeness.

"cause of death" slipped through because decode_tags classified if as tags. "of tags" is parsed as a tag (a space-including tag, so not in valid_tags), and "cause" is classified as a "topic" for some reason, and there's a small piece of boolean logic that says that if any of there are topics or a flag is set or there are no tags with spaces in the collected tags. So because "cause" is in topics, the no tags with spaces does not trigger. This is a super annoying, probably quite rare edgecase.

EDIT: This edgecase should be fixed with the template-arguments fix.

29jm · 2024-05-14T08:47:54Z

That last PR fixed most of the issues! Here's a pastebin of the remaining ones, from a JSON downloaded today: https://pastebin.com/hMyZBXnh

EDIT: If those remaining ones are due to issues on the wiktionary side, let me know how and I'll fix them one by one.

kristian-clausal · 2024-05-14T08:52:40Z

I'll take a look at these later, thanks for keeping your eye on the output!

Issue #604, Czech translations (continued) In translations like ``` ví ucho ― Leonotis nepetifolia (literally, “lion's ear”) ``` the translation part starting with "Leonotis" is has its classification returned as "taxonomic" due to the heuristics used in classify_desc(). I've been trying to kludge something better here, but for this specifically the right call to make is to change it so that if a description is either "english" or "taxonomic", that counts as English. There is not meaningful distinction here in the examples when trying to figure out translation stuff. The heuristics could be better, which is what I tried to figure out, but it works fine for now...

kristian-clausal · 2024-07-09T10:13:22Z

Sorry that this lapsed, I've kludged something small for:

 ucho       | lví ucho ― Leonotis nepetifolia (literally, “lion's ear”)
 ucho       | sloní ucho ― Haemanthus albiflos (literally, “elephant's ear”)
 ucho       | mořské ucho ― Haliotis tuberculata (literally, “sea ear”)

These above examples should soon be fine, as soon as kaikki.org updates. The change made was to accept "taxonomic" text as "english".

As for these:

 liška      | liška jezerní ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška japonská ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška mořská ― raccoon dog (Nyctereutes procyonoides) (= psík mývalovitý)
 liška      | liška patagonská ― culpeo, Andean fox (Lycalopex culpaeus) (= pes horský)
 liška      | liška Azarova ― hoary fox (Lycalopex vetulus) (= pes šedý)
 liška      | liška habešská ― Ethiopian wolf (Canis simensis) (= vlček etiopský)
 liška      | liška krátkouchá ― short-eared dog (Atelocynus microtis) (= pes krátkouchý)

There's too much non-English in these. Sometimes a couple of taxonomic names don't trigger the heuristics too much, but here we have 4/6 non-English words.

 dílna      | Škola — dílna lidskosti. ― School — the workshop of humanity.

This breaks because of the extra long hyphen. We end up with "škola" and "dílna lidskosti [...]".

 paní       | paní Nováková ― Mrs Nováková (see also -ová)

Too little English.

 esemeska   | Až zítra dojedete do Třebíče, pošlete mi esemesku, ať nemám starost. ― When you arrive in Třebíč tomorrow, send me an SMS so I'm not worried.
 tož        | „Tož demokracii bychom už měli; teď ještě nějaké ty demokraty.“ ― „So for now the democracy we already have – now we need some democrats as well.“ (T. G. Masaryk)

Don't yet know what the problem here is. Suspect it might be punctuation.

The rest seem to contain just rare words that the small corpus we use doesn't recognize. I'll ad them to the dictionary manually.

kristian-clausal · 2024-07-10T08:15:37Z

The issue with esemeska was the use of the template argument inline; that broke a condition (len(arguments) == 3), but that was reasonably easy to kludge.

There was an issue with "physical property", because "physical property" was classified as tags by classify_desc(), because both of those words are actual tag data in some language or other. However, this was fixed by improving the same block of conditions that was also used in the esemeska kludge above, and had to do with template arguments. In this case, example templates have a third argument for translations, which is usually just unnamed (and represented by 3 in our data), but sometimes people use alternative keyed forms of the argument; |translation=Foo instead of just "|Foo", which broke the condition.

I'll check out some of the issues that are left (when there are too many –– long hyphens) and maybe the liška stuff.

Issue #604, Czech translations (continued) In translations like ``` ví ucho ― Leonotis nepetifolia (literally, “lion's ear”) ``` the translation part starting with "Leonotis" is has its classification returned as "taxonomic" due to the heuristics used in classify_desc(). I've been trying to kludge something better here, but for this specifically the right call to make is to change it so that if a description is either "english" or "taxonomic", that counts as English. There is not meaningful distinction here in the examples when trying to figure out translation stuff. The heuristics could be better, which is what I tried to figure out, but it works fine for now...

kristian-clausal · 2024-07-11T11:22:51Z

I've committed some more fixes. The only ones left are the "does not look like English to the classify_desc heuristics" in liška and paní; with paní it's also because of the extra text after the template, which means we can't just auto-accept the line as a clean example template. Might no be much to be done about these.

kristian-clausal · 2024-07-12T11:52:42Z

I've just spent HOURS trying to figure out why a regex isn't working with a specific line and it turns out there seems to be a rendering bug (or generation bug?) with i+combining acute accent on my computer in all the programs I've tried. Havlín gets normalized to having i + combining, which is what breaks the regex, but when it's printed out into the terminal it just shows Havlin, without the combining character in the stream... Had to look at the original normalized string's ord(s) to find the culprit.

Suffice to say, the issue here is that classify_desc uses a simple regex to gatekeep what text to let through to be checked whether it's English or not, and that doesn't include allowing stuff beyond [a-zA-Z]. I thought of changing the regex to allow extended Latin letters in words starting with uppercase (which is a pain because native Python doesn't have [[:upper:]] and I did it by hand), but that didn't work... Because the normalized string, which I thought was just stripping Havlín down to Havlin, was actually hiding a combining diacritic character that it did not render in my terminal window!

Anyhow, this is more of a memo to me to explain to myself wtf is going on my local branch when I get back from vacation.

kristian-clausal mentioned this issue Apr 26, 2024

Classify desc #607

Merged

kristian-clausal mentioned this issue May 7, 2024

Use example template args to determine example #615

Merged

kristian-clausal mentioned this issue Jul 9, 2024

In examples, "taxonomic" is the same as "english" #715

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English translation of examples sometimes missing, included in the original text #604

English translation of examples sometimes missing, included in the original text #604

29jm commented Apr 24, 2024

kristian-clausal commented Apr 24, 2024 •

edited

Loading

kristian-clausal commented Apr 25, 2024

29jm commented Apr 25, 2024

kristian-clausal commented Apr 26, 2024 •

edited

Loading

29jm commented Apr 26, 2024

29jm commented May 1, 2024 •

edited

Loading

kristian-clausal commented May 6, 2024

xxyzz commented May 6, 2024

kristian-clausal commented May 6, 2024

kristian-clausal commented May 6, 2024 •

edited

Loading

29jm commented May 14, 2024 •

edited

Loading

kristian-clausal commented May 14, 2024

kristian-clausal commented Jul 9, 2024

kristian-clausal commented Jul 10, 2024 •

edited

Loading

kristian-clausal commented Jul 11, 2024

kristian-clausal commented Jul 12, 2024

English translation of examples sometimes missing, included in the original text #604

English translation of examples sometimes missing, included in the original text #604

Comments

29jm commented Apr 24, 2024

kristian-clausal commented Apr 24, 2024 • edited Loading

kristian-clausal commented Apr 25, 2024

29jm commented Apr 25, 2024

kristian-clausal commented Apr 26, 2024 • edited Loading

29jm commented Apr 26, 2024

29jm commented May 1, 2024 • edited Loading

kristian-clausal commented May 6, 2024

xxyzz commented May 6, 2024

kristian-clausal commented May 6, 2024

kristian-clausal commented May 6, 2024 • edited Loading

29jm commented May 14, 2024 • edited Loading

kristian-clausal commented May 14, 2024

kristian-clausal commented Jul 9, 2024

kristian-clausal commented Jul 10, 2024 • edited Loading

kristian-clausal commented Jul 11, 2024

kristian-clausal commented Jul 12, 2024

kristian-clausal commented Apr 24, 2024 •

edited

Loading

kristian-clausal commented Apr 26, 2024 •

edited

Loading

29jm commented May 1, 2024 •

edited

Loading

kristian-clausal commented May 6, 2024 •

edited

Loading

29jm commented May 14, 2024 •

edited

Loading

kristian-clausal commented Jul 10, 2024 •

edited

Loading