Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

form_of including trailing gloss text #126

Open
medmunds opened this issue Mar 17, 2022 · 7 comments
Open

form_of including trailing gloss text #126

medmunds opened this issue Mar 17, 2022 · 7 comments
Labels
fix on wiktionary Problem with wiktionary data: typoes, bad formatting by users, unparseable but human-readable stuff

Comments

@medmunds
Copy link

When the Wiktionary definition includes both a form template and additional text, that additional text is being incorrectly included in the extracted form_of root word. Some examples:

"wetlands" - kaikki.org (2022-03-15 extraction)

      "form_of": [
        {
          "extra": "a marsh",
          "word": "wetland An area or region that is characteristically saturated"
        }
      ],

"wetlands" - Wiktionary

# {{plural of|en|wetland}} An area or region that is characteristically saturated; a marsh.

"shields" (verb entry) - kaikki.org (2022-03-15 extraction)

      "form_of": [
        {
          "word": "shield. Protects"
        }
      ],

"shields" - Wiktionary

# {{en-third-person singular of|shield}}. Protects

(Arguably these Wiktionary entries could use some cleanup, but apparently this pattern is in use.)

@kristian-clausal
Copy link
Collaborator

This is an issue that can't easily be fixed through coding. I'll take some time tomorrow to go through a list of obvious problematic cases we generated today and fix them in Wiktionary. These are articles in languages other than English, and the filter we used was basically "native-language-word english-language-word" when there are exactly two words in the result, which catches a lot of examples where you have a translation that is a single word long: "Form of fragā strawberry", for example.

In the case of wetlands style stuff, the correct thing to do is to use the |t=| parameter in {{plural of}}: https://en.wiktionary.org/wiki/Template:plural_of -- I've corrected the wetlands article if you want to take a look.

@kristian-clausal
Copy link
Collaborator

I spent the whole day going through a short list of error-candidates for this; basically, "form of" entries that are suspicious and have a native language word + english language words as its contents (fraga strawberry).
Many were false positives(?) when the second part of the pair contain a native language words that has the same spelling as something in English (lots of "in" in Swedish examples, for example), and these were correctly identified as "form of" entries.
But I also found a ton of erroneously parsed glosses in the form of "root of plant" or "indicative of a farm".
However, the biggest and most annoying offender were the Romanian entries of some random wiktionarian created over a decade ago that consistently just did {{form-of-template-stuff}} [[one-word-translation]].

This isn't really feasible to edit by hand, like I did. If we want this to be edited on Wiktionary, might need to learn about wikimedia bots that would do execute lists of generated edits (inserting |t=|) after getting user approval.

@medmunds
Copy link
Author

medmunds commented Apr 5, 2022

Thanks for digging into this. As you say, there are a ton (several tons!) of poorly constructed entries. And the correct form_of is not always a single word (see the second example below). I've also noticed similar problems with alt_of extraction.

Does wiktextract have access to either the wiki source or the rendered HTML where it's trying to extract the base form? I think it's often unambiguous from that, because the base form is a single linked item—either a template parameter or a raw [[wiki link]]—no matter what other text follows.

Three examples (2022-04-04 extraction):

  • "jakes" (English noun sense 1)

    • extracted form_of.word: "jake in its various senses"
    • correct form_of.word: "jake"
    • wiki source: # {{plural of|en|jake}} ''in its various senses''.
    • rendered: plural of jake in its various senses.
    • rendered html: <span class="form-of-definition use-with-mention"><a href="/wiki/Appendix:Glossary#plural_number" title="Appendix:Glossary">plural</a> of <span class="form-of-definition-link"><i class="Latn mention" lang="en"><a href="/wiki/jake#English" title="jake">jake</a></i></span></span> <i>in its various senses</i>.
  • "ion channels" (English noun)

    • extracted form_of.word: "ion channel May refer to either multiple ion channel classes or to multiple members of a single class"
    • correct form_of.word: "ion channel"
    • wiki source: # {{plural of|en|ion channel}} May refer to either multiple ion channel classes or to multiple members of a single class.
    • rendered: plural of ion channel May refer to either multiple ion channel classes or to multiple members of a single class.
    • rendered html: <span class="form-of-definition use-with-mention"><a href="/wiki/Appendix:Glossary#plural_number" title="Appendix:Glossary">plural</a> of <span class="form-of-definition-link"><i class="Latn mention" lang="en"><a href="/wiki/ion_channel#English" title="ion channel">ion channel</a></i></span></span> May refer to either multiple ion channel classes or to multiple members of a single class.
  • "corgies" (English noun)

    • extracted form_of.word: "corgi or corgy"
    • correct form_of.word: "corgi" (or ideally separate form_of items for both "corgi" and "corgy", but that's probably not reasonable)
    • wiki source: # {{plural of|en|corgi}} or '''[[corgy]]'''
    • rendered: plural of corgi or corgy
    • rendered html: <span class="form-of-definition use-with-mention"><a href="/wiki/Appendix:Glossary#plural_number" title="Appendix:Glossary">plural</a> of <span class="form-of-definition-link"><i class="Latn mention" lang="en"><a href="/wiki/corgi#English" title="corgi">corgi</a></i></span></span> or <b><a href="/wiki/corgy" title="corgy">corgy</a></b>

Extracting form_of from the wiki source or html would also avoid some false positives where the gloss happens to resemble a form-of definition. If there's not a link in the definition, it's not really a form_of. E.g.:

  • "ladyship": "Formal form of address for a lady judge", but this word should not be a form_of "address for a lady judge"
  • "ill": "Indicative of unkind or malevolent intentions", but this word should not be a form_of "unkind or malevolent intentions"
@tatuylonen
Copy link
Owner

I decided to implement simpler but less general fixes at this time (I am also not sure if the link approach would always work; I don't think there are links in all cases even though I don't immediately have a counterexample).

I fixed "jakes" and other similar cases by recognizing "in its various senses" as not being part of base.
I fixed "ion channels" and other similar cases (if any) by recognizing "May refer to" as not being part of base.
I fixed "corgies" by recognizing "a or b" syntax for base, where "a" and "b" are similar, as being variants and generating two distinct form_of/alt_of entries.
I already had code that was supposed to prevent "form of address" from resulting in form_of, but it did not recognize the case with "formal" at the beginning. I changed it to suppress "form_of" interpretation if "form of address" occurs anywhere in the gloss. This fixed "ladyship".
I fixed "ill" by removing "indicative" from form_of_tags table in tags.py (I checked and there are no valid form_of glosses with just "indicative of").

Let me know if you find other similar issues.

The changes should be reflected on https://kaikki.org in a couple of days.

@tatuylonen
Copy link
Owner

I also did a few other changes, including not interpreting "root of" as a form_of. I checked and none of the "root of" glosses was really a form_of.

@medmunds
Copy link
Author

This sounds like it should help, thanks. I'm grabbing the latest data now, and will update this after taking a look.

Incidentally, I found these by looking through cases where form_of[].word doesn't have any entry in the same language. (I'm mainly looking at English right now, but the principle should apply to other languages.) Of course, a lot of those are problems in the Wiktionary source itself, rather than extraction problems.

@medmunds
Copy link
Author

That led to a substantial improvement in the 2022-04-29 extraction—thanks!

Let me know if you find other similar issues.

There are 126 form_of that seem to be parsing problems in the English 2022-04-29 extraction—full list attached below. Here are some patterns that might be worth special casing:

  • "root [an] alternative form/spelling/letter-case form of…" (benempt, breengeing, cardsharpers, etiologies, wassocks, etc.)
  • "root referring to…" (hypodiploidies, macroreentries, sakes, etc.)
  • extend the "rootA or rootB" syntax to allow "or of" (espaliere form-of "espauliere or of epauliere")
  • "root gerund of…", "root plural of…" and similar (matings, stainings, etc.)
  • "root in its countable senses" (toshes, works)

Also, "so" gloss "Reduced form of 'so that'" is being parsed as a form_of 'so that' (including the quotes). This might be similar to dot removal (#125) but with quotes: so that has an entry, 'so that' does not. (Also, the gloss "reduced form of root" should maybe result in alt-of with tag:ellipsis or tag:clipping, rather than form-of, but that's a different issue.)

Again, I suspect examining rendered links or template parameters could solve all of these. But if that isn't feasible, then the remaining cases are probably best fixed by editing Wiktionary. (And the list is short enough now that I'll probably start doing that.)

Attachment: form-of-misparsed.txt

@kristian-clausal kristian-clausal added the fix on wiktionary Problem with wiktionary data: typoes, bad formatting by users, unparseable but human-readable stuff label May 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix on wiktionary Problem with wiktionary data: typoes, bad formatting by users, unparseable but human-readable stuff
3 participants