You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there. First of all, thank you very much for providing this project.
I've been trying to understand the structure of the data, in particular the "glosses" lists. My assumption is that the list contains multiple glosses when there is a nested list in the original data, i.e.
main
sub 1
sub 2
turns into { senses: [{ glosses: [main] }, { glosses: [main, sub 1] }, { glosses: [main, sub 2] }] }.
I ran some queries for entries that violate that assumption and found a bunch of glosses with template tags of the form :Template:SAFESUBST:#invoke:ordinal in them, for example (rendered on the website):
{
"word": "iceberg",
"pos": "noun",
"senses": [
{
"glosses": [
"The seaward end of a glacier. [:Template:SAFESUBST:#invoke:ordinal–:Template:SAFESUBST:#invoke:ordinal c.]",
"The seaward end of a glacier."
]
},
{
"glosses": [
"A huge mass of ocean-floating ice which has broken off a glacier or ice shelf [from :Template:SAFESUBST:#invoke:ordinal c.]",
"A huge mass of ocean-floating ice which has broken off a glacier or ice shelf"
]
},
{
"glosses": [
"An aloof person. [from :Template:SAFESUBST:#invoke:ordinal c.]",
"An aloof person."
]
},
{
"glosses": [
"An impending disastrous event whose adverse effects are only beginning to show, in reference to one-tenth of the volume of an iceberg being visible above water."
]
}
]
}
There were also some cases of entries containing multiple glosses with and without notes in parentheses, leading commas, and trailing colons.
I'm not sure how much of this is intentional but at the very least the template part seems quite wrong. Is there any more info available on the structure of the dataset beyond what's in the README? I'd appreciate any pointers.
The [:Template:SAFESUBST* text is because the ordinal template which is used by century template is not expanded properly.
Some text between brackets are extracted to tags or topics fields, but our code couldn't handle all the cases and might break some gloss texts. The original gloss texts are in the raw_glosses field.
Hi there. First of all, thank you very much for providing this project.
I've been trying to understand the structure of the data, in particular the "glosses" lists. My assumption is that the list contains multiple glosses when there is a nested list in the original data, i.e.
turns into
{ senses: [{ glosses: [main] }, { glosses: [main, sub 1] }, { glosses: [main, sub 2] }] }
.I ran some queries for entries that violate that assumption and found a bunch of glosses with template tags of the form
:Template:SAFESUBST:#invoke:ordinal
in them, for example (rendered on the website):There were also some cases of entries containing multiple glosses with and without notes in parentheses, leading commas, and trailing colons.
I'm not sure how much of this is intentional but at the very least the template part seems quite wrong. Is there any more info available on the structure of the dataset beyond what's in the README? I'd appreciate any pointers.
The complete list of entries: violating.json
The text was updated successfully, but these errors were encountered: