Support for Tamil Wiktionary Word Extraction #330

kupilikula · 2023-09-08T01:22:43Z

When I run wiktwords on this dump: https://dumps.wikimedia.org/tawiktionary/latest/ , using the following command:

wiktwords --all --language ta --out tamildata.json tawiktionary-latest-pages-meta-history.xml.bz2

I get a lot of error messages of the form:
"DEBUG: unexpected top-level node: <LEVEL6...".

Only a small fraction of the words are ending up in the output json file. Can you add support for the Tamil wiktionary: https://ta.wiktionary.org/ ?

Thanks!

The text was updated successfully, but these errors were encountered:

xxyzz · 2023-09-08T01:38:24Z

You downloaded the wrong dump file, should use "tawiktionary-20230901-pages-articles.xml.bz2"(the file with bold font in the download page: https://dumps.wikimedia.org/tawiktionary/20230901/)

wiktextract currently supports Chinese, English, French Wiktionary dump file, both Chinese and French are WIP. Each Wiktionary edition should use its own extractor code in the "extractor" folder, or use the English extractor otherwise. Our priority for now is to improve the Chinese and French code, maybe in the future we'll support new languages.

kupilikula · 2023-09-08T01:49:59Z

I tried the same command on the dump file you pointed me to, and I got similar results. Lot of the same error messages and only a very small list of words.

xxyzz · 2023-09-08T01:56:03Z

Because Tamil Wiktionary is not supported, it at least should have some subtitle data files in the "data" folder and pass the "--dump-file-language-code" option.

kristian-clausal · 2023-09-08T04:43:44Z

Yeah, unfortunately different editions of Wiktionary are so incompatible that you need to do a lot of work to make one work with Wiktextract. All the effort up to now has been to get en.wiktionary.org to work, and even that is still incomplete after a few years; but at least it's possible to build upon the framework and lots of almost-universal code to write code that can extract stuff from other editions. You'd also need someone who knows the language that the edition has been written in (in your case Tamil) so that they can figure out what needs to happen when parsing, and they need to extract all the necessary metadata xxyzz was talking about that goes in the data directory and then write the code necessary to handle the pages themselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Tamil Wiktionary Word Extraction #330

Support for Tamil Wiktionary Word Extraction #330

kupilikula commented Sep 8, 2023

xxyzz commented Sep 8, 2023

kupilikula commented Sep 8, 2023

xxyzz commented Sep 8, 2023

kristian-clausal commented Sep 8, 2023 •

edited

Loading

Support for Tamil Wiktionary Word Extraction #330

Support for Tamil Wiktionary Word Extraction #330

Comments

kupilikula commented Sep 8, 2023

xxyzz commented Sep 8, 2023

kupilikula commented Sep 8, 2023

xxyzz commented Sep 8, 2023

kristian-clausal commented Sep 8, 2023 • edited Loading

kristian-clausal commented Sep 8, 2023 •

edited

Loading