Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Tamil Wiktionary Word Extraction #330

Open
kupilikula opened this issue Sep 8, 2023 · 4 comments
Open

Support for Tamil Wiktionary Word Extraction #330

kupilikula opened this issue Sep 8, 2023 · 4 comments

Comments

@kupilikula
Copy link

When I run wiktwords on this dump: https://dumps.wikimedia.org/tawiktionary/latest/ , using the following command:

wiktwords --all --language ta --out tamildata.json tawiktionary-latest-pages-meta-history.xml.bz2

I get a lot of error messages of the form:
"DEBUG: unexpected top-level node: <LEVEL6...".

Only a small fraction of the words are ending up in the output json file. Can you add support for the Tamil wiktionary: https://ta.wiktionary.org/ ?

Thanks!

@xxyzz
Copy link
Collaborator

xxyzz commented Sep 8, 2023

You downloaded the wrong dump file, should use "tawiktionary-20230901-pages-articles.xml.bz2"(the file with bold font in the download page: https://dumps.wikimedia.org/tawiktionary/20230901/)

wiktextract currently supports Chinese, English, French Wiktionary dump file, both Chinese and French are WIP. Each Wiktionary edition should use its own extractor code in the "extractor" folder, or use the English extractor otherwise. Our priority for now is to improve the Chinese and French code, maybe in the future we'll support new languages.

@kupilikula
Copy link
Author

I tried the same command on the dump file you pointed me to, and I got similar results. Lot of the same error messages and only a very small list of words.

@xxyzz
Copy link
Collaborator

xxyzz commented Sep 8, 2023

Because Tamil Wiktionary is not supported, it at least should have some subtitle data files in the "data" folder and pass the "--dump-file-language-code" option.

@kristian-clausal
Copy link
Collaborator

kristian-clausal commented Sep 8, 2023

Yeah, unfortunately different editions of Wiktionary are so incompatible that you need to do a lot of work to make one work with Wiktextract. All the effort up to now has been to get en.wiktionary.org to work, and even that is still incomplete after a few years; but at least it's possible to build upon the framework and lots of almost-universal code to write code that can extract stuff from other editions. You'd also need someone who knows the language that the edition has been written in (in your case Tamil) so that they can figure out what needs to happen when parsing, and they need to extract all the necessary metadata xxyzz was talking about that goes in the data directory and then write the code necessary to handle the pages themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants