Skip to content

Crawler for linguistic corpora

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

google/corpuscrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code Language Tokens¹
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
am Amharic 2,170K 💾
az Azerbaijani 3,413K 💾
be Belarusian 1,441K 💾
bg Bulgarian 10,597K 💾
bm Bambara 30K 💾
bn Bangla 7,258K 💾
bo Tibetan 5,642K 💾
bs Bosnian 8,993K 💾
ccp Chakma 79K 💾
cs Czech 3,141K 💾
de German 7,894K² 💾
dz Dzongkha 61K 💾
el Greek 5,470K 💾
es Spanish 13,511K² 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
fi Finnish 4,837K 💾
fit Tornedalen Finnish 292K 💾
fo Faroese 851K 💾
fuv Nigerian Fulfulde 13K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gv Manx Gaelic 152K 💾
ha Hausa 1,775K 💾
hi Hindi 10,004K 💾
hr Croatian 8,188K 💾
id Indonesian 6,634K 💾
ig Igbo 13K 💾
ja Japanese 2,116K 💾
kj Kuanyama 1,474K 💾
kk Kazakh 642K 💾
km Khmer 20,908K 💾
ku Kurdish 2,479K 💾
ky Kyrgyz 4,380K² 💾
lo Lao 4,384K 💾
mk Macedonian 10,422K 💾
mnw Mon 1,836K 💾
mt Maltese 3,331K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
pl Polish 7,148K 💾
ps Pashto 7,343K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
ro Romanian 13,962K 💾
ru Russian 6,216K² 💾
rw Kinyarwanda 605K 💾
shn Shan 1,435K 💾
sn Shona 2,542K 💾
so Somali 874K 💾
sq Albanian 10,104K 💾
sr-Latn Serbian (Latin) 10,143K 💾
sv Swedish 23,803K² 💾
sw Swahili 8,817K 💾
ta Tamil 1,413K 💾
taq Tamasheq 66K 💾
ti Tigrinya 803K 💾
tr Turkish 13,846K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
yo Yoruba 80K 💾

¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

² Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus

Releases

No releases published

Packages

No packages published

Languages