Skip to content

Crawler for linguistic corpora

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

google/corpuscrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code Language Tokens¹
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
am Amharic 2,170K 💾
ar Arabic 14,345K² 💾
az Azerbaijani 3,413K 💾
ba Bashkir 666K 💾
be Belarusian 1,441K 💾
bg Bulgarian 10,597K 💾
bm Bambara 30K 💾
bn Bangla 7,258K 💾
bo Tibetan 5,642K 💾
bs Bosnian 8,993K 💾
ccp Chakma 79K 💾
cs Czech 3,141K 💾
cy Welsh 11,519K 💾
de German 46,431K 💾
dz Dzongkha 61K 💾
el Greek 5,470K 💾
es Spanish 32,670K 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
fi Finnish 4,837K 💾
fit Tornedalen Finnish 292K 💾
fo Faroese 851K 💾
fuv Nigerian Fulfulde 13K 💾
ga Irish 298K 💾
gd Scottish Gaelic 17,105K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gv Manx Gaelic 152K 💾
ha Hausa 1,775K 💾
haw Hawaiian 2,221K 💾
hi Hindi 10,004K 💾
hr Croatian 8,188K 💾
hy Armenian 25,972K 💾
id Indonesian 6,634K 💾
ig Igbo 13K 💾
iu Inuktitut 98K 💾
ja Japanese 2,116K 💾
kab Kabyle 66K 💾
kj Kuanyama 1,474K 💾
kk Kazakh 642K 💾
km Khmer 29,110K 💾
ku Kurdish 2,479K 💾
ky Kyrgyz 18,597K 💾
la Latin 48K 💾
lb Luxembourgish 5,173K 💾
lo Lao 4,384K 💾
mi Maori 1,504K 💾
mk Macedonian 10,422K 💾
mnw Mon 1,836K 💾
mr Marathi 16,594K 💾
mt Maltese 3,331K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
nl Dutch 24,289K² 💾
ny Nyanja 356K 💾
osa Osage 3K 💾
pcm Nigerian Pidgin 315K 💾
pa Punjabi 28,446K² 💾
pl Polish 7,148K 💾
ps Pashto 7,343K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
ro Romanian 13,962K 💾
ru Russian 40,987K² 💾
rw Kinyarwanda 605K 💾
sah Sakha 2,457K 💾
shn Shan 1,435K 💾
si Sinhala 1,046K 💾
sl Slovenian 10,975K 💾
sn Shona 2,542K 💾
so Somali 874K 💾
sq Albanian 10,104K 💾
sr-Latn Serbian (Latin) 10,143K 💾
sv Swedish 33,633K 💾
sw Swahili 8,817K 💾
ta Tamil 1,413K 💾
ti Tigrinya 803K 💾
tpi Tok Pisin 8,049K 💾
tr Turkish 13,846K 💾
tt Tatar 1,356K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
vec Venetian 2K 💾
vec-u-sd-itpd Venetian (Padua) 813K 💾
vec-u-sd-itts Venetian (Trieste) 12K 💾
vec-u-sd-itvr Venetian (Verona) 16K 💾
yo Yoruba 80K 💾

¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

² Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus

Releases

No releases published

Packages

No packages published

Languages