Corpus Crawler is a tool for Corpus Linguistics.
Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.
This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.
To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.
IETF BCP47 Code | Language | Tokens¹ |
---|---|---|
ae |
Avestan | 129K 💾 |
ae-Latn |
Avestan (Latin) | 141K 💾 |
am |
Amharic | 2,170K 💾 |
az |
Azerbaijani | 3,413K 💾 |
be |
Belarusian | 1,441K 💾 |
bg |
Bulgarian | 10,597K 💾 |
bm |
Bambara | 30K 💾 |
bn |
Bangla | 7,258K 💾 |
bo |
Tibetan | 5,642K 💾 |
bs |
Bosnian | 8,993K 💾 |
ccp |
Chakma | 79K 💾 |
cs |
Czech | 3,141K 💾 |
de |
German | 7,894K² 💾 |
dz |
Dzongkha | 61K 💾 |
el |
Greek | 5,470K 💾 |
es |
Spanish | 13,511K² 💾 |
fa |
Persian | 9,114K 💾 |
fa-AF |
Dari | 7,363K 💾 |
fi |
Finnish | 4,837K 💾 |
fit |
Tornedalen Finnish | 292K 💾 |
fo |
Faroese | 851K 💾 |
fuv |
Nigerian Fulfulde | 13K 💾 |
gsw-u-sd-chag |
Swiss German (Aargau) | 99K 💾 |
gsw-u-sd-chbe |
Swiss German (Bern) | 73K 💾 |
gsw-u-sd-chfr |
Swiss German (Fribourg) | 42K 💾 |
gv |
Manx Gaelic | 152K 💾 |
ha |
Hausa | 1,775K 💾 |
hi |
Hindi | 10,004K 💾 |
hr |
Croatian | 8,188K 💾 |
id |
Indonesian | 6,634K 💾 |
ig |
Igbo | 13K 💾 |
ja |
Japanese | 2,116K 💾 |
kj |
Kuanyama | 1,474K 💾 |
kk |
Kazakh | 642K 💾 |
km |
Khmer | 20,908K 💾 |
ku |
Kurdish | 2,479K 💾 |
ky |
Kyrgyz | 4,380K² 💾 |
lo |
Lao | 4,384K 💾 |
mk |
Macedonian | 10,422K 💾 |
mnw |
Mon | 1,836K 💾 |
mt |
Maltese | 3,331K 💾 |
my |
Burmese | 1,007K 💾 |
my-t-d0-zawgyi |
Burmese (Zawgyi encoding) | 593K 💾 |
pl |
Polish | 7,148K 💾 |
ps |
Pashto | 7,343K 💾 |
rm-puter |
Romansh (Puter) | 1,068K 💾 |
rm-rumgr |
Romansh (Grischun) | 4,794K 💾 |
rm-surmiran |
Romansh (Surmiran) | 2,540K 💾 |
rm-sursilv |
Romansh (Sursilvan) | 11,678K 💾 |
rm-sutsilv |
Romansh (Sutsilvan) | 1,007K 💾 |
rm-vallader |
Romansh (Vallader) | 5,560K 💾 |
ro |
Romanian | 13,962K 💾 |
ru |
Russian | 6,216K² 💾 |
rw |
Kinyarwanda | 605K 💾 |
shn |
Shan | 1,435K 💾 |
sn |
Shona | 2,542K 💾 |
so |
Somali | 874K 💾 |
sq |
Albanian | 10,104K 💾 |
sr-Latn |
Serbian (Latin) | 10,143K 💾 |
sv |
Swedish | 23,803K² 💾 |
sw |
Swahili | 8,817K 💾 |
ta |
Tamil | 1,413K 💾 |
taq |
Tamasheq | 66K 💾 |
ti |
Tigrinya | 803K 💾 |
tr |
Turkish | 13,846K 💾 |
ug |
Uyghur | 9,493K 💾 |
uk |
Ukrainian | 12,921K 💾 |
ur |
Urdu | 3,622K 💾 |
yo |
Yoruba | 80K 💾 |
¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER
, UBRK_WORD_KANA
, or UBRK_WORD_IDEO
. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.
² Crawl is still in progress; the final number will be larger.
./corpuscrawler --language=rm --output=./corpus