Corpus Crawler is a tool for Corpus Linguistics.
Modern linguistic research works on language corpora, which are large samples of โreal worldโ text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.
This is not an official Google product. But if youโre a linguistic researcher, or if youโre writing a spell checker (or similar language-processing software) for an โexoticโ language, you might find Corpus Crawler useful.
To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.
IETF BCP47 Code | Language | Tokensยน |
---|---|---|
aai |
Arifama-Miniafia | 181K ๐พ |
aak |
Ankave | 194K ๐พ |
aby |
Aneme Wake | 233K ๐พ |
ace |
Aceh/Acehnese | 817K ๐พ |
ae |
Avestan | 129K ๐พ |
ae-Latn |
Avestan (Latin) | 141K ๐พ |
aey |
Amele | 218K ๐พ |
agd |
Agarabi | 256K ๐พ |
agg |
Angor | 214K ๐พ |
agm |
Angaataha | 238K ๐พ |
akh |
Akha | 408K ๐พ |
amm |
Ama (Papua New Guinea) | 246K ๐พ |
amp |
Alamblak | 241K ๐พ |
am |
Amharic | 2,170K ๐พ |
aom |
รmie | 231K ๐พ |
aon |
Bumbita Arapesh | 294K ๐พ |
ape |
Bukiyip | 294K ๐พ |
apr |
Arop-Lokep | 373K ๐พ |
apz |
Safeyoka | 235K ๐พ |
ar |
Arabic | 14,345Kยฒ ๐พ |
aso |
Dano | 290K ๐พ |
ata |
Pele-Ata | 248K ๐พ |
auy |
Awiyaana | 164K ๐พ |
avt |
Au | 263K ๐พ |
awb |
Awa (Papua New Guinea) | 179K ๐พ |
az |
Azerbaijani | 3,413K ๐พ |
ba |
Bashkir | 666K ๐พ |
bbb |
Barai | 289K ๐พ |
bbr |
Girawa | 245K ๐พ |
bch |
Bariai | 248K ๐พ |
bdd |
Bunama | 171K ๐พ |
bef |
Benabena | 239K ๐พ |
be |
Belarusian | 1,441K ๐พ |
bg |
Bulgarian | 10,597K ๐พ |
bhl |
Bimin | 324K ๐พ |
big |
Biangai | 229K ๐พ |
bjr |
Binumarien | 226K ๐พ |
bmh |
Kein | 253K ๐พ |
bmu |
Somba-Siawari | 234K ๐พ |
bm |
Bambara | 30K ๐พ |
bnp |
Bola | 263K ๐พ |
bn |
Bangla | 7,258K ๐พ |
boj |
Anjam | 255K ๐พ |
bon |
Bine | 244K ๐พ |
bo |
Tibetan | 5,642K ๐พ |
bs |
Bosnian | 8,993K ๐พ |
buk |
Bugawac | 264K ๐พ |
byx |
Qaqet | 387K ๐พ |
bzh |
Mapos Buang | 251K ๐พ |
ccp |
Chakma | 79K ๐พ |
cjv |
Chuave | 286K ๐พ |
cs |
Czech | 3,141K ๐พ |
cy |
Welsh | 11,519K ๐พ |
dad |
Marik | 197K ๐พ |
dah |
Gwahatike | 274K ๐พ |
ded |
Dedua | 146K ๐พ |
de |
German | 46,431K ๐พ |
dgz |
Daga | 219K ๐พ |
dob |
Dobu | 179K ๐พ |
dww |
Dawawa | 208K ๐พ |
dz |
Dzongkha | 61K ๐พ |
el |
Greek | 5,470K ๐พ |
emi |
Mussau-Emira | 176K ๐พ |
enq |
Enga | 217K ๐พ |
eri |
Ogea | 269K ๐พ |
es |
Spanish | 32,670K ๐พ |
fa |
Persian | 9,114K ๐พ |
fa-AF |
Dari | 7,363K ๐พ |
faa |
Fasu | 238K ๐พ |
fai |
Faiwol | 256K ๐พ |
fi |
Finnish | 4,837K ๐พ |
fit |
Tornedalen Finnish | 292K ๐พ |
for |
Fore | 169K ๐พ |
fo |
Faroese | 851K ๐พ |
fuv |
Nigerian Fulfulde | 13K ๐พ |
gah |
Alekano | 210K ๐พ |
gam |
Kandawo | 250K ๐พ |
gaw |
Nobonob | 246K ๐พ |
ga |
Irish | 298K ๐พ |
gdn |
Umanakaina | 306K ๐พ |
gdr |
Wipi | 271K ๐พ |
gd |
Scottish Gaelic | 17,105K ๐พ |
gfk |
Patpatar | 294K ๐พ |
ghs |
Guhu-Samane | 186K ๐พ |
gsw-u-sd-chag |
Swiss German (Aargau) | 99K ๐พ |
gsw-u-sd-chbe |
Swiss German (Bern) | 73K ๐พ |
gsw-u-sd-chfr |
Swiss German (Fribourg) | 42K ๐พ |
gvf |
Golin | 276K ๐พ |
gv |
Manx Gaelic | 152K ๐พ |
ha |
Hausa | 1,775K ๐พ |
haw |
Hawaiian | 2,221K ๐พ |
hi |
Hindi | 10,004K ๐พ |
hla |
Halia | 273K ๐พ |
hot |
Hote | 222K ๐พ |
ho |
Hiri Motu | 240K ๐พ |
hr |
Croatian | 8,188K ๐พ |
hui |
Huli | 232K ๐พ |
hy |
Armenian | 25,972K ๐พ |
ian |
Iatmul | 224K ๐พ |
id |
Indonesian | 6,634K ๐พ |
ig |
Igbo | 13K ๐พ |
imo |
Imbongu | 280K ๐พ |
ino |
Inoke-Yate | 236K ๐พ |
iou |
Tuma-Irumu | 225K ๐พ |
ipi |
Ipili | 312K ๐พ |
iu |
Inuktitut | 98K ๐พ |
iws |
Sepik Iwam | 307K ๐พ |
jae |
Yabem | 186K ๐พ |
ja |
Japanese | 2,116K ๐พ |
kab |
Kabyle | 66K ๐พ |
kbm |
Iwal | 298K ๐พ |
kbq |
Kamano | 156K ๐พ |
kew |
West Kewa | 247K ๐พ |
kgf |
Kube | 175K ๐พ |
khz |
Keapara | 196K ๐พ |
kjs |
East Kewa | 251K ๐พ |
kj |
Kuanyama | 1,474K ๐พ |
kk |
Kazakh | 642K ๐พ |
kmg |
Kรขte | 127K ๐พ |
kmo |
Kwoma | 213K ๐พ |
kms |
Kamasau | 293K ๐พ |
kmu |
Kanite | 214K ๐พ |
km |
Khmer | 29,110K ๐พ |
kpf |
Komba | 174K ๐พ |
kpr |
Korafe-Yegha | 262K ๐พ |
kpw |
Kobon | 288K ๐พ |
kpx |
Mountain Koiali | 190K ๐พ |
kqc |
Doromu-Koki | 209K ๐พ |
kqw |
Kandas | 201K ๐พ |
ksd |
Kuanua | 228K ๐พ |
ksr |
Borong | 233K ๐พ |
kto |
Kuot | 286K ๐พ |
kud |
โAuhelawa | 167K ๐พ |
kue |
Kuman (Papua New Guinea) | 230K ๐พ |
kup |
Kunimaipa | 279K ๐พ |
ku |
Kurdish | 2,479K ๐พ |
kwj |
Kwanga | 290K ๐พ |
kyc |
Kyaka | 220K ๐พ |
kyg |
Keyagana | 190K ๐พ |
ky |
Kyrgyz | 18,597K ๐พ |
kze |
Kosena | 164K ๐พ |
la |
Latin | 48K ๐พ |
lb |
Luxembourgish | 5,173K ๐พ |
lcm |
Tungag | 239K ๐พ |
leu |
Kara (Papua New Guinea) | 255K ๐พ |
lid |
Nyindrou | 308K ๐พ |
lo |
Lao | 4,384K ๐พ |
mbh |
Mangseng | 321K ๐พ |
mcq |
Ese | 158K ๐พ |
med |
Melpa | 283K ๐พ |
mee |
Mengen | 301K ๐พ |
mek |
Mekeo | 234K ๐พ |
meu |
Motu | 175K ๐พ |
mhl |
Mauwake | 235K ๐พ |
mi |
Maori | 1,504K ๐พ |
mk |
Macedonian | 10,422K ๐พ |
mlh |
Mape | 235K ๐พ |
mlp |
Bargam | 297K ๐พ |
mmo |
Mangga Buang | 269K ๐พ |
mmx |
Madak | 271K ๐พ |
mna |
Mbula | 257K ๐พ |
mnw |
Mon | 1,836K ๐พ |
mox |
Molima | 222K ๐พ |
mpt |
Mian | 256K ๐พ |
mpx |
Misima-Panaeati | 227K ๐พ |
mr |
Marathi | 16,594K ๐พ |
msy |
Aruamu | 229K ๐พ |
mti |
Maiwa (Papua New Guinea) | 166K ๐พ |
mt |
Maltese | 3,331K ๐พ |
mux |
Bo-Ung | 363K ๐พ |
mva |
Manam | 231K ๐พ |
my |
Burmese | 1,007K ๐พ |
my-t-d0-zawgyi |
Burmese (Zawgyi encoding) | 593K ๐พ |
myw |
Muyuw | 150K ๐พ |
naf |
Nabak | 220K ๐พ |
nak |
Nakanai | 333K ๐พ |
nas |
Naasioi | 168K ๐พ |
nca |
Iyo | 203K ๐พ |
nho |
Takuu | 309K ๐พ |
nl |
Dutch | 58,357K ๐พ |
nop |
Numanggang | 183K ๐พ |
nou |
Ewage-Notu | 266K ๐พ |
nsn |
Nehan | 248K ๐พ |
nvm |
Namiae | 290K ๐พ |
ny |
Nyanja | 356K ๐พ |
okv |
Orokaiva | 212K ๐พ |
ong |
Olo | 284K ๐พ |
opm |
Oksapmin | 332K ๐พ |
osa |
Osage | 3K ๐พ |
pa |
Punjabi | 28,446Kยฒ ๐พ |
pcm |
Nigerian Pidgin | 315K ๐พ |
pl |
Polish | 7,148K ๐พ |
ppo |
Folopa | 258K ๐พ |
ps |
Pashto | 7,343K ๐พ |
ptp |
Patep | 294K ๐พ |
pwg |
Gapapaiwa | 208K ๐พ |
rai |
Ramoaaina | 273K ๐พ |
rm-puter |
Romansh (Puter) | 1,068K ๐พ |
rm-rumgr |
Romansh (Grischun) | 4,794K ๐พ |
rm-surmiran |
Romansh (Surmiran) | 2,540K ๐พ |
rm-sursilv |
Romansh (Sursilvan) | 11,678K ๐พ |
rm-sutsilv |
Romansh (Sutsilvan) | 1,007K ๐พ |
rm-vallader |
Romansh (Vallader) | 5,560K ๐พ |
roo |
Rotokas | 292K ๐พ |
ro |
Romanian | 13,962K ๐พ |
rro |
Waima | 177K ๐พ |
ru |
Russian | 40,987Kยฒ ๐พ |
rw |
Kinyarwanda | 605K ๐พ |
sah |
Sakha | 2,457K ๐พ |
sgz |
Sursurunga | 327K ๐พ |
shn |
Shan | 1,435K ๐พ |
sim |
Mende (Papua New Guinea) | 273K ๐พ |
si |
Sinhala | 1,046K ๐พ |
sll |
Salt-Yui | 264K ๐พ |
sl |
Slovenian | 10,975K ๐พ |
snc |
Sinaugoro | 216K ๐พ |
sny |
Saniyo-Hiyewe | 348K ๐พ |
sn |
Shona | 2,542K ๐พ |
soq |
Kanasi | 213K ๐พ |
so |
Somali | 874K ๐พ |
spl |
Selepet | 244K ๐พ |
sps |
Saposa | 324K ๐พ |
sq |
Albanian | 10,104K ๐พ |
sr-Latn |
Serbian (Latin) | 10,143K ๐พ |
ssd |
Siroi | 210K ๐พ |
ssg |
Seimat | 221K ๐พ |
ssx |
Samberigi | 233K ๐พ |
sua |
Sulka | 458K ๐พ |
sue |
Suena | 227K ๐พ |
sv |
Swedish | 33,633K ๐พ |
swp |
Suau | 175K ๐พ |
sw |
Swahili | 8,817K ๐พ |
taw |
Tai | 268K ๐พ |
ta |
Tamil | 1,413K ๐พ |
tbc |
Takia | 278K ๐พ |
tbo |
Tawala | 198K ๐พ |
tgo |
Sudest | 216K ๐พ |
tif |
Tifal | 413K ๐พ |
tim |
Timbe | 206K ๐พ |
ti |
Tigrinya | 803K ๐พ |
tlf |
Telefol | 422K ๐พ |
tpi |
Tok Pisin | 8,049K ๐พ |
tpz |
Tinputz | 370K ๐พ |
tr |
Turkish | 13,846K ๐พ |
tte |
Bwanabwana | 198K ๐พ |
tt |
Tatar | 1,356K ๐พ |
ubr |
Ubir | 222K ๐พ |
ug |
Uyghur | 9,493K ๐พ |
uk |
Ukrainian | 12,921K ๐พ |
ur |
Urdu | 3,622K ๐พ |
usa |
Usarufa | 171K ๐พ |
uvl |
Lote | 277K ๐พ |
vec |
Venetian | 2K ๐พ |
vec-u-sd-itpd |
Venetian (Padua) | 813K ๐พ |
vec-u-sd-itts |
Venetian (Trieste) | 12K ๐พ |
vec-u-sd-itvr |
Venetian (Verona) | 16K ๐พ |
viv |
Iduna | 220K ๐พ |
waj |
Waffa | 236K ๐พ |
wer |
Weri | 209K ๐พ |
wiu |
Wiru | 232K ๐พ |
wnc |
Wantoat | 238K ๐พ |
wnu |
Usan | 234K ๐พ |
wos |
Hanga Hundi | 264K ๐พ |
wrs |
Waris | 213K ๐พ |
wsk |
Waskia | 239K ๐พ |
wuv |
Wuvulu-Aua | 187K ๐พ |
xla |
Kamula | 230K ๐พ |
xsi |
Sio | 319K ๐พ |
yby |
Yaweyuha | 219K ๐พ |
yle |
Yele | 298K ๐พ |
yml |
Iamalele | 245K ๐พ |
yo |
Yoruba | 80K ๐พ |
yuj |
Karkar-Yuri | 258K ๐พ |
yut |
Yopno | 227K ๐พ |
yuw |
Yau (Morobe Province) | 243K ๐พ |
zia |
Zia | 242K ๐พ |
ยน To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER
, UBRK_WORD_KANA
, or UBRK_WORD_IDEO
. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.
ยฒ Crawl is still in progress; the final number will be larger.
./corpuscrawler --language=rm --output=./corpus