Skip to content

Latest commit

 

History

History
316 lines (302 loc) · 40.4 KB

README.md

File metadata and controls

316 lines (302 loc) · 40.4 KB

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code Language Tokens¹
aai Arifama-Miniafia 181K 💾
aak Ankave 194K 💾
aby Aneme Wake 233K 💾
ace Aceh/Acehnese 817K 💾
ae Avestan 129K 💾
ae-Latn Avestan (Latin) 141K 💾
aey Amele 218K 💾
agd Agarabi 256K 💾
agg Angor 214K 💾
agm Angaataha 238K 💾
akh Akha 408K 💾
amm Ama (Papua New Guinea) 246K 💾
amp Alamblak 241K 💾
am Amharic 2,170K 💾
aom Ömie 231K 💾
aon Bumbita Arapesh 294K 💾
ape Bukiyip 294K 💾
apr Arop-Lokep 373K 💾
apz Safeyoka 235K 💾
ar Arabic 14,345K² 💾
aso Dano 290K 💾
ata Pele-Ata 248K 💾
auy Awiyaana 164K 💾
avt Au 263K 💾
awb Awa (Papua New Guinea) 179K 💾
az Azerbaijani 3,413K 💾
ba Bashkir 666K 💾
bbb Barai 289K 💾
bbr Girawa 245K 💾
bch Bariai 248K 💾
bdd Bunama 171K 💾
bef Benabena 239K 💾
be Belarusian 1,441K 💾
bg Bulgarian 10,597K 💾
bhl Bimin 324K 💾
big Biangai 229K 💾
bjr Binumarien 226K 💾
bmh Kein 253K 💾
bmu Somba-Siawari 234K 💾
bm Bambara 30K 💾
bnp Bola 263K 💾
bn Bangla 7,258K 💾
boj Anjam 255K 💾
bon Bine 244K 💾
bo Tibetan 5,642K 💾
bs Bosnian 8,993K 💾
buk Bugawac 264K 💾
byx Qaqet 387K 💾
bzh Mapos Buang 251K 💾
ccp Chakma 79K 💾
cjv Chuave 286K 💾
cs Czech 3,141K 💾
cy Welsh 11,519K 💾
dad Marik 197K 💾
dah Gwahatike 274K 💾
ded Dedua 146K 💾
de German 46,431K 💾
dgz Daga 219K 💾
dob Dobu 179K 💾
dww Dawawa 208K 💾
dz Dzongkha 61K 💾
el Greek 5,470K 💾
emi Mussau-Emira 176K 💾
enq Enga 217K 💾
eri Ogea 269K 💾
es Spanish 32,670K 💾
fa Persian 9,114K 💾
fa-AF Dari 7,363K 💾
faa Fasu 238K 💾
fai Faiwol 256K 💾
fi Finnish 4,837K 💾
fit Tornedalen Finnish 292K 💾
for Fore 169K 💾
fo Faroese 851K 💾
fuv Nigerian Fulfulde 13K 💾
gah Alekano 210K 💾
gam Kandawo 250K 💾
gaw Nobonob 246K 💾
ga Irish 298K 💾
gdn Umanakaina 306K 💾
gdr Wipi 271K 💾
gd Scottish Gaelic 17,105K 💾
gfk Patpatar 294K 💾
ghs Guhu-Samane 186K 💾
gsw-u-sd-chag Swiss German (Aargau) 99K 💾
gsw-u-sd-chbe Swiss German (Bern) 73K 💾
gsw-u-sd-chfr Swiss German (Fribourg) 42K 💾
gvf Golin 276K 💾
gv Manx Gaelic 152K 💾
ha Hausa 1,775K 💾
haw Hawaiian 2,221K 💾
hi Hindi 10,004K 💾
hla Halia 273K 💾
hot Hote 222K 💾
ho Hiri Motu 240K 💾
hr Croatian 8,188K 💾
hui Huli 232K 💾
hy Armenian 25,972K 💾
ian Iatmul 224K 💾
id Indonesian 6,634K 💾
ig Igbo 13K 💾
imo Imbongu 280K 💾
ino Inoke-Yate 236K 💾
iou Tuma-Irumu 225K 💾
ipi Ipili 312K 💾
iu Inuktitut 98K 💾
iws Sepik Iwam 307K 💾
jae Yabem 186K 💾
ja Japanese 2,116K 💾
kab Kabyle 66K 💾
kbm Iwal 298K 💾
kbq Kamano 156K 💾
kew West Kewa 247K 💾
kgf Kube 175K 💾
khz Keapara 196K 💾
kjs East Kewa 251K 💾
kj Kuanyama 1,474K 💾
kk Kazakh 642K 💾
kmg Kâte 127K 💾
kmo Kwoma 213K 💾
kms Kamasau 293K 💾
kmu Kanite 214K 💾
km Khmer 29,110K 💾
kpf Komba 174K 💾
kpr Korafe-Yegha 262K 💾
kpw Kobon 288K 💾
kpx Mountain Koiali 190K 💾
kqc Doromu-Koki 209K 💾
kqw Kandas 201K 💾
ksd Kuanua 228K 💾
ksr Borong 233K 💾
kto Kuot 286K 💾
kud ‘Auhelawa 167K 💾
kue Kuman (Papua New Guinea) 230K 💾
kup Kunimaipa 279K 💾
ku Kurdish 2,479K 💾
kwj Kwanga 290K 💾
kyc Kyaka 220K 💾
kyg Keyagana 190K 💾
ky Kyrgyz 18,597K 💾
kze Kosena 164K 💾
la Latin 48K 💾
lb Luxembourgish 5,173K 💾
lcm Tungag 239K 💾
leu Kara (Papua New Guinea) 255K 💾
lid Nyindrou 308K 💾
lo Lao 4,384K 💾
mbh Mangseng 321K 💾
mcq Ese 158K 💾
med Melpa 283K 💾
mee Mengen 301K 💾
mek Mekeo 234K 💾
meu Motu 175K 💾
mhl Mauwake 235K 💾
mi Maori 1,504K 💾
mk Macedonian 10,422K 💾
mlh Mape 235K 💾
mlp Bargam 297K 💾
mmo Mangga Buang 269K 💾
mmx Madak 271K 💾
mna Mbula 257K 💾
mnw Mon 1,836K 💾
mox Molima 222K 💾
mpt Mian 256K 💾
mpx Misima-Panaeati 227K 💾
mr Marathi 16,594K 💾
msy Aruamu 229K 💾
mti Maiwa (Papua New Guinea) 166K 💾
mt Maltese 3,331K 💾
mux Bo-Ung 363K 💾
mva Manam 231K 💾
my Burmese 1,007K 💾
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K 💾
myw Muyuw 150K 💾
naf Nabak 220K 💾
nak Nakanai 333K 💾
nas Naasioi 168K 💾
nca Iyo 203K 💾
nho Takuu 309K 💾
nl Dutch 58,357K 💾
nop Numanggang 183K 💾
nou Ewage-Notu 266K 💾
nsn Nehan 248K 💾
nvm Namiae 290K 💾
ny Nyanja 356K 💾
okv Orokaiva 212K 💾
ong Olo 284K 💾
opm Oksapmin 332K 💾
osa Osage 3K 💾
pa Punjabi 28,446K² 💾
pcm Nigerian Pidgin 315K 💾
pl Polish 7,148K 💾
ppo Folopa 258K 💾
ps Pashto 7,343K 💾
ptp Patep 294K 💾
pwg Gapapaiwa 208K 💾
rai Ramoaaina 273K 💾
rm-puter Romansh (Puter) 1,068K 💾
rm-rumgr Romansh (Grischun) 4,794K 💾
rm-surmiran Romansh (Surmiran) 2,540K 💾
rm-sursilv Romansh (Sursilvan) 11,678K 💾
rm-sutsilv Romansh (Sutsilvan) 1,007K 💾
rm-vallader Romansh (Vallader) 5,560K 💾
roo Rotokas 292K 💾
ro Romanian 13,962K 💾
rro Waima 177K 💾
ru Russian 40,987K² 💾
rw Kinyarwanda 605K 💾
sah Sakha 2,457K 💾
sgz Sursurunga 327K 💾
shn Shan 1,435K 💾
sim Mende (Papua New Guinea) 273K 💾
si Sinhala 1,046K 💾
sll Salt-Yui 264K 💾
sl Slovenian 10,975K 💾
snc Sinaugoro 216K 💾
sny Saniyo-Hiyewe 348K 💾
sn Shona 2,542K 💾
soq Kanasi 213K 💾
so Somali 874K 💾
spl Selepet 244K 💾
sps Saposa 324K 💾
sq Albanian 10,104K 💾
sr-Latn Serbian (Latin) 10,143K 💾
ssd Siroi 210K 💾
ssg Seimat 221K 💾
ssx Samberigi 233K 💾
sua Sulka 458K 💾
sue Suena 227K 💾
sv Swedish 33,633K 💾
swp Suau 175K 💾
sw Swahili 8,817K 💾
taw Tai 268K 💾
ta Tamil 1,413K 💾
tbc Takia 278K 💾
tbo Tawala 198K 💾
tgo Sudest 216K 💾
tif Tifal 413K 💾
tim Timbe 206K 💾
ti Tigrinya 803K 💾
tlf Telefol 422K 💾
tpi Tok Pisin 8,049K 💾
tpz Tinputz 370K 💾
tr Turkish 13,846K 💾
tte Bwanabwana 198K 💾
tt Tatar 1,356K 💾
ubr Ubir 222K 💾
ug Uyghur 9,493K 💾
uk Ukrainian 12,921K 💾
ur Urdu 3,622K 💾
usa Usarufa 171K 💾
uvl Lote 277K 💾
vec Venetian 2K 💾
vec-u-sd-itpd Venetian (Padua) 813K 💾
vec-u-sd-itts Venetian (Trieste) 12K 💾
vec-u-sd-itvr Venetian (Verona) 16K 💾
viv Iduna 220K 💾
waj Waffa 236K 💾
wer Weri 209K 💾
wiu Wiru 232K 💾
wnc Wantoat 238K 💾
wnu Usan 234K 💾
wos Hanga Hundi 264K 💾
wrs Waris 213K 💾
wsk Waskia 239K 💾
wuv Wuvulu-Aua 187K 💾
xla Kamula 230K 💾
xsi Sio 319K 💾
yby Yaweyuha 219K 💾
yle Yele 298K 💾
yml Iamalele 245K 💾
yo Yoruba 80K 💾
yuj Karkar-Yuri 258K 💾
yut Yopno 227K 💾
yuw Yau (Morobe Province) 243K 💾
zia Zia 242K 💾

¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

² Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus