Skip to content

Crawler for linguistic corpora

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE.md
Notifications You must be signed in to change notification settings

google/corpuscrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of โ€œreal worldโ€ text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if youโ€™re a linguistic researcher, or if youโ€™re writing a spell checker (or similar language-processing software) for an โ€œexoticโ€ language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code Language Tokensยน
aai Arifama-Miniafia 181K ๐Ÿ’พ
aak Ankave 194K ๐Ÿ’พ
aby Aneme Wake 233K ๐Ÿ’พ
ace Aceh/Acehnese 817K ๐Ÿ’พ
ae Avestan 129K ๐Ÿ’พ
ae-Latn Avestan (Latin) 141K ๐Ÿ’พ
aey Amele 218K ๐Ÿ’พ
agd Agarabi 256K ๐Ÿ’พ
agg Angor 214K ๐Ÿ’พ
agm Angaataha 238K ๐Ÿ’พ
akh Akha 408K ๐Ÿ’พ
amm Ama (Papua New Guinea) 246K ๐Ÿ’พ
amp Alamblak 241K ๐Ÿ’พ
am Amharic 2,170K ๐Ÿ’พ
aom ร–mie 231K ๐Ÿ’พ
aon Bumbita Arapesh 294K ๐Ÿ’พ
ape Bukiyip 294K ๐Ÿ’พ
apr Arop-Lokep 373K ๐Ÿ’พ
apz Safeyoka 235K ๐Ÿ’พ
ar Arabic 14,345Kยฒ ๐Ÿ’พ
aso Dano 290K ๐Ÿ’พ
ata Pele-Ata 248K ๐Ÿ’พ
auy Awiyaana 164K ๐Ÿ’พ
avt Au 263K ๐Ÿ’พ
awb Awa (Papua New Guinea) 179K ๐Ÿ’พ
az Azerbaijani 3,413K ๐Ÿ’พ
ba Bashkir 666K ๐Ÿ’พ
bbb Barai 289K ๐Ÿ’พ
bbr Girawa 245K ๐Ÿ’พ
bch Bariai 248K ๐Ÿ’พ
bdd Bunama 171K ๐Ÿ’พ
bef Benabena 239K ๐Ÿ’พ
be Belarusian 1,441K ๐Ÿ’พ
bg Bulgarian 10,597K ๐Ÿ’พ
bhl Bimin 324K ๐Ÿ’พ
big Biangai 229K ๐Ÿ’พ
bjr Binumarien 226K ๐Ÿ’พ
bmh Kein 253K ๐Ÿ’พ
bmu Somba-Siawari 234K ๐Ÿ’พ
bm Bambara 30K ๐Ÿ’พ
bnp Bola 263K ๐Ÿ’พ
bn Bangla 7,258K ๐Ÿ’พ
boj Anjam 255K ๐Ÿ’พ
bon Bine 244K ๐Ÿ’พ
bo Tibetan 5,642K ๐Ÿ’พ
bs Bosnian 8,993K ๐Ÿ’พ
buk Bugawac 264K ๐Ÿ’พ
byx Qaqet 387K ๐Ÿ’พ
bzh Mapos Buang 251K ๐Ÿ’พ
ccp Chakma 79K ๐Ÿ’พ
cjv Chuave 286K ๐Ÿ’พ
cs Czech 3,141K ๐Ÿ’พ
cy Welsh 11,519K ๐Ÿ’พ
dad Marik 197K ๐Ÿ’พ
dah Gwahatike 274K ๐Ÿ’พ
ded Dedua 146K ๐Ÿ’พ
de German 46,431K ๐Ÿ’พ
dgz Daga 219K ๐Ÿ’พ
dob Dobu 179K ๐Ÿ’พ
dww Dawawa 208K ๐Ÿ’พ
dz Dzongkha 61K ๐Ÿ’พ
el Greek 5,470K ๐Ÿ’พ
emi Mussau-Emira 176K ๐Ÿ’พ
enq Enga 217K ๐Ÿ’พ
eri Ogea 269K ๐Ÿ’พ
es Spanish 32,670K ๐Ÿ’พ
fa Persian 9,114K ๐Ÿ’พ
fa-AF Dari 7,363K ๐Ÿ’พ
faa Fasu 238K ๐Ÿ’พ
fai Faiwol 256K ๐Ÿ’พ
fi Finnish 4,837K ๐Ÿ’พ
fit Tornedalen Finnish 292K ๐Ÿ’พ
for Fore 169K ๐Ÿ’พ
fo Faroese 851K ๐Ÿ’พ
fuv Nigerian Fulfulde 13K ๐Ÿ’พ
gah Alekano 210K ๐Ÿ’พ
gam Kandawo 250K ๐Ÿ’พ
gaw Nobonob 246K ๐Ÿ’พ
ga Irish 298K ๐Ÿ’พ
gdn Umanakaina 306K ๐Ÿ’พ
gdr Wipi 271K ๐Ÿ’พ
gd Scottish Gaelic 17,105K ๐Ÿ’พ
gfk Patpatar 294K ๐Ÿ’พ
ghs Guhu-Samane 186K ๐Ÿ’พ
gsw-u-sd-chag Swiss German (Aargau) 99K ๐Ÿ’พ
gsw-u-sd-chbe Swiss German (Bern) 73K ๐Ÿ’พ
gsw-u-sd-chfr Swiss German (Fribourg) 42K ๐Ÿ’พ
gvf Golin 276K ๐Ÿ’พ
gv Manx Gaelic 152K ๐Ÿ’พ
ha Hausa 1,775K ๐Ÿ’พ
haw Hawaiian 2,221K ๐Ÿ’พ
hi Hindi 10,004K ๐Ÿ’พ
hla Halia 273K ๐Ÿ’พ
hot Hote 222K ๐Ÿ’พ
ho Hiri Motu 240K ๐Ÿ’พ
hr Croatian 8,188K ๐Ÿ’พ
hui Huli 232K ๐Ÿ’พ
hy Armenian 25,972K ๐Ÿ’พ
ian Iatmul 224K ๐Ÿ’พ
id Indonesian 6,634K ๐Ÿ’พ
ig Igbo 13K ๐Ÿ’พ
imo Imbongu 280K ๐Ÿ’พ
ino Inoke-Yate 236K ๐Ÿ’พ
iou Tuma-Irumu 225K ๐Ÿ’พ
ipi Ipili 312K ๐Ÿ’พ
iu Inuktitut 98K ๐Ÿ’พ
iws Sepik Iwam 307K ๐Ÿ’พ
jae Yabem 186K ๐Ÿ’พ
ja Japanese 2,116K ๐Ÿ’พ
kab Kabyle 66K ๐Ÿ’พ
kbm Iwal 298K ๐Ÿ’พ
kbq Kamano 156K ๐Ÿ’พ
kew West Kewa 247K ๐Ÿ’พ
kgf Kube 175K ๐Ÿ’พ
khz Keapara 196K ๐Ÿ’พ
kjs East Kewa 251K ๐Ÿ’พ
kj Kuanyama 1,474K ๐Ÿ’พ
kk Kazakh 642K ๐Ÿ’พ
kmg Kรขte 127K ๐Ÿ’พ
kmo Kwoma 213K ๐Ÿ’พ
kms Kamasau 293K ๐Ÿ’พ
kmu Kanite 214K ๐Ÿ’พ
km Khmer 29,110K ๐Ÿ’พ
kpf Komba 174K ๐Ÿ’พ
kpr Korafe-Yegha 262K ๐Ÿ’พ
kpw Kobon 288K ๐Ÿ’พ
kpx Mountain Koiali 190K ๐Ÿ’พ
kqc Doromu-Koki 209K ๐Ÿ’พ
kqw Kandas 201K ๐Ÿ’พ
ksd Kuanua 228K ๐Ÿ’พ
ksr Borong 233K ๐Ÿ’พ
kto Kuot 286K ๐Ÿ’พ
kud โ€˜Auhelawa 167K ๐Ÿ’พ
kue Kuman (Papua New Guinea) 230K ๐Ÿ’พ
kup Kunimaipa 279K ๐Ÿ’พ
ku Kurdish 2,479K ๐Ÿ’พ
kwj Kwanga 290K ๐Ÿ’พ
kyc Kyaka 220K ๐Ÿ’พ
kyg Keyagana 190K ๐Ÿ’พ
ky Kyrgyz 18,597K ๐Ÿ’พ
kze Kosena 164K ๐Ÿ’พ
la Latin 48K ๐Ÿ’พ
lb Luxembourgish 5,173K ๐Ÿ’พ
lcm Tungag 239K ๐Ÿ’พ
leu Kara (Papua New Guinea) 255K ๐Ÿ’พ
lid Nyindrou 308K ๐Ÿ’พ
lo Lao 4,384K ๐Ÿ’พ
mbh Mangseng 321K ๐Ÿ’พ
mcq Ese 158K ๐Ÿ’พ
med Melpa 283K ๐Ÿ’พ
mee Mengen 301K ๐Ÿ’พ
mek Mekeo 234K ๐Ÿ’พ
meu Motu 175K ๐Ÿ’พ
mhl Mauwake 235K ๐Ÿ’พ
mi Maori 1,504K ๐Ÿ’พ
mk Macedonian 10,422K ๐Ÿ’พ
mlh Mape 235K ๐Ÿ’พ
mlp Bargam 297K ๐Ÿ’พ
mmo Mangga Buang 269K ๐Ÿ’พ
mmx Madak 271K ๐Ÿ’พ
mna Mbula 257K ๐Ÿ’พ
mnw Mon 1,836K ๐Ÿ’พ
mox Molima 222K ๐Ÿ’พ
mpt Mian 256K ๐Ÿ’พ
mpx Misima-Panaeati 227K ๐Ÿ’พ
mr Marathi 16,594K ๐Ÿ’พ
msy Aruamu 229K ๐Ÿ’พ
mti Maiwa (Papua New Guinea) 166K ๐Ÿ’พ
mt Maltese 3,331K ๐Ÿ’พ
mux Bo-Ung 363K ๐Ÿ’พ
mva Manam 231K ๐Ÿ’พ
my Burmese 1,007K ๐Ÿ’พ
my-t-d0-zawgyi Burmese (Zawgyi encoding) 593K ๐Ÿ’พ
myw Muyuw 150K ๐Ÿ’พ
naf Nabak 220K ๐Ÿ’พ
nak Nakanai 333K ๐Ÿ’พ
nas Naasioi 168K ๐Ÿ’พ
nca Iyo 203K ๐Ÿ’พ
nho Takuu 309K ๐Ÿ’พ
nl Dutch 58,357K ๐Ÿ’พ
nop Numanggang 183K ๐Ÿ’พ
nou Ewage-Notu 266K ๐Ÿ’พ
nsn Nehan 248K ๐Ÿ’พ
nvm Namiae 290K ๐Ÿ’พ
ny Nyanja 356K ๐Ÿ’พ
okv Orokaiva 212K ๐Ÿ’พ
ong Olo 284K ๐Ÿ’พ
opm Oksapmin 332K ๐Ÿ’พ
osa Osage 3K ๐Ÿ’พ
pa Punjabi 28,446Kยฒ ๐Ÿ’พ
pcm Nigerian Pidgin 315K ๐Ÿ’พ
pl Polish 7,148K ๐Ÿ’พ
ppo Folopa 258K ๐Ÿ’พ
ps Pashto 7,343K ๐Ÿ’พ
ptp Patep 294K ๐Ÿ’พ
pwg Gapapaiwa 208K ๐Ÿ’พ
rai Ramoaaina 273K ๐Ÿ’พ
rm-puter Romansh (Puter) 1,068K ๐Ÿ’พ
rm-rumgr Romansh (Grischun) 4,794K ๐Ÿ’พ
rm-surmiran Romansh (Surmiran) 2,540K ๐Ÿ’พ
rm-sursilv Romansh (Sursilvan) 11,678K ๐Ÿ’พ
rm-sutsilv Romansh (Sutsilvan) 1,007K ๐Ÿ’พ
rm-vallader Romansh (Vallader) 5,560K ๐Ÿ’พ
roo Rotokas 292K ๐Ÿ’พ
ro Romanian 13,962K ๐Ÿ’พ
rro Waima 177K ๐Ÿ’พ
ru Russian 40,987Kยฒ ๐Ÿ’พ
rw Kinyarwanda 605K ๐Ÿ’พ
sah Sakha 2,457K ๐Ÿ’พ
sgz Sursurunga 327K ๐Ÿ’พ
shn Shan 1,435K ๐Ÿ’พ
sim Mende (Papua New Guinea) 273K ๐Ÿ’พ
si Sinhala 1,046K ๐Ÿ’พ
sll Salt-Yui 264K ๐Ÿ’พ
sl Slovenian 10,975K ๐Ÿ’พ
snc Sinaugoro 216K ๐Ÿ’พ
sny Saniyo-Hiyewe 348K ๐Ÿ’พ
sn Shona 2,542K ๐Ÿ’พ
soq Kanasi 213K ๐Ÿ’พ
so Somali 874K ๐Ÿ’พ
spl Selepet 244K ๐Ÿ’พ
sps Saposa 324K ๐Ÿ’พ
sq Albanian 10,104K ๐Ÿ’พ
sr-Latn Serbian (Latin) 10,143K ๐Ÿ’พ
ssd Siroi 210K ๐Ÿ’พ
ssg Seimat 221K ๐Ÿ’พ
ssx Samberigi 233K ๐Ÿ’พ
sua Sulka 458K ๐Ÿ’พ
sue Suena 227K ๐Ÿ’พ
sv Swedish 33,633K ๐Ÿ’พ
swp Suau 175K ๐Ÿ’พ
sw Swahili 8,817K ๐Ÿ’พ
taw Tai 268K ๐Ÿ’พ
ta Tamil 1,413K ๐Ÿ’พ
tbc Takia 278K ๐Ÿ’พ
tbo Tawala 198K ๐Ÿ’พ
tgo Sudest 216K ๐Ÿ’พ
tif Tifal 413K ๐Ÿ’พ
tim Timbe 206K ๐Ÿ’พ
ti Tigrinya 803K ๐Ÿ’พ
tlf Telefol 422K ๐Ÿ’พ
tpi Tok Pisin 8,049K ๐Ÿ’พ
tpz Tinputz 370K ๐Ÿ’พ
tr Turkish 13,846K ๐Ÿ’พ
tte Bwanabwana 198K ๐Ÿ’พ
tt Tatar 1,356K ๐Ÿ’พ
ubr Ubir 222K ๐Ÿ’พ
ug Uyghur 9,493K ๐Ÿ’พ
uk Ukrainian 12,921K ๐Ÿ’พ
ur Urdu 3,622K ๐Ÿ’พ
usa Usarufa 171K ๐Ÿ’พ
uvl Lote 277K ๐Ÿ’พ
vec Venetian 2K ๐Ÿ’พ
vec-u-sd-itpd Venetian (Padua) 813K ๐Ÿ’พ
vec-u-sd-itts Venetian (Trieste) 12K ๐Ÿ’พ
vec-u-sd-itvr Venetian (Verona) 16K ๐Ÿ’พ
viv Iduna 220K ๐Ÿ’พ
waj Waffa 236K ๐Ÿ’พ
wer Weri 209K ๐Ÿ’พ
wiu Wiru 232K ๐Ÿ’พ
wnc Wantoat 238K ๐Ÿ’พ
wnu Usan 234K ๐Ÿ’พ
wos Hanga Hundi 264K ๐Ÿ’พ
wrs Waris 213K ๐Ÿ’พ
wsk Waskia 239K ๐Ÿ’พ
wuv Wuvulu-Aua 187K ๐Ÿ’พ
xla Kamula 230K ๐Ÿ’พ
xsi Sio 319K ๐Ÿ’พ
yby Yaweyuha 219K ๐Ÿ’พ
yle Yele 298K ๐Ÿ’พ
yml Iamalele 245K ๐Ÿ’พ
yo Yoruba 80K ๐Ÿ’พ
yuj Karkar-Yuri 258K ๐Ÿ’พ
yut Yopno 227K ๐Ÿ’พ
yuw Yau (Morobe Province) 243K ๐Ÿ’พ
zia Zia 242K ๐Ÿ’พ

ยน To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

ยฒ Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus

Releases

No releases published

Packages

No packages published

Languages