Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
Lib/corpuscrawler		Lib/corpuscrawler
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.md		LICENSE.md
README.md		README.md
corpuscrawler		corpuscrawler

Repository files navigation

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

Supported Languages

IETF BCP47 Code	Language	Tokens¹
`ae`	Avestan	129K 💾
`ae-Latn`	Avestan (Latin)	141K 💾
`am`	Amharic	2,170K 💾
`ar`	Arabic	14,345K² 💾
`az`	Azerbaijani	3,413K 💾
`ba`	Bashkir	666K 💾
`be`	Belarusian	1,441K 💾
`bg`	Bulgarian	10,597K 💾
`bm`	Bambara	30K 💾
`bn`	Bangla	7,258K 💾
`bo`	Tibetan	5,642K 💾
`bs`	Bosnian	8,993K 💾
`ccp`	Chakma	79K 💾
`cs`	Czech	3,141K 💾
`cy`	Welsh	11,519K 💾
`de`	German	46,431K 💾
`dz`	Dzongkha	61K 💾
`el`	Greek	5,470K 💾
`es`	Spanish	32,670K 💾
`fa`	Persian	9,114K 💾
`fa-AF`	Dari	7,363K 💾
`fi`	Finnish	4,837K 💾
`fit`	Tornedalen Finnish	292K 💾
`fo`	Faroese	851K 💾
`fuv`	Nigerian Fulfulde	13K 💾
`ga`	Irish	298K 💾
`gd`	Scottish Gaelic	17,105K 💾
`gsw-u-sd-chag`	Swiss German (Aargau)	99K 💾
`gsw-u-sd-chbe`	Swiss German (Bern)	73K 💾
`gsw-u-sd-chfr`	Swiss German (Fribourg)	42K 💾
`gv`	Manx Gaelic	152K 💾
`ha`	Hausa	1,775K 💾
`haw`	Hawaiian	2,221K 💾
`hi`	Hindi	10,004K 💾
`hr`	Croatian	8,188K 💾
`hy`	Armenian	25,972K 💾
`id`	Indonesian	6,634K 💾
`ig`	Igbo	13K 💾
`iu`	Inuktitut	98K 💾
`ja`	Japanese	2,116K 💾
`kab`	Kabyle	66K 💾
`kj`	Kuanyama	1,474K 💾
`kk`	Kazakh	642K 💾
`km`	Khmer	29,110K 💾
`ku`	Kurdish	2,479K 💾
`ky`	Kyrgyz	18,597K 💾
`la`	Latin	48K 💾
`lb`	Luxembourgish	5,173K 💾
`lo`	Lao	4,384K 💾
`mi`	Maori	1,504K 💾
`mk`	Macedonian	10,422K 💾
`mnw`	Mon	1,836K 💾
`mr`	Marathi	16,594K 💾
`mt`	Maltese	3,331K 💾
`my`	Burmese	1,007K 💾
`my-t-d0-zawgyi`	Burmese (Zawgyi encoding)	593K 💾
`nl`	Dutch	24,289K² 💾
`ny`	Nyanja	356K 💾
`osa`	Osage	3K 💾
`pcm`	Nigerian Pidgin	315K 💾
`pa`	Punjabi	28,446K² 💾
`pl`	Polish	7,148K 💾
`ps`	Pashto	7,343K 💾
`rm-puter`	Romansh (Puter)	1,068K 💾
`rm-rumgr`	Romansh (Grischun)	4,794K 💾
`rm-surmiran`	Romansh (Surmiran)	2,540K 💾
`rm-sursilv`	Romansh (Sursilvan)	11,678K 💾
`rm-sutsilv`	Romansh (Sutsilvan)	1,007K 💾
`rm-vallader`	Romansh (Vallader)	5,560K 💾
`ro`	Romanian	13,962K 💾
`ru`	Russian	40,987K² 💾
`rw`	Kinyarwanda	605K 💾
`sah`	Sakha	2,457K 💾
`shn`	Shan	1,435K 💾
`si`	Sinhala	1,046K 💾
`sl`	Slovenian	10,975K 💾
`sn`	Shona	2,542K 💾
`so`	Somali	874K 💾
`sq`	Albanian	10,104K 💾
`sr-Latn`	Serbian (Latin)	10,143K 💾
`sv`	Swedish	33,633K 💾
`sw`	Swahili	8,817K 💾
`ta`	Tamil	1,413K 💾
`ti`	Tigrinya	803K 💾
`tpi`	Tok Pisin	8,049K 💾
`tr`	Turkish	13,846K 💾
`tt`	Tatar	1,356K 💾
`ug`	Uyghur	9,493K 💾
`uk`	Ukrainian	12,921K 💾
`ur`	Urdu	3,622K 💾
`vec`	Venetian	2K 💾
`vec-u-sd-itpd`	Venetian (Padua)	813K 💾
`vec-u-sd-itts`	Venetian (Trieste)	12K 💾
`vec-u-sd-itvr`	Venetian (Verona)	16K 💾
`yo`	Yoruba	80K 💾

¹ To count tokens, we use an ICU word break iterator and count all tokens whose break status is one of UBRK_WORD_LETTER, UBRK_WORD_KANA, or UBRK_WORD_IDEO. Downloadable files include counts for each token. To get the raw text, run the crawler yourself.

² Crawl is still in progress; the final number will be larger.

Running the Crawler

./corpuscrawler --language=rm --output=./corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

Corpus Crawler

Supported Languages

Running the Crawler

About

Licenses found

Releases

Packages

Contributors 11

Languages

License

Licenses found

google/corpuscrawler

Folders and files

Latest commit

History

Repository files navigation

Corpus Crawler

Supported Languages

Running the Crawler

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 11

Languages

Packages