Skip to content
Navigation Menu
Toggle navigation
Sign in
Product
Actions
Automate any workflow
Packages
Host and manage packages
Security
Find and fix vulnerabilities
Codespaces
Instant dev environments
GitHub Copilot
Write better code with AI
Code review
Manage code changes
Issues
Plan and track work
Discussions
Collaborate outside of code
Explore
All features
Documentation
GitHub Skills
Blog
Solutions
By size
Enterprise
Teams
Startups
By industry
Healthcare
Financial services
Manufacturing
By use case
CI/CD & Automation
DevOps
DevSecOps
Resources
Topics
AI
DevOps
Innersource
Open Source
Security
Software Development
Explore
Learning Pathways
White papers, Ebooks, Webinars
Customer Stories
Partners
Open Source
GitHub Sponsors
Fund open source developers
The ReadME Project
GitHub community articles
Repositories
Topics
Trending
Collections
Enterprise
Enterprise platform
AI-powered developer platform
Available add-ons
Advanced Security
Enterprise-grade security features
GitHub Copilot
Enterprise-grade AI features
Premium Support
Enterprise-grade 24/7 support
Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search syntax tips
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Sign in
Sign up
You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
Dismiss alert
{{ message }}
google
/
corpuscrawler
Public
Notifications
You must be signed in to change notification settings
Fork
56
Star
187
Code
Issues
17
Pull requests
0
Actions
Security
Insights
Additional navigation options
Code
Issues
Pull requests
Actions
Security
Insights
Commits
Branch selector
c53b003
User selector
All users
All time
Commit History
Commits on Nov 1, 2017
some of the files in the sitemap do not exist
jimregan
authored and
brawer
committed
Nov 1, 2017
c53b003
deal with RTE's funky sitemap
jimregan
authored and
brawer
committed
Nov 1, 2017
bedad9d
change to Translation.en:
jimregan
authored and
brawer
committed
Nov 1, 2017
54e8098
skip articles that describe a news programme (with identical text each time)
jimregan
authored and
brawer
committed
Nov 1, 2017
bf63f98
make tags conditional, they aren't always present
jimregan
authored and
brawer
committed
Nov 1, 2017
01901b6
some fixes; also crawl the sport section
jimregan
authored and
brawer
committed
Nov 1, 2017
d9997f7
these articles have tags, output them
jimregan
authored and
brawer
committed
Nov 1, 2017
8beafc5
[mi] first pass at Maori
jimregan
authored and
brawer
committed
Nov 1, 2017
b186132
strip \r, replace multiple \n with one
jimregan
authored and
brawer
committed
Nov 1, 2017
9b5c070
remove page numbers and blank page notices
jimregan
authored and
brawer
committed
Nov 1, 2017
78b2c9d
first pass
jimregan
authored and
brawer
committed
Nov 1, 2017
ef42503
[util] Add filepath to FetchResult
behnam
authored and
brawer
committed
Nov 1, 2017
cd67262
basic crawler for Irish; fetch_sitemap returning nothing :/
jimregan
authored and
brawer
committed
Nov 1, 2017
ec6bcb3
Commits on Oct 31, 2017
[gd] Add basic crawler for Scots Gaelic
jimregan
authored and
brawer
committed
Oct 31, 2017
eb8b1cd
Commits on Oct 28, 2017
Add crawler for Sinhala
keshan
authored and
brawer
committed
Oct 28, 2017
f9d2aab
Commits on Oct 26, 2017
[ky] Filter out articles without title or text
brawer
committed
Oct 26, 2017
b141972
Right-align token count in README
brawer
committed
Oct 26, 2017
c0d30ed
Clean up README file
brawer
committed
Oct 26, 2017
032434a
Add download link for word count files to README
brawer
committed
Oct 26, 2017
5eda744
[ky] Make content excaction for Kyrgyz newspaper a bit more resilient
brawer
committed
Oct 26, 2017
f8929ce
[mr] Crawl a Marathi language corpus
brawer
committed
Oct 26, 2017
5186e86
Commits on Oct 25, 2017
[util] Replace unichr() for narrow Python builds
behnam
authored and
brawer
committed
Oct 25, 2017
54207e6
[de, es, sv] Update token counts for German, Spanish, Swedish corpora
brawer
committed
Oct 25, 2017
b2ead9a
[mk, uk] Update token counts for Macedonian and Ukrainian corpus
brawer
committed
Oct 25, 2017
ab5386c
[ar] Add bbc_news and sputnik_news
behnam
authored and
brawer
committed
Oct 25, 2017
f1433de
[ky] Improve Kyrgyz content extraction
brawer
committed
Oct 25, 2017
3150c19
Commits on Oct 24, 2017
[pl, tr] Update token counts for Polish and Turkish
brawer
committed
Oct 24, 2017
93567c3
[util/fetch_sitemap] Add subsitemap_filter option (
#2
)
behnam
authored and
brawer
committed
Oct 24, 2017
425ab82
[ar] Add Modern Standard Arabic: UDHR and DW (
#5
)
behnam
authored and
brawer
committed
Oct 24, 2017
eeabd58
[util/fetch] Add more prints for showing progress (
#4
)
behnam
authored and
brawer
committed
Oct 24, 2017
4c2b01b
Commits on Oct 23, 2017
[sv] Handle pages from Sverigesradio that have no title
brawer
committed
Oct 23, 2017
6b226cd
Commits on Oct 18, 2017
[sn] Crawl a Shona language corpus
brawer
committed
Oct 18, 2017
4881b93
[fi, hi] Update token counts for Finnish and Hindi language corpus
brawer
committed
Oct 18, 2017
ac1784a
Commits on Oct 12, 2017
[el, sr-Latn] Update token counts for Greek and Serbian (Latin) corpora
brawer
committed
Oct 12, 2017
8200386
Commits on Oct 11, 2017
[de, es] Crawl language corpora in German and Spanish
brawer
committed
Oct 11, 2017
b214b66
Pagination
Previous
Next
You can’t perform that action at this time.