Skip to content
Navigation Menu
Toggle navigation
Sign in
Product
Actions
Automate any workflow
Packages
Host and manage packages
Security
Find and fix vulnerabilities
Codespaces
Instant dev environments
GitHub Copilot
Write better code with AI
Code review
Manage code changes
Issues
Plan and track work
Discussions
Collaborate outside of code
Explore
All features
Documentation
GitHub Skills
Blog
Solutions
By size
Enterprise
Teams
Startups
By industry
Healthcare
Financial services
Manufacturing
By use case
CI/CD & Automation
DevOps
DevSecOps
Resources
Topics
AI
DevOps
Innersource
Open Source
Security
Software Development
Explore
Learning Pathways
White papers, Ebooks, Webinars
Customer Stories
Partners
Open Source
GitHub Sponsors
Fund open source developers
The ReadME Project
GitHub community articles
Repositories
Topics
Trending
Collections
Enterprise
Enterprise platform
AI-powered developer platform
Available add-ons
Advanced Security
Enterprise-grade security features
GitHub Copilot
Enterprise-grade AI features
Premium Support
Enterprise-grade 24/7 support
Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search syntax tips
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Sign in
Sign up
You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
Dismiss alert
{{ message }}
google
/
corpuscrawler
Public
Notifications
You must be signed in to change notification settings
Fork
56
Star
187
Code
Issues
17
Pull requests
0
Actions
Security
Insights
Additional navigation options
Code
Issues
Pull requests
Actions
Security
Insights
Commits
Branch selector
2d1431c
User selector
All users
All time
Commit History
Commits on Nov 3, 2017
[mi] (public domain) Bible scraper
jimregan
authored and
brawer
committed
Nov 3, 2017
2d1431c
another sentence start to omit
jimregan
authored and
brawer
committed
Nov 3, 2017
313402a
[es, mi, si] Add token counts for Maori and Sinhala; update Spanish
brawer
committed
Nov 3, 2017
249d011
Commits on Nov 2, 2017
[ga] conditions were right, needed to cast to int
jimregan
authored and
brawer
committed
Nov 2, 2017
7d68c92
need more ns/no ns handling here
jimregan
authored and
brawer
committed
Nov 2, 2017
97a70e3
Commits on Nov 1, 2017
[ga] url conditions were backwards
jimregan
authored and
brawer
committed
Nov 1, 2017
d1f51a9
handle mixed broken/unbroken namespaces
jimregan
authored and
brawer
committed
Nov 1, 2017
039de18
fix regex
jimregan
authored and
brawer
committed
Nov 1, 2017
63554b8
[ga] get rid of duplicate sitemap crawler
jimregan
authored and
brawer
committed
Nov 1, 2017
20427b2
strip cookie warnings, etc.
jimregan
authored and
brawer
committed
Nov 1, 2017
4ea2d51
some of the files in the sitemap do not exist
jimregan
authored and
brawer
committed
Nov 1, 2017
c53b003
deal with RTE's funky sitemap
jimregan
authored and
brawer
committed
Nov 1, 2017
bedad9d
change to Translation.en:
jimregan
authored and
brawer
committed
Nov 1, 2017
54e8098
skip articles that describe a news programme (with identical text each time)
jimregan
authored and
brawer
committed
Nov 1, 2017
bf63f98
make tags conditional, they aren't always present
jimregan
authored and
brawer
committed
Nov 1, 2017
01901b6
some fixes; also crawl the sport section
jimregan
authored and
brawer
committed
Nov 1, 2017
d9997f7
these articles have tags, output them
jimregan
authored and
brawer
committed
Nov 1, 2017
8beafc5
[mi] first pass at Maori
jimregan
authored and
brawer
committed
Nov 1, 2017
b186132
strip \r, replace multiple \n with one
jimregan
authored and
brawer
committed
Nov 1, 2017
9b5c070
remove page numbers and blank page notices
jimregan
authored and
brawer
committed
Nov 1, 2017
78b2c9d
first pass
jimregan
authored and
brawer
committed
Nov 1, 2017
ef42503
[util] Add filepath to FetchResult
behnam
authored and
brawer
committed
Nov 1, 2017
cd67262
basic crawler for Irish; fetch_sitemap returning nothing :/
jimregan
authored and
brawer
committed
Nov 1, 2017
ec6bcb3
Commits on Oct 31, 2017
[gd] Add basic crawler for Scots Gaelic
jimregan
authored and
brawer
committed
Oct 31, 2017
eb8b1cd
Commits on Oct 28, 2017
Add crawler for Sinhala
keshan
authored and
brawer
committed
Oct 28, 2017
f9d2aab
Commits on Oct 26, 2017
[ky] Filter out articles without title or text
brawer
committed
Oct 26, 2017
b141972
Right-align token count in README
brawer
committed
Oct 26, 2017
c0d30ed
Clean up README file
brawer
committed
Oct 26, 2017
032434a
Add download link for word count files to README
brawer
committed
Oct 26, 2017
5eda744
[ky] Make content excaction for Kyrgyz newspaper a bit more resilient
brawer
committed
Oct 26, 2017
f8929ce
[mr] Crawl a Marathi language corpus
brawer
committed
Oct 26, 2017
5186e86
Commits on Oct 25, 2017
[util] Replace unichr() for narrow Python builds
behnam
authored and
brawer
committed
Oct 25, 2017
54207e6
[de, es, sv] Update token counts for German, Spanish, Swedish corpora
brawer
committed
Oct 25, 2017
b2ead9a
[mk, uk] Update token counts for Macedonian and Ukrainian corpus
brawer
committed
Oct 25, 2017
ab5386c
[ar] Add bbc_news and sputnik_news
behnam
authored and
brawer
committed
Oct 25, 2017
f1433de
Pagination
Previous
Next
You can’t perform that action at this time.