Skip to content
Navigation Menu
Toggle navigation
Sign in
Product
Actions
Automate any workflow
Packages
Host and manage packages
Security
Find and fix vulnerabilities
Codespaces
Instant dev environments
GitHub Copilot
Write better code with AI
Code review
Manage code changes
Issues
Plan and track work
Discussions
Collaborate outside of code
Explore
All features
Documentation
GitHub Skills
Blog
Solutions
By size
Enterprise
Teams
Startups
By industry
Healthcare
Financial services
Manufacturing
By use case
CI/CD & Automation
DevOps
DevSecOps
Resources
Resources
Learning Pathways
White papers, Ebooks, Webinars
Customer Stories
Partners
Open Source
GitHub Sponsors
Fund open source developers
The ReadME Project
GitHub community articles
Repositories
Topics
Trending
Collections
Enterprise
Enterprise platform
AI-powered developer platform
Available add-ons
Advanced Security
Enterprise-grade security features
GitHub Copilot
Enterprise-grade AI features
Premium Support
Enterprise-grade 24/7 support
Pricing
Search or jump to...
Search code, repositories, users, issues, pull requests...
Search syntax tips
Provide feedback
Saved searches
Use saved searches to filter your results more quickly
Sign in
Sign up
You signed in with another tab or window.
Reload
to refresh your session.
You signed out in another tab or window.
Reload
to refresh your session.
You switched accounts on another tab or window.
Reload
to refresh your session.
Dismiss alert
{{ message }}
google
/
corpuscrawler
Public
Notifications
You must be signed in to change notification settings
Fork
56
Star
187
Code
Issues
17
Pull requests
0
Actions
Security
Insights
Additional navigation options
Code
Issues
Pull requests
Actions
Security
Insights
Commits
Branch selector
master
User selector
jimregan
All time
Commit History
Commits on Aug 10, 2021
[ga] skip search results also
jimregan
committed
Aug 10, 2021
7a218c2
[ga] update crawler
jimregan
committed
Aug 10, 2021
3a9c446
Commits on Nov 20, 2019
[ga] new crawlers
jimregan
committed
Nov 20, 2019
3f7aff5
Commits on Nov 16, 2019
Irish Times changed the section name
jimregan
committed
Nov 16, 2019
c7922ba
use context setter for Irish Times (requires at least TLSv1_2)
jimregan
committed
Nov 16, 2019
a24ed38
make (ssl) context a property, add setter
jimregan
committed
Nov 16, 2019
bc012db
Commits on Nov 6, 2019
strip more boilerplate
jimregan
committed
Nov 6, 2019
ba8c432
fix regex
jimregan
committed
Nov 6, 2019
0a2089e
Commits on May 8, 2018
US embassy crawler for Polish (
#38
)
jimregan
authored and
brawer
committed
May 8, 2018
cb81515
Commits on Dec 27, 2017
[ga] Crawl additional Irish sites
jimregan
authored and
brawer
committed
Dec 27, 2017
2180472
Commits on Dec 26, 2017
CHG crawler (
#35
)
jimregan
authored and
brawer
committed
Dec 26, 2017
4b181a8
remove comment
jimregan
committed
Dec 26, 2017
a65b268
Irish Times
jimregan
committed
Dec 26, 2017
8dbbf74
Commits on Dec 15, 2017
continue if no text available; Achi bible has text in browser but not from crawler (or curl)
jimregan
authored and
brawer
committed
Dec 15, 2017
d12e53c
Belarusian, Bulgarian, Bambara
jimregan
authored and
brawer
committed
Dec 15, 2017
09f24f1
Bashkir
jimregan
authored and
brawer
committed
Dec 15, 2017
aec1108
Add Amharic
jimregan
authored and
brawer
committed
Dec 15, 2017
d11ac86
move crawl_bibleis to util; add for Ukrainian
jimregan
authored and
brawer
committed
Dec 15, 2017
b1756ac
Commits on Dec 14, 2017
bible crawl
jimregan
authored and
brawer
committed
Dec 14, 2017
7a7ca05
basic crawler for Aceh
jimregan
authored and
brawer
committed
Dec 14, 2017
3142598
Commits on Nov 3, 2017
[mi] (public domain) Bible scraper
jimregan
authored and
brawer
committed
Nov 3, 2017
2d1431c
another sentence start to omit
jimregan
authored and
brawer
committed
Nov 3, 2017
313402a
Commits on Nov 2, 2017
[ga] conditions were right, needed to cast to int
jimregan
authored and
brawer
committed
Nov 2, 2017
7d68c92
need more ns/no ns handling here
jimregan
authored and
brawer
committed
Nov 2, 2017
97a70e3
Commits on Nov 1, 2017
[ga] url conditions were backwards
jimregan
authored and
brawer
committed
Nov 1, 2017
d1f51a9
handle mixed broken/unbroken namespaces
jimregan
authored and
brawer
committed
Nov 1, 2017
039de18
fix regex
jimregan
authored and
brawer
committed
Nov 1, 2017
63554b8
[ga] get rid of duplicate sitemap crawler
jimregan
authored and
brawer
committed
Nov 1, 2017
20427b2
strip cookie warnings, etc.
jimregan
authored and
brawer
committed
Nov 1, 2017
4ea2d51
some of the files in the sitemap do not exist
jimregan
authored and
brawer
committed
Nov 1, 2017
c53b003
deal with RTE's funky sitemap
jimregan
authored and
brawer
committed
Nov 1, 2017
bedad9d
change to Translation.en:
jimregan
authored and
brawer
committed
Nov 1, 2017
54e8098
skip articles that describe a news programme (with identical text each time)
jimregan
authored and
brawer
committed
Nov 1, 2017
bf63f98
make tags conditional, they aren't always present
jimregan
authored and
brawer
committed
Nov 1, 2017
01901b6
some fixes; also crawl the sport section
jimregan
authored and
brawer
committed
Nov 1, 2017
d9997f7
Pagination
Previous
Next
You can’t perform that action at this time.