Skip to content

a Web Spider - to crawl the latest papers on arXiv

Notifications You must be signed in to change notification settings

cckenny/arXiv-Spider

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv-Spider

A Web Spider - to crawl papers on arXiv by specify the submission date

You can change the configuration in main.py

Belows are the variables and their explanations you may concern:

  • BASE_URLs: a list of the mirrors of arXiv.org, the program can randomly select the mirror address to download(to avoid the system to detect our crawler).
  • SAVE_DIR: the path to save the crawled files
  • WORKER_NUM: the total number of the threads(workers)
  • BEGIN_YEARBEGIN_MONTHEND_YEAREND_MONTH: With the default configuration, it can crawl the papers with submission dates from January 2015 to March 2018

Tips: When you meet the 403 error or "Access Denied" error, don't worry. It means the download speed was so fast that the system has detected our crawler. You should be patient, and try again a few minutes later.


Author: Xingw Xiong

Date: 2018-04-08

Date: 2019-04-09
change urllib2 to urllib.request
change 'a bytes-like object' variable page to string pattern by encode('utf-8')
for running in python3

Usage

Run main.py

About

a Web Spider - to crawl the latest papers on arXiv

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages

  • Python 50.4%
  • Tcl 36.2%
  • C 6.7%
  • C++ 6.7%