Skip to content

dalindev/Scrapy-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Scrapy



Used: Python 2.79, scrapy, Git
System: OS X (10.x)

How to run it:
scrapy crawl dalin -o dainhuang.json -t json

Category path:
Department >> Sub_Department >> Sub_Sub_Department >> this_category
(example): Home >> TV & Video >> Televisions >> 25 - 31" Televisions

Features:
• Fast speed with 6639 Products retrieved in 5mins (DOWNLOAD_DELAY = 0.01), 7mins (DOWNLOAD_DELAY = 0.01)
• Products info were retrieved in product list pages (15 product each page)
hence it is much faster than going to product pages • No wasted request, spider is guided through categories
• Sub_Departments Gift Card and Bundles are special treated since page formates are different

Known Issues:
• Output Data Structures (JSON) are not well formated, but it is very easy to handle data with JSON
• Some of the Category names shown in http://www.visions.ca/ home page are different inside sub categories pages
[ I used the categories path shown in the product page ]




REQUIREMENT:

A simple tool that will scrape product information from http://www.visions.ca/, returning at a minimum the following information:
• The product categories available on the website
• At least one product per category
Each product returned should have the following information:
• Product title
• Product sale or regular price where applicable
• Product availability
As a bonus (but not doing this will not count against you), you may want to use the following your solution:
• Python
• The scrapy, lxml, or requests python libraries
• Xpaths or CSS selectors

About

Python 2.79, scrapy crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages