Skip to main content

Questions tagged [web-crawler]

A Web crawler (also known as Web spider) is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or – especially in the FOAF community – Web scutters.

0 votes
0 answers
9 views

Crawl data in Top 250 Movies IDMb

Please, i need someone help me. I can't understand why I only crawl 25 movies instead of 250. My code: import pandas as pd import requests from bs4 import BeautifulSoup headers = {'User-Agent': '...
Vu-Hoang Duong's user avatar
-1 votes
0 answers
8 views

Weblow pagination hurt SEO? [closed]

I'm using Webflow for a certain website and a lot of paginated pages end up in the GSC tab: crawled, currently not indexed. For example: https://www.example.com/blog?65b097f7_page=5 Is this hurting ...
Ruben's user avatar
  • 1
0 votes
1 answer
29 views

How to exclude div classes 'modal-content' and 'modal-body' from pyppeteer web scraper?

I'm building a scraper that gets text data from a list of articles. A common specimen in the text content I'm scraping at the minute is that at the bottom there is this message: "As a subscriber, ...
Shehzadi Aziz's user avatar
0 votes
0 answers
11 views

Sudden increase in requests received

my application suddenly had a huge increase in the number of requests being made to it. I believe the only change of merit was adding a sitemap.xml and I believe the increase in requests is due to ...
egauzens's user avatar
0 votes
0 answers
7 views

Github Action _ Overwriting/replace/update .json file prblem

I want to use google web API to crap some coffee shop info from my country then there is already a original version .json file in my repo to use, but if some new coffee shop be created ,I need to ...
Yatayork's user avatar
0 votes
0 answers
18 views

AWS crawler creating Null values for partion columns

I am having some country level partitioned data in s3 and crawler is crawling the this root folder and creating a table. No Null value is there for country code. But when looked in the Athena, there ...
Ananth's user avatar
  • 41
-3 votes
0 answers
36 views

Download ICD-10 codes (International Classification of Diseases)

We can easily browse the ICD-10 codes: https://icd.who.int/browse10/2019/en Unfortunately, there is no way to download all of the codes as TXT (or XLS) file in order to parse with Python, or import ...
JoyfulPanda's user avatar
-1 votes
0 answers
20 views

crawler - rotten tomatoes website - problem with pages

im trying to crawl the website rotten tomatoes but i have a problem: to get the html for page 5 and above of the movies for example: https://www.rottentomatoes.com/browse/movies_at_home/?page=**8** ...
Nadav Goldin's user avatar
1 vote
1 answer
62 views

Scrapy Spider does not work with multiple urls

I wrote a Scrapy spider and used Selenium in it to scrape the products in devgrossonline.com. It does not work with multiple category urls, but it works when I provide only one url. Here is my spider: ...
serkan ertas's user avatar
-1 votes
0 answers
22 views

The time obtained by the Python crawler is incorrect when getting comments

When I use Python to crawl stock comments from a website, the time parsed from the website is different from the time obtained by my crawler. For example: when use the F12 to detect the website,i find ...
Ohhhhh's user avatar
  • 1
0 votes
1 answer
37 views

TYPO3 indexed search fails to index PDF files

I'm hoping to get help with a problem I can't solve. The working environment is as follows: SYSTEM Debian 12 bookworm PHP 7.4 (tried 8.2 and 8.3 with failure on crawler) + FPM/FastCGI /usr/bin/...
Alessandro Tuveri's user avatar
0 votes
0 answers
13 views

How to download PDFs using Norconex Web Crawler?

I have tried to download PDFs from certain URLs (e.g. https://example.com) using the Norconex Web Crawler (v3.0) and the configuration below but no luck. Can someone please help me with this? <?xml ...
teklot's user avatar
  • 59
0 votes
0 answers
40 views

Getting subsequent GET calls for some PUT, POST APIs in web site

I'm observing subsequent GET calls for some PUT, POST API. I already checked the code and there is no GET calls created for those endpoints. But I'm seeing this call in my server logs. Say for example ...
coding life's user avatar
-2 votes
0 answers
39 views

TikTok finding username with videoID

I am currently working on a project that deals with the data of the DSA transparency data base. Specifically, I am looking at the TikTok data. Now I would like to go one step further and check if the ...
Moritz's user avatar
  • 1
0 votes
0 answers
11 views

Issues with Crawling Yahoo Auction During Peak Hours in a Cross-Border E-commerce System (Errors 404, 500)

I am seeking assistance with a critical issue we are facing in our cross-border e-commerce auction and proxy purchase platform. Our system relies heavily on web crawling technology to access Yahoo ...
Nguyễn Nam Hải's user avatar

15 30 50 per page
1
2 3 4 5
648