scraping table from web page

Question

I'm trying to scrape a table from a webpage using Selenium and BeautifulSoup but I'm not sure how to get to the actual data using BeautifulSoup.

webpage: https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general

I tried extracting table rows (tag <tr>) but when I call find_all, the array is empty.

When I inspect element, I see several elements with a tr tag, why don't they show up with BeautifulSoup.find_all() ??

I tried extracting table rows (tag <tr>) but when I call find_all, the array is empty.

Code:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()

driver.get("https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general")

html_source = driver.page_source

soup = BeautifulSoup(html_source, 'html.parser')

table = soup.find_all("tbody")
print(len(table))
for entry in table:
    print(entry)
    print("\n")

MendelG · Accepted Answer · 2024-07-07 08:22:40Z

why don't they show up with BeautifulSoup.find_all() ??

after taking a quick glance, it seems like it takes a long time for the page to load.

The thing is, when you pass the driver.page_source to BeautifulSoup, not all the HTML/CSS is loaded yet.

So, the solution would be to use an Explicit wait:

Wait until page is loaded with Selenium WebDriver for Python

or, even, (less recommended):

from time import sleep
sleep(10)

but I'm not 100% sure, since I don't currently have Selenium installed on my machine

However, I'd like to take on a completely different solution:

If you take a look at your browsers Network calls (Click on F12 in your browser, and it'll open the developer options), you'll see that data (the table) your looking for, is loaded through sending a GET request the their API:

The endpoint is under:

https://api.leetify.com/api/games/5c438e85-c31c-443a-8257-5872d89e548c

which you can view directly from your browser.

So, you can directly use the requests library to make a GET request to the above endpoint, which will be much more efficent:

import requests
from pprint import pprint

response = requests.get('https://api.leetify.com/api/games/5c438e85-c31c-443a-8257-5872d89e548c')
data = response.json()


pprint(data)

Prints (trucated):

{'agents': [{'gameFinishedAt': '2024-07-06T07:10:02.000Z',
             'gameId': '5c438e85-c31c-443a-8257-5872d89e548c',
             'id': '63e38340-d1ae-4e19-b51c-e278e3325bbb',
             'model': 'customplayer_tm_balkan_variantk',
             'steam64Id': '76561198062922849',
             'teamNumber': 2},
            {'gameFinishedAt': '2024-07-06T07:10:02.000Z',
             'gameId': '5c438e85-c31c-443a-8257-5872d89e548c',
             'id': 'e10f9fc4-759d-493b-a17f-a85db2fcd09d',
             'model': 'customplayer_ctm_fbi_variantg',
             'steam64Id': '76561198062922849',
             'teamNumber': 3},

This approach bypasses the need to wait for the page to load, allowing you to directly access the data.

SIGHUP · Accepted Answer · 2024-07-07 09:31:48Z

You don't need to use BeautifulSoup as everything can be done with the selenium module as follows:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver import ChromeOptions
from selenium.webdriver.support.expected_conditions import presence_of_all_elements_located as POAEL

URL = "https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general"

options = ChromeOptions()
options.add_argument("--headless")

with webdriver.Chrome(options=options) as driver:
    driver.get(URL)
    wait = WebDriverWait(driver, 10)
    selector = By.CSS_SELECTOR, "table tr"
    for tr in wait.until(POAEL(selector)):
        print(tr.text)

The point about wait is that the page you're scraping is JavaScript driven so you need to be sure that the table element (Tag) has been rendered before you try to analyse its contents.

The HTML content on this page is slightly unusual in that the table has more than one tbody element so you'll probably want to handle that. The code in this answer simply emits the text from all/any tr elements in the table with no consideration for the tbody from whence they came

Anandha Krishnan Aji · Accepted Answer · 2024-07-07 08:22:55Z

You need to give the webpage, the time to load. Currently, on most websites, the data are filled using AJAX from the client. This below code can help, but it's not the ideal way. You must add the logic to wait until a particular element is visible. Refer https://selenium-python.readthedocs.io/waits.html

from selenium import webdriver
from bs4 import BeautifulSoup
import time

driver = webdriver.Chrome()

driver.get("https://leetify.com/app/match-details/5c438e85-c31c-443a-8257-5872d89e548c/details-general")

time.sleep(2)

html_source = driver.page_source

soup = BeautifulSoup(html_source, 'html.parser')

table = soup.find_all("tbody")
print(len(table))
for entry in table:
    print(entry)
    print("\n")

Collectives™ on Stack Overflow

scraping table from web page

3 Answers 3

Not the answer you're looking for? Browse other questions tagged
javascript
python
web-scraping
beautifulsoup
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Not the answer you're looking for? Browse other questions tagged javascriptpythonweb-scrapingbeautifulsoup or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
javascript
python
web-scraping
beautifulsoup
or ask your own question.