Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmartScraperGraph only extracts a small part of items requested #710

Open
sillasgonzaga opened this issue Sep 29, 2024 · 8 comments
Open
Assignees

Comments

@sillasgonzaga
Copy link

Describe the bug
It's not quite an error, but I am trying to scrape this Aliexpress search page, which contains 60 products listed in the first page. However, it only returns data for 10 products. It's probably due to how the web page is loaded. Is there any parameter I could use to increase the wait time before extracting the source code of the requested page?

To Reproduce

from scrapegraphai.graphs import SmartScraperGraph, ScriptCreatorGraph, OmniScraperGraph, SmartScraperMultiGraph 

# Define the configuration for the scraping pipeline
graph_config = {
    "llm": {
        "api_key": "MY_KEY",
        "model": "openai/gpt-4o-mini",
    },
    "library": "selenium",
    "verbose": False,
    "headless": True
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Return the data about the products listed, including product id and product name",
    source="https://pt.aliexpress.com/w/wholesale-TECIDO-PAET%C3%8A-ROSA.html",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
@VinciGit00
Copy link
Collaborator

@sillasgonzaga
Copy link
Author

@VinciGit00 thanks but sadly it did not work, it kept returning just 10 results.

@VinciGit00
Copy link
Collaborator

have you tried to add: a config like this:
graph_config = { "llm": { "api_key": openai_key, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, },
headless should be false

@djds4rce
Copy link

djds4rce commented Oct 1, 2024

tried with headless false too. Same behaviour

@SwapnilSonker
Copy link
Contributor

@VinciGit00 I want to work on this issue, if you have more information on this issue then it would be helpful for me to work and if the the issue is unassigned then pls assign it to me.

@VinciGit00
Copy link
Collaborator

Please tell me what you want to do

@SwapnilSonker
Copy link
Contributor

@VinciGit00 yes sure, will just increase the loading page parameters which will help in extracting all the products after the page loads, and if there is any bug beside that I will try to handle it myself.

@SwapnilSonker
Copy link
Contributor

@VinciGit00
PR - #849

I also have a second method to solve the above error as @sillasgonzaga is using selenium then what I have is a custom function.

### for the convenience imports are also added
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

def selenium_fetch(url, wait_time=5, scroll_pause=2):
    # Configure Selenium WebDriver
    options = Options()
    options.headless = False  # Set True for headless mode
    driver = webdriver.Chrome(options=options)

    try:
        # Open the URL
        driver.get(url)
        time.sleep(wait_time)  # Allow initial page load

        # Simulate scrolling to load more products
        last_height = driver.execute_script("return document.body.scrollHeight")
        while True:
            driver.find_element(By.TAG_NAME, "body").send_keys(Keys.END)
            time.sleep(scroll_pause)  # Allow time for additional products to load
            new_height = driver.execute_script("return document.body.scrollHeight")
            if new_height == last_height:  # Break if no new content is loaded
                break
            last_height = new_height

        # Return the full page source
        return driver.page_source

    finally:
        driver.quit()

VinciGit00 added a commit that referenced this issue Jan 5, 2025
#710 - just a check added for the detailed logging
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants