HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

Kaoticz · 2024-12-16T05:29:10Z

Description

Passing raw HTML as the "source" parameter to a SmartScraperGraph causes a ValueError to be raised when executing the run() method of the scraper.

To Reproduce

SmartScraperGraph(prompt, raw_html, graph_config).run()

Expected behavior

The run() method should return a string with the result of the scraping operation.

Actual behavior

The following error is raised:

Traceback (most recent call last):
  File ".../repository_path/main.py", line 72, in <module>
    json: str = scrape("""
                ^^^^^^^^^^
  File ".../repository_path/main.py", line 64, in scrape
    return smart_scraper_graph.run()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 280, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 321, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 270, in _execute_standard
    raise e
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 243, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 171, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/nodes/fetch_node.py", line 135, in execute
    if self.is_valid_url(source):
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/nodes/fetch_node.py", line 97, in is_valid_url
    raise ValueError(
ValueError: Invalid URL format: <html>big-ass html code here</html>. URL must start with http(s):// and contain a valid domain.

Desktop:

OS: Arch Linux
Browser: Chromium 131.0.6778.139
ScrapegraphAI Version: 1.33.3

Additional context

I narrowed down the problem to this code:

Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py

Lines 134 to 141 in 515e12d

    
           try: 
        
               if self.is_valid_url(source): 
        
                   return self.handle_web_source(state, source) 
        
           except ValueError as e: 
        
               # Re-raise the exception from is_valid_url 
        
               raise 
        
           return self.handle_local_source(state, source)

In this function, a series of checks are performed to determine the nature of the source data (if it's a json file, an xml file, a pdf file, etc). When all of them fail, the code checks if the source is a web URL. If that also fails, it considers the source to be "local" and proceeds with that.

The issue lies in the check for the web URL. is_valid_url(source) either returns True when the source is a web URL, or raises an error when it's not (weird design decision, it should just return False instead).

Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py

Lines 83 to 100 in 515e12d

    
               def is_valid_url(self, source: str) -> bool: 
        
                   """ 
        
                   Validates if the source string is a valid URL using regex. 
        
                   Parameters: 
        
                   source (str): The URL string to validate 
        
                   Raises: 
        
                   ValueError: If the URL is invalid 
        
                   """ 
        
                   import re 
        
                   url_pattern = r"^https?://[^\s/$.?#].[^\s]*$" 
        
                   if not bool(re.match(url_pattern, source)): 
        
                       raise ValueError( 
        
                           f"Invalid URL format: {source}. URL must start with http(s):// and contain a valid domain." 
        
                       ) 
        
                   return True

When the source fails the check, the error is caught and then immediately re-raised again. That means the call to handle_local_source(...) is unreachable because the try/except will either return from the function or raise an error before the call can be performed.

Ideally, the try/except should be removed and is_valid_url(...) should be rewritten so it either returns True or False (so no raising errors). If that's not a possibility due to other parts of the code base relying on the behavior of is_valid_url(...), then the raise in the try/except should be changed to a pass so the function can proceed to calling handle_local_source(...).

The text was updated successfully, but these errors were encountered:

teemall · 2024-12-19T20:00:24Z

Bump

SwapnilSonker · 2024-12-23T04:11:31Z

I believe there is no need to change the ValueError to False, as this could cause confusion about the type of error encountered. Raising an error when a raw HTML input is passed is appropriate because it clearly indicates to the user where the issue lies. Raw HTML is not meant to be processed in this context.

Here, is the documentation from where I found the above solution.
https://scrapegraph-doc.onrender.com/docs/Graphs/smart_scraper_graph

If you are having issues now, feel free to ask.
@Kaoticz

@VinciGit00, do you have any idea what has to be done on this?

Kaoticz · 2024-12-23T05:58:19Z

@SwapnilSonker If that's the case, the documentation here should be changed accordingly. More specifically, the comment in the codeblock that states the following:

   # also accepts a string with the already downloaded HTML code
   source="https://perinim.github.io/projects",

I believe there is no need to change the ValueError to False, as this could cause confusion about the type of error encountered.

I don't think so. If is_valid_url(...) were to return True or False, the code in execute(...) could've been altered to the following:

if self.is_valid_url(source): 
    return self.handle_web_source(state, source)
else:
    raise ValueError('source is not a valid URL.')

This would make the intent of the code clearer without changing its behavior.

Here, is the documentation from where I found the above solution.

It seems that you forgot to link the relevant documentation.

Either way, I'm concerned that, if raw HTML is not supposed to be passed there, then how can we pass HTML to the LLM model? Cuz every scraper graph seems to only accept URLs, which is a problem for scraping pages that dynamically load the desired data.

SwapnilSonker · 2024-12-23T06:22:04Z

have you tried raising a PR for this issue?.
added the documentation from where I read about.

Kaoticz · 2024-12-23T07:39:25Z

I have, but soon realized that the relevant code had been changed to only trigger when source is an URL. So I assumed a PR would've been unnecessary.

Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py

Lines 134 to 140 in a44b74a

    
           if input_type == "url": 
        
               try: 
        
                   if self.is_valid_url(source): 
        
                       return self.handle_web_source(state, source) 
        
               except ValueError as e: 
        
                   # Re-raise the exception from is_valid_url 
        
                   raise

SwapnilSonker · 2024-12-23T08:28:17Z

I understood this already while reading the code but it made me curious that your test case seems too obvious, but while reading the documentation, I found there was a mention of your issue and decided to research on it .
# also accepts a string with the already downloaded HTML code source="https://perinim.github.io/projects",

but now the errors message makes it clear!!.

Thank you @Kaoticz for cooperating, let's connect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

Kaoticz commented Dec 16, 2024 •

edited

Loading

teemall commented Dec 19, 2024

SwapnilSonker commented Dec 23, 2024 •

edited

Loading

Kaoticz commented Dec 23, 2024 •

edited

Loading

SwapnilSonker commented Dec 23, 2024

Kaoticz commented Dec 23, 2024 •

edited

Loading

SwapnilSonker commented Dec 23, 2024

HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

Comments

Kaoticz commented Dec 16, 2024 • edited Loading

Description

To Reproduce

Expected behavior

Actual behavior

Additional context

teemall commented Dec 19, 2024

SwapnilSonker commented Dec 23, 2024 • edited Loading

Kaoticz commented Dec 23, 2024 • edited Loading

SwapnilSonker commented Dec 23, 2024

Kaoticz commented Dec 23, 2024 • edited Loading

SwapnilSonker commented Dec 23, 2024

Kaoticz commented Dec 16, 2024 •

edited

Loading

SwapnilSonker commented Dec 23, 2024 •

edited

Loading

Kaoticz commented Dec 23, 2024 •

edited

Loading

Kaoticz commented Dec 23, 2024 •

edited

Loading