Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845

Open
Kaoticz opened this issue Dec 16, 2024 · 6 comments

Comments

@Kaoticz
Copy link

Kaoticz commented Dec 16, 2024

Description

Passing raw HTML as the "source" parameter to a SmartScraperGraph causes a ValueError to be raised when executing the run() method of the scraper.

To Reproduce

SmartScraperGraph(prompt, raw_html, graph_config).run()

Expected behavior

The run() method should return a string with the result of the scraping operation.

Actual behavior

The following error is raised:

Traceback (most recent call last):
  File ".../repository_path/main.py", line 72, in <module>
    json: str = scrape("""
                ^^^^^^^^^^
  File ".../repository_path/main.py", line 64, in scrape
    return smart_scraper_graph.run()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/smart_scraper_graph.py", line 280, in run
    self.final_state, self.execution_info = self.graph.execute(inputs)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 321, in execute
    return self._execute_standard(initial_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 270, in _execute_standard
    raise e
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 243, in _execute_standard
    result, node_exec_time, cb_data = self._execute_node(
                                      ^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/graphs/base_graph.py", line 171, in _execute_node
    result = current_node.execute(state)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/nodes/fetch_node.py", line 135, in execute
    if self.is_valid_url(source):
       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../repository_path/venv/lib/python3.12/site-packages/scrapegraphai/nodes/fetch_node.py", line 97, in is_valid_url
    raise ValueError(
ValueError: Invalid URL format: <html>big-ass html code here</html>. URL must start with http(s):// and contain a valid domain.

Desktop:

  • OS: Arch Linux
  • Browser: Chromium 131.0.6778.139
  • ScrapegraphAI Version: 1.33.3

Additional context

I narrowed down the problem to this code:

try:
if self.is_valid_url(source):
return self.handle_web_source(state, source)
except ValueError as e:
# Re-raise the exception from is_valid_url
raise
return self.handle_local_source(state, source)

In this function, a series of checks are performed to determine the nature of the source data (if it's a json file, an xml file, a pdf file, etc). When all of them fail, the code checks if the source is a web URL. If that also fails, it considers the source to be "local" and proceeds with that.

The issue lies in the check for the web URL. is_valid_url(source) either returns True when the source is a web URL, or raises an error when it's not (weird design decision, it should just return False instead).

def is_valid_url(self, source: str) -> bool:
"""
Validates if the source string is a valid URL using regex.
Parameters:
source (str): The URL string to validate
Raises:
ValueError: If the URL is invalid
"""
import re
url_pattern = r"^https?://[^\s/$.?#].[^\s]*$"
if not bool(re.match(url_pattern, source)):
raise ValueError(
f"Invalid URL format: {source}. URL must start with http(s):// and contain a valid domain."
)
return True

When the source fails the check, the error is caught and then immediately re-raised again. That means the call to handle_local_source(...) is unreachable because the try/except will either return from the function or raise an error before the call can be performed.

Ideally, the try/except should be removed and is_valid_url(...) should be rewritten so it either returns True or False (so no raising errors). If that's not a possibility due to other parts of the code base relying on the behavior of is_valid_url(...), then the raise in the try/except should be changed to a pass so the function can proceed to calling handle_local_source(...).

@teemall
Copy link

teemall commented Dec 19, 2024

Bump

@SwapnilSonker
Copy link
Contributor

SwapnilSonker commented Dec 23, 2024

I believe there is no need to change the ValueError to False, as this could cause confusion about the type of error encountered. Raising an error when a raw HTML input is passed is appropriate because it clearly indicates to the user where the issue lies. Raw HTML is not meant to be processed in this context.

Here, is the documentation from where I found the above solution.
https://scrapegraph-doc.onrender.com/docs/Graphs/smart_scraper_graph

If you are having issues now, feel free to ask.
@Kaoticz

@VinciGit00, do you have any idea what has to be done on this?

@Kaoticz
Copy link
Author

Kaoticz commented Dec 23, 2024

@SwapnilSonker If that's the case, the documentation here should be changed accordingly. More specifically, the comment in the codeblock that states the following:

   # also accepts a string with the already downloaded HTML code
   source="https://perinim.github.io/projects",

I believe there is no need to change the ValueError to False, as this could cause confusion about the type of error encountered.

I don't think so. If is_valid_url(...) were to return True or False, the code in execute(...) could've been altered to the following:

if self.is_valid_url(source): 
    return self.handle_web_source(state, source)
else:
    raise ValueError('source is not a valid URL.')

This would make the intent of the code clearer without changing its behavior.

Here, is the documentation from where I found the above solution.

It seems that you forgot to link the relevant documentation.

Either way, I'm concerned that, if raw HTML is not supposed to be passed there, then how can we pass HTML to the LLM model? Cuz every scraper graph seems to only accept URLs, which is a problem for scraping pages that dynamically load the desired data.

@SwapnilSonker
Copy link
Contributor

have you tried raising a PR for this issue?.
added the documentation from where I read about.

@Kaoticz
Copy link
Author

Kaoticz commented Dec 23, 2024

I have, but soon realized that the relevant code had been changed to only trigger when source is an URL. So I assumed a PR would've been unnecessary.

if input_type == "url":
try:
if self.is_valid_url(source):
return self.handle_web_source(state, source)
except ValueError as e:
# Re-raise the exception from is_valid_url
raise

@SwapnilSonker
Copy link
Contributor

I understood this already while reading the code but it made me curious that your test case seems too obvious, but while reading the documentation, I found there was a mention of your issue and decided to research on it .
# also accepts a string with the already downloaded HTML code source="https://perinim.github.io/projects",

but now the errors message makes it clear!!.

Thank you @Kaoticz for cooperating, let's connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants