-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML source raises a ValueError in SmartScraperGraph when the documentation indicates it should work #845
Comments
Bump |
I believe there is no need to change the ValueError to False, as this could cause confusion about the type of error encountered. Raising an error when a raw HTML input is passed is appropriate because it clearly indicates to the user where the issue lies. Raw HTML is not meant to be processed in this context. Here, is the documentation from where I found the above solution. If you are having issues now, feel free to ask. @VinciGit00, do you have any idea what has to be done on this? |
@SwapnilSonker If that's the case, the documentation here should be changed accordingly. More specifically, the comment in the codeblock that states the following: # also accepts a string with the already downloaded HTML code
source="https://perinim.github.io/projects",
I don't think so. If if self.is_valid_url(source):
return self.handle_web_source(state, source)
else:
raise ValueError('source is not a valid URL.') This would make the intent of the code clearer without changing its behavior.
It seems that you forgot to link the relevant documentation. Either way, I'm concerned that, if raw HTML is not supposed to be passed there, then how can we pass HTML to the LLM model? Cuz every scraper graph seems to only accept URLs, which is a problem for scraping pages that dynamically load the desired data. |
have you tried raising a PR for this issue?. |
I have, but soon realized that the relevant code had been changed to only trigger when Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py Lines 134 to 140 in a44b74a
|
I understood this already while reading the code but it made me curious that your test case seems too obvious, but while reading the documentation, I found there was a mention of your issue and decided to research on it . but now the errors message makes it clear!!. Thank you @Kaoticz for cooperating, let's connect. |
Description
Passing raw HTML as the "source" parameter to a SmartScraperGraph causes a ValueError to be raised when executing the
run()
method of the scraper.To Reproduce
Expected behavior
The
run()
method should return a string with the result of the scraping operation.Actual behavior
The following error is raised:
Desktop:
Additional context
I narrowed down the problem to this code:
Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py
Lines 134 to 141 in 515e12d
In this function, a series of checks are performed to determine the nature of the source data (if it's a json file, an xml file, a pdf file, etc). When all of them fail, the code checks if the source is a web URL. If that also fails, it considers the source to be "local" and proceeds with that.
The issue lies in the check for the web URL.
is_valid_url(source)
either returnsTrue
when the source is a web URL, or raises an error when it's not (weird design decision, it should just returnFalse
instead).Scrapegraph-ai/scrapegraphai/nodes/fetch_node.py
Lines 83 to 100 in 515e12d
When the
source
fails the check, the error is caught and then immediately re-raised again. That means the call tohandle_local_source(...)
is unreachable because thetry/except
will either return from the function or raise an error before the call can be performed.Ideally, the
try/except
should be removed andis_valid_url(...)
should be rewritten so it either returnsTrue
orFalse
(so no raising errors). If that's not a possibility due to other parts of the code base relying on the behavior ofis_valid_url(...)
, then theraise
in thetry/except
should be changed to apass
so the function can proceed to callinghandle_local_source(...)
.The text was updated successfully, but these errors were encountered: