-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add AdaptivePlaywrightCrawler #872
base: master
Are you sure you want to change the base?
Conversation
Add method to BasicCrawler to handle just one request.
Add statistics TODO: Make mypy happy about statistics. Wrap existing statistics from init in adaptive statistics. Silent subcrawler statistics and loggers in general. (Set level to error?)
Ignore and create TODO follow up issue for refactoring Statistics class after technical discussion.
Handle use state.
Pre-navigation hooks delegation to sub crawler hooks.
Statistics were marked as generics, but in reality were not. Hardcoding state_model to make it explicit and clear.
WIP KVS handling. Currently it does not go through Result handler.
This reverts commit d345259.
Fix wrong id for predictor_state persistence.
Add test for pre nav hook Add test for statistics in crawler init
…nternals of sub crawlers) Cleanup commit results.
340c53d
to
f12f605
Compare
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
Add it in adaptive crawler instead at the cost of accessing many private members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I finally managed to read this one through again 🙂 It needs some cleaning up, but everything important looks correct - nice job!
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
…ght_crawler.py Co-authored-by: Jan Buchar <[email protected]>
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawling_context.py
Outdated
Show resolved
Hide resolved
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I have to integrate properly |
baa2052
to
8bce425
Compare
Co-authored-by: Jan Buchar <[email protected]>
Add AdaptivePlaywrightCrawler, based on JS implementation
Issues
#249
Purpose:
Adaptive crawler can choose to crawl page with either static crawler(like BeautifulSoupCrawler or ParselCrawler) or browser-based PlaywrightCrawler.
This is done to theoretically benefit from both approaches - use faster static crawler where it is possible and fallback to slower browser-based crawler where needed.
Implementation
Python implementation differs from JS implementation mainly in the way adaptive crawler performs the browser crawling or the static crawling.
JS implementation explicitly defines the code that is used for static/browser crawling - this makes it more simple and easier to understand, but it is some code duplication(code that already exists in other crawlers) and also it does drop some features that exist on those crawlers, like pre-navigation hooks.
Python implementation does not define explicitly code for static/browser crawling. It instead delegates this task to "sub crawlers" (instance of implementation of AbstractHttpCrawler for static crawling and instance of PlaywrightCrawler.)
This avoids code duplication and supports more features and control over the static/browser crawling, but it adds complexity of connecting all the components together.
"Sub crawlers" are directly used only in
__init__
method of top crawler to create their context pipelines. These are then used by top crawler to perform single request crawls when needed. This approach re-uses existing code from other crawlers and fully preserve their context pipeline handling.Crawling contexts
Since
AdaptivePlaywrightCrawler
can crawl page in two different manners the contexts can also look differently. I selected approach where there is unified context for both crawl types(static/browser) and this unified context contains all the attributes that are found both on static context and browser context, but browser based attributes will raiseAdaptiveContextError(RuntimeError)
when accessed during static crawling. It is impossible to know in advance which mode of crawling will be used, so this forces users to usetry-catch
pattern when working with these unified contexts.For user defined pre-navigation hooks there is
AdaptivePlaywrightPreNavCrawlingContext
(very nice name!) that haspage
property that can raiseAdaptiveContextError
.For user defined request handler there is
AdaptivePlaywrightCrawlingContext
that haspage
,infinite_scroll
andresponse
properties (all copied from PlaywrightCrawlingContext) that can raiseAdaptiveContextError
.Missing in this PR
Documentation
Documentation, examples and template will be added as last step in follow up PRs. The reason is that this feature should be documented only when it is in state that is ready to be used by users. Updating documentation too early would lead to users to try to use incomplete feature.
Rendering type predictor
JS implementation uses statistical regression based predictor to decide what crawl type should be used. This predictor is not implemented in this PR and is replaced by dummy version -
RandomRenderingTypePredictor
.Proper predictor will be implemented in follow up PR.
Helper methods on adaptive context
JS implementation adds several helper methods on adaptive context. Since adaptive crawler in JS is not generic with respect to static crawling, it is straight forward to implement it with the only option Cheerio.
If same was to be done in Python it would require more handling of generics as it would basically require to wrap some sort of generic "select" method from whatever generic parser is used and it's return value would have to be another generic parameter.
This change is standalone and it is better to do it separately.
Parity changes to JS code
Some improvements could be ported back to JS implementation. These are improvements and issues are summarized in apify/crawlee#2798