-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add AdaptivePlaywrightCrawler #872
Open
Pijukatel
wants to merge
68
commits into
master
Choose a base branch
from
adaptive-PwCrawler
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 57 commits
Commits
Show all changes
68 commits
Select commit
Hold shift + click to select a range
d8b2438
WIP
Pijukatel c12acda
More feasible version of composition.
Pijukatel 623d341
Pass properly all kwargs to subcrawlers
Pijukatel 548349e
Add run decision logic from JS version
Pijukatel 29d510a
Statistics class change lead too many ripple changes.
Pijukatel 04eefd9
Handle sub crawlers loggers
Pijukatel 33efed3
Statistics change to be usable without ignores.
Pijukatel 202aceb
Align use_state with JS implementation.
Pijukatel d474a94
Remove "fake generics" from Statistics.
Pijukatel e95ff19
Align result comparator witrh JS implementation.
Pijukatel 2408d85
Add doc strings.
Pijukatel d345259
use_state through RequestHandlerRunResult
Pijukatel bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel 63f278a
Add basic delegation test.
Pijukatel e190788
Add context test.
Pijukatel 0ecb137
Add tests for use_state and predictor.
Pijukatel b73c702
Remove unintended edit.
Pijukatel 5c79d0d
Add tests for statistics.
Pijukatel 957915a
Add test for error handling adn commiting correct results.
Pijukatel f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel 5256af2
Lint
Pijukatel 2fd7aae
Remove BasicCrawler modifications.
Pijukatel 714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel 3b05228
Keep predictor related functions on predictor_state
Pijukatel dc06490
Unify pre-nav hooks.
Pijukatel 0766b7a
Simplify pre-nav hook common context.
Pijukatel 4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel bd72a84
Update tests to remove bs references.
Pijukatel c964d44
Revert accidental Lint edits to website/*.py
Pijukatel a471395
Review comments.
Pijukatel dbf6310
Sub crawler timeout handling + test
Pijukatel 8ee8f99
Simplify prenav hooks
janbuchar c291d3e
Simplify context manager handling
janbuchar 2c23238
Review comments - _run_request_handler + timeouts
Pijukatel b2a29c1
Statistics.
Pijukatel 0786f87
Make statistics generic again!
Pijukatel 4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel 1234ea7
Mock requests in tests.
Pijukatel e85ec9f
Improve error readability.
Pijukatel 6e8635a
Mock both static and browser requests in tests.
Pijukatel 0e9146a
Create proper example code
Pijukatel a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel 44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel 64d9e54
WIP
Pijukatel 08fc81f
Update options typed dicts
Pijukatel 9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel 9a14569
Make crawl_one_with private.
Pijukatel fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel 4221219
Review comments
Pijukatel 565d36b
Remove _run_subcrawler_pipeline
Pijukatel 4bd8251
Remove Orphans
Pijukatel a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel fc95132
Move SubCrawlerRun to where it is used
Pijukatel 4f316b5
Use custom _TestInput dataclass
Pijukatel 949c4ff
Review comments
Pijukatel 56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel 781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel 8bce425
Use context result map for handling request handler results
Pijukatel 5f8c26c
Review comments based comments
Pijukatel e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel aacb90a
Integrate RenderingTypePredictor
Pijukatel ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel 2bf43f6
Finalize exports and re exports
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
import asyncio | ||
|
||
from playwright.async_api import Route | ||
|
||
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler | ||
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import ( | ||
AdaptiveContextError, | ||
AdaptivePlaywrightCrawlingContext, | ||
AdaptivePlaywrightPreNavCrawlingContext, | ||
) | ||
|
||
|
||
async def main() -> None: | ||
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( | ||
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False} | ||
) | ||
|
||
@crawler.router.default_handler | ||
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: | ||
# Code that will be executed in both crawl types | ||
context.log.info(f'User handler processing: {context.request.url} ...') | ||
|
||
try: | ||
some_locator = context.page.locator('div').first | ||
# Code that will be executed only in Playwright crawl. | ||
# Trying to access `context.page` in static crawl will throw `AdaptiveContextError`. | ||
|
||
await some_locator.wait_for() | ||
# Do stuff with locator... | ||
context.log.info(f'Playwright processing of: {context.request.url} ...') | ||
except AdaptiveContextError: | ||
# Code that will be executed in only in static crawl | ||
context.log.info(f'Static processing of: {context.request.url} ...') | ||
|
||
# FInd more links and enqueue them. | ||
await context.enqueue_links() | ||
await context.push_data({'Top crawler Url': context.request.url}) | ||
|
||
@crawler.pre_navigation_hook | ||
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: | ||
async def some_routing_function(route: Route) -> None: | ||
await route.continue_() | ||
|
||
try: | ||
await context.page.route('*/**', some_routing_function) | ||
context.log.info(f'Playwright pre navigation hook for: {context.request.url} ...') | ||
except AdaptiveContextError: | ||
context.log.info(f'Static pre navigation hook for: {context.request.url} ...') | ||
|
||
# Run the crawler with the initial list of URLs. | ||
await crawler.run(['https://warehouse-theme-metal.myshopify.com/']) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
Large diffs are not rendered by default.
Oops, something went wrong.
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawler import AdaptivePlaywrightCrawler | ||
from crawlee.crawlers._adaptive_playwright._adaptive_playwright_crawling_context import ( | ||
AdaptivePlaywrightCrawlingContext, | ||
AdaptivePlaywrightPreNavCrawlingContext, | ||
) | ||
|
||
__all__ = ['AdaptivePlaywrightCrawler', 'AdaptivePlaywrightCrawlingContext', 'AdaptivePlaywrightPreNavCrawlingContext'] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is technically correct, but not the way it is supposed to be used. Here's what I'd do:
enqueue_links
AdaptiveContextError
- that's an implementation detail and we shouldn't make it look like this is something you should doThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now this example is more of a starting point for the reviewers and not really for end users given the fact that it is not mentioned in docs. I agree with the changes you suggest, but I would prefer to do them once the context helpers are in place, so that the example is matching the current state of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that, but I believe that even though that example is not actually accessible from anywhere, it shouldn't contain code that goes against how the adaptive crawler should be used. I'd prefer to update that file now and then once more after we have those context helpers in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, but unless we split pre-navigation hooks to two different functions I do not think we can avoid try-catch there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or unless we use the new
block_requests
context helper there and just have it do nothing in a static crawling context. Would that work?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that covers only some specific use cases. There can be any code in those hooks. So to cover general case and have 1 to 1 matching behavior with both PlaywrightCrawler and whatever variant of ParsedHttpCrawler, I still se just two options: Try-catch or separate hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, the point of the adaptive crawler is that you shouldn't need to make separate branches for browser/static. Is there any operation that you can do in a prenav hook in the static context but not in the browser? If not (I can't think of any, but that's not saying anything), then we can just provide
block_requests
(a possibly common use case) andpage
for all the other usage scenarios (but it will force browser-based crawling when used).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I tried some compromise solution.
@crawler.pre_navigation_hook
This decorator registers hooks for both static and playwright sub crawlers -> if anyone tries to access context.page, it will raise an exception. (The exception text points directly to the example below)
@crawler.pre_navigation_hook(playwright_only=True)
This decorator registers hooks only playwright sub crawler -> safe to access context.page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this is fine by me