-
Notifications
You must be signed in to change notification settings - Fork 395
feat: Add AdaptivePlaywrightCrawler #872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
71 commits
Select commit
Hold shift + click to select a range
d8b2438
WIP
Pijukatel c12acda
More feasible version of composition.
Pijukatel 623d341
Pass properly all kwargs to subcrawlers
Pijukatel 548349e
Add run decision logic from JS version
Pijukatel 29d510a
Statistics class change lead too many ripple changes.
Pijukatel 04eefd9
Handle sub crawlers loggers
Pijukatel 33efed3
Statistics change to be usable without ignores.
Pijukatel 202aceb
Align use_state with JS implementation.
Pijukatel d474a94
Remove "fake generics" from Statistics.
Pijukatel e95ff19
Align result comparator witrh JS implementation.
Pijukatel 2408d85
Add doc strings.
Pijukatel d345259
use_state through RequestHandlerRunResult
Pijukatel bfa9290
Revert "use_state through RequestHandlerRunResult"
Pijukatel 63f278a
Add basic delegation test.
Pijukatel e190788
Add context test.
Pijukatel 0ecb137
Add tests for use_state and predictor.
Pijukatel b73c702
Remove unintended edit.
Pijukatel 5c79d0d
Add tests for statistics.
Pijukatel 957915a
Add test for error handling adn commiting correct results.
Pijukatel f12f605
Add crawl_one_required_contexts property. (Alternative to accessing i…
Pijukatel 5256af2
Lint
Pijukatel 2fd7aae
Remove BasicCrawler modifications.
Pijukatel 714b5bd
Make _commit_result consistent with how other result components are h…
Pijukatel b38dda1
Remove subcrawlers and add _OrphanPipeline
Pijukatel ffb2a78
Use dummy statistics in subcrawlers.
Pijukatel 3b05228
Keep predictor related functions on predictor_state
Pijukatel dc06490
Unify pre-nav hooks.
Pijukatel 0766b7a
Simplify pre-nav hook common context.
Pijukatel 4a63c2c
Make static crawling part of AdaptiveCrawler generic.
Pijukatel bd72a84
Update tests to remove bs references.
Pijukatel c964d44
Revert accidental Lint edits to website/*.py
Pijukatel a471395
Review comments.
Pijukatel dbf6310
Sub crawler timeout handling + test
Pijukatel 8ee8f99
Simplify prenav hooks
janbuchar c291d3e
Simplify context manager handling
janbuchar 2c23238
Review comments - _run_request_handler + timeouts
Pijukatel b2a29c1
Statistics.
Pijukatel 0786f87
Make statistics generic again!
Pijukatel 4f7d7f4
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel 1234ea7
Mock requests in tests.
Pijukatel e85ec9f
Improve error readability.
Pijukatel 6e8635a
Mock both static and browser requests in tests.
Pijukatel 0e9146a
Create proper example code
Pijukatel a4eac8f
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel 44ad898
Relax timeout in test to avoid flakiness in CI
Pijukatel f219453
Remove AdaptivePlaywrightCrawlerStatistics
Pijukatel 64d9e54
WIP
Pijukatel 08fc81f
Update options typed dicts
Pijukatel 9bde9dc
Add docstrings to adaptive context public stuff
Pijukatel 9a14569
Make crawl_one_with private.
Pijukatel fd8dd82
Update tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwri…
Pijukatel 4221219
Review comments
Pijukatel 565d36b
Remove _run_subcrawler_pipeline
Pijukatel 4bd8251
Remove Orphans
Pijukatel a3ab2e8
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel fc95132
Move SubCrawlerRun to where it is used
Pijukatel 4f316b5
Use custom _TestInput dataclass
Pijukatel 949c4ff
Review comments
Pijukatel 56ad33a
Add optional argument to pre navigation hook decorator
Pijukatel 781d5ff
Remove _push_result_to_context and add result argument/return to _run…
Pijukatel a93d6a1
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel b4ba31b
Add `block_request` to adaptive pre nav context
Pijukatel 8bce425
Use context result map for handling request handler results
Pijukatel 5f8c26c
Review comments based comments
Pijukatel e1d0c7e
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel aacb90a
Integrate RenderingTypePredictor
Pijukatel ab1c40e
Update src/crawlee/crawlers/_basic/_basic_crawler.py
Pijukatel 2bf43f6
Finalize exports and re exports
Pijukatel 5a6d07c
Review comments
Pijukatel e20aa49
Merge remote-tracking branch 'origin/master' into adaptive-PwCrawler
Pijukatel 1f48d53
Review comments
Pijukatel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
import asyncio | ||
|
||
from playwright.async_api import Route | ||
|
||
from crawlee.crawlers import ( | ||
AdaptivePlaywrightCrawler, | ||
AdaptivePlaywrightCrawlingContext, | ||
AdaptivePlaywrightPreNavCrawlingContext, | ||
) | ||
|
||
|
||
async def main() -> None: | ||
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser( | ||
max_requests_per_crawl=5, playwright_crawler_specific_kwargs={'headless': False} | ||
) | ||
|
||
@crawler.router.handler(label='label') | ||
async def request_handler_for_label(context: AdaptivePlaywrightCrawlingContext) -> None: | ||
# Do some processing using `page` | ||
some_locator = context.page.locator('div').first | ||
await some_locator.wait_for() | ||
# Do stuff with locator... | ||
context.log.info(f'Playwright processing of: {context.request.url} ...') | ||
|
||
@crawler.router.default_handler | ||
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None: | ||
context.log.info(f'User handler processing: {context.request.url} ...') | ||
# Do some processing using `parsed_content` | ||
context.log.info(context.parsed_content.title) | ||
|
||
# Find more links and enqueue them. | ||
await context.enqueue_links() | ||
await context.push_data({'Top crawler Url': context.request.url}) | ||
|
||
@crawler.pre_navigation_hook | ||
async def hook(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: | ||
"""Hook executed both in static sub crawler and playwright sub crawler.""" | ||
# Trying to access context.page in this hook would raise `AdaptiveContextError` for pages crawled | ||
# without playwright. | ||
context.log.info(f'pre navigation hook for: {context.request.url} ...') | ||
|
||
@crawler.pre_navigation_hook(playwright_only=True) | ||
async def hook_playwright(context: AdaptivePlaywrightPreNavCrawlingContext) -> None: | ||
"""Hook executed only in playwright sub crawler.""" | ||
|
||
async def some_routing_function(route: Route) -> None: | ||
await route.continue_() | ||
|
||
await context.page.route('*/**', some_routing_function) | ||
context.log.info(f'Playwright only pre navigation hook for: {context.request.url} ...') | ||
|
||
# Run the crawler with the initial list of URLs. | ||
await crawler.run(['https://warehouse-theme-metal.myshopify.com/']) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
try: | ||
from ._rendering_type_predictor import RenderingType, RenderingTypePrediction, RenderingTypePredictor | ||
except ImportError as exc: | ||
raise ImportError( | ||
janbuchar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"To import this, you need to install the 'adaptive-playwright' extra. " | ||
"For example, if you use pip, run `pip install 'crawlee[adaptive-playwright]'`.", | ||
) from exc | ||
|
||
from ._adaptive_playwright_crawler import AdaptivePlaywrightCrawler | ||
from ._adaptive_playwright_crawling_context import ( | ||
AdaptivePlaywrightCrawlingContext, | ||
AdaptivePlaywrightPreNavCrawlingContext, | ||
) | ||
|
||
__all__ = [ | ||
'AdaptivePlaywrightCrawler', | ||
'AdaptivePlaywrightCrawlingContext', | ||
'AdaptivePlaywrightPreNavCrawlingContext', | ||
'RenderingType', | ||
'RenderingTypePrediction', | ||
'RenderingTypePredictor', | ||
] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.